# Text cleaning

In order to have an accurate result with your NLP model, you need to give all possible information you can to the model. (Only the ones that are useful and well-formatted, of course.)

For example, if you have an image before each important word in a text, or some block of text separated by a lot of spaces.

Let's take a concrete use-case:

![text](https://i.imgur.com/2METpwn.png)

In this image, you could provide the text like this:

```
Becode 1st December 2020 Cantersteen 10 Bruxelles 1000 Bruxelles Dear learners,
```

But it will be hard for your model to extract meaningful informations out of it. Even for you, if I give you this text it will not be easy.

A first solution could be to format and sort it.

```
Becode
Cantersteen 10
1000 Bruxelles

1st December 2020
Bruxelles

Dear learners,
```

A bit better, but it's still not perfect because the model doesn't understand your line breaks, it only understands text and spaces (which are a part of text, too). So we can add a tag. 

## Html tags

![html](https://cdn.lynda.com/course/170427/170427-637363828865101045-16x9.jpg)

As a convention, people often use the same tag as the following HTML tag: `<br>` which stands for **B**reak **L**line.

So we can do something like:
```html
Becode
Cantersteen 10
1000 Bruxelles

1st December 2020
Bruxelles
<br>
Dear learners,
```

As you saw in previous [chapters](../../2.python/2.python_advanced/05.Scraping/2.beautifulsoup_advanced.ipynb), HTML tags will also be important when scraping data from the web.

## Create your own tags

Sometimes, you want to add visual information that is not in the text. It could be emojis, recurrent images at specific places in front of the text, etc...

In those cases, you can create your own tags, but **be careful to only do that if**:
1. You are sure that this information will help the model
2. There is enough repetition of this tag to allow the model to understand the meaning of it.

For example in our letter, we could specify to the model that the address and the date are on a different side of the page. We could. decide to add a tag `<LEFT_SECTION>` (it's just a choice made to call it like that). If I only give this document or add other document that doesn't contain this tag, the model will not understand the meaning of it! But if I give 100 documents like that with the same tag each time, the model could start to understand the link.

```
Becode
Cantersteen 10
1000 Bruxelles

<LEFT_SECTION>
1st December 2020
Bruxelles
</LEFT_SECTION>


Dear learners,
```

## Tags are sometime dangerous

**Pro tips:** Sometimes, you will find those tags in the extracted text, you should always ask yourself: *Does it make sense or is it confusing?*

For example, in this text:

```
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of

 type and scrambled it to make a type specimen book. 
```

Here the line break doesn't add any information, it's more for the style and the readability.
So if you extract those line breaks (and you will with some document formats), you get
```
Lorem Ipsum is simply dummy text of the printing and typesetting industry.<br>

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of<br>

 type and scrambled it to make a type specimen book. 
```

You should remove them! You could, for example, use regular expressions for that.

You will also encounter formatting tags like `<b>` or `<i>` (bold and italic). Once again, depending on your task, you may want to remove them. If you do document classification it can totally bias the model, if you do Named Entity Recognition (which we'll see in a later chapter), it could definitely help the model.

Try to always ask yourself the question: *Would it help me to do the task or not?* If the answer is no, just remove them.

## Practice time!

Remove all the HTML tags in the text below.

In [16]:
text = "<p> WWF's mission is to stop the <strong> degradation </strong> of our planet's natural environment. </p>"
import re

def remove_tags(text):
    clean_text = re.sub('<.*?>', '', text)
    return clean_text

# Example usage
html_text = '<p>This is <b>some</b> text with <a href="#">tags</a>.</p>'
cleaned_text = remove_tags(html_text)
print(cleaned_text)

remove_tags(text)

This is some text with tags.


" WWF's mission is to stop the  degradation  of our planet's natural environment. "

# Lowercase the text and spaces

## Text casing

The text casing will also have an influence on your model. If you are looking for address, names, and so on, it will help the model. But if you want to do document classification, the model will do a difference between `doctor` and `Doctor` and you don't want that. One way to avoid this is by changing all the text to lowercase.

## Space trailing

When you extract text from documents, sometimes you will have additional spaces after a sentence or a double space where there shouldn't be one. These are just formatting errors, but your model will be affected.

Example:

```
 This  is   some text where some spaces  have been added.  
   Remove them! 
```

In this text it would be easy to remove the spaces but you will not do it by hand for each documents!


## Practice time!

Remove all the double spaces and the single space at the start or end of the line.

In [17]:
text = " Lorem Ipsum is  simply dummy text  of the printing and typesetting industry. Lorem Ipsum has been  the industry's standard dummy text ever since the  1500s, when an unknown printer took a galley  of type and scrambled it to  make a type specimen  book. It has survived  not only five centuries, but  also the leap  into electronic typesetting ,  remaining essentially unchanged . It was popularised  in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop  publishing software like Aldus PageMaker including versions of Lorem Ipsum. "
text=text.replace('  ',' ')
text=text.replace(' . ','. ')
text=text.replace(' , ',', ')


print(text)
# Remove all the useless spaces in 'text'

 Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. 


# Stop words

## What are stop words?

A stop word is a word that has very little meaning by itself, such as `the`,`a`, `and`, `an`,...
Most search engines remove these "stop words" when you do a search.

![stop words](https://i2.wp.com/xpo6.com/wp-content/uploads/2009/04/stop-words.png?fit=837%2C499)

## How to remove these stop words?

You could remove them by hand with the `replace()` function, but if you want to go faster, you can use libraries like `SpaCy`, `NLTK`,  `Gensim`, and more. Each library will behave slightly differently, but not enough to make big changes to your model.

## Practice time!

Using the library of your choice, remove all the stop words of this text:

In [20]:
# Remove all my stop words
text = "At BeCode, we like to learn. Sometime, we play games not win a price but to have fun!"


import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")



# Process the text
doc = nlp(text)

# Remove stop words
filtered_text = ' '.join([token.text for token in doc if not token.is_stop])

# Print the result
print("Original Text:")
print(text)

print("\nText without Stop Words:")
print(filtered_text)


Original Text:
At BeCode, we like to learn. Sometime, we play games not win a price but to have fun!

Text without Stop Words:
BeCode , like learn . , play games win price fun !


The result should be something like this:
```
['BeCode', ',', 'like', 'learn', '.', ',', 'play', 'games', 'win', 'price', 'fun', '!']
```

So as you can see, depending on what kind of information you want to extract, you will be able to exclude stop words. For document classification or semantic search, you will not need those stop words for example.

## Customize your stop words

You can also add or remove stop words from the list that the libraries uses for stop words. If there is a specific word in your document that should not be considered as a stop word, or one that should absolutely be given to the model, you can do so.


## Additional resources
* [NLP Essentials: Removing stopwords and performing Text Normalization using NLTK and spaCy in Python](https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/)
* [Removing stop words from strings in Python](https://stackabuse.com/removing-stop-words-from-strings-in-python/#usingpythonsnltklibrary)
* [Dropping common terms: stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)