# Text Preprocessing

Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice recognition software, search engine lookup, and machine learning model training. It is an essential step because text data can vary. From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into your data.

The ultimate goal of cleaning and preparing text data is to reduce the text to only the words that you need for your NLP goals.

We will cover a few common approaches for cleaning and processing text data. They include:

1. Using Regex & NLTK libraries

2. Removing unnecessary characters and formatting

3. Tokenization – break multi-word strings into smaller components

4. Normalization – a catch-all term for processing data; this includes stemming and lemmatization

Cleaning and preparation are crucial for many tasks, and NLP is no exception. Text preprocessing is usually the first step one will take when faced with an NLP task.

Without preprocessing, your computer interprets "the", "The", and "The" as entirely different words. Regex and NLTK will do most of it for you! Common tasks include:
    1. Noise removal — stripping text of formatting (e.g., HTML tags).
    2. Tokenization — breaking text into individual words.
    3. Normalization — cleaning text data in any other way:

a. Stemming is a blunt axe to chop off word prefixes and suffixes. “booing” and “booed” become “boo”, but “sing” may become “s” and “sung” would remain “sung.”

b. Lemmatization is a scalpel to bring words down to their root forms. For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”

c. Other common tasks include lowercasing, stopwords removal, spelling correction, etc.


## Noise Removal

Text cleaning is a technique that developers use in a variety of domains. Depending on the goal of your project and where you get your data from, you may want to remove unwanted information, such as:

punctuation and accents
special characters
numeric digits
leading, ending, and vertical whitespace
HTML formatting
The type of noise that you need to remove from text usually depends on its source. For example, you could access data via the Twitter API, scraping a webpage, or voice recognition software. Fortunately, you can use the .sub() method in Python’s regular expression (re) library for most of your noise removal needs.

The .sub() method has three required arguments:

1. pattern – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.

2. replacement_text – text that replaces all matches in the input string

3. input – the input string that will be edited by the .sub() method

The method returns a string with all instances of the pattern replaced by the replacement_text. Let’s see a few examples of using this method to remove and replace text from a string.

### Examples

First, let’s consider how to remove HTML p tags from a string:

![img](https://i.imgur.com/cgjjFf5.jpg)

Notice, we replace the tags with an empty string ' '. This is a common approach for removing text.

Next, let’s remove the whitespace from the beginning of the text. The whitespace consists of four spaces.

![img1](https://i.imgur.com/P7OhCGK.jpg)

## Tokenization

For many natural language processing tasks, we need access to each word in a string. To access each word, we first have to break the text into smaller components. The method for breaking text into smaller components is called tokenization and the individual components are called tokens.

A few common operations that require tokenization include:

a. Finding how many words or sentences appear in text

b. Determining how many times a specific word or phrase exists

c. Accounting for which terms are likely to co-occur

While tokens are usually individual words or terms, they can also be sentences or other size pieces of text.

To tokenize individual words, we can use nltk‘s word_tokenize() function. The function accepts a string and returns a list of words:

![i](https://i.imgur.com/N1IV9On.jpg)

To tokenize at the sentence level, we can use sent_tokenize() from the same module.

![i](https://i.imgur.com/mXFuCyG.jpg)

 ## Normalization
 
Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text normalization is a catch-all term for various text pre-processing tasks. We’ll cover a few of them:

a. Upper or lowercasing

b. Stopword removal

c. Stemming – bluntly removing prefixes and suffixes from a word

d. Lemmatization – replacing a single-word token with its root

The simplest of these approaches is to change the case of a string. We can use Python’s built-in String methods to make a string all uppercase or lowercase:

![n](https://i.imgur.com/Vvm0CK1.jpg)

## Stopword Removal

Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.

NLTK provides a built-in library with these words. You can import them using the following statement:

    from nltk.corpus import stopwords 

    stop_words = set(stopwords.words('english'))

We create a set with the stop words so we can check if the words are in a list below.

Now that we have the words saved to stop_words, we can use tokenization and a list comprehension to remove them from a sentence:

![i](https://i.imgur.com/KzTCyzd.jpg)

In this code, we first tokenized our string, nbc_statement, then used a list comprehension to return a list with all of the stopwords removed.



## Stemming

In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For example, stemming would cast the word “going” to “go”. This is a common method used by search engines to improve matching between user input and website hits.

NLTK has a built-in stemmer called PorterStemmer. You can use it with a list comprehension to stem each word in a tokenized list of words.

First, you must import and initialize the stemmer:

    from nltk.stem import PorterStemmer
    
    stemmer = PorterStemmer()

Now that we have our stemmer, we can apply it to each word in a list using a list comprehension:

![s](https://i.imgur.com/ZGWikub.jpg)

Notice, the words like ‘was’ and ‘founded’ became ‘wa’ and ‘found’, respectively. The fact that these words have been reduced is useful for many language processing applications. However, we need to be careful when stemming strings, because words can often be converted to something unrecognizable.



## Lemmatization

Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part-of-speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

We can use NLTK’s WordNetLemmatizer to lemmatize text:

    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    
Once we have the lemmatizer initialized, we can use a list comprehension to apply the lemmatize operation to each word in a list:

![t](https://i.imgur.com/fyNQaoT.jpg)

The result, saved to lemmatized contains 'wa', while the rest of the words remain the same. Not too useful. This happened because lemmatize() treats every word as a noun. To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely part of speech.


## Part-of-Speech Tagging
To improve the performance of lemmatization, we need to find the part of speech for each word in our string. In script.py, to the right, we created a part-of-speech tagging function. The function accepts a word, then returns the most common part of speech for that word. Let’s break down the steps:

1. Import wordnet and Counter

    from nltk.corpus import wordnet
    from collections import Counter
    
Wordnet is a database that we use for contextualizing words

Counter is a container that stores elements as dictionary keys

2. Get synonyms

Inside of our function, we use the wordnet.synsets() function to get a set of synonyms for the word:

    def get_part_of_speech(word):    
      probable_part_of_speech = wordnet.synsets(word)
      
3. Use synonyms to determine the most likely part of speech

Next, we create a Counter() object and set each value to the count of the number of synonyms that fall into each part of speech:

    pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
    
4. Return the most common part of speech

Now that we have a count for each part of speech, we can use the .most_common() counter method to find and return the most likely part of speech: 

    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    
Now that we can find the most probable part of speech for a given word, we can pass this into our lemmatizer when we find the root for each word. Let’s take a look at how we would do this for a tokenized string:

![pos](https://i.imgur.com/tgs6lnJ.jpg)

In [1]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from collections import Counter
def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len([ item for item in probable_part_of_speech if item.pos()=="n"])
  pos_counts["v"] = len([ item for item in probable_part_of_speech if item.pos()=="v"])
  pos_counts["a"] = len([ item for item in probable_part_of_speech if item.pos()=="a"])
  pos_counts["r"] = len([ item for item in probable_part_of_speech if item.pos()=="r"])
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)
tokenized = word_tokenize(cleaned)

stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

## -- CHANGE these -- ##
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token,get_part_of_speech(token)) for token in tokenized]

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['So', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'I', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'be', 'jump', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'pack', 'valise', 'I', 'go', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arrive', 'She', 'hardly', 'even', 'notice']
