### Noise Removal
Text cleaning is a technique that developers use in a variety of domains. Depending on the goal of your project and where you get your data from, you may want to remove unwanted information, such as:
- Punctuation and accents
- Special characters
- Numeric digits
- Leading, ending, and vertical whitespace
- HTML formatting

The type of noise that you need to remove from text usually depends on its source. For example, you could access data via the Twitter API, scraping a webpage, or voice recognition software. Fortunately, you can use the `.sub()` method in Python’s regular expression (re) library for most of your noise removal needs.

The `.sub()` method has three required arguments:
- pattern – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
- replacement_text – text that replaces all matches in the input string
- input – the input string that will be edited by the .sub() method

The method returns a string with all instances of the pattern replaced by the replacement_text.
Example

```First, let’s consider how to remove HTML <p> tags from a string:

import re 

text = "<p>    This is a paragraph</p>" 
result = re.sub(r'<.?p>', '', text)

print(result) 
#    This is a paragraph```


In [None]:
import re

headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'
tweet = '@fat_meats, veggies are better than you think.'

#Remove the opening and closing h1 tags
headline_no_tag = re.sub(r'</?h1>', '', headline_one)

#Remove @ from the tweet
tweet_no_at = re.sub(r'@','',tweet)


### Tokenization
For many natural language processing tasks, we need access to each word in a string. To access each word, we first have to break the text into smaller components. The method for breaking text into smaller components is called tokenization and the individual components are called tokens.

A few common operations that require tokenization include:
- Finding how many words or sentences appear in text
- Determining how many times a specific word or phrase exists
- Accounting for which terms are likely to co-occur
- While tokens are usually individual words or terms, they can also be sentences or other size pieces of text.

To tokenize individual words, we can use nltk‘s `word_tokenize()` function. The function accepts a string and returns a list of words:
```
from nltk.tokenize import word_tokenize

text = "Tokenize this text"
tokenized = word_tokenize(text)
```

To tokenize at the sentence level, we can use `sent_tokenize()` from the same module.
```
from nltk.tokenize import sent_tokenize

text = "Tokenize this sentence. Also, tokenize this sentence."
tokenized = sent_tokenize(text)
```

In [1]:
from nltk import sent_tokenize, word_tokenize

ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. The readings can be used to diagnose cardiac arrhythmias.'

tokenized_by_word = word_tokenize(ecg_text)
tokenized_by_sentence = sent_tokenize(ecg_text)

ModuleNotFoundError: No module named 'nltk'

### Normalization
Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text normalization is a catch-all term for various text pre-processing tasks. In the next few exercises, we’ll cover a few of them:
- Upper or lowercasing
- Stopword removal
- Stemming – bluntly removing prefixes and suffixes from a word
- Lemmatization – replacing a single-word token with its root

The simplest of these approaches is to change the case of a string. We can use Python’s built-in String methods to make a string all uppercase or lowercase

### Stopword Removal
Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.

NLTK provides a built-in library with these words.
The process:
1. import nltk.corpus stopwords and nltk.tokenize word_tokenize
2. initialize stopwords: set(stopwords.words('english'))
3. tokenize the text
4. with a list comprehension create a list of words without the stopwords


In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

survey_text = 'A YouGov study found that American\'s like Italian food more than any other country\'s cuisine.'

stop_words = set(stopwords.words('english'))
tokenized_survey = word_tokenize(survey_text.lower())
text_no_stops = [word for word in tokenized_survey if word not in stop_words]

### Stemming
In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For example, stemming would cast the word “going” to “go”. This is a common method used by search engines to improve matching between user input and website hits.

NLTK has a built-in stemmer called PorterStemmer. You can use it with a list comprehension to stem each word in a tokenized list of words.

First, you must import and initialize the stemmer:
```
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# Now that we have our stemmer, we can apply it to each word in a list using a list comprehension:
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.']

stemmed = [stemmer.stem(token) for token in tokenized]
print(stemmed)
# ['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']
```
Notice, the words like ‘was’ and ‘founded’ became ‘wa’ and ‘found’, respectively. The fact that these words have been reduced is useful for many language processing applications. However, you need to be careful when stemming strings, because words can often be converted to something unrecognizable.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

populated_island = 'Java is an Indonesian island in the Pacific Ocean. It is the most populated island in the world, with over 140 million people.'

stemmer = PorterStemmer()

island_tokenized = word_tokenize(populated_island)
stemmed = [stemmer.stem(token) for token in island_tokenized]

### Lemmatization
Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part of speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming because lemmatize() treats every word as a noun. To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely part of speech.

### Part-of-Speech Tagging
To improve the performance of lemmatization, we need to find the part of speech for each word in our string. In script.py, to the right, we created a part-of-speech tagging function. The function accepts a word, then returns the most common part of speech for that word. Let’s break down the steps:

1. Import wordnet and Counter
```
from nltk.corpus import wordnet
from collections import Counter
```
wordnet is a database that we use for contextualizing words
Counter is a container that stores elements as dictionary keys

2. Get synonyms
Inside of our function, we use the wordnet.synsets() function to get a set of synonyms for the word:
```
def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  ```
The returned synonyms come with their part of speech.

3. Use synonyms to determine the most likely part of speech
Next, we create a Counter() object and set each value to the count of the number of synonyms that fall into each part of speech:

`pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )`

This line counts the number of nouns in the synonym set.

4. Return the most common part of speech
Now that we have a count for each part of speech, we can use the .most_common() counter method to find and return the most likely part of speech:

most_likely_part_of_speech = pos_counts.most_common(1)[0][0]

Now that we can find the most probable part of speech for a given word, we can pass this into our lemmatizer when we find the root for each word. Because we passed in the part of speech, “is” was cast to its root, “be.” This means that words like “was” and “were” will be cast to “be”.

In [5]:
# The get part of speech function
import nltk
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'

tokenized_string = word_tokenize(populated_island)

lemmatized_pos = [ lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized_string]
lemmatized_pos[:3]
tokenized_string

['Indonesia',
 'was',
 'founded',
 'in',
 '1945',
 '.',
 'It',
 'contains',
 'the',
 'most',
 'populated',
 'island',
 'in',
 'the',
 'world',
 ',',
 'Java',
 ',',
 'with',
 'over',
 '140',
 'million',
 'people',
 '.']

1. Book
- Processing Raw Text
https://www.nltk.org/book/ch03.html

2. Videos
- Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences
https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
- Stop Words - Natural Language Processing With Python and NLTK p.2
https://pythonprogramming.net/stop-words-nltk-tutorial/
- Stemming - Natural Language Processing With Python and NLTK p.3
https://pythonprogramming.net/stemming-nltk-tutorial/
- Lemmatizing - Natural Language Processing With Python and NLTK p.8
https://pythonprogramming.net/lemmatizing-nltk-tutorial/