## Text Clearning for NLP

[https://monkeylearn.com/blog/text-cleaning/](https://monkeylearn.com/blog/text-cleaning/)

1. Normalize Text
2. Remove Unicode Character
3. Remove Stopwords
4. Perform stemming and lemmatization


Further Data Cleaning
1. Part of Speech Tagging
2. Translation
3. Typo Correction
4. Number Unification :  Phone number standardization, address standardization


### Normalization

Normalizing text is the process of standardizig 
text so that, through NLP, computer models can better understand
human input, with the end goal being to more effectively perform
sentimental analysis and other types of analysis on your customer 
feedback.

Normalizing text with python and NLTK library means standardizing
capitalization so that machine models don't group capitalized words
as different from lowercase counterparts


In [2]:
text = "Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first FIX THIS ASAP! @AmazonHelp"

text = text.lower()

print(text)

hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first fix this asap! @amazonhelp


In [4]:
import re

text = "hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp"

text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

print(text)

hey amazon  my package never arrived  please fix asap amazonhelp


In [7]:
import nltk.corpus
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
text = "my package from amazon never arrived fix this asap"
text = " ".join([word for word in text.split() if word not in stop])

print(text)

package amazon never arrived fix asap


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nirajan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stemming groups words by their root stem. This allows us to recognize that 'jumping', 'jumps' and 'jumped' are all rooted to the same verb(jump) and thus are referring to similar problems

Lemmatization groups words based on root definition, and allows us to differentiate between present, past and indefinite. So, 'jumps' and 'jump' are grouped into the present 'jump', as different from all uses of 'jumped' which are grouped together as past tense, all instasnce of 'jumping' which are grouped together as the indifinite(meaning continuous)

In [10]:
import nltk
nltk.download('wordnet')
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

words = ["jump", "jumped", "jumps", "jumping"]
stemmer = PorterStemmer()
for word in words:
    print(word + " = " + stemmer.stem(word))

[nltk_data] Downloading package wordnet to /home/nirajan/nltk_data...


jump = jump
jumped = jump
jumps = jump
jumping = jump


In [11]:
lemmatizer = WordNetLemmatizer()

for word in words:
    print(word + " = " + lemmatizer.lemmatize(word))

jump = jump
jumped = jumped
jumps = jump
jumping = jumping


In [13]:
sentence = " ".join(words)
sentence

'jump jumped jumps jumping'

In [14]:
print(stemmer.stem(sentence))
print(lemmatizer.lemmatize(sentence))

jump jumped jumps jump
jump jumped jumps jumping


# Part of Speech Tagging

In [16]:
import nltk 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

tokens = nltk.word_tokenize("amazon package never arrived fix asap")
print(tokens)
pos = nltk.pos_tag(tokens)
print(pos)

[nltk_data] Downloading package punkt to /home/nirajan/nltk_data...


KeyboardInterrupt: 