# Text Preprocessing

[Text preprocessing for ML & NLP](https://kavita-ganesan.com/text-preprocessing-tutorial/)

[Code Snippets](https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/text-pre-processing)

Bring text into a form that is **predictable** and **analyzable**

TASK = Combination of *approach* and *domain*

*Ex. extract top keywords with tfidf (approach) from Tweets (domain) is a Task*

Different preprocessing techniques for different data - not one true type

## Types of Preprocessing Techniques

### Lowercasing
one of simplest and most effective forms - applicable to most NLP problems

`lower_words=[word.lower() for word in texts]`

### Stemming
Process of reducing inflection in words to their root form

'Word Embeddings' better approach 

Ex. dances, dancing, danced => danc

Crude process chop off ends of words in hopes of correctly transforming word to root

Porters Algorithm - most common for stemming

Useful for delaing with sparcity and standardizing vocab

Found success in search application is particular

### Lemmatization
Similar to stemming - remove inflections and map to root word

Difference - actually tries to transform to actual root word rather than just chopping off end

May use a dictionary such as WordNet for mappings

Ex. dances, dancing, danced => dance OR goose, geese => goose

Found not a significance improvement from stemming and more processing power to do

### Stop-word Removal
Remove commonly used words in a language - Ex 'a', 'the', 'and' etc. so can focus on important words

Commonly applied in search systems, text classification applications, topic modeling, topic extraction

can come from pre-established sets or can customize

sklearn - allow you to remove words that appeared in X% of documents

### Normalization
Transform text into a canonical (standard) from

Ex. "gooood" and "gud" => "good"

Or map near identical words 

Ex. "stopwords", "stop-words" and "stop words" => "stopwords"

Important for social media content / text messages / comments with lots of abbrev. misspellings 

No standard way to normalize text - depends on task

Dictionary Mappings - easiest / Statistical Machine Translation (SMT) / Spelling correction 

### Noise Removal

Removing characters digits and pieces of text that interfere with analysis

Most essential and highly domain dependent

Ex. Stemming "..trouble.." & "trouble<" & "1.trouble" won't change their root 

Must normalize them and remove all noise for stemming to work

Punctuation / Special Character / Numbers / Html formatting etc. REMOVAL

[Basic Removal](https://github.com/kavgan/nlp-in-practice/blob/master/text-pre-processing/Text%20Preprocessing%20Examples.ipynb)


## MUST DO:
- Noise Removal
- Lowercasing

## Should do:
- Simple Normalization

## Task Dependent:
- advanced normalization
- stop-word removal
- stemming / lemmatization


## Porter Stemmer

In [1]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer

# init stemmer
porter_stemmer=PorterStemmer()
# stem connect variations
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word=word) for word in words]

stemdf= pd.DataFrame({'original_word': words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,original_word,stemmed_word
0,connect,connect
1,connected,connect
2,connection,connect
3,connections,connect
4,connects,connect


In [2]:
# stem trouble variations
words=["trouble","troubled","troubles","troublemsome"]
stemmed_words=[porter_stemmer.stem(word=word) for word in words]

stemdf= pd.DataFrame({'original_word': words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,original_word,stemmed_word
0,trouble,troubl
1,troubled,troubl
2,troubles,troubl
3,troublemsome,troublemsom


## Lemmatization

In [4]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# init lemmatizer
lemmatizer = WordNetLemmatizer()

#lemmatize trouble variations
words=["trouble","troubling","troubled","troubles",]
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='v') for word in words]
lemmatizeddf= pd.DataFrame({'original_word': words,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hugho\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,original_word,lemmatized_word
0,trouble,trouble
1,troubling,trouble
2,troubled,trouble
3,troubles,trouble


In [5]:
#lemmatize goose variations
words=["goose","geese"]
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='n') for word in words]
lemmatizeddf= pd.DataFrame({'original_word': words,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf

Unnamed: 0,original_word,lemmatized_word
0,goose,goose
1,geese,goose


## Noise Removal

In [1]:
import nltk
import pandas as pd
import re
from nltk.stem import PorterStemmer

porter_stemmer=PorterStemmer()

# stem raw words with noise
raw_words=["..trouble..","trouble<","trouble!","<a>trouble</a>",'1.trouble',"trouble_"]
stemmed_words=[porter_stemmer.stem(word=word) for word in raw_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,raw_word,stemmed_word
0,..trouble..,..trouble..
1,trouble<,trouble<
2,trouble!,trouble!
3,<a>trouble</a>,<a>trouble</a>
4,1.trouble,1.troubl
5,trouble_,trouble_


In [2]:
def scrub_words(text):
    """Basic cleaning of texts."""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text

# STEM WORDS ALREADY CLEANED
cleaned_words=[scrub_words(w) for w in raw_words]
cleaned_stemmed_words=[porter_stemmer.stem(word=word) for word in cleaned_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'cleaned_word':cleaned_words,'stemmed_word': cleaned_stemmed_words})
stemdf=stemdf[['raw_word','cleaned_word','stemmed_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word,stemmed_word
0,..trouble..,trouble,troubl
1,trouble<,trouble,troubl
2,trouble!,trouble,troubl
3,<a>trouble</a>,trouble,troubl
4,1.trouble,trouble,troubl
5,trouble_,trouble_,trouble_
