Text preprocessing is an important step in the use of unstructured text document for any type of data mining, information retrieval, or text analytics. We'll use NLTK and look at the following concepts:
* Stop Words
* Stemming
* Lemmatization

In [61]:
import pprint, string, nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

## Stop words

Text documents often contain many occurences of the same word. For example, words such as "a", "the", "of", and "it" likely occur very frequently. When classifying a document based on the number of times specific words occur in the text document, these words can lead to biases, especially since they are generally common in ALL text documents you might want to classify. 

As a result, the concept of stop words was invented. 

https://en.wikipedia.org/wiki/Stop_words

These words are the most commonly occurring words and they should be removed during the tokenization process in order to improve subseqeunt text analytics efforts. 

In [62]:
cv = CountVectorizer(analyzer = "word", lowercase = True)

In [63]:
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [64]:
myText = "Hello! My name is Jacky Zhao. I'm an aspiring data scientist. Follow me on twitter @iamdatabear. The is my text analysis practice!"

In [65]:
myText

"Hello! My name is Jacky Zhao. I'm an aspiring data scientist. Follow me on twitter @iamdatabear. The is my text analysis practice!"

In [66]:
cv1 = CountVectorizer(lowercase = True)
cv2 = CountVectorizer(stop_words = "english", lowercase = True)

In [67]:
tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

In [68]:
pp = pprint.PrettyPrinter(indent = 2, depth = 1, width = 80, compact = True)

In [69]:
example1 = tk_func1(myText)
print("Tokenization for 1: \n", example1)

Tokenization for 1: 
 ['hello', 'my', 'name', 'is', 'jacky', 'zhao', 'an', 'aspiring', 'data', 'scientist', 'follow', 'me', 'on', 'twitter', 'iamdatabear', 'the', 'is', 'my', 'text', 'analysis', 'practice']


In [70]:
example2 = tk_func2(myText)
print("Tokenization for 2: \n", example2)

Tokenization for 2: 
 ['hello', 'jacky', 'zhao', 'aspiring', 'data', 'scientist', 'follow', 'twitter', 'iamdatabear', 'text', 'analysis', 'practice']


## Stemming
We have looked at the removal of redundant or unimportant words (stop words). However, an issue still exist because of different word forms of the same base term. For example: compute, computer, computed, and computing. The process of changing words back to their root term (basic term) so that token frequencies match the use of the root token rather than being spread across multiple tokens is known as stemming

https://en.wikipedia.org/wiki/Stemming

The most popular stemming is the "Porter Stemmer". It was originally published by Martin Porterin 1980. Since then, an improved version was released in 2000. NLTK includes the Porter Stemmer. 

In [71]:
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

In [72]:
stemmer = PorterStemmer()

In [73]:
for w in example_words:
    print(stemmer.stem(w))

python
python
python
python
pythonli


In [74]:
newText = "It is important to be very pythonly while you are pythoning with pythong. All pythoners have pythoned poorly at least once!"

In [75]:
newText

'It is important to be very pythonly while you are pythoning with pythong. All pythoners have pythoned poorly at least once!'

In [76]:
tokens = nltk.word_tokenize(newText)

In [77]:
print(tokens)

['It', 'is', 'important', 'to', 'be', 'very', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'pythong', '.', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once', '!']


In [78]:
tokens = [token for token in tokens if token not in string.punctuation]

In [79]:
print(tokens)

['It', 'is', 'important', 'to', 'be', 'very', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'pythong', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once']


In [80]:
for w in tokens:
    print(stemmer.stem(w))

It
is
import
to
be
veri
pythonli
while
you
are
python
with
pythong
all
python
have
python
poorli
at
least
onc


## Lemmatization
Lemmatization in linguistics is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. By inflected, it means to change the form of a word to express a particular grammatical function or attribute, typically tense, mood, person, number, case, and gender.

In computational linguistics, lemmatization is the algorithmic process of determining the lemma for a given word. The process may involve complex tasks such as understanding the context and determining the parts of speech of a word. 

In many languages, words appear in several inflected forms. For example, the verb "to walk" may appear as "walk", "walked", "walks", and "walking". The base form "walk" is called the lemma of the word. 

Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without the knowledge of the context and therefore cannot discriminate between words which have different meanings depending on the part of speech. However, stemmers are typically easier to implement and run much faster. The reduced accuracy may not matter for some applications.