Text preprocessing is an important step in the use of unstructured text document for any type of data mining, information retrieval, or text analytics. We'll use NLTK and look at the following concepts:
* Stop Words
* Stemming
* Lemmatization

In [103]:
import pprint, string, nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

## Stop words

Text documents often contain many occurences of the same word. For example, words such as "a", "the", "of", and "it" likely occur very frequently. When classifying a document based on the number of times specific words occur in the text document, these words can lead to biases, especially since they are generally common in ALL text documents you might want to classify. 

As a result, the concept of stop words was invented. 

https://en.wikipedia.org/wiki/Stop_words

These words are the most commonly occurring words and they should be removed during the tokenization process in order to improve subseqeunt text analytics efforts. 

In [104]:
cv = CountVectorizer(analyzer = "word", lowercase = True)

In [105]:
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [106]:
myText = "Hello! My name is Jacky Zhao. I'm an aspiring data scientist. Follow me on twitter @iamdatabear. The is my text analysis practice!"

In [107]:
myText

"Hello! My name is Jacky Zhao. I'm an aspiring data scientist. Follow me on twitter @iamdatabear. The is my text analysis practice!"

In [108]:
cv1 = CountVectorizer(lowercase = True)
cv2 = CountVectorizer(stop_words = "english", lowercase = True)

In [109]:
tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

In [110]:
pp = pprint.PrettyPrinter(indent = 2, depth = 1, width = 80, compact = True)

In [111]:
example1 = tk_func1(myText)
print("Tokenization for 1: \n", example1)

Tokenization for 1: 
 ['hello', 'my', 'name', 'is', 'jacky', 'zhao', 'an', 'aspiring', 'data', 'scientist', 'follow', 'me', 'on', 'twitter', 'iamdatabear', 'the', 'is', 'my', 'text', 'analysis', 'practice']


In [112]:
example2 = tk_func2(myText)
print("Tokenization for 2: \n", example2)

Tokenization for 2: 
 ['hello', 'jacky', 'zhao', 'aspiring', 'data', 'scientist', 'follow', 'twitter', 'iamdatabear', 'text', 'analysis', 'practice']


## Stemming
We have looked at the removal of redundant or unimportant words (stop words). However, an issue still exist because of different word forms of the same base term. For example: compute, computer, computed, and computing. The process of changing words back to their root term (basic term) so that token frequencies match the use of the root token rather than being spread across multiple tokens is known as stemming

https://en.wikipedia.org/wiki/Stemming

The most popular stemming is the "Porter Stemmer". It was originally published by Martin Porterin 1980. Since then, an improved version was released in 2000. NLTK includes the Porter Stemmer. 

In [113]:
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

In [114]:
stemmer = PorterStemmer()

In [115]:
for w in example_words:
    print(stemmer.stem(w))

python
python
python
python
pythonli


In [116]:
newText = "It is important to be very pythonly while you are pythoning with pythong. All pythoners have pythoned poorly at least once!"

In [117]:
newText

'It is important to be very pythonly while you are pythoning with pythong. All pythoners have pythoned poorly at least once!'

In [118]:
tokens = nltk.word_tokenize(newText)

In [119]:
print(tokens)

['It', 'is', 'important', 'to', 'be', 'very', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'pythong', '.', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once', '!']


In [120]:
tokens = [token for token in tokens if token not in string.punctuation]

In [121]:
print(tokens)

['It', 'is', 'important', 'to', 'be', 'very', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'pythong', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once']


In [122]:
for w in tokens:
    print(stemmer.stem(w))

It
is
import
to
be
veri
pythonli
while
you
are
python
with
pythong
all
python
have
python
poorli
at
least
onc


## Lemmatization
Lemmatization in linguistics is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. By inflected, it means to change the form of a word to express a particular grammatical function or attribute, typically tense, mood, person, number, case, and gender.

In computational linguistics, lemmatization is the algorithmic process of determining the lemma for a given word. The process may involve complex tasks such as understanding the context and determining the parts of speech of a word. 

In many languages, words appear in several inflected forms. For example, the verb "to walk" may appear as "walk", "walked", "walks", and "walking". The base form "walk" is called the lemma of the word. 

Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without the knowledge of the context and therefore cannot discriminate between words which have different meanings depending on the part of speech. However, stemmers are typically easier to implement and run much faster. The reduced accuracy may not matter for some applications.

In [123]:
lemmatizer = WordNetLemmatizer()

In [124]:
lemmatizer.lemmatize("dogs")

'dog'

### Lemmatization vs Stemming

In [125]:
words = ["going", "gone", "goes", "went"]

In [126]:
print("Stemming: \n")
for w in words:
    print(w,"becomes", stemmer.stem(w))

Stemming: 

going becomes go
gone becomes gone
goes becomes goe
went becomes went


In [127]:
print("Lemmatize without context:\n")
for w in words:
    print(w, "becomes", lemmatizer.lemmatize(w))

Lemmatize without context:

going becomes going
gone becomes gone
goes becomes go
went becomes went


In [128]:
print("Lemmatize WITH context: \n")
for w in words:
    print(w, "becomes", lemmatizer.lemmatize(w, pos = "v"))

Lemmatize WITH context: 

going becomes go
gone becomes go
goes becomes go
went becomes go


We can observe that the stemming process does not generate a real word but a root form. Other the other side, the lemmatizer generates real words, but without contextual information, it's not able to distinguish between nouns and verbs. Hence, the lemmatization process does not change the word. 

The context is provided by the POS tag ("v" for verb). We cannot specify POS tag everytime in order to lemmatize words in a text. NLTK generates POS tags automatially, using a simple function pos_tag()

In [129]:
s = "This is a simple sentence. Let's SeE IF iT cAn fiGuRe tHiS cRaZy sEnTenCe ouT!"
s

"This is a simple sentence. Let's SeE IF iT cAn fiGuRe tHiS cRaZy sEnTenCe ouT!"

In [130]:
tokens = word_tokenize(s)

In [132]:
print(tokens)

['This', 'is', 'a', 'simple', 'sentence', '.', 'Let', "'s", 'SeE', 'IF', 'iT', 'cAn', 'fiGuRe', 'tHiS', 'cRaZy', 'sEnTenCe', 'ouT', '!']


In [133]:
tokens_pos = pos_tag(tokens)

In [134]:
print(tokens_pos)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN'), ('.', '.'), ('Let', 'VB'), ("'s", 'POS'), ('SeE', 'NNP'), ('IF', 'IN'), ('iT', 'JJ'), ('cAn', 'JJ'), ('fiGuRe', 'NN'), ('tHiS', 'NN'), ('cRaZy', 'NN'), ('sEnTenCe', 'VBZ'), ('ouT', 'RP'), ('!', '.')]


In [144]:
word_and_pos = {}

for tp in tokens_pos:
    word_and_pos[tp[0]] = tp[1]

In [145]:
len(word_and_pos)

18

In [146]:
print(word_and_pos)

{'This': 'DT', 'is': 'VBZ', 'a': 'DT', 'simple': 'JJ', 'sentence': 'NN', '.': '.', 'Let': 'VB', "'s": 'POS', 'SeE': 'NNP', 'IF': 'IN', 'iT': 'JJ', 'cAn': 'JJ', 'fiGuRe': 'NN', 'tHiS': 'NN', 'cRaZy': 'NN', 'sEnTenCe': 'VBZ', 'ouT': 'RP', '!': '.'}


for k, v in word_and_pos.items():
    print(k, "becomes", lemmatizer.lemmatize(k, pos = v))

Stop words, stemming, and lemmatization are important pre-processing steps in text analytics applications. You can leverage the off-the-shelf solutions offered by NLTK into your text analysis applications. Additionally, many code libraries and applications that perform more advanced text analyticsal processes incorporate these techniques in them by default. 

In [154]:
stringAction = "We are meeting"
stringNoun = "We had a meeting"

In [155]:
cv = CountVectorizer(stop_words = "english", lowercase = True)
tk_function = cv.build_analyzer()

In [156]:
pp = pprint.PrettyPrinter(indent = 2, depth = 1, width = 80, compact = True)

In [163]:
print("Tokenizations:\n")
print("'{}':".format(stringAction))
pp.pprint(tk_function(stringAction))

Tokenizations:

'We are meeting':
['meeting']


In [167]:
print("Tokenization:\n")
print("'{}':".format(stringNoun))
pp.pprint(tk_function(stringNoun))

Tokenization:

'We had a meeting':
['meeting']


In [169]:
stemmer = PorterStemmer()

In [170]:
print(stemmer.stem(stringAction))

we are meet


In [171]:
print(stemmer.stem(stringNoun))

we had a meet


In [172]:
lem = WordNetLemmatizer()

In [175]:
stringActionTokens = nltk.word_tokenize(stringAction)
sat = [t for t in stringActionTokens if t not in string.punctuation]
print(sat)

['We', 'are', 'meeting']


In [176]:
stringNounTokens = nltk.word_tokenize(stringNoun)
snt = [t for t in stringNounTokens if t not in string.punctuation]
print(snt)

['We', 'had', 'a', 'meeting']


In [177]:
for w in sat:
    print(lem.lemmatize(w))

We
are
meeting


In [178]:
for w in snt:
    print(lem.lemmatize(w))

We
had
a
meeting


In [179]:
print(pos_tag(stringActionTokens))

[('We', 'PRP'), ('are', 'VBP'), ('meeting', 'VBG')]


In [180]:
print(pos_tag(stringNounTokens))

[('We', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('meeting', 'NN')]
