# Word Frequency with Lemmatization
____

[Wikipedia](https://en.wikipedia.org/wiki/Lemmatisation): Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In other words, lemmatization merges the inflected forms of a word into its canonical or "dictionary" form. We will use three different lemmatizers to explore how lemmatization impacts word frequency in a bag of words model.

In [1]:
import spacy
import pattern
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from collections import Counter

stop_words = set(stopwords.words('english'))

Initialize a dataset object. 

In [2]:
from tdm_client import Dataset

# The following line returns the default "shakespeare" dataset
# dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c') 

# Custom datasets
# dset = Dataset('0b5d9a99-d008-f971-3a0e-e2ac7e952fbb')
# dset = Dataset('86ff2482-071c-3774-ebba-75704205e8de')
dset = Dataset('3fe1090f-5769-ec6a-89b2-25132cbbf576')

Print the text of the query that built this dataset.

In [3]:
dset.query_text()

'"lynching, communism" from JSTOR from 1900 - 2020'

Find total number of documents in the dataset using the `len()` function. 

In [4]:
len(dset)

7629

In [5]:
print(dset.query())
import urllib

urllib.parse.unquote(dset.query())

q=lynching%2C%20communism&fq=yearPublished%3A%5B1900%20TO%202020%5D&fq=provider%3A(%22jstor%22%20OR%20%22portico%22)


'q=lynching, communism&fq=yearPublished:[1900 TO 2020]&fq=provider:("jstor" OR "portico")'

The next code block takes prints a set of words and their corresponding lemma. You can modify the word list to expermiment with NLTK's lemmatization function.

In [6]:
words = ['be', 'am', 'are', 'is', 'being', 'was', 'were', 'been', 'am not', "aren't", "isn't", "wasn't", "weren't"]
words += ['play', 'plays', 'playing', 'played', 'goose', 'geese', 'good','better', 'best', 'nice','nicely']
words += ['fly', 'flown', 'flew']
words += ['foot', 'feet', 'drink', 'drank', 'drunk', 'drunks', 'swim', 'swims', 'swam']

print("Word".ljust(20), "Lemma (NLTK)")
print("----".ljust(20), "-----")
for word in words:
    print(word.ljust(20), WordNetLemmatizer().lemmatize(word))

Word                 Lemma (NLTK)
----                 -----
be                   be
am                   am
are                  are
is                   is
being                being
was                  wa
were                 were
been                 been
am not               am not
aren't               aren't
isn't                isn't
wasn't               wasn't
weren't              weren't
play                 play
plays                play
playing              playing
played               played
goose                goose
geese                goose
good                 good
better               better
best                 best
nice                 nice
nicely               nicely
fly                  fly
flown                flown
flew                 flew
foot                 foot
feet                 foot
drink                drink
drank                drank
drunk                drunk
drunks               drunk
swim                 swim
swims                swim
swam        

As can see from the results above, by default, NLTK's lemmatizer is limited. In our word list, it only lemmatized nouns. Let's lemmatize the same word list above but this time taking into consideration each word's part of speech (POS). There are several Python libraries that accomplish this task (see [here](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#stanfordcorenlplemmatization) for a discussion of several). We wil use the [spaCy](https://spacy.io/) libary, which you installed earlier.

In [7]:
nlp = spacy.load('en', disable=['parser','ner'])

print("Word".ljust(20), "Lemma (spaCy)")
print("----".ljust(20), "-----------")
for word in words:
    print(word.ljust(20), nlp(word)[0].lemma_)

Word                 Lemma (spaCy)
----                 -----------
be                   be
am                   be
are                  be
is                   be
being                be
was                  be
were                 be
been                 be
am not               be
aren't               be
isn't                be
wasn't               be
weren't              be
play                 play
plays                play
playing              play
played               play
goose                goose
geese                geese
good                 good
better               well
best                 good
nice                 nice
nicely               nicely
fly                  fly
flown                fly
flew                 fly
foot                 foot
feet                 foot
drink                drink
drank                drank
drunk                drunk
drunks               drunk
swim                 swim
swims                swim
swam                 swam


Let's try one more popular lemmatizer: [Pattern](https://www.clips.uantwerpen.be/pages/pattern). We will lemmatize the same set of words. Pattern is a web mining library for Python. In addition to natural language processing, it has tools for data mining, machine learning, and network analysis and visualization.

In [8]:
from pattern.en import lemma

print("Word".ljust(20), "Lemma (Pattern)")
print("----".ljust(20), "---------------")
for word in words:
    print(word.ljust(20), lemma(word))

Word                 Lemma (Pattern)
----                 ---------------
be                   be
am                   be
are                  be
is                   be
being                be
was                  be
were                 be
been                 be
am not               be
aren't               be
isn't                be
wasn't               be
weren't              be
play                 play
plays                play
playing              play
played               play
goose                goose
geese                geese
good                 good
better               better
best                 best
nice                 nice
nicely               nicely
fly                  fly
flown                fly
flew                 fly
foot                 foot
feet                 feet
drink                drink
drank                drink
drunk                drink
drunks               drunk
swim                 swim
swims                swim
swam                 swim


Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:

* lowercases all tokens
* discards all tokens less than 4 characters
* discards non alphabetical tokens - e.g. --9
* removes stopwords using NLTK's stopword list
* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

In [9]:
def process_token(token, do_lemmatizer = False):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    if do_lemmatizer:
        return WordNetLemmatizer().lemmatize(token)
    else:
        return token

Next, we will loop through and count all of the words in the first 25 articles of your dataset. This code block will NOT lemmatize words; the next two blocks will. The objective is to compare word frequencies with and without lemmatizing your dataset.

In [10]:
aggr_doc = []

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token, False)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    aggr_doc += this_doc
    if n >= 24:
        break

word_freq = Counter(aggr_doc)
print("** 25 Most Frequent Words with NO Lemmatization **")
for token, count in word_freq.most_common(25):
    print(token.ljust(20), str(count).rjust(4, ' '))

** 25 Most Frequent Words with NO Lemmatization **
american             1793
jewish                958
world                 881
year                  736
would                 721
also                  612
america               576
book                  563
history               487
united                474
time                  450
people                442
university            412
jews                  412
even                  411
york                  411
black                 407
states                397
like                  392
many                  389
press                 389
political             385
first                 384
could                 376
life                  370


Repeat the code block above but with NLTK Lemmatization.

In [11]:
aggr_doc = []

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token, True)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    aggr_doc += this_doc
    if n >= 24:
        break

word_freq = Counter(aggr_doc)
print("** 25 Most Frequent Words with NTLK Lemmatization **")
nltk_results = []
for token, count in word_freq.most_common(25):
    mystr = token.ljust(20) + str(count).rjust(4, ' ')
    nltk_results.append(mystr)
    print(mystr)

** 25 Most Frequent Words with NTLK Lemmatization **
american            2130
year                 985
jewish               958
world                894
would                721
state                703
book                 659
also                 612
america              584
time                 547
history              515
black                491
people               484
united               474
life                 437
right                432
university           419
jew                  412
even                 411
york                 411
like                 394
press                393
many                 389
political            385
first                384


Now, compare NLTK lemmatization with spaCy and Patter Lemmatization. This may take a few minutes to complete!

In [12]:
# First, we need to define token processors like the one for NLTK for spaCy and Pattern
def process_token_spacy(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return nlp(token)[0].lemma_


def process_token_pattern(token):
    token = token.lower()
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    if token in stop_words:
        return
    return lemma(token)

In [13]:
if not 'nlp' in locals():
    nlp = spacy.load('en', disable=['parser','ner'])

if not 'pattern' in locals():
    from pattern.en import lemma

aggr_doc_spacy = []
aggr_doc_pattern = []
print("WORKING_", end = '')
for n, unigram_count in enumerate(dset.get_features()):
    if n % 2:
        print("o_m_", end = '')
    else:
        print("_m_o", end = '')
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token_spacy(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    aggr_doc_spacy += this_doc
    for token, count in unigram_count.items():
        clean_token = process_token_pattern(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    aggr_doc_pattern += this_doc
    if n >= 24:
        break

word_freq_spacy = Counter(aggr_doc_spacy)
spacy_results = []
for token, count in word_freq_spacy.most_common(25):
    mystr = token.ljust(20) + str(count).rjust(4, ' ')
    spacy_results.append(mystr)
    
word_freq_pattern = Counter(aggr_doc_pattern)
pattern_results = []
for token, count in word_freq_pattern.most_common(25):
    mystr = token.ljust(20) + str(count).rjust(4, ' ')
    pattern_results.append(mystr)

print("o_m__DONE!!!", end = '')

WORKING__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__m_oo_m__DONE!!!

Now, let's print the results below:

In [24]:
nltk_flag = 'nltk_results' in locals() and len(nltk_results) > 24

print("********* 25 Most Frequent Words with Pattern, spaCy, and NLTK Lemmatization *********")
print("------------------------".ljust(30), "------------------------".ljust(30), "------------------------".ljust(30))
print("Pattern Results".ljust(30), "spaCy Results".ljust(29), "NLTK Results".ljust(30))
print("------------------------".ljust(30), "------------------------".ljust(30), "------------------------".ljust(30))
for i in range(0,24):
    if nltk_flag:
        print(pattern_results[i] + "   |   " + spacy_results[i]  + "   |   " + nltk_results[i])
    else:
        print(pattern_results[i] + "   |   " + spacy_results[i]  + "   |   " + "N/A")

********* 25 Most Frequent Words with Pattern, spaCy, and NLTK Lemmatization *********
------------------------       ------------------------       ------------------------      
Pattern Results                spaCy Results                 NLTK Results                  
------------------------       ------------------------       ------------------------      
american            3923   |   american            1793   |   american            2130
year                1970   |   year                 985   |   year                 985
jewish              1916   |   jewish               958   |   jewish               958
world               1788   |   world                894   |   world                894
would               1442   |   would                721   |   would                721
state               1432   |   state                716   |   state                703
book                1318   |   book                 659   |   book                 659
make                1230  