# Lemmatization

Lemmatization involves simplifying words to their base/stem.

i.e. running -> run, cats -> cat

This is useful for language processing as we can reduce complexity of sentence structure and compare similar words easier.

# Imports

The nltk wordnet needs to be downloaded in order for its lemmatizer to be used.

nltk.download('wordnet') should return **True** if successful.

In [6]:
import nltk
import pandas as pd

print(nltk.download('wordnet'))
print(nltk.download('averaged_perceptron_tagger'))

True
True


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
lemmatizer = nltk.stem.WordNetLemmatizer()
stemmatizer = nltk.stem.snowball.SnowballStemmer('english')

In [8]:
df = pd.DataFrame()
df['title1_en'] = ['There are a lot of rocks',
                   'I fell into some cacti',
                   'I went for a run',
                   'John and I went to the store today',
                   'I\'m better at running']

def lemmatize_df_col(text):
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split()])

def stemmatize_df_col(text):
    return ' '.join([stemmatizer.stem(w) for w in text.split()])

df['title1_en_lemmatized'] = df.title1_en.apply(lemmatize_df_col)
df['title1_en_stemmatized'] = df.title1_en.apply(stemmatize_df_col)

df

Unnamed: 0,title1_en,title1_en_lemmatized,title1_en_stemmatized
0,There are a lot of rocks,There are a lot of rock,there are a lot of rock
1,I fell into some cacti,I fell into some cactus,i fell into some cacti
2,I went for a run,I went for a run,i went for a run
3,John and I went to the store today,John and I went to the store today,john and i went to the store today
4,I'm better at running,I'm better at running,i'm better at run


# Finding the pos tag

NLTK also includes an automatic POS tagger

In [9]:
nltk.tag.pos_tag(['Hello','how','are','you','doing'])

[('Hello', 'NNP'),
 ('how', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG')]

In [10]:
def lemmatize_df_col_tagged(text,lemmatizer):
    words = []
    for word,tag in nltk.tag.pos_tag(text.split()):
        if tag.startswith("NN"):
            word = lemmatizer.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            word = lemmatizer.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            word = lemmatizer.lemmatize(word, pos='a')
        words.append(word)
    return ' '.join(words)

df['title1_en_lemmatized_tagged'] = df.title1_en.apply(lemmatize_df_col_tagged, lemmatizer=lemmatizer)
df

Unnamed: 0,title1_en,title1_en_lemmatized,title1_en_stemmatized,title1_en_lemmatized_tagged
0,There are a lot of rocks,There are a lot of rock,there are a lot of rock,There be a lot of rock
1,I fell into some cacti,I fell into some cactus,i fell into some cacti,I fell into some cactus
2,I went for a run,I went for a run,i went for a run,I go for a run
3,John and I went to the store today,John and I went to the store today,john and i went to the store today,John and I go to the store today
4,I'm better at running,I'm better at running,i'm better at run,I'm good at run
