# Stemming TED Talks

29,000 features still seems like a lot. One possible way to reduce the feature set is to stem (or lemmatize) the words involved. (For more on the difference between the two, see the note below from the Stanford NLP group[1].)

But such reduction of word forms to a base form throws away information. Is there a way to maintain that information, e.g., if a word form occurred more often as a noun, verb, adjective, adverb? There is, if we run parts-of-speech tagging first, then grab the words from the PoS tuple, but the more we consider this possibility, the more it seems that any interest in a particular word might be just as well served by searching back through a corpus, using something like KWiC, to explore nuances.

1. "Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. â€¦ The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma." [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

So, the first task in this notebook is to stem/lemmatize the corpus to see what reudction in features can be achieved.

We will begin with one block that loads all the usual suspects: modules, data, etc.

In [6]:
import time

In [3]:
# IMPORTS

import pandas as pd, re, csv, nltk, string
from sklearn.feature_extraction.text import CountVectorizer

# LOCAL FUNCTION to remove parentheticals (See Terms-O1b)

def remove_parentheticals(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]

# DATA

# Load the dataset
df = pd.read_csv('../output/TEDall_speakers.csv')

# Grab the texts
texts = df.text.tolist()

For the sake of comparison, here's our baseline DTM again:

In [14]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer(   lowercase = True,
                                preprocessor = remove_parentheticals, 
                                min_df = 2 )

tic = time.perf_counter()
X_0 = vectorizer.fit_transform(texts)
toc = time.perf_counter()
# see how many features we have
t0 = toc - tic
print(X_0.shape, t0)

In [9]:
stemmer = nltk.PorterStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))
stem_vectorizer = CountVectorizer(  analyzer=stemmed_words,
                                    min_df = 2 )

tic = time.perf_counter()
X_1 = stem_vectorizer.fit_transform(texts)
toc = time.perf_counter()
t1 = toc - tic
# see how many features we have
print(X_1.shape, t1)

(1747, 18243) 42.66327483299983


In [11]:
stem_vectorizer = CountVectorizer(  analyzer=stemmed_words,
                                    min_df = 2,
                                    lowercase = True,
                                    preprocessor = remove_parentheticals )

tic = time.perf_counter()
X2 = stem_vectorizer.fit_transform(texts)
toc = time.perf_counter()
t2 = toc - tic

NameError: name 'x2' is not defined

In [12]:
print(X2.shape, t2)

(1747, 18243) 42.99891504199991


In [13]:
wnl = nltk.WordNetLemmatizer()
analyzer = CountVectorizer().build_analyzer()

def lemmed_words(doc):
    return (wnl.lemmatize(w) for w in analyzer(doc))

lem_vectorizer = CountVectorizer(  analyzer = lemmed_words,
                                    min_df = 2,
                                    lowercase = True,
                                    preprocessor = remove_parentheticals )

tic = time.perf_counter()
X3 = lem_vectorizer.fit_transform(texts)
toc = time.perf_counter()
t3 = toc - tic

# see how many features we have
print(X3.shape, t3)

(1747, 25312) 11.054988249999951


In [26]:
print (f""" 
Shapes and Times \n
Base:           {X_0.shape} / {t0:.2f} \n
Stemmed:        {X_1.shape} / {t1:.2f} \n
Stemmed + ():   {X2.shape} / {t2:.2f} \n
Lemmatized:     {X3.shape} / {t3:.2f} \n
""")

 
Shapes and Times 

Base:           (1747, 29340) / 3.35 

Stemmed:        (1747, 18243) / 42.66 

Stemmed + ():   (1747, 18243) / 43.00 

Lemmatized:     (1747, 25312) / 11.05 


