**IST664/CIS668: Week 3 Lab: Analyzing Syntax**

In the realm of natural language processing, syntax is one level up from morphology. Whereas morphology pertains to the components of words, syntax examines how words are sequenced together.  

Although contemporary deep learning methods tend to hide a lot of these details behind the veil of the neural network, syntactical analysis remains a key part of effective NLP solutions - which is why it is such a core process in spaCy. Your ability to create, debug, and successfully modify a natural language system will be enhanced by deepening your understanding of how we use code to assign meaning to various parts of speech as well as the ways that sentences fit together.

This lab begins by reading a complete text from the Project Gutenberg website. We are downloading Dostoevsky's Crime and Punishment, as plain text, in a translation by Constance Garnett. 

In [1]:
import nltk # We'll be using lots of facilities from this
nltk.download('punkt') # Download, as not included in basic colab

# text from online gutenberg
from urllib import request # We will need this to read from the URL

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw), len(raw)
#1176812 characters present in the text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


(str, 1176812)

In [2]:
# Over one million characters. Let's look at the first few.
raw[:178]

'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world'

In [3]:
# We'll begin our processing with tokenization 
crimetokens = nltk.word_tokenize(raw)
crimetokens[112:122]


['Release', 'Date', ':', 'March', ',', '2001', '[', 'eBook', '#', '2554']

In [4]:
# Let's keep track of how many unique tokens we're starting with.
len(set(crimetokens))
#11516 unique tokens

11516

In [5]:
# Let's normalize to lower case to reduce the number of unique tokens
crimetokens = [w.lower() for w in crimetokens]
crimetokens[112:122]

['release', 'date', ':', 'march', ',', '2001', '[', 'ebook', '#', '2554']

We discussed stemmers in class last week. Let's compare three stemmers provided by NLTK.

In [6]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.SnowballStemmer('english')
type(porter), type(lancaster), type(snowball)


(nltk.stem.porter.PorterStemmer,
 nltk.stem.lancaster.LancasterStemmer,
 nltk.stem.snowball.SnowballStemmer)

Computer scientist Martin Porter wrote and published the Porter Stemmer more than 40 years ago. The Porter stemmer is a rule-based algorithm (i.e., no dictionary) for "suffix stripping." The algorithm was subsequently implemented by other coders in more than two dozen different computer languages. Eventually, Porter got tired of hearing about the implementation errors in some of these other versions, so he rewrote the algorithm in C about 20 years ago. He also created a programming framework, called "Snowball" that can be used to create additional stemmers including the third one above, which is also known as the Porter2 stemmer. 

The Lancaster stemmer, also known as the Paice/Husk stemmer, was created at Lancaster University and has the advantage that the "rule book" it uses is external to the algorithm itself and can therefore be adapted to languages other than English.

In [7]:
# From a data reduction standpoint, which stemmer results in the greatest
# reduction in the number of unique tokens? Remember that we started with 11539.
crimePstem = [porter.stem(t) for t in crimetokens]
crimeLstem = [lancaster.stem(t) for t in crimetokens]
crimeSstem = [snowball.stem(t) for t in crimetokens]

len(set(crimePstem)), len(set(crimeLstem)), len(set(crimeSstem))

#Lancaster stemmer gives the greatest reduction in the number of unique tokens
#lowest - 6399

(7363, 6399, 7174)

In [8]:
# What proportion of reduction have we achieved with the Porter stemmer?
len(set(crimePstem))/len(set(crimetokens))
#Porter stemmer - 68.9% reduction

0.6890323788134007

In [9]:
# The Porter stemmer produced a set roughly 69% of the size of the original
# vocabulary. Now calculate and show the percent reduction in the number of 
# tokens for the other two stemmers.

# 3.1: Compute and display reduction ratio for the Lancaster stemmer
len(set(crimeLstem))/len(set(crimetokens))
#Lancaster stemmer - 59.8% reduction

0.5988208871420551

In [10]:
#
# 3.2: Compute and display reduction ratio for the Snowball stemmer
len(set(crimeSstem))/len(set(crimetokens))
#Snowball Stemmer - 67.1% reduction

0.6713456859442261

In [11]:
# Let's compare the highest frequency tokens from the three stemmers
from tabulate import tabulate
from nltk import FreqDist
pdist = FreqDist(crimePstem)
ldist = FreqDist(crimeLstem)
sdist = FreqDist(crimeSstem)

# zip() is a cool built-in function for zipping together two or
# more lists/tuples into a single iterator.
compare = zip(pdist.most_common(20),
              ldist.most_common(20),
              sdist.most_common(20))

print(tabulate(compare, headers=["Porter", "Lancaster", "Snowball"]))

Porter          Lancaster       Snowball
--------------  --------------  --------------
(',', 16177)    (',', 16177)    (',', 16177)
('.', 8908)     ('.', 8908)     ('.', 8908)
('the', 8006)   ('the', 8038)   ('the', 8006)
('and', 7031)   ('and', 7031)   ('and', 7031)
('to', 5350)    ('to', 5350)    ('to', 5350)
('he', 4769)    ('he', 4769)    ('he', 4769)
('a', 4651)     ('a', 4651)     ('a', 4651)
('i', 4397)     ('i', 4397)     ('i', 4397)
('you', 4086)   ('you', 4094)   ('you', 4086)
('’', 4039)     ('’', 4039)     ('’', 4039)
('“', 3980)     ('“', 3980)     ('“', 3980)
('”', 3929)     ('”', 3929)     ('”', 3929)
('of', 3927)    ('of', 3927)    ('of', 3927)
('it', 3474)    ('it', 3474)    ('it', 3474)
('that', 3282)  ('that', 3282)  ('that', 3282)
('in', 3248)    ('in', 3261)    ('in', 3248)
('wa', 2826)    ('was', 2826)   ('was', 2826)
('!', 2364)     ('on', 2606)    ('!', 2364)
('?', 2275)     ('!', 2364)     ('?', 2275)
('hi', 2114)    ('?', 2275)     ('his', 2113)


#Comments from discussion
Lancaster stemmer stems words like "these" to "the" which could explain why Lancaster has more "the" compared to other stemmers. Lancaster ignores the demonstrative pronouns like "these". It clubs the part of speeches to bigger chunks, so we acheive less unique words. 
Porter stemmer stems all the -s in words thinking it is a plural form. Hence was gets converted to -wa. It is not able to detect past verbs and stems them.
Snowball appears to be a better stemmer than the other two. Even though it doesnot reduce tokens as much, it can stem with better part of speech. 

#Discuss with Your Partner

There's a lot going on in the display just above. All three stemmers agree on commas, periods, the word "and," and the close double quote. Can you think of some reasons why the word "the" has a different count for the Lancaster stemmer? Discuss this point with your lab partner.

What's going on near the end of the list where we have the following output:

(('wa', 2825), ('was', 2825), ('was', 2825))

The counts match, but what has the Porter stemmer done differently? Even based on the small amount of evidence above, what conclusions can you draw about the advantages and disadvantages of various stemmers?



Because stemming is quite variable in the results it produces, some NLP processing methods use lemmatization instead. A lemma is the root form on a word. In English one of the most striking set of lemmas comes from the verb "to be." The words "am," "is," "are," and "be," despite their unique spellings and pronunciations, all lemmatize to "be." Let's try this with the Wordnet Lemmatizer: 


In [24]:
nltk.download('wordnet')
nltk.download('omw-1.4')
wnl = nltk.WordNetLemmatizer()
type(wnl)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


nltk.stem.wordnet.WordNetLemmatizer

In [25]:
wnl.lemmatize("am", pos ="v")

'be'

In [31]:
# Now lemmatize "is" and "are" using wnl.lemmatize(). 
# Also test what happens if you leave out the pos argument? 
# Write a comment describing what the pos argument does.

# 3.3: Lemmatize is, using pos="v"
print("is : ",wnl.lemmatize("is", pos="v"))

#is : be

is :  be


In [30]:
#
# 3.4: Lemmatize are, using pos="v"
print("are : ",wnl.lemmatize("are", pos="v"))
#are : be

are :  be


In [32]:
#
# 3.5: Test the lemmatize method without the pos argument
#
print("is : ",wnl.lemmatize("is"))

#is returns is when given without pos = v

is :  is


In [33]:
# Let's lemmatize Crime and Punishment to see what we get:
crimelemma = [wnl.lemmatize(t) for t in crimetokens]

len(set(crimelemma))
#Unique lemmatized tokens :9793

9793

In [34]:
# What proportion of reduction have we achieved with the lemmatizer?
len(set(crimelemma))/len(set(crimetokens))

#We have acheived less reduction compared to other stemmers
#0.9164327157027887

0.9164327157027887

At about 92%, the WordNet lemmatizer does not achieve as much data reduction as the stemming methods. WordNet really seems to require that the user specify the part of speech. Without that specification, there are likely to be errors.

Switching gears for a moment, one way of capturing more contextual information in our token lists is to analyze tokens in sets of two or more. Two tokens together is called a bigram, three is called a trigram, and more generally any number "n" is called an ngram. NLTK and other language packages contain numerous tools for working with bigrams. Let's look at the output of the NLTK ngrams() function:

In [35]:
# Rather than a whole book, let's begin by working with one sentence:
sentence = "thomas jefferson began building monticello at the age of twenty-six."
len(sentence)

68

In [36]:
from nltk.util import ngrams
import re # Regular expressions library
pattern = re.compile('[a-z]+')
tokens = pattern.findall(sentence)
list(ngrams(tokens, 2))

[('thomas', 'jefferson'),
 ('jefferson', 'began'),
 ('began', 'building'),
 ('building', 'monticello'),
 ('monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', 'twenty'),
 ('twenty', 'six')]

In [37]:
# We can easily repeat the process for trigrams:
list(ngrams(tokens, 3))

[('thomas', 'jefferson', 'began'),
 ('jefferson', 'began', 'building'),
 ('began', 'building', 'monticello'),
 ('building', 'monticello', 'at'),
 ('monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', 'twenty'),
 ('of', 'twenty', 'six')]

It should be pretty clear what's happening in both of the previous cells. Also, it may not seem especially useful, but research has shown that there is  value in understanding the context around words - i.e., the other words that occur nearby. In fact, this was an idea called "the distributional hypothesis" imagined by linguist Zellig Harris, that words with similar meanings tend to occur in similar contexts.

In [38]:
# We can create ngram token strings from these lists.
bigrams = [" ".join(w) for w in ngrams(tokens, 2)]
print(bigrams) # If we were going to use CountVectorizer, this would be the input

['thomas jefferson', 'jefferson began', 'began building', 'building monticello', 'monticello at', 'at the', 'the age', 'age of', 'of twenty', 'twenty six']


In [40]:
#
# 3.5: Build trigram tokens from the Thomas Jefferson tokens.
# Make sure the trigram tokens have spaces between the component words as shown
# in the bigram example in the code block just above.
#
trigrams = [" ".join(w) for w in ngrams(tokens, 3)]
print(trigrams) 

['thomas jefferson began', 'jefferson began building', 'began building monticello', 'building monticello at', 'monticello at the', 'at the age', 'the age of', 'age of twenty', 'of twenty six']


In [41]:
# Given only one sentence, that result is not very exciting, but what if
# we did a whole book?
nltk.download('stopwords')
nltk_stops = nltk.corpus.stopwords.words('english')

crimenopunct = [w for w in crimetokens if w.isalnum()]
crimenostops = [w for w in crimenopunct if w not in nltk_stops]
crimebigrams = [" ".join(w) for w in ngrams(crimenostops, 2)]

fdist = FreqDist(crimebigrams) # This creates a list of frequencies for bigrams
len(fdist) # This is the total number of unique bigrams

#75363 number of unique bigrams in the book

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


75363

In [42]:
fdist.most_common(10) # What do you notice about the most frequent bigrams

#Most frequent bigrams are Names of the characters in book. 
#Most likely first and last name appears together

[('katerina ivanovna', 215),
 ('pyotr petrovitch', 172),
 ('pulcheria alexandrovna', 123),
 ('avdotya romanovna', 112),
 ('old woman', 91),
 ('rodion romanovitch', 82),
 ('porfiry petrovitch', 81),
 ('marfa petrovna', 76),
 ('sofya semyonovna', 71),
 ('amalia ivanovna', 54)]

**Part Two**

At this point we know that the WordNet lemmatizer does not work very well unless we already know the part of speech of the word we are trying to lemmatize. This is a significant limitation, which is also reflected in the fact that we only achieved an 8% reduction in the number of unique tokens using this lemmatizer.

Given the limitations of stemmers and simple lemmatizers, it is time to take a more serious look at part of speech tagging. For this, we are going to graduate from NLTK to our first effort with spaCy. Whereas NLTK was designed for teaching and research, spaCy was architected so that it can serve as the basis of a production-grade NLP pipeline. Unlike other NLP toolkits (e.g., Stanford core NLP) spaCy was written in Python and Cython, so it is convenient for use directly from the Jupyter notebook environment. We will do a thorough examination of many of spaCy's capabilities in the Week 5 lab. For now, we will just try out a few basic techniques.


In [43]:
import spacy
nlp = spacy.load('en_core_web_sm') # sm means small - some pipeline capabilities not loaded
type(nlp) # This is our pipeline: an instantiated class that we can use to process any string
# You can ignore this warning if you see it: "UserWarning: Can't initialize NVML"

spacy.lang.en.English

In [44]:
# Let's process a small example from NLPIA first
sentence = "The faster Harry got to the store, the faster Harry would get home."
spsent = nlp(sentence)
type(spsent), len(spsent)

(spacy.tokens.doc.Doc, 15)

In [45]:
# But this is no ordinary set of string tokens:
# What bound methods and attributes are available for this parsed object?
[m for m in dir(spsent) if m[0] != '_']

['cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'set_extension',
 'similarity',
 'spans',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_bytes',
 'to_dict',
 'to_disk',
 'to_json',
 'to_utf8_array',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [46]:
# So there are quite a number of attributes and bound methods for
# this collection of tokens. We will learn more of them eventually
# but for now, let's just look at one attribute.
spsent.has_annotation("TAG") # What does this one tell us?

#There is a method called "tag"

True

In [47]:
# So spaCy has guessed the part of speech for each token. We can easily list
# all of the tags.
tags = [(i, i.pos_) for i in spsent]
print(tabulate(tags, headers=["Token", "POS Tag"]))

Token    POS Tag
-------  ---------
The      DET
faster   ADJ
Harry    PROPN
got      VERB
to       ADP
the      DET
store    NOUN
,        PUNCT
the      PRON
faster   ADJ
Harry    PROPN
would    AUX
get      VERB
home     ADV
.        PUNCT


In [48]:
# SpaCy has also stored the lemmas for each token
# Let's show the lemmas and clean up our output. We can use the
# tabulate package to make clean, simple display tables.
from tabulate import tabulate

# Make a little dataset for tabulate() to work on.
poslist = [ (i, i.lemma_, i.pos_) for i in spsent]

print(tabulate(poslist,  headers=["Token", "Lemma", "Tag"]))


Token    Lemma    Tag
-------  -------  -----
The      the      DET
faster   fast     ADJ
Harry    Harry    PROPN
got      get      VERB
to       to       ADP
the      the      DET
store    store    NOUN
,        ,        PUNCT
the      the      PRON
faster   fast     ADJ
Harry    Harry    PROPN
would    would    AUX
get      get      VERB
home     home     ADV
.        .        PUNCT


In [49]:
# Take a peek at those tags. Some of them should be pretty obvious. For how many
# of them can you guess what part of speech it is referring to? Also take a 
# look at the lemmas.

# There's even more info in there for every token. In particular, pay attention
# to the "is_" tests: There are 18 tests that you can do on any token to
# see how spaCy has categorized it.
[m for m in dir(spsent[0]) if m[0:2] == 'is']

['is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper']

In [50]:
# Let's make a more detailed table with a few of these fields.
poslist = [ (i, i.head, i.lemma_, i.pos_, i.tag_, i.is_alpha) for i in spsent]

print(tabulate(poslist,  headers=["Token", "Head", "Lemma", "Tag", "Details","Alpha?"]))


Token    Head    Lemma    Tag    Details    Alpha?
-------  ------  -------  -----  ---------  --------
The      Harry   the      DET    DT         True
faster   Harry   fast     ADJ    JJR        True
Harry    got     Harry    PROPN  NNP        True
got      get     get      VERB   VBD        True
to       got     to       ADP    IN         True
the      store   the      DET    DT         True
store    to      store    NOUN   NN         True
,        get     ,        PUNCT  ,          False
the      faster  the      PRON   DT         True
faster   Harry   fast     ADJ    JJR        True
Harry    get     Harry    PROPN  NNP        True
would    get     would    AUX    MD         True
get      get     get      VERB   VB         True
home     get     home     ADV    RB         True
.        get     .        PUNCT  .          False


The table above just scratches the surface, but there's still a lot of interesting stuff happening there. In the first column we have the token itself, which can be a word, a number, or punctuation. The second column starts to unpack the idea of dependency grammar - that each word in a sentence represents a portion of a tree, with "ancestors" that it depends on and "children" that depend on it. "Head" refers to the immediate ancestor of a word. So for instance, the proper noun "Harry" depends on the corresponding verb "got." Next we have the lemmas and the simple part of speech tag as before. By the way, you can find an explanation of these tags here:

https://universaldependencies.org/docs/u/pos/

Finally, there is a fine-grained part of speech - a more complicated tag provided by spaCy. These are unique to each language model, but there is a function call that will provide information about any of the tags:

In [22]:
spacy.explain("JJR")

'adjective, comparative'

#Checkpoint! Use spacy.explain("RB")

In the empty code box below, add and run spacy.explain("RB"). The tag "RB" is from the detail field of the last word (home) in the tabular output just above. The method will return the part of speech that "RB" refers to. Write that part of speech on the whiteboard next to your name.

In [23]:
spacy.explain("RB")

'adverb'

In [51]:
# Let's practice by tagging another sentence. Here's some text extracted from
# Wikipedia's article on kites.
kites = """A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces. 
A kite consists of wings, tethers and anchors. Kites often have a bridle and tail to guide the face of the kite so the wind can lift it. 
Some kite designs don’t need a bridle; box kites can have a single attachment point. 
A kite may have fixed or moving anchors that can balance the kite. 
One technical definition is that a kite is “a collection of tether-coupled wing sets“.
The name derives from its resemblance to a hovering bird."""

spkites = nlp(kites)
type(spkites), len(spkites)

(spacy.tokens.doc.Doc, 130)

In [None]:
# Add code to conduct the following analyses:

# 3.6: Display tokens, lemmas, and parts of speech for spkites. Try using a
# nice, neat tabular format for the output.

poslist2 = [ (i, i.lemma_, i.pos_) for i in spkites]

print(tabulate(poslist2,  headers=["Token", "Lemma", "Tag"]))


In [53]:
# It might be more convenient to work with individual sentences:
kitespans = list(spkites.sents)

kitespans[0] # Let's view just the first sentence

A kite is a tethered heavier-than-air or lighter-than-air craft with wing surfaces that react against the air to create lift and drag forces. 

In [54]:
# One other neat trick: We can use spaCy to display a graphical
# version of the dependence tree for any sentence or document.
# You saw this in an exercise in class.
from spacy import displacy 
displacy.render(kitespans[0], style="dep", jupyter=True)

In [55]:
# Apply displacy to another sentence from the same Wikipedia article.

# 3.7: Add a dependency structure graph for the second sentence in kites.
displacy.render(kitespans[1], style="dep", jupyter=True)

Let's close the loop on the idea of data reduction by seeing how many unique lemmas spaCy creates for Crime and Punishment. Recall that we were unsatisfied with the lemmatizer from NLTK because - in order for it to work efficiently - we needed to know the POS for each token before calling the lemmatizer. The spaCy nlp() call ingests our whole text, applies tags, and determines lemmas, all based on a swappable language model. 

In [56]:
# Process Crime and Punishment with spaCy: takes a minute!
nlp.max_length = 1200000 # Increase from the default of 1 million characters

# Note that this call takes about a minute to complete.
crimespacy = nlp(raw) # We're going back to the original raw text data!

type(crimespacy), len(crimespacy)

(spacy.tokens.doc.Doc, 274697)

In [57]:
# Let's count unique lemmas
newcrimelemma = [l.lemma_ for l in crimespacy]
len(set(newcrimelemma))
#7844 unique tokens

7844

In [58]:
# What percentage reduction have we achieved with the lemmatizer? How does that
# compare with the stemmers we tested at the beginning of this lab?
len(set(newcrimelemma))/len(set(crimetokens))

#This has acheived about 73% reduction
#Reduction is lower than the three stemmers and higher than the lemmatizer

0.7340445442635224

**Part Three**

So using the spaCy text preprocessing we have generated a complete sequence of approximately 275,000 tokens, considerably more than the NLTK word tokenizer. Can you guess why there are so many more? Even starting from this larger base, however, the spaCy lemmatizer - with the advantage of knowing the POS for each token - has cut things down to about 8000 unique tokens. This is not as aggresive as the Lancaster stemmer, but on the other hand, all of the lemmatized words in the spaCy list are real words (note that the spaCy list will also include tokens that were not lemmatized, such as proper names).

What can we do with this large collection of lemmatized tokens? One essential way of representing a corpus is to transform the token counts into a "document term matrix" (DTM) or a transposed version of the same thing, a "term document matrix." The most basic DTM contains word frequencies in each cell. A more advanced DTM contains adjusted values known as TF-IDF (term frequency, inverse document frequency). Creating a DTM, either with counts or with TF-IDF values begins with a process called vectorization. Let's vectorize Crime and Punishment, treating each sentence as a document. 

In [59]:
# It might be more convenient to work with individual sentences:
crimespans = list(crimespacy.sents)

crimespans[42:45] # Let's view three sample sentences

[A few months later Dostoevsky died., He was followed to the grave by a
 vast multitude of mourners, who “gave the hapless man the funeral of a
 king.”, He is still probably the most widely read writer in Russia.
 ]

In [60]:
type(crimespans[42]) # Check on the type of a single sentence

#Type is spacy tokens span

spacy.tokens.span.Span

In [61]:
# Create a vectorizer using the powerful Sci-Kit Learn" library.
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a vectorizer, removing stopwords, setting min doc frequency
vectorizer = CountVectorizer(min_df=1, stop_words='english', lowercase=True) 

crimesparse = vectorizer.fit_transform([ t.text for t in crimespans])
type(crimesparse)

#Converted to scipy sparse matrix to represent document term frequency -DTM


scipy.sparse.csr.csr_matrix

In [62]:
# A sparse matrix DTM is excellent for efficient storage, but to do useful 
# manipulations, we will need to blow it up into a data frame.
import pandas as pd
dtmDF = pd.DataFrame(crimesparse.toarray(),
                      columns=vectorizer.get_feature_names_out())

dtmDF.shape # Make sure you know what these numbers are: Confirm with your partner!

#14723  rows or sentences
#9557 columns or wods

(14713, 9557)

In [63]:
dtmDF # We can get a preview of the data frame: first five and last five rows
# And remember that the magic wand tool lets you take a closer look.

Unnamed: 0,000,14,1500,1849,1859,1861,1864,1880,1887,20,...,zest,zeus,zigzags,zimmerman,zossimov,æsthetic,æsthetically,æsthetics,éternelle_,êtes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14708,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14709,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14710,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




In [64]:
# The output above shows us that we have about 2800 terms (columns) in our
# data frame and nearly 15000 sentences (rows). We can look up any word in
# the DTM by name and find out how frequently it occurs.
dtmDF['priest'].sum() # We're computing the column sum of word counts

#The word priest occurs 19 times

19

In [68]:
# Choose another word that you think should be in the DTM. What 
# happens if you try a stop word?

# 3.8: Get a total frequency count for a different word.
print("happy : ",dtmDF['happy'].sum())

#Stopwords dont exist in the dataframe

happy :  30


In [89]:
# Let's make a complete frequency list of all words (columns)
wordfreqs = [ (word, dtmDF[word].sum()) for word in vectorizer.get_feature_names_out()] 

# Now we can sort, using the count as a key. This code uses a lambda function,
# an anonymous temporary function, to choose the second element of each tuple 
# as the basis for the sort. You can read this lambda function as saying, 
# "I receive as input one of the tuples and I call it w. Given w, I will 
# return the second element (the count) to be used as the sorting key."
wordfreqs.sort(key=lambda w: w[1], reverse=True)

# Show the top 20 items
wordfreqs[0:20]

[('raskolnikov', 785),
 ('know', 530),
 ('said', 519),
 ('did', 497),
 ('come', 480),
 ('man', 479),
 ('don', 464),
 ('like', 453),
 ('sonia', 402),
 ('time', 385),
 ('went', 356),
 ('razumihin', 347),
 ('dounia', 325),
 ('thought', 306),
 ('ivanovna', 304),
 ('say', 296),
 ('looked', 293),
 ('suddenly', 293),
 ('little', 288),
 ('petrovitch', 287)]

In [73]:
# Rodio Raskolnikov and Dmitri Prokofych Razumikhin are focal characters in the book,
# so it is pretty cool that their names are among the most frequently
# appearing terms in our DTM.

# Next, make a list of *row sums* from our dtm using dtmDF.sum(axis=1).  
# Examine this list to see if there are any documents that have a row 
# sum of zero. What would this imply, if you found it?

# 3.9: Count the number of rows where row sum equals zero
dtmDF[dtmDF.sum(axis=1)==0]
# Write a comment indicating what (if anything) you should do with 
# rows whose sum is zero?

#If a rows sum is zero, it means the sentence is empty. We can remove those sentences.
#They majorly contain only stop words or words that were stemmed and lemmatized.
#The words of those sentences contain what is considered stop words by SpaCy 

Unnamed: 0,000,14,1500,1849,1859,1861,1864,1880,1887,20,...,zest,zeus,zigzags,zimmerman,zossimov,æsthetic,æsthetically,æsthetics,éternelle_,êtes
72,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
176,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
213,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
269,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
298,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14664,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14667,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14672,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14679,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [75]:
# As a hint about the previous exercise, let's find out all of the words
# that are included in the default list of stop words used by CountVectorizer
from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS
len(stop_words)

#318 number of stop words

318

In [None]:
# 
# 3.9a: Print out the contents of stop_words. Review it carefully. Are 
# there any surprises?
#
stop_words
#There are words like "together","twelve","two","beforehand" - which are considered 
#as stopwords, which could mean the sentence has become empty.

In [77]:
# Let's conclude with a primitive analysis of the dtm. First we'll make two 
# subsets of our data, based on mentions of characters:

raskolDF = dtmDF[dtmDF.raskolnikov > 0]
razumihinDF = dtmDF[dtmDF.razumihin > 0]
raskolDF.shape, razumihinDF.shape

#raskolnikov is present in 777 sentences
#razumihin is present in 343 sentences

((777, 9557), (343, 9557))

In [78]:
# This creates a ratio of the number of times the word good is mentioned 
# in each of the two data subsets. What's another word we could probe
# to get a sense of how these two characters are discussed.
raskolDF['good'].sum()/razumihinDF['good'].sum()

#raskolDF contains 2 times more frequency of "good" than razumihin

2.0

In [90]:
#
# 3.10: Now obtain a ratio of total word frequency for a word other than good
#
raskolDF['mother'].sum()/razumihinDF['mother'].sum()

#raskolDF contains 3.3 times more frequency of "mother" than razumihin

3.3333333333333335

There are many more things we can do with our vectorized sentences, and we will learn some more of them in future weeks. Note that unigram vectorization based on word frequencies is really the most primitive numeric representation of a document. Where we are headed is to use contemporary vector representations (first, word by word, and later with full sentences) of document contents.

You are probably near the end of the lab period, so don't forget to submit your lab file AND make a note of how far you got.

If there is some remaining time in the lab session, go back to the cell where CountVectorizer is invoked and use that code as the starting point for a new set of code cells that use *TfidfVectorizer* from the scikit.learn package. You can find the documentation for the TfidfVectorizer here: 

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

After vectorizing with TF-IDF, convert your sparse matrix to a data frame and do some diagnostics on it as we did above. Finally, break your data frame into subsets, with one for Raskolnikov and one for Razhumihin. Then repeat your ratio tests on good and other words to see if anything has changed as compared with using raw frequency counts.

In [None]:
#
# 3.11: Revectorize Crime and Punishment sentences with TF-IDF
#

In [None]:
#
# 3.12: Convert vectorization results (the TF-IDF DTM) to pandas data frame
#

In [None]:
#
# 3.13: Repeat one or more of diagnostic tests demonstrated for the 
# count vectorization.
#

In [None]:
#
# 3.14: Repeat ratio tests, comparing the contents of the DTMs for the two characters.
#

##Bonus Content!##

One last advanced topic for this lab is to consider the concept of "pointwise mutual information" (PMI) mentioned in today's presentation. When we are dealing with data such as lists of bigrams or trigrams, we often would like to have a method of sifting the data to find the most interesting examples. PMI calculates the probability of the co-occurence of two words using the probability of each word independently as a baseline. Here's an example: Let's say that "fish" occurs five times in 100 words, while "cake" appears eight times. The combination "fish cake" appears 3 times. Now run the code below:

In [None]:
pfish = 5/100
pcake = 8/100
pfishcake = 3/100

import math # We will need the log2() function
pmi = math.log2( pfishcake / (pfish * pcake))
print(pmi)

So based on this result, Fish and Cake are occuring together somewhat more frequently than would be expected based on how often they appear independently. You can fiddle around with the probability values to see how it affects the PMI calculation.

To to this kind of analysis at scale, we'll pull in some code from the nltk.collocations module.

In [None]:
from nltk.collocations import BigramAssocMeasures # We need two modules
from nltk.collocations import BigramCollocationFinder

# Here we are creating instances of two classes. The first is a
# bigram measurer of the class nltk.metrics.association.BigramAssocMeasures.
# The second is a locator function that is initialized with the tokens from Crime and Punishment.
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(crimetokens)
type(bigram_measures), type(finder)

In [None]:
# The NLTK Pointwise Mutual Information scoring function, PMI, scores
# the bigrams by taking into account the frequency of the two
# component words. When infrequent words make a bigram they get
# a boost in PMI. A higher score thus means a more interesting
# bigram.
finder.apply_freq_filter(2) # Let's ignore hapaxes
scored = finder.score_ngrams(bigram_measures.pmi)

# Examine the pairs with PMI greater than 16.5 (an arbitrary number chosen
# simply to keep the list short).
[bg for bg in scored if bg[1] > 16.5]

In [None]:
# 
# 3.15: Lower the PMI threshold to 14 or 15 and examine some of the 
# additional bigrams. What do you see? Are high PMI bigrams useful
# at telling us something about the corpus?