## LEGALST-190 Lab 3/6

---

In this lab, students will learn about dominant language models in natural language processing and the basics of how to implement it in Python. We'll be using the data you extracted from the last lab (un-debates-2001-clean.csv).


In [1]:
# dependencies
# from datascience import *
import numpy as np
import pandas as pd

### Overview

Here we will discuss one widely used representation of text:
- <b>Bag-of-Words Encoding</b>: encodes text by the frequency of each word

This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions. Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts.

In [3]:
## retrieve our data
data = pd.read_csv('data/un-debates-2001-clean.csv', index_col=0)
data.head()

Unnamed: 0,session,year,country,text,tokens
7318,56,2001,COM,"﻿On\nbehalf of the Comorian delegation, which ...",﻿on behalf comorian deleg i honour lead behalf...
7319,56,2001,RWA,"﻿It is a\ngreat honour for me, on behalf of th...",﻿it great honour behalf rwandan deleg join pre...
7320,56,2001,MMR,﻿On behalf of the\ndelegation of the Union of ...,﻿on behalf deleg union myanmar i wish extend w...
7321,56,2001,PHL,﻿Let me begin by\ncongratulating Your Excellen...,﻿let begin congratul your excel mr han seungso...
7322,56,2001,MRT,﻿I\nam delighted to be able to congratulate yo...,﻿i delight abl congratul sir behalf deleg isla...


In [8]:
## let's store our text and our tokens into a list
text_list = list(data['text'])
tokens_list = list(data['tokens'])
print(tokens_list[:15])
type(tokens_list)

['\ufeffon behalf comorian deleg i honour lead behalf i offer sir warmest congratul elect presid general assembl session we express ardent hope work enlighten leadership success my deleg i pay ring tribut predecessor mr harri holkeri excel manner led work previous session as secretarygener mr kofi annan i prais merit man great talent exemplari wisdom i also pay tribut dedic servic world organ the nobel peac prize award togeth organ concret proof outstand valu on 11 septemb entir world plung gloom anarchi terrorist network defi entir intern communiti reprehens attack american interest new york global hospit cosmopolitan citi — capit entir world thus i fail duti convey rostrum deep sympathi compass govern peopl comoro american peopl govern follow pain tragic unfortun event we offer griefstricken condol particular famili victim whose terribl pain share follow sudden death furthermor deepli move loss live aeroplan accid took place last monday new york we extend sincer condol govern peopl u

list

## Bag-of-Words Encoding

The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.

__Key Things to Note:__

1. __Stop words are removed.__ Stop-words are words like 'is' and 'about' that in isolation contain very little information about the meaning of the sentence.
2. __Word order information is lost.__ 
3. __Capitalization and punctuation__ are typically removed.
4. __Sparse Encoding:__ is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Why is it called a __bag-of-words__?

__SOLUTION:__ <b>Because it is just an unordered collection of words and their frequencies. We don't know anything about syntax, context, or anything other than that a word occurs n times in a document.</b> But it is still relatively common.

### Implementing the Bag-of-words Model

### Review of Tokens

If you remember from the last lab, we created tokens and added it to our table. Normally at this point of the stage, you would create tokens for yourself to use, so let's introduce a new term called `Counter`.

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts:

In [10]:
from collections import Counter
# extract the first speech in tokens_list, split it by whitespace, then put it into a Counter
first_speech = text_list[0].split(' ')
counter = Counter(first_speech) 
counter

Counter({'\ufeffOn\nbehalf': 1,
         'of': 107,
         'the': 146,
         'Comorian': 1,
         'delegation,': 1,
         'which': 10,
         'I': 11,
         'have': 4,
         'the\nhonour': 1,
         'to': 75,
         'lead,': 1,
         'and': 56,
         'on': 14,
         'my': 8,
         'own': 1,
         'behalf,': 1,
         'offer': 1,
         'you,': 1,
         'Sir,\nour': 1,
         'warmest': 1,
         'congratulations': 1,
         'your': 1,
         'election': 1,
         'the\npresidency': 1,
         'General': 2,
         'Assembly': 2,
         'at': 3,
         'this': 19,
         'session.': 1,
         'We\nexpress': 1,
         'ardent': 1,
         'hope': 1,
         'that': 30,
         'our': 25,
         'work,': 1,
         'under': 3,
         'your\nenlightened': 1,
         'leadership,': 1,
         'will': 8,
         'be': 10,
         'successful.\nMy': 1,
         'delegation': 1,
         'pay': 2,
         'a': 26,


The `most_common()` method can be called on a Counter to return the most common tokens and their counts. What are the most common tokens in the first speech? What do these common words tell you about the content or tone of the speech?

In [9]:
# in a UN speech the most common tokens would likely be about current international problems and probably about
#     the home country of the speaker (well, I was assuming stop words cut out)
counter.most_common()

[('the', 146),
 ('of', 107),
 ('to', 75),
 ('and', 56),
 ('in', 47),
 ('is', 31),
 ('that', 30),
 ('a', 26),
 ('our', 25),
 ('for', 20),
 ('this', 19),
 ('on', 14),
 ('by', 14),
 ('I', 11),
 ('as', 11),
 ('which', 10),
 ('be', 10),
 ('it', 10),
 ('all', 10),
 ('Government', 9),
 ('we', 9),
 ('my', 8),
 ('will', 8),
 ('has', 8),
 ('from', 8),
 ('national', 8),
 ('with', 7),
 ('Comoros', 7),
 ('The', 7),
 ('peace', 7),
 ('world', 6),
 ('people', 6),
 ('its', 6),
 ('respect', 6),
 ('international', 5),
 ('Republic', 5),
 ('Nations', 5),
 ('This', 5),
 ('United', 5),
 ('an', 5),
 ('order', 5),
 ('island', 5),
 ('have', 4),
 ('led', 4),
 ('great', 4),
 ('must', 4),
 ('responsibility', 4),
 ('are', 4),
 ('not', 4),
 ('framework', 4),
 ('In', 4),
 ('country,', 4),
 ('Comoros,', 4),
 ('part', 4),
 ('made', 4),
 ('Comoran', 4),
 ('at', 3),
 ('under', 3),
 ('Mr.', 3),
 ('such', 3),
 ('also', 3),
 ('entire', 3),
 ('We', 3),
 ('particular', 3),
 ('these', 3),
 ('were', 3),
 ('took', 3),
 ('peoples

## Document-Term Matrix

We can use sklearn to construct a bag-of-words representation of text. Create an instance of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer.
cv = CountVectorizer()

After creating the CountVectorizer object, use `fit_transform` on it. `fit_transform` takes in the list of documents we want to represent (in this case, the list of tokenized text).

In [12]:
dtm = cv.fit_transform(tokens_list)
dtm

<189x7874 sparse matrix of type '<class 'numpy.int64'>'
	with 109126 stored elements in Compressed Sparse Row format>

What's this? A sparse matrix just means that many cells in the table don't have value. 

We can get a better look at what's going on by turning the sparse matrix into a data frame. First, get the list of words in our 'bag-of-words' by using `get_feature_names()` on your CountVectorizer.

In [13]:

# create labels for columns.
word_list = cv.get_feature_names()
word_list

['024',
 '03',
 '033',
 '04',
 '07',
 '071',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '105',
 '106',
 '10817',
 '10year',
 '11',
 '1135',
 '117',
 '1192',
 '12',
 '120',
 '120000',
 '1244',
 '125',
 '127',
 '1278',
 '12month',
 '12step',
 '13',
 '1300',
 '130000',
 '1306',
 '1314',
 '1325',
 '1333',
 '134',
 '1343',
 '1345',
 '135',
 '1359',
 '1365',
 '1368',
 '1371',
 '1373',
 '1375',
 '1376',
 '1377',
 '1378',
 '14',
 '140',
 '14000',
 '1419',
 '142',
 '1440s',
 '145',
 '147',
 '1492',
 '15',
 '150',
 '15000',
 '150000',
 '1514',
 '1580',
 '15th',
 '15year',
 '16',
 '160',
 '164',
 '167',
 '17',
 '1700',
 '17000',
 '18',
 '1800',
 '181',
 '182',
 '1833',
 '187',
 '1884',
 '189',
 '1890',
 '19',
 '1907',
 '1917',
 '1920s',
 '1930s',
 '194',
 '1940s',
 '1942',
 '1945',
 '1946',
 '1947',
 '1948',
 '1949',
 '1950s',
 '1952',
 '1955',
 '1958',
 '19591960',
 '1960s',
 '1961',
 '1963',
 '1964',
 '1967',
 '1970',
 '1971',
 '1972',
 '1973',
 '1974',
 '1976',
 '1977',
 '1978',
 '1979',
 '

You can then de-sparsify the sparse matrix by turning it into an array. Try using `toarray()` on it.

In [14]:
# de-sparsify by turning dtm into an array
desparse = dtm.toarray()
desparse

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

You now have everything you need to convert your sparse matrix to a DataFrame. Double-check the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for a reminder of how to construct the frame.

In [15]:
# create a dataframe with words as columns and the sparse matrix data as the data
dtm_df = pd.DataFrame(data=desparse, columns=word_list)
dtm_df.head()

Unnamed: 0,024,03,033,04,07,071,10,100,1000,10000,...,zia,ziaur,zimbabw,zimbabwean,zine,zionist,zone,àvis,état,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is what we call a Document Term Matrix, a core concept in NLP and text analysis.

As you can see, there are columns for each word in the entire list. Each row is for each text. The values are the word count for that word in the corresponding text. Note that there are many 0s, hence the matrix is 'sparse'. 

Why are there so many zeros?

__SOLUTION__: There are so many zeros because in most of the documents (speeches) a given word never occurs.

To see the frequency of a word across documents, index that word's column.

In [16]:
# can easily find the frequencies for each of the given words
dtm_df['zone']

0      0
1      0
2      0
3      0
4      0
5      2
6      0
7      0
8      1
9      0
10     0
11     0
12     0
13     0
14     0
15     2
16     0
17     0
18     0
19     0
20     3
21     0
22     0
23     0
24     0
25     1
26     0
27     0
28     0
29     2
      ..
159    0
160    0
161    0
162    0
163    0
164    1
165    0
166    0
167    0
168    0
169    0
170    0
171    0
172    0
173    0
174    0
175    0
176    0
177    0
178    1
179    0
180    0
181    2
182    1
183    0
184    0
185    0
186    0
187    0
188    0
Name: zone, Length: 189, dtype: int64

In [17]:
# what's the total number of times the word 'zone' pops up?
dtm_df['zone'].sum()

41

In [18]:
# how many words appear in the 100th document?
print('100th doc words: ', dtm_df.loc[100].sum())
print('150th doc words: ', dtm_df.loc[150].sum())

100th doc words:  1553
150th doc words:  1121


## Normalization

Let's see if we can take another step and try to make equal comparisons across each of the texts. We can normalize the values by dividing each word count by the total number of words in the text. We'll need to sum on axis=1, or summing the row, as each row is a text), as opposed to summing up the column.

Once we have the total number of words in the text, we can get a percentage of words that one particular word accounts for, thus applying this method to every other word across the matrix.

In [19]:
# see if you can fill this out on your own following the steps listed above.

row_sums = desparse.sum(axis=1) # sum up the desparse on axis=1
print('row_sums shape: ', np.shape(row_sums))
print('desparse shape: ', np.shape(desparse))
normed = np.apply_along_axis(np.divide, 0, desparse, row_sums) # divide this over the total number of row_sums
# new_row_sums = normed.sum(axis=1)
# new_row_sums should equal one for each row and it does!

dtm_df = pd.DataFrame(data=normed, columns=word_list) # create a data frame using the word_list and the new data

dtm_df.head()

row_sums shape:  (189,)
desparse shape:  (189, 7874)


Unnamed: 0,024,03,033,04,07,071,10,100,1000,10000,...,zia,ziaur,zimbabw,zimbabwean,zine,zionist,zone,àvis,état,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000924,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.001126,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.001653,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


When would it be most important to normalize word counts?
1. When you have a lot of documents
2. When you have very few documents
3. When the documents are of many different lengths
4. When the documents are all around the same length

Why?

__SOLUTION__: <b>It is most important to normalize word counts when the documents are of many different lengths, since longer documents will tend to have higher word frequencies merely because they are long.</b> The number of documents doesn't matter so much, and if they are all the same length, you wouldn't need to standardize the frequency count.

## Streamlining

Overall, this was a lot of work and if it is as common as we say it is in NLP, shouldn't someone have streamlined it before? In fact, we can simply instruct CountVectorizer not to include stopwords at all (so we could use it on our non-tokenized text), and another function, TfidfTransformer, normalizes easily.

In [21]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

engl_stop_words = list(ENGLISH_STOP_WORDS)

# fill out this beginning part
# when you create your CountVectorizer, set the stop_words argument equal to engl_stop_words
cv = CountVectorizer(stop_words = engl_stop_words)
dtm = cv.fit_transform(tokens_list)

# this is what allows us to easily streamline
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

<189x7690 sparse matrix of type '<class 'numpy.float64'>'
	with 99385 stored elements in Compressed Sparse Row format>

Fantastic! There's no need to directly answer this question, but think about how we could perhaps remove the numbers from the matrix in addition to the stop words.

---
## Bibliography

- Document Term Matrix, normalization markdown and code adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/06-DTM.ipynb

---
Notebook developed by: Gibson Chu

Data Science Modules: http://data.berkeley.edu/education/modules
