In this problem set, we'll do exploratory data analysis with our text corpus. Students should follow the instructions in the notebook and fill the parts that are required from them.

Although you're free to execute the notebook on your personal environment, I would strongly recommend using Google Colab. You can upload this notebook to Google colab by following the steps below.

1. Open [colab.research.google.com](colab.research.google.com)
2. Click on the upload tab
3. Upload the .ipynb file by choosing the right file from your local disk


**Submission instructions**

1. When you're ready to submit, you'll save the notebook as QTM340-PS1-Firstname-Lastname.ipynb; for example, if your name is Harry Potter, save the file as `QTM340-PS1-Harry-Potter.ipynb`. This can be done in Google colab by editing the filename and then following File --> Download --> .ipynb

2. Upload this file on canvas.

**Objective**: In this notebook, we'll learn to:

a. Find stylometrics (3 points)

b. Find distinctive terms by comparing multiple corpora (2 points)

c. Find similar documents to a given document (3 points)

We'll use congressional speeches from the 114th congress as our corpus for analysis. To get the data, execute the following line:



In [1]:
!wget https://raw.githubusercontent.com/sandeepsoni/QTM340-Fall23/main/data/114_speeches.tar.gz
!tar -xzvf 114_speeches.tar.gz

--2023-10-02 03:55:58--  https://raw.githubusercontent.com/sandeepsoni/QTM340-Fall23/main/data/114_speeches.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43102274 (41M) [application/octet-stream]
Saving to: ‘114_speeches.tar.gz’


2023-10-02 03:56:00 (205 MB/s) - ‘114_speeches.tar.gz’ saved [43102274/43102274]

114/
114/speeches_114.txt
114/README.txt
114/114_SpeakerMap.txt


The above execution should create a directory named `114` with the following structure:

```
114/
114/speeches_114.txt
114/README.txt
114/114_SpeakerMap.txt
```

## 0. Setup

Let's load all the speeches for which we have additional metadata.

The speaker info file is delimited by `|` and the columns are named.

In [2]:
!head -n 5 114/114_SpeakerMap.txt

speakerid|speech_id|lastname|firstname|chamber|state|gender|party|district|nonvoting
114120480|1140000007|MCMORRIS RODGERS|CATHY|H|WA|F|R|5|voting
114118560|1140000009|BECERRA|XAVIER|H|CA|M|D|34|voting
114121890|1140000011|MASSIE|THOMAS|H|KY|M|R|4|voting
114122500|1140000013|BRIDENSTINE|JIM|H|OK|M|R|1|voting


Similarly, the speeches file is delimited by `|` and contains the speech and its Id

In [3]:
!head -n 5 114/speeches_114.txt

speech_id|speech
1140000001|The Representativeselect and their guests will please remain standing and join in the Pledge of Allegiance.
1140000002|As directed by law. the Clerk of the House has prepared the official roll of the Representativeselect. Certificates of election covering 435 seats in the 114th Congress have been received by the Clerk of the House. and the names of those persons whose credentials show that they were regularly elected as Representatives in accordance with the laws of their respective States or of the United States will be called. The Representativeselect will record their presence by electronic device and their names will be reported in alphabetical order by State. beginning with the State of Alabama. to determine whether a quorum is present. Representatives- elect will have a minimum of 15 minutes to record their presence by electronic device. Representatives- elect who have not obtained their voting ID cards may do so now in the Speakers lobby.
1140000003|F

We'll use the `pandas` library to load both the speeches and the speaker info. If you are familar with `R` then pandas can be thought of as providing pretty much the same functionality to construct and manipulate dataframes. You can read about more about it [here](https://pandas.pydata.org/docs/user_guide/10min.html).

We'll also import other libraries and configure them so they're ready to use later in the notebook.

In [4]:
# Import the general libraries
import math
import pandas as pd
from tqdm import tqdm
import numpy as np
from collections import defaultdict, Counter
import matplotlib.pyplot as pyplot
%matplotlib inline

# Import spacy and configure the nlp pipeline for spacy
import spacy
nlp = spacy.load ("en_core_web_sm", disable=["ner", "parser"])
nlp.disable_pipe ("ner")
nlp.disable_pipe ("parser")

# Import nltk and download the punct models
import nltk
nltk.download ("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
# Let's read the speeches
speeches = pd.read_csv ("114/speeches_114.txt", #name of the file
                        sep='|', #delimiter
                        encoding="utf-8", #encoding of the characters
                        encoding_errors="ignore", #ignore any errors in encoding
                        on_bad_lines="skip" #skip lines which contain ill-formatted speeches
                       )

In [6]:
# Let's read the speaker info.
speaker_map = pd.read_csv ("114/114_SpeakerMap.txt", #name of the file
                        sep='|', #delimiter
                        encoding="utf-8", #encoding of the characters
                        encoding_errors="ignore", #ignore any errors in encoding
                        on_bad_lines="skip" #skip lines which contain ill-formatted speeches
                       )

Let's see a few rows in both the dataframes. We can do this by calling the `.head` function of the pandas dataframe.

In [7]:
speeches.head (5)

Unnamed: 0,speech_id,speech
0,1140000001,The Representativeselect and their guests will...
1,1140000002,As directed by law. the Clerk of the House has...
2,1140000003,Four hundred and one Represent ativeselect hav...
3,1140000004,Credentials. regular in form. have been receiv...
4,1140000005,The Clerk is in receipt of a letter from the H...


In [8]:
speaker_map.head (5)

Unnamed: 0,speakerid,speech_id,lastname,firstname,chamber,state,gender,party,district,nonvoting
0,114120480,1140000007,MCMORRIS RODGERS,CATHY,H,WA,F,R,5.0,voting
1,114118560,1140000009,BECERRA,XAVIER,H,CA,M,D,34.0,voting
2,114121890,1140000011,MASSIE,THOMAS,H,KY,M,R,4.0,voting
3,114122500,1140000013,BRIDENSTINE,JIM,H,OK,M,R,1.0,voting
4,114120780,1140000017,PELOSI,NANCY,H,CA,F,D,12.0,voting


Now we'll merge both the dataframes into one. We can do this by calling `pd.merge` as follows (if you'are familiar with SQL we'll do a join operation of these two tables that have the speech_id field in common)

In [9]:
overall_data = pd.merge (speeches,
                         speaker_map,
                         how="inner",
                         on="speech_id")

The resuling dataframe can be accessed with the variable `overall_data`

In [10]:
overall_data.head (5)

Unnamed: 0,speech_id,speech,speakerid,lastname,firstname,chamber,state,gender,party,district,nonvoting
0,1140000007,RODGERS. Madam Clerk. it is an honor to addres...,114120480,MCMORRIS RODGERS,CATHY,H,WA,F,R,5.0,voting
1,1140000009,Madam Clerk. first I would like to recognize e...,114118560,BECERRA,XAVIER,H,CA,M,D,34.0,voting
2,1140000011,Madam Clerk. I present for election to the off...,114121890,MASSIE,THOMAS,H,KY,M,R,4.0,voting
3,1140000013,Madam Clerk. I present for the election of the...,114122500,BRIDENSTINE,JIM,H,OK,M,R,1.0,voting
4,1140000015,Madam Clerk. I rise to place in a nomination f...,114120060,KING,STEVE,H,IA,M,R,4.0,voting


Now let's randomly pick 10000 speeches from the dataframe for our analysis. We can do this by calling the `.sample` method on the dataframe and passing an argument to it to indicate the number of rows we want to get post-sampling. Notice that we set the `random_state` to a fixed number (and not just any number but the answer to [the ultimate question of life](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42))) which is done to ensure replicability.

In [11]:
n=10000
overall_data = overall_data.sample (n=n, random_state=42)
print (len (overall_data))

10000


In [31]:
sample = overall_data #rename variable to sample

## 1. Stylometrics

#### 1.1 Word length metrics

To warmup, let's calculate the average word length in the entire corpus. We'll use the `nltk` library to do this as shown below.

In [12]:
def get_all_tokens_from_corpus (frame: pd.DataFrame, field_name:str="speech") -> list:
  """ Iterate over the speeches in the dataframe and tokenize the speeches

  :params:
  :frame (pd.DataFrame): The dataframe that contains the entire tabular dataset
  :field_name (str): The field name in the dataframe the cotnains the speeches

  :returns:
  :all_tokens (list): List of tokens.

  """
  all_tokens = list ()
  for i,row in tqdm(frame.iterrows ()):
    tokens = nltk.word_tokenize (row[field_name]) # tokenize each speech
    tokens = [token.lower () for token in tokens if token.isalpha()]
    all_tokens.extend (tokens)
  return all_tokens

def get_average_token_length (tokens: list) -> float:
  """Given a list of tokens, calculate the average token length

  :params:
  :tokens (list): The list of tokens.

  :returns:
  :avg_token_length (float): The average token length over all the speeches

  """
  avg_token_length = sum([len (token) for token in tokens]) / len (tokens)
  return avg_token_length

In [13]:
# Tokenize all the speeches in the corpus
all_tokens = get_all_tokens_from_corpus (overall_data)

10000it [00:44, 226.54it/s]


In [14]:
# Get the average token length in the corpus.
avg_token_length = get_average_token_length (all_tokens)
print (f"The average token length is {avg_token_length:.2f}")

The average token length is 4.77


We used f-strings to format the output messages. You should learn more about them [here](https://builtin.com/data-science/python-f-string)

**Your turn:** Q1. Calculate the variance of the word length [0.5 points]

You'll implement a function that takes in the tokens as input and spits out a number which is the variance of the token lengths. The variance is calculated as follows:

$$
Var (X) = \frac{\sum\limits (x_i - \bar{x})^2}{n-1},
$$

Where $$X = \{x_1, x_2, \dots, x_n};\$$

$$\bar{x} = \frac{\sum\limits x_i}{n}$$


You can reuse the `get_average_token_length` function to calculate the variance


In [17]:
def get_variance_token_length (tokens: list) -> float:
  """ Given a list of tokens, calculates the variance of the token length

  :params:
  :tokens (list): The list of tokenized speeches.

  :returns:
  :variance (float): The variance of token lengths over all the speeches

  """
  variance = 0.0
  # Write your code below
  avg_token_length = get_average_token_length (tokens)
  diff_square = [(len(token)-avg_token_length) **2 for token in tokens]
  numerator = np.sum(diff_square)
  denominator = (len(tokens) - 1)
  variance = numerator/denominator
  return variance

In [18]:
var_token_length = get_variance_token_length (all_tokens)
print (f"The variance of the token length is {var_token_length:.2f}")

The variance of the token length is 7.28


**Sanity check**: The variance is almost 1.5 times the average.

In [19]:
check = 4.77*1.5
check

7.154999999999999

**Your turn** Q2. Report the average word length stratified by gender. Are female speakers using longer words in their speeches?  [0.5 points]

You'll need to filter the dataframe by the `gender` column. There are two labels "M" and "F" corresponding to male and female speakers.

_Tip_: Try to reuse the `get_average_token_length` functio on slices of the corpus

In [32]:
# Write code below
male_sample = sample[sample['gender'] == 'M']
female_sample = sample[sample['gender'] == 'F']


female_tokens = get_all_tokens_from_corpus(female_sample, field_name="speech")
male_tokens = get_all_tokens_from_corpus(male_sample, field_name="speech")

avg_token_length_female = get_average_token_length(female_tokens)
avg_token_length_male = get_average_token_length(male_tokens)

print (f"The average token length for female is {avg_token_length_female}")
print (f"The average token length for male is {avg_token_length_male}")


1795it [00:08, 213.94it/s]
8205it [00:29, 275.07it/s]


The average token length for female is 4.810179624952658
The average token length for male is 4.76110368088013


#### 1.2 Vocabulary richness by characteristic

In class we discussed type to token ratio as a way to quantify the richness in vocabulary.

[This paper](http://ccc.inaoep.mx/~villasen/bib/Quantitative%20Authorship%20AttributionQuantitative%20Authorship%20Attribution.pdf) suggests other ways of calculating the richness of the vocabulary (see 2.2.3). We'll implement Eq. 2 and 9 in this homework

**Your turn**: Q3. Implement a function that takes the tokens as input and calculates the metric in Eq.2 of the paper. [1 point]

$$
\frac{1}{N^2}\sum\limits_i i^2V_i - N
$$

Where $N$ is the number of tokens, and $V_i$ is the number of words that occur i times in the text. This statistic is also referred to as _The Characteristic_ in literature.

**Interpretation**: $V_i$ will be larger if words are being repeated. Thus, if the vocabulary is not rich, we expect this value -- and, in turn, our statistic -- to be high; conversely, if the reciprocal is high, then the vocabulary is rich.


To calculate the above equation, we will first need to have access to word counts. In class, we have seen that the word counts can be obtained by calling the `Counter` method as follows.

In [20]:
# We can access the tokens using the variable all_tokens
all_speeches_counter = Counter (all_tokens)

Now write the two functions below [0.5 points each]:

1. The first function takes the counter object as input and returns a dictionary whose keys are counts and values are the number of vocabulary items with that count. For eg.  the returned dictionary should be of the form `{1: 213, 2:124, ...}` which means that there are 213 words that occur exactly once and 124 words that occur exactly twice and so on.

2. The second takes the output of the first function as input and returns the
richness metric as shown in "The characteristic" expression above.

In [23]:
def counts_map (counter: Counter) -> dict:
  """ Create a dictionary with counts as keys and number of vocabulary
      items as values

      :params:
      :counter(Counter): wordcounts for the entire vocabulary

      :returns:
      value_counts (dict): dictionary contains a map of word count as keys and
                           number of vocabulary items that have that count as
                           values
  """
  # Your code below
  value_counts = Counter(counter.values())
  return value_counts

def vocab_richness_by_characteristic (V:dict, N: int) -> float:
  """ Calculate the richness of vocabulary based on the repetition
      characteristic

      :params:
      :V (dict): key=wordcount; value=number of vocabulary items
      :N (int): The number of tokens in the corpus

      :returns:
      richness (float): The calculated vocabulary richness

  """
  richness = 0.0
  # Write code below
  for key, value in V.items():
    richness += (key**2 * value - N)/N**2
  return richness

Before we calculate the richness for a bigger text collection, let's see what we get for a couple of cooked up examples.

Calculate the richness for

$ V = \{1:1, 2:1, 3:1, 4:1, 5:1\}$, $N=5$

and

$ V = \{1:2, 2:2, 3:0, 4:0, 5:1\}$, $N=5$

In [24]:
V = {1:1, 2:1, 3:1, 4:1, 5:1}
N=5
print (vocab_richness_by_characteristic (V, 5))

1.2000000000000002


In [None]:
V = {1:2, 2:2, 3:0, 4:0, 5:1}
N=5
print (vocab_richness_by_characteristic (V, 5))

0.4


**Sanity check**: The richness metric for the second map of counts should be smaller; I get 0.4 for the second dictionary.

In [25]:
V = counts_map (all_speeches_counter)
richness_characteristic = vocab_richness_by_characteristic (V, len (all_tokens))
print (f"Vocabulary richness according to characteristic={richness_characteristic}")

Vocabulary richness according to characteristic=0.009369180304238165


#### 1.3 Richness based on entropy

**Your turn**: Q4. Implement a function that takes in the tokens as input and calculates the metric in Eq.9 of the paper. [0.5 point]

$$
Entropy = - \sum p_v \log p_v,
$$

where $p_v$ is the relative frequency or the probability of seeing the token $v$ in text.

Write a function that takes the counter object and creates a dictionary for the probabilities of the words.

In [26]:
def vocab_richness_by_entropy (counter:Counter) -> float:
  """ Calculate the richness by calculating the entropy of
  a probability distribution

  :params:
  :counter(Counter): wordcounts for the corpus

  :returns:
  :entropy (float): The entropy of the distribution
  """
  entropy = 0.0
  # Write your code below
  total_word = sum(counter.values())
  for value in counter.values():
    prob = (value/total_word)
    entropy -= prob *np.log(prob)
  return entropy

Again, before we calculate the "richness by entropy" for a bigger text collection, let's see what we get for a couple of cooked up examples.

Calculate the richness for

$ p_v = \{"the":2, "show":2, "must":2, "go":2, "on":2\}$

and

$ V = \{"the":3, "show":3, "must":1, "go":1, "on":2\}$, $N=5$

In [27]:
print (vocab_richness_by_entropy ({"the":2, "show":2, "must":2, "go":2, "on":2}))
print (vocab_richness_by_entropy ({"the":3, "show":3, "must":1, "go":1, "on":2}))

1.6094379124341005
1.5047882836811908


**Sanity check**: We expect a flat distribution to be more rich and consequently higher entropy. I got 1.61 for the first example.

In [28]:
richness_entropy = vocab_richness_by_entropy (all_speeches_counter)
print (richness_entropy)

6.790311510032257


#### 1.4 Comparative analysis

**Your turn**: Q5. Report the two richness metrics for republican and democrat speeches respectively. Whose vocabulary is richer? Democrats or Republicans? [0.5 points]

You may want to use the `party` column from the `overall_data` dataframe.

In [33]:
# Write code below
rep_sample = sample[overall_data['party'] == 'R']
republican_tokens = get_all_tokens_from_corpus (rep_sample)
republican_counter = Counter(republican_tokens)

5681it [00:21, 270.51it/s]


In [34]:
dem_sample = sample[sample['party'] == 'D']
democrat_tokens = get_all_tokens_from_corpus (dem_sample)
democrat_counter = Counter(democrat_tokens)

4270it [00:18, 232.23it/s]


In [None]:
print (f"Vocabulary richness by characteristic for republicans={vocab_richness_by_characteristic (counts_map (republican_counter), len (republican_tokens))}")
print (f"Vocabulary richness by characteristic for democrats={vocab_richness_by_characteristic (counts_map (democrat_counter), len (democrat_tokens))}")


Vocabulary richness by characteristic for republicans=0.00938091923801958
Vocabulary richness by characteristic for democrats=0.009044343113440817


In [35]:
print (f"Vocabulary richness by entropy for republicans={vocab_richness_by_entropy (republican_counter)}")
print (f"Vocabulary richness by entropy for democrats={vocab_richness_by_entropy (democrat_counter)}")

Vocabulary richness by entropy for republicans=6.75557443469104
Vocabulary richness by entropy for democrats=6.77655489344723


## 2. Find distinctive terms

Our objective in this section is to find terms that are distinctive of the two parties i.e. words that indicate partisanship. We'll now implement the method to find such terms.


#### 2.1 odds ratio or log-odds difference

The odds $o_{w}^{(i)}$ for a word $w$ in group $i$ is given by:

$$
o_{w}^{(i)} = \frac{f_w^{(i)}}{1-f_w^{(i)}},
$$

where $f_w^{(i)}$ is the relative frequency of $w$ in group $i$.

**Note** The relative frequency of a word is nothing but its normalized counts or proportion of occurrences of a word in a corpus.

To get words that are distinctive of democrats, we can calculate the odds ratio for every word $w$

$$
r_w = \frac{o_{w}^{(dem)}}{o_{w}^{(rep)}}
$$

or the log of the odds ratio

$$
r_w^{*} = \log o_{w}^{(dem)} - \log o_{w}^{(rep)}
$$




**Your turn**: Q. Implement a function to calculate the odds of a word given wordcounts [0.5 points]

To avoid zero counts, we'll add a small non-zero number to the count for every word.

In [135]:
def word_odds (word_counts: dict, smoothing=0.0) -> dict:
  """ Calculate the odds for each word in the word_counts dictionary/Counter

  :params:
  :word_counts (dict): count of words
  :smoothing (float): add a little smoothing to avoid zero counts (default:0.0)

  :results:
  :odds (dict): odds ratio for all the words

  """
  odds = dict ()
  # Write your code below
  num_words = sum(word_counts.values())

  for key, value in word_counts.items():
    odds[key] = float((value+smoothing*len(word_counts))/(num_words+smoothing*len(word_counts)))
    odds[key] = odds[key]/(1-odds[key])

  return odds

**Your turn**: Q. Implement a function to calculate the log odds given wordcounts from each group [0.5 points]

**Hints**
- You'll first have to take the union of the vocabularies from individual groups and then recalculate the word counts based on this common vocabulary for both the groups.
- You will have to get the word_odds

In [136]:
def log_odds_diff (word_counts_group1: dict, word_counts_group2: dict, smoothing=0.0) -> dict:
  """ Calculate the log odds difference for each word in the word_counts dictionary/Counter

  :params:
  :word_counts_group1 (dict): count of words in group 1
  :word_counts_group2 (dict): count of words in group 2
  :smoothing (float): add a little smoothing to avoid zero counts (default:0.0)


  :results:
  :log_odds: give the log odds diff between group 1 and group 2
  """
  log_odds = dict ()
  # Write your code below
  words1 = list(word_counts_group1.keys())
  words2 = list(word_counts_group2.keys())
  words = set(words1+words2)

  for word in words:
    if word not in word_counts_group1.keys():
      word_counts_group1[word] = 0
    if word not in word_counts_group2.keys():
      word_counts_group2[word] = 0

  word_odds_group1 = word_odds(word_counts_group1, smoothing)
  word_odds_group2 = word_odds(word_counts_group2, smoothing)

  for word in words:
    log_odds[word] = word_odds_group1.get(word)/word_odds_group2.get(word)
    log_odds[word] = np.log(log_odds[word])

  return log_odds

**Your turn** Write code to create the word counter by aggregating all the republican speeches; then create a word counter from all democrat speeches; and finally get the difference of the log odds between democrats and republican speenches [0.5 points]

In [106]:
rep_sample = sample[overall_data['party'] == 'R']
republican_tokens = get_all_tokens_from_corpus (rep_sample)
republican_counter = Counter(republican_tokens)
dem_sample = sample[sample['party'] == 'D']
democrat_tokens = get_all_tokens_from_corpus (dem_sample)
democrat_counter = Counter(democrat_tokens)

5681it [00:19, 287.09it/s]
4270it [00:18, 226.68it/s]


In [137]:
# Write code below
log_odds_score = log_odds_diff (democrat_counter, republican_counter, smoothing=0.01)

In [138]:
print ("Distinctive words for democrats")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=True)[0:10]:
  print (word, f"{score:0.2f}")

Distinctive words for democrats
gun 0.80
republican 0.65
republicans 0.60
climate 0.59
water 0.54
violence 0.53
women 0.45
clean 0.44
puerto 0.41
students 0.37


In [139]:
print ("Distinctive words for republicans")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=False)[0:10]:
  print (word, f"{score:0.2f}")

Distinctive words for republicans
obamacare -0.59
unanimous -0.58
administration -0.55
consent -0.54
ask -0.49
patent -0.42
obama -0.42
president -0.40
religious -0.39
authorized -0.38


**Your turn** Try different values of smoothing from $\{0.001, 0.01, 0.1, 1\}$. Report if the top 10 lists differ for different smoothing values. Give a brief explanation (2-3 sentences) of why you see some of these words strongly associated with republican or democractic speeches, respectively [0.5 points]



In [143]:
# Your code below
log_odds_score = log_odds_diff (democrat_counter, republican_counter, smoothing=0.001)
print ("Distinctive words for democrats")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=True)[0:10]:
  print (word, f"{score:0.2f}")
print ("Distinctive words for republicans")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=False)[0:10]:
  print (word, f"{score:0.2f}")

Distinctive words for democrats
gun 1.84
climate 1.42
flint 1.29
violence 1.21
guns 1.20
poverty 1.15
shooting 1.15
tobacco 1.15
carbon 1.11
wage 1.11
Distinctive words for republicans
obamacare -1.81
patent -1.69
garrison -1.40
unborn -1.20
religious -1.19
authorized -1.13
revise -1.12
christians -1.10
arkansas -1.09
entitled -1.07


In [144]:
log_odds_score = log_odds_diff (democrat_counter, republican_counter, smoothing=0.1)
print ("Distinctive words for democrats")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=True)[0:10]:
  print (word, f"{score:0.2f}")
print ("Distinctive words for republicans")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=False)[0:10]:
  print (word, f"{score:0.2f}")

Distinctive words for democrats
republican 0.14
gun 0.13
republicans 0.12
water 0.11
women 0.10
health 0.10
climate 0.09
violence 0.08
public 0.08
are 0.07
Distinctive words for republicans
president -0.20
he -0.17
his -0.16
ask -0.15
unanimous -0.14
consent -0.14
administration -0.13
senate -0.13
committee -0.12
speaker -0.12


In [145]:
log_odds_score = log_odds_diff (democrat_counter, republican_counter, smoothing=1)
print ("Distinctive words for democrats")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=True)[0:10]:
  print (word, f"{score:0.2f}")
print ("Distinctive words for republicans")
for word, score in sorted (log_odds_score.items(), key=lambda x:x[1], reverse=False)[0:10]:
  print (word, f"{score:0.2f}")

Distinctive words for democrats
we 0.02
are 0.02
a 0.02
republican 0.01
gun 0.01
republicans 0.01
health 0.01
water 0.01
in 0.01
women 0.01
Distinctive words for republicans
the -0.04
president -0.04
he -0.03
his -0.03
speaker -0.02
ask -0.02
i -0.02
senate -0.02
unanimous -0.02
committee -0.02


Your answer here: The distinctive words are not the same regarding different smoothing values.But there are overlap between the distinctive words for each party. And these words reflect the focus of each party. For example, the words usually distinctive for Democrat are gun, health, water, womem. These are the social issues that Democrat aim to deal with. The distinctive words for republicans are obamacares, president, unanimous. These also indicate the common topics covered by republican speeches.

#### 2.2 [Extra credit]


[Monroe et. al.](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) propose a model based approach to find distinctive words (section 3.3 of their paper) in which they assume the log-odds ratio as a random variable. They estimate the log-odds ratio under their model by Eq. 16 and calculate the variance and the z-score of the log-odds in Eq. 21 and 22, respectively.

**Your turn** Implement eq. 16, 21 and 22 from the Monroe et. al. paper. Report the difference in the wordlists obtained by this methods with the simple log-odds from the previous section. [1 point]

In [None]:
# Your code below

Your answer here:

## 3. Find similar documents


In this section, we'll write functions to calculate similarity between documents. We'll first start by representing documents as _tfidf_ vectors.

In [146]:
def get_speeches_as_tokens (frame: pd.DataFrame,
                            speech_field:str="speech",
                            id_field="speech_id") -> dict:
  """ Iterate over the speeches in the dataframe and tokenize the speeches

  :params:
  :frame (pd.DataFrame): The dataframe that contains the entire tabular dataset
  :speech_field (str): The field name in the dataframe the contains the speeches
  :id_field (str): The field name in the dataframe the contains the speech_id

  :returns:
  :all_tokens (dict): Dictionary of tokenized speeches

  """
  all_tokens = dict ()
  for i,row in tqdm(frame.iterrows ()):
    tokens = nltk.word_tokenize (row[speech_field]) # tokenize each speech
    tokens = [token.lower () for token in tokens if token.isalpha()]
    all_tokens[row[id_field]] = tokens
  return all_tokens

In [147]:
# Get the speeches as a tokenized stream
speeches_as_tokens = get_speeches_as_tokens (overall_data,
                                             speech_field="speech",
                                             id_field="speech_id")

# Create a counter
all_tokens_counter = Counter ([token for speech_id in speeches_as_tokens for token in speeches_as_tokens[speech_id]])

# Create a two-way map of word to index and index to word, but only consider
# the most popular 5000 words
iwx = {word: i for i, (word, count) in enumerate (all_tokens_counter.most_common (5000))}
iiwx = {value: key for key,value in iwx.items()}

# Remove all the speeches whose words are not found in the top 5000 words vocabulary
final_speeches_as_tokens = {speech_id: speeches_as_tokens[speech_id]
                             for speech_id in speeches_as_tokens if not sum([token in iwx for token in speeches_as_tokens[speech_id]]) == 0}

# Create a two-way map between speech_ids and the row numbers
idx = {speech_id: i for i, speech_id in enumerate (final_speeches_as_tokens)}
iidx = {value: key for key,value in idx.items()}

10000it [00:38, 263.01it/s]


We'll use the `final_speeches_as_tokens` variable

#### 3.1 Term frequency

**Your turn** Write a function to calculate the term frequency matrix.[0.5 points]


- The term frequency matrix has number of rows equal to the number of documents, and the number of columns as the size of the vocab
- Every cell in the matrix corresponds to word $w$ and document $d$; the term frequency or $tf (w,d)$ is

$$
tf(w,d) = \log (1+f_{w,d}),
$$

where $f_{w,d}$ is the relative frequency of word $w$ in document $d$.

In [None]:
def get_term_frequency_matrix (tokens: list, vocab_index: tuple) -> np.array:
  """ Get the term frequency matrix

  :params:
  :tokens (dict): dictionary of lists; every key is a speech id, and every value is a tokenized speech
  :vocab_index (tuple): contains a mapping of word to index and another mapping from index to word
  :returns:
  :tf_mat (np.array): term frequency matrix

  """
  w2i, i2w = vocab_index
  return tf_mat

In [148]:
tf_mat = get_term_frequency_matrix (final_speeches_as_tokens, (iwx, iiwx))
print (tf_mat[idx[1140046217], iwx['of']])

NameError: ignored

**Sanity check** I get tf for word "of" and in speech "1140046217" as 0.047

#### 3.2 Inverse document frequency

**Your turn** Write a function to calculate the inverse document frequency vector [0.5 points]


- The inverse document frequency vector has number of rows equal to the number of words
- Every cell in the vector corresponds to word $w$; the inverse document frequency or $idf (w,D)$ is

$$
idf(w,D) = \log (\frac{N}{n_w}),
$$

where $n_{w}$ is the number of documents in which the word $w$ occurs.


**Hint**: You can compute the idf vector from the term frequency matrix

In [None]:
def get_inverse_document_vector (tf_mat: np.array) -> np.array:
  """ Calculate the inverse document vector

  :params:
  :tf_mat (np.array): The term frequency matrix
  :returns:
  :idf_vec (np.array): The inverse document frequency vector
  """

  # Write code here


In [None]:
idf_vec = get_inverse_document_vector (tf_mat)

#### 3.3 TF-IDF

**Your turn** Write a function to calculate the term TFIDF matrix [0.5 points]


- The TFIDF matrix should have number of rows equal to number of documents, and number of columns equal to the size of the vocabulary
- Every cell in the matrix corresponds to document $d$ and word $w$; the $tfidf (w,D)$ is

$$
tfidf(w,d) = tf(w,d) \dot idf(w,D),
$$


**Hint**: You can calculate the tf-idf matrix if you have the tf matrix and the idf vector using numpy's broadcast operations

In [None]:
def get_tfidf_mat (tf_mat: np.array, idf_vec: np.array) ->np.array:

  tf_idf_mat = np.dot(tf_mat, idf_vec.T)
  return tf_idf_mat


#### 3.4 Similarity metrics

**Your turn**: Write a function to calculate the cosine distance between two vectors. [0.5 points]

The cosine distance is defined as follows.

$$
cos\_dist (x,y) = 1 - \frac{x \cdot y}{\lVert x \rVert \lVert y \rVert}
$$

In [None]:
def cosine_dist (x: np.array, y:np.array) -> float:
  """ cosine distance

  :params:
  :x (np.array): vector
  :y (np.array): vector

  :returns:
  dist (float): distance between two vectors
  """
  dist = 0.0
  # Write code below
  dist = 1 - (np.dot(x,y)/np.linalg.norm(x)*np.linalg.norm(y))
  return dist

#### 3.5 Pairwise distances


Consider three sets:

- A is a set of pairs where both the speeches are made by republicans
- B is a set of pairs where both the speeches are made by democrats
- C is a set of pairs where the first speech is made by a republican and the second speech is made by a democrat

**Your turn**: Calculate and report the average pairwise distance between speeches in sets A, B, and C. Interpret the result [1 point]


**Hints**:

- This is likely to be a computationally intensive calculation. If you cannot average over all the pairs in the set, then randomly sample 100 pairs from each set and average over them. Repeat the random sampling 10 times and report the range of the average.


In [None]:
# Your code here
def get_all_tokens_by_speech(frame: pd.DataFrame, field_name: str = "speech") -> list:
    return [
        [token.lower() for token in nltk.word_tokenize(row[field_name]) if token.isalpha()]
        for _, row in tqdm(frame.iterrows())
    ]


Your answer here