**The following notebook demonstrates a possible solution for tasks 3**


We start by importing the relevant libraries. For the NLP applications, we will use the NLK library. For summarizing results we will use pandas.

In [14]:
#Imports:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import string

For the purposes of this project, I chose to use the Gutenberg archive, as some of the texts are readily available in the NLTK library. More specifically, my corpus will be sourced from 'Emma', a novel by Jane Austen

In [15]:
#download corpus, punctioation and stopwords
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

#Get the gutenberg data:
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

As the corpus is already tokenized, we will directly start by pre-processing. This will include stop-word removal (english in our case), punctuation removal and lemmatization. Since we will be exploring bi-grams, lemmatization will be preferred over stemming, since we want our words to make sense when exploring the rsults. 

As the corpus has some additional tokens that do not carry any contextual meaning, we add those to the stopwords (or punctioatin) in order to remove them at the pre-processing stage.

In [16]:
#define stop words list and lemmatizer:
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

In [17]:
#append some additional tokens for preprocessing:
custom_list = ['--',',"','."','----']
stop_words+=custom_list

In [18]:
#download corpus:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma_sentences = nltk.corpus.gutenberg.sents('austen-emma.txt')

Below we define a function to pre-process sentences. The function will remove stop words and punctuation, transform words to lowercase, and lemmatize whre possible.

In [19]:
def preprocess_sentence(sent):
  '''
  function that takes in tokenized sentence and returns
  preprcossed format.
  Performs stopword and punctioation removal, case
  normalization and lemmatization
  '''
  new_sent = []
  for word in sent:
    if word not in stop_words and word not in string.punctuation:
      word = word.lower()
      word = lemmatizer.lemmatize(word)
      for char in word:
        if char in string.punctuation:
          break
      new_sent.append(word)
  return new_sent

With the pre-processing function defined, we edit our corpus and create a pandas dataframe that has a pre-processed sentence at each row. The dataframe step is not necessary but it helps us visualise the results.

In [20]:
#create pandas dataframe with preprocessed sentences:
data = []
for sent in emma_sentences:
  new_sent = preprocess_sentence(sent)
  if len(new_sent) >= 2:
    data.append(new_sent)

df = pd.DataFrame()
df['processed'] = data

Before analysing the collocations, we want to break down the corpus to bi-grams. In order to do this, we first merege all the sentences into one list, and use the tools provided by NLTK in order to extract the bigrams.

In [21]:
#create a single list from all sentences
merged_sents = [word for sent in df['processed'] for word in sent]

In [22]:
bigrams = nltk.collocations.BigramAssocMeasures()

In [23]:
find_bigrams = nltk.collocations.BigramCollocationFinder.from_words(merged_sents)

With the bigrams extracted, we need to decide on the methods we will use to analyse collocations. We will take three different approaches and compare the results. 

Namely, we will start by a simple frequency count. This naive approach only takes into account the frequency of each bigram in the corpus and is not expected to give great results. However it will be used as a baseline for comparison. To extract the most frequent bigrams, we simply count and sort in decsending order. The 10 most frequent bigrams are visualised below.

In [24]:
bigram_frequency = find_bigrams.ngram_fd.items()

In [25]:
#create frequency based table:
bigram_freq_table = pd.DataFrame(list(bigram_frequency), columns=['bigram','frequency']).sort_values(by='frequency', ascending=False)

In [26]:
bigram_freq_table.head(10).reset_index(drop=True)

Unnamed: 0,bigram,frequency
0,"(mr, weston)",420
1,"(mr, elton)",374
2,"(mr, knightley)",301
3,"(miss, woodhouse)",173
4,"(frank, churchill)",151
5,"(mr, woodhouse)",135
6,"(i, think)",132
7,"(every, thing)",126
8,"(miss, fairfax)",125
9,"(i, shall)",122


We notice that the majority of the bigrams consist of a title and a name. In the context of a novel (our corpus) these are indeed collocations, as they might be main characters. However, we notice that the table includes bigrams such as 'I shall' that do not carry any contextual information.

In [27]:
find_bigrams.apply_freq_filter(25)

A more sophisticated approach, that descirbes the likelihood of two words occuring together, while considering individual frequencies. This criterions is referred to as Pointwise Mutual Information (PMI) and is given by:

$PMI(w_1,w_2) = log(\frac{P(w_1,w_2)}{P(w_1)P(w_2)})$

If any of the two words is rare on it's own, but the bigram is common, the probability of collocation is high, whereas if both of the words are common, the common bigram might just be a product of co-occurance. The PMI has some drawbacks, as for two rare words the bighram might be marked as significant. 

In [28]:
#create PMI based table:
bigram_pmi_table = pd.DataFrame(list(find_bigrams.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)

In [29]:
bigram_pmi_table.head(10)#[:10]

Unnamed: 0,bigram,PMI
0,"(maple, grove)",11.396518
1,"(robert, martin)",9.704998
2,"(colonel, campbell)",9.681323
3,"(frank, churchill)",7.993862
4,"(great, deal)",7.782759
5,"(dare, say)",7.702622
6,"(young, man)",7.281553
7,"(young, lady)",7.205037
8,"(my, dear)",7.135071
9,"(john, knightley)",7.124386


The 10 highest PMI scoring bigrams (shown above) seem to capture more context and collocation than the frequency based method. Although names are again present, verbs are less common. For the case of ('dare say') we can confidently assume significance due to the context of the book. We also notice that 'Maple Grove' is the The home of the Sucklings, Mrs. Elton's sister and brother-in-law.

Finally, as a third statictic we will implement the well established, chi-sqared test. Here we assume indepence between words for the co-occurance case (null hypothesis). The statistic is given by:

$χ^2 = \sum_{i,j}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}$ 
where E denotes the expectation between independent words, and O denotes the observed value. As an exaple, E(dare say) = Count(dare)Coun(say)/D, where D is the corpus size. 

In [30]:
#create xhi-squared based table
bigram_chisq_table = pd.DataFrame(list(find_bigrams.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi-sq']).sort_values(by='chi-sq', ascending=False)

In [31]:
bigram_chisq_table.head(10)

Unnamed: 0,bigram,chi-sq
0,"(maple, grove)",83571.0
1,"(frank, churchill)",38395.113403
2,"(robert, martin)",25851.55473
3,"(colonel, campbell)",22964.431733
4,"(mr, weston)",17598.13772
5,"(mr, elton)",16322.448142
6,"(jane, fairfax)",14111.8192
7,"(great, deal)",14025.649105
8,"(miss, woodhouse)",13117.607255
9,"(young, man)",12967.796057


The resulting, highest scoring bigrams are similar to the PMI results. In order to visually compare all 3, we summarise our top 10 findings in the table below:

In [32]:
no_elements = 10
top_freq = bigram_freq_table[:no_elements].bigram.values
top_pmi = bigram_pmi_table[:no_elements].bigram.values
top_chi =  bigram_chisq_table[:no_elements].bigram.values

In [33]:
comparison_table = pd.DataFrame([top_freq,top_pmi,top_chi]).T
comparison_table.columns = ['Frequency-Based','PMI','Chi-Squared']

In [34]:
comparison_table

Unnamed: 0,Frequency-Based,PMI,Chi-Squared
0,"(mr, weston)","(maple, grove)","(maple, grove)"
1,"(mr, elton)","(robert, martin)","(frank, churchill)"
2,"(mr, knightley)","(colonel, campbell)","(robert, martin)"
3,"(miss, woodhouse)","(frank, churchill)","(colonel, campbell)"
4,"(frank, churchill)","(great, deal)","(mr, weston)"
5,"(mr, woodhouse)","(dare, say)","(mr, elton)"
6,"(i, think)","(young, man)","(jane, fairfax)"
7,"(every, thing)","(young, lady)","(great, deal)"
8,"(miss, fairfax)","(my, dear)","(miss, woodhouse)"
9,"(i, shall)","(john, knightley)","(young, man)"


Just by a visual comparison, we immediately disregard the frequency based model. Out of the other two statistics, the PMI seems to be performing better as it has less title-related bigrams (mr weston, mr elton). It is worth noting that the frequency based approach, is prone to pronoun-based bigrams being classified as important. As this was not an issue for the other statistics, no measures were taken to mitigate the issue. However, as an additional step, we can POS tag our bigrams, and filter out combinations that start with pronouns. 