# Part 3 - Text analysis

---

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
from collections import Counter
import string
tqdm.pandas()

In [2]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# load stopwords
sw = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()

In [4]:
#load data 
df =  pd.read_csv('reviews.csv')
# deal with empty reviews
df.comments = df.comments.fillna('')

In [5]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [6]:
df.shape

(452143, 6)

###  3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [7]:
def process_text(comment):
    '''
    Function to tokenize text applied to each row of the 'comments' column

    Parameters
    ----------
    comment: string
        string of each row of the 'comments' column
    
        
    Returns
    ----------
    tokens: string
        tokenized string

    '''
    # Remove punctuation
    comment = comment.translate(str.maketrans('', '', string.punctuation))
    # Tokenizer
    tokenized_comment = word_tokenize(comment)
    # Remove stop words
    tokens = [word for word in tokenized_comment if word not in sw]
    # Lemmitization
    tokens = [wordnet_lemmatizer.lemmatize(word) for word in tokens]
    # Remove single alphabets
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

In [8]:
def process_reviews(df):
  '''
    Inputs the original review dataframe, processes the raw reviews and outputs the dataframe with three additional columns: 
    tokenized, tagged and lower_tagged

    Parameters
    ----------
    df: pandas.DataFrame
      Dataframe containing the Airbnb reviews
    
    Returns
    ----------
    df: pandas.DataFrame
        Returns the original dataframe with three additional columns:
          tokenized - Column with all reviews tokenized
          tagged - Column with Part-of-Speech (PoS) tagging
          lower_tagged - Column with lower case tokenized with tagging
    '''
  # your code here
  df['tokenized'] = df['comments'].apply(process_text)
  # PoS tagging
  df['tagged'] = df['tokenized'].apply(pos_tag)
  # Lower case
  df['tokenized'] = df['tokenized'].apply(lambda x: [word.lower() for word in x])
  # Lower case tagging
  df['lower_tagged'] = df['tokenized'].apply(pos_tag)
  return df

In [9]:
df = process_reviews(df)

### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [10]:
vocab_size = 1000

In [11]:
def get_vocab(df):
  '''
  Takes the dataframe generated above and returns two lists: the 1000 most frequent nouns and the 1000 most frequent verbs or
  adjectives

  Parameters
  ----------
  df: pandas.DataFrame
    Dataframe containing the Airbnb reviews and the additional columns

  Returns
  ----------
  cent_vocab: List
    list containing the 1000 most frequent nouns (center words)
  
  cont_vocab: List
    list containing the 1000 most frequent verbs or adjectives (context words)
  '''
    
  tagged_list = df['lower_tagged'].to_list()
  tagged_list = sum(tagged_list, [])
  # Extract nouns
  cent_vocab = [word for word, tag in tagged_list if (tag[0]=='N')]
  # Extract verbs and adjectives
  cont_vocab = [word for word, tag in tagged_list if (tag[0]=='J' or tag[0]=='V')]
  
  # Most frequent nouns
  occurence_count = Counter(cent_vocab)
  res = occurence_count.most_common(vocab_size)
  cent_vocab = [word for word, count in res]
  
  # Most frequent verbs and adjectives
  occurence_count = Counter(cont_vocab)
  res = occurence_count.most_common(vocab_size)
  cont_vocab = [word for word, count in res]
  
  
  return cent_vocab, cont_vocab

In [12]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [13]:
def get_coocs(df, cent_vocab, cont_vocab):
  '''
  Takes the dataframe and lists generated above and returns dictionary of dictionary containing: for each center word and 
  its assoicated context words

  Parameters
  ----------
  df: pandas.DataFrame
    Dataframe containing the Airbnb reviews and the additional columns
  
  cent_vocab: List
    list containing the 1000 most frequent nouns (center words)
  
  cont_vocab: List
    list containing the 1000 most frequent verbs or adjectives (context words)

  Returns
  ----------
  coocs: Dict
  dictionary containing dictionaries of each center work with the assoicated context words
  '''

  token_sent_list = df['tokenized'].to_list()
  # Create dictionary as mentioned in the document
  coocs = {ii:Counter({jj:0 for jj in cont_vocab if jj!=ii}) for ii in cent_vocab}

  k=5  # Window Size

  for sen in token_sent_list:
      used_cent_words = []      # If centre word occurs more than once
      for ii in range(len(sen)):
          if sen[ii] in coocs and sen[ii] not in used_cent_words:
              used_cent_words.append(sen[ii])
              if ii < k:  # check if word occurs below window range
                  c = Counter(set(sen[0:ii+k+1]) & set(cont_vocab)) # get verbs and adjectives within window range
                  del c[sen[ii]]
                  coocs[sen[ii]] = coocs[sen[ii]] + c
              elif ii > len(sen)-(k+1): # check if word occurs above window range
                  c = Counter(set(sen[ii-k::]) & set(cont_vocab)) # get verbs and adjectives within window range
                  del c[sen[ii]]
                  coocs[sen[ii]] = coocs[sen[ii]] + c
              else:
                  c = Counter(set(sen[ii-k:ii+k+1]) & set(cont_vocab)) # get verbs and adjectives within window range
                  del c[sen[ii]]
                  coocs[sen[ii]] = coocs[sen[ii]] + c

  coocs = {ii:dict(coocs[ii]) for ii in cent_vocab}  # final dictionary of cent_vocab and cont_vocab
  return coocs  

In [14]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [15]:
def cooc_dict2df(coocs):
    '''
    Takes the coocs dictionary and converts it into a dataframe. Also adds in missing colomns or rows

    Parameters
    ----------
    coocs: Dict
        Dictionary containing dictionaries of each center work with the assoicated context words

    Returns
    ----------
    coocdf: pandas.DataFrame
        Dataframe which is the coocs dictionary converted into a dataframe    
    '''

    coocdf = pd.DataFrame.from_dict(coocs, orient='index') # convert dataframe to dictioanry
    missing_cols = set(cont_vocab) - set(list(coocdf.columns)) # check any missing column
    missing_rows = set(cent_vocab) - set(list(coocdf.index)) # check any missing row
    for col in missing_cols: # add missing column if any
        coocdf[col] = 0
    if len(missing_rows) > 0:  # add missing row if any
        row = pd.Series([0]*vocab_size)
        temp_df = pd.DataFrame(columns=coocdf.columns)
        for row_ in missing_rows:
            row_df = pd.DataFrame([row], index = [row_], columns=coocdf.columns)
            temp_df = pd.concat([row_df, temp_df])
        coocdf = pd.concat([coocdf, temp_df])
    coocdf.fillna(0, inplace=True)  # fill empty values with 0
    coocdf = coocdf.astype(int)

    return coocdf

In [16]:
coocdf = cooc_dict2df(coocs)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [17]:
def cooc2pmi(df):
    '''
    Takes the coocdf dataframe and replaces the raw co-occurrences with the PMI scores

    Parameters
    ----------
    df: pandas.DataFrame
        Dataframe which is the coocs dictionary converted into a dataframe    
    

    Returns
    ----------
    pmidf: pandas.DataFrame
        Dataframe with the PMI scores

    '''
    arr = df.to_numpy()  # numpy array
    
    # p(y|x) probability
    row_totals = arr.sum(axis=1).astype(float)
    prob_cols_given_row = (arr.T / row_totals).T

    # p(y) p(x) probability
    col_totals = arr.sum(axis=0).astype(float)
    prob_of_cols = col_totals / sum(col_totals)
    
    # PMI: log( p(y|x) / p(y) p(x) )
    ratio = prob_cols_given_row / prob_of_cols
    ratio[ratio==0] = 0.00001
    _pmi = np.log(ratio)
    _pmi[_pmi < 0] = 0

    pmidf = pd.DataFrame(_pmi, columns=df.columns, index=df.index)

    return pmidf

In [18]:
pmidf = cooc2pmi(coocdf)

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [19]:
def topk(df, center_word, N=10):
    '''
    Takes the pmidf dataframe and finds the top N context words for a given center word

    Parameters
    ----------
    df: pandas.DataFrame
        Dataframe containing the PMI scores

    center_word: string
        string of a chosen noun, e.g 'location'
    
    N: int
        Used to specify the number of context words to find.
        Default = 10

    Returns
    ----------
    list
        returns a list if N strings, in descending order of their PMI score with the center word

    '''
    try:
        values = df.loc[center_word, :].sort_values(ascending=False) # sort values according to PMI score
        words = values.index
        words = words[:N] # get top N words
        top_words = []
        for w in words:
            top_words.append(w)
            top_words
        return top_words
    except Exception as e:
        return "Center Word does not exist."

In [20]:
topk(pmidf, 'coffee')

['tea',
 'kettle',
 'nespresso',
 'microwave',
 'complimentary',
 'fridge',
 'bread',
 'snack',
 'dish',
 'cheese']

In [21]:
coocdf.shape

(1000, 1000)

In [22]:
pmidf.shape

(1000, 1000)