# Part 3 - Text analysis

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [None]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
import string
tqdm.pandas()

  from pandas import Panel


In [None]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# load stopwords
sw = set(stopwords.words('english'))

In [None]:
p = ""
df = pd.read_csv(os.path.join(p,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [None]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [None]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [None]:
def process_reviews(df):
  '''
  1- tokenization of reviews save it under tokenized column

  2-  divide senteces as words, remove punctuation and post tag all of them

  3- take all words in tagged and make them lower

  return:
    dataframe with new columns
  '''
  tokenized = df.comments.apply(sent_tokenize) # tokenization of reviews 
  df["tokenized"] = tokenized

  tagged = []

  for comment in df.comments:
    tagged.append(pos_tag(comment.translate(str.maketrans('', '', string.punctuation)).split())) # divide senteces as words, remove punctuation and post tag all of them

  df["tagged"] = tagged

  lower_tagged = []

  for tag in df.tagged:   # take all words in tagged and make them lower
    arr = []
    for word in tag:
      wrd = (word[0].lower(), word[1])
      arr.append(wrd)
    lower_tagged.append(arr) 

  df["lower_tagged"] = lower_tagged

  return df

In [None]:
df = process_reviews(df)

In [None]:
df.head(10)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel is really cool., The place was nice an...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel is the most amazing host!, His place i...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We had such a great time in Amsterdam., Danie...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very professional operation., Room is very cl...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel is highly recommended., He provided al...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco..."
5,2818,4748,2009-06-29,20192,Jie,Daniel was a great host! He made everything so...,"[Daniel was a great host!, He made everything ...","[(Daniel, NNP), (was, VBD), (a, DT), (great, J...","[(daniel, NNP), (was, VBD), (a, DT), (great, J..."
6,2818,5202,2009-07-07,23055,Vanessa,Daniele is an amazing host! He provided everyt...,"[Daniele is an amazing host!, He provided ever...","[(Daniele, NNP), (is, VBZ), (an, DT), (amazing...","[(daniele, NNP), (is, VBZ), (an, DT), (amazing..."
7,2818,9131,2009-09-06,26343,Katja,You can´t have a nicer start in Amsterdam. Dan...,"[You can´t have a nicer start in Amsterdam., D...","[(You, PRP), (can´t, VBP), (have, VB), (a, DT)...","[(you, PRP), (can´t, VBP), (have, VB), (a, DT)..."
8,2818,12103,2009-10-01,40999,Marie-Eve,Daniel was a fantastic host. His place is calm...,"[Daniel was a fantastic host., His place is ca...","[(Daniel, NNP), (was, VBD), (a, DT), (fantasti...","[(daniel, NNP), (was, VBD), (a, DT), (fantasti..."
9,2818,16196,2009-11-04,38623,Graham,Daniel was great. He couldn.t do enough for us...,"[Daniel was great., He couldn.t do enough for ...","[(Daniel, NNP), (was, VBD), (great, JJ), (He, ...","[(daniel, NNP), (was, VBD), (great, JJ), (he, ..."


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [None]:
def get_vocab(df):
  '''
  take lower than one bu one and put all in an array, calculate frequency of each word, take most common words
  if tag start with N, it is a center word, if tag start with J or N, it is a context word

  take most used 1000 center word, take most used 1000 context word

  return:
      most used 1000 center word and most used 1000 context word
  

  '''
  vocab = []

  for sentence in df.lower_tagged:  #  take lower tag one bu one and put all in an array
    vocab.extend(sentence)
  
  freq = Counter(vocab)  # calculate frequency of each word

  most_common_words = freq.most_common() # take most common words

  

  most_common_nouns_arr = []     # if tag start with N, it is a center word
  for i in most_common_words:
    if i[0][1][0] == "N":
      most_common_nouns_arr.append(i)

  most_common_context = []    # if tag start with J or N, it is a context word
  for i in most_common_words:
    if i[0][1][0] == "J" or i[0][1][0] == "V":
      most_common_context.append(i)

  center = []
  for i in most_common_nouns_arr:  # take most used 1000 center word
    center.append(i[0][0])
    if len(list(set(center))) == 1000:
      break

  cent_vocab = list(set(center))
  

  context = []
  for i in most_common_context:  # take most used 1000 context word
    context.append(i[0][0])
    if len(list(set(context))) == 1000:
      break

  cont_vocab = list(set(context))




  return cent_vocab, cont_vocab

In [None]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [None]:
df.head(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel is really cool., The place was nice an...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel is the most amazing host!, His place i...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We had such a great time in Amsterdam., Danie...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very professional operation., Room is very cl...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel is highly recommended., He provided al...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco..."


In [None]:
df_test = df.head(10)

In [None]:
def get_coos(df, cent_vocab, cont_vocab):

  sentences = []
  for sentence in df.tokenized:
    for i in sentence:
     sentences.extend([i.translate(str.maketrans('', '', string.punctuation)).lower()]) # take all sentences one by one and remove all punctuations and also make all sentences lower
 
  coocs = {}   # create coocs dict, we will return this as a result

  for sentence in sentences:  # take all sentences
    for vocab in cent_vocab:  # take all senter words
      dictt = {}
      if vocab in sentence.split():   # split sentences to make it list and search center words in this list 
        for cont in cont_vocab:    # if there is center words in the list then now search for context words
          if cont in  sentence.split():
            number = sentence.count(cont) #. if context words have found, count how many ?
            dictt.update({cont:number})   # create dict key: context word name, value : number of occurance

        if vocab in list(coocs.keys()):  # check is there dictionary before with same key if there is append disctionaries
          dict_previous = coocs.get(vocab)
          coocs.update({vocab: {k: dict_previous.get(k, 0) + dictt.get(k, 0) for k in set(dict_previous) | set(dictt)}})
        else:
          coocs.update({vocab:dictt})   # put the dict above inside of coocs list as a value, the key is center word
  return coocs

In [None]:
len(df)

452143

In [None]:
import time
start = time.process_time()
coocs = get_coos(df, cent_vocab, cont_vocab)
print(time.process_time() - start)


11475.934205631


### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [None]:
def cooc_dict2df(coocs):
  '''
  1- take all keys and they are center words

  2- take all keys in values of coocs and they are context words

  3- create matrix with center words as an index, context words as feature

  4- in coocs dict, fill all values in the dataframe one by one

  return: 
  1000 * 1000 occurance matrix
  '''

  center_w = list(coocs.keys()) # take all keys and they are center words

  context_w = set()
  for keys, values in coocs.items():  # # take all keys in values of coocs and they are context words
    for k,v in values.items():
      context_w.add(k)


  coocdf = pd.DataFrame(0, index= center_w, columns= context_w)   # create matrix with center words as an index, context words as feature
  for keys, values in coocs.items():  # in coocs dict, fill all values in the dataframe one by one
    for k, v in values.items():
      coocdf.at[keys, k] = v
  return coocdf

In [None]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

In [None]:
coocdf.to_csv("coocdf.csv")

In [None]:
coocdf

Unnamed: 0,sleep,petite,recommend,climbing,answered,aber,natural,needed,shower,vélos,se,liegt,looking,checking,end,justice,dans,shared,coming,hard,u,based,doing,longer,nicest,white,suitable,said,centre,uns,love,cosy,little,séjour,say,nous,bad,compact,typical,mais,...,ready,calm,taking,told,saying,dining,cool,ran,pretty,neighborhood,deal,loved,try,cual,chambre,playing,4th,miss,wellequipped,arrived,particular,show,returning,woke,everything,uncomfortable,chez,found,works,recomend,european,eine,accueil,seemed,detail,think,nest,late,suggested,thats
daniel,2,1,67,0,10,3,1,47,6,1,11,0,4,4,4,0,9,5,2,2,11,1,1,1,1,0,0,4,9,24,5,9,18,16,8,58,0,0,1,5,...,6,0,2,2,1,0,9,1,3,11,6,15,4,1,1,0,0,0,2,41,1,6,2,1,80,2,19,8,3,1,2,7,5,1,3,4,1,13,3,3
really,363,2,3688,14,248,0,73,1105,631,0,3,0,374,133,290,143,1,167,190,148,202,28,62,115,27,26,82,109,1790,0,523,1450,1448,0,341,3,220,14,104,0,...,169,446,128,96,13,70,1766,16,513,1641,135,1315,119,0,0,30,18,65,60,645,35,127,31,15,4178,44,2,380,76,77,16,0,1,88,128,247,15,509,54,148
cool,46,2,195,3,3,3,18,79,94,4,22,0,92,10,35,6,31,8,25,11,40,7,5,6,7,8,2,14,244,5,33,131,280,14,21,40,9,0,26,16,...,16,56,11,19,2,14,9528,3,225,673,7,129,17,1,7,4,1,9,4,34,6,13,2,0,404,4,4,35,12,6,3,4,9,12,7,21,2,37,14,41
was,1568,4,4339,78,1468,293,333,11511,3780,0,14,60,1397,1353,1467,145,7,993,890,735,600,149,298,559,190,133,182,817,4748,511,870,3360,6069,2,1316,2,821,73,291,1,...,1404,722,543,991,129,364,2000,178,1998,5060,791,3223,337,0,1,135,124,161,290,6475,160,886,134,123,29955,332,6,1561,226,56,180,314,1,377,514,847,42,4008,235,289
place,933,25,19239,37,86,0,126,1752,556,28,125,0,3439,141,589,291,136,189,496,249,535,64,60,259,109,40,145,189,2893,1,1517,2421,2647,55,782,287,213,60,173,97,...,225,795,179,169,38,88,1599,23,803,2554,143,3217,327,0,43,27,47,135,82,725,52,286,85,13,6333,46,19,837,91,229,56,0,22,60,123,561,22,392,77,313
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
martijn,1,1,31,0,11,0,0,21,2,1,2,0,3,1,4,0,4,1,2,0,1,0,0,0,1,0,0,1,7,8,3,3,14,11,7,26,1,0,1,4,...,8,1,0,4,0,0,5,1,0,5,2,7,1,0,1,0,1,1,0,17,0,1,0,1,41,0,5,2,0,0,0,0,3,1,1,2,0,8,0,0
mirjam,2,1,38,0,2,0,0,25,6,1,11,0,3,3,3,0,10,4,1,1,9,0,1,1,2,0,1,1,6,12,3,6,8,32,3,50,0,0,1,5,...,3,0,0,2,1,0,0,0,5,2,2,3,0,0,12,0,0,0,0,19,0,1,1,0,46,0,22,1,0,0,0,13,13,1,3,4,0,10,2,2
henk,0,3,19,0,2,6,0,16,0,6,9,0,5,0,3,0,8,2,1,1,3,0,0,0,1,0,0,0,8,45,8,4,12,12,5,43,0,0,0,3,...,4,1,5,2,0,1,1,0,3,2,1,12,0,0,1,1,0,0,0,17,0,1,0,0,16,0,12,1,0,0,0,20,2,0,1,1,0,6,0,1
jeroen,0,0,28,0,5,3,0,16,1,3,20,0,3,3,0,0,11,1,1,3,10,0,0,0,0,0,0,1,9,21,2,5,13,19,4,53,0,0,1,9,...,6,1,2,7,0,0,10,0,1,6,3,7,1,0,3,0,0,0,0,23,0,3,0,0,53,0,11,3,0,2,0,2,8,0,1,1,1,10,0,1


### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [None]:
def cooc2pmi(df):
  '''
  at first, find common words in both context and center words. we will make them 0 pmi because they can effect realted word list.

  then calculate the pmi using formula which was given to us in the assignment.

  return:
  matrix with pmi values
  '''

  index_list = coocdf.index.to_list() # take all center_words
  feature_list = coocdf.columns.to_list() # take all context words
  common_words = set(index_list) - (set(index_list) - set(feature_list)) # find comomon words

  for i in common_words:  # to calculate pmi if there is common words in both index and feature make this value 0 because it will effect pmi score
    coocdf.at[i, i] = 0

  positive=True     # CALCULATE PMI
  col_totals = df.sum(axis=0)
  total = col_totals.sum()
  row_totals = df.sum(axis=1)
  expected = np.outer(row_totals, col_totals) / total # divide total to find pmi
  pmidf = df / expected
  
  return pmidf

In [None]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

(1000, 1000)

In [None]:
pmidf.head(10)

Unnamed: 0,sleep,petite,recommend,climbing,answered,aber,natural,needed,shower,vélos,se,liegt,looking,checking,end,justice,dans,shared,coming,hard,u,based,doing,longer,nicest,white,suitable,said,centre,uns,love,cosy,little,séjour,say,nous,bad,compact,typical,mais,...,ready,calm,taking,told,saying,dining,cool,ran,pretty,neighborhood,deal,loved,try,cual,chambre,playing,4th,miss,wellequipped,arrived,particular,show,returning,woke,everything,uncomfortable,chez,found,works,recomend,european,eine,accueil,seemed,detail,think,nest,late,suggested,thats
daniel,0.244641,0.309659,0.875782,0.0,2.56604,0.440434,0.452841,1.658897,0.414872,0.306798,0.385089,0.0,0.274325,1.466618,0.507948,0.0,0.30113,1.177295,0.379674,0.723901,1.485041,1.393833,0.811379,0.307504,0.872769,0.0,0.0,1.379768,0.138334,1.47327,0.49785,0.500594,0.538604,0.862582,1.183954,1.074648,0.0,0.0,0.320049,0.340186,...,1.531317,0.0,0.651541,0.630787,1.871066,0.0,0.713371,1.65195,0.310487,0.304555,2.554748,0.72609,1.315343,0.773456,0.099134,0.0,0.0,0.0,1.618272,2.743181,1.364194,2.05407,1.800915,1.9836,0.97898,1.964463,2.972716,0.901215,1.822997,1.390522,3.019522,0.537505,0.732357,0.745864,1.274477,0.740996,0.434845,1.184395,3.177977,1.055585
really,1.217735,0.016985,1.322086,0.626314,1.745269,0.0,0.9066,1.069624,1.196573,0.0,0.00288,0.0,0.703436,1.337384,1.009961,3.47264,0.000918,1.078398,0.989195,1.469124,0.747901,1.070326,1.37963,0.969829,0.646265,1.029731,1.61782,1.031146,0.75455,0.0,1.428163,2.211864,1.188263,0.0,1.384034,0.001524,1.246872,0.600985,0.912846,0.0,...,1.1829,1.889122,1.143587,0.830369,0.667083,0.668456,3.838935,0.724877,1.456084,1.246031,1.576442,1.74571,1.073182,0.0,0.0,1.226344,0.604419,1.383813,1.331435,1.183526,1.309458,1.192381,0.765548,0.816005,1.402166,1.18526,0.008582,1.174004,1.266559,2.936405,0.662484,0.0,0.004017,1.800071,1.491311,1.254874,0.178885,1.271798,1.568811,1.428174
cool,1.298699,0.142944,0.588313,1.12951,0.177679,0.101656,1.881352,0.643577,1.500177,0.283247,0.177764,0.0,1.456282,0.846271,1.025839,1.226252,0.239401,0.434768,1.095401,0.918955,1.246401,2.251961,0.936366,0.425847,1.410099,2.666522,0.332086,1.114619,0.865624,0.070842,0.758394,1.68177,1.93378,0.174205,0.717326,0.171061,0.429286,0.0,1.920623,0.251258,...,0.94251,1.996264,0.827098,1.383116,0.863717,1.125143,0.0,1.143853,5.374728,4.300709,0.687934,1.441255,1.290268,0.17852,0.160167,1.376119,0.282599,1.612543,0.747022,0.525051,1.889207,1.02721,0.415667,0.0,1.141083,0.90683,0.144448,0.910036,1.683054,1.925667,1.045398,0.070892,0.304262,2.065822,0.686374,0.897898,0.200732,0.778049,3.423022,3.329722
was,1.338465,0.008644,0.395798,0.88792,2.628765,0.300185,1.05233,2.835287,1.823965,0.0,0.00342,0.130139,0.668597,3.461921,1.300024,0.895997,0.001634,1.631649,1.179053,1.856517,0.565274,1.449303,1.687339,1.199566,1.157218,1.340346,0.913698,1.966666,0.509285,0.218904,0.60452,1.304201,1.267291,0.000752,1.359137,0.000259,1.184015,0.797395,0.649938,0.000475,...,2.500595,0.778175,1.234452,2.181166,1.684384,0.884487,1.10628,2.052009,1.443044,0.977655,2.350366,1.088734,0.773341,0.0,0.000692,1.404235,1.059504,0.872177,1.637502,3.02324,1.523205,2.116703,0.842035,1.702634,2.558086,2.275695,0.006551,1.227167,0.958375,0.543411,1.896457,0.168258,0.001022,1.96229,1.523829,1.094968,0.127452,2.548256,1.737239,0.70963
place,1.649184,0.11187,3.634064,0.872182,0.318897,0.0,0.824527,0.893603,0.555554,0.124137,0.063236,0.0,3.408211,0.747076,1.080844,3.723558,0.065757,0.643081,1.360665,1.302379,1.04373,1.289079,0.703499,1.150903,1.374721,0.834741,1.50739,0.942099,0.642576,0.000887,2.182746,1.945927,1.144562,0.042848,1.672401,0.076844,0.636092,1.357152,0.800113,0.095369,...,0.829822,1.774327,0.842663,0.770244,1.027452,0.442791,1.831513,0.549052,1.200953,1.021839,0.879876,2.250292,1.553873,0.0,0.0616,0.581562,0.831582,1.514395,0.958791,0.700967,1.025105,1.414877,1.10604,0.372637,1.119907,0.652921,0.042958,1.362551,0.799088,4.601529,1.221757,0.0,0.046566,0.646695,0.7551,1.501784,0.138244,0.516093,1.178715,1.591494
nice,1.091456,0.019643,0.562446,0.275932,1.150255,0.003104,1.417106,0.986616,1.913803,0.006487,0.003701,0.006572,0.920819,0.829537,1.162631,0.748913,0.000707,1.23968,0.638224,0.734709,1.29883,0.559963,0.892118,0.760731,0.811983,1.801566,0.943096,0.846056,0.939573,0.0,0.591603,2.692055,1.809497,0.0,0.75102,0.000784,0.519989,0.860512,1.312845,0.0,...,1.413873,3.046875,0.757704,0.980314,0.474751,1.472492,1.69944,0.768448,1.6719,3.123804,0.918316,0.666308,0.820458,0.0,0.0,1.10308,0.750778,0.853523,1.385803,0.961999,0.778816,1.14371,0.323674,0.671072,1.376545,0.436142,0.0,0.912289,1.0022,1.440685,0.638459,0.001624,0.0,2.144833,0.799458,0.685471,0.110334,0.963204,1.29913,1.086225
clean,0.78629,0.02351,0.407221,0.123848,0.402629,0.0,2.590027,2.460898,2.614348,0.007764,0.000886,0.0,0.803594,0.584585,0.385648,1.098052,0.007621,2.872186,0.384345,0.430524,1.076234,0.564392,0.41068,0.451366,0.728893,2.412111,1.092372,0.785668,0.509965,0.0,0.647608,4.63397,0.873125,0.002729,0.636714,0.002345,0.4707,3.169048,0.777568,0.012053,...,1.756844,2.040322,0.371001,0.478911,0.284113,0.572785,1.225644,0.543489,1.760117,1.034212,0.495684,0.779124,0.216373,0.0,0.017562,0.301776,0.49578,0.687599,6.102223,1.085373,0.794062,1.005014,0.319037,0.3012,3.222375,0.621447,0.023758,1.091912,1.26104,0.703815,0.917001,0.0,0.003707,0.924924,1.225649,0.506326,0.187083,0.27899,0.294898,0.667857
very,1.189708,0.005302,0.498354,0.963672,2.443046,0.0,1.190275,1.475309,1.263341,0.002627,0.002698,0.0,0.440383,1.977704,0.73497,0.621661,0.001432,1.020068,0.76391,1.143508,0.917761,0.668286,0.81278,0.69242,0.792079,1.186965,1.934026,0.856461,0.73682,0.000526,0.664094,2.403001,0.898711,0.000462,0.698168,0.00111,0.601581,1.621576,1.515323,0.001165,...,1.470593,2.360367,0.873011,0.953215,0.672825,0.965909,1.823493,1.159777,1.261814,1.567125,1.491021,0.706207,0.546191,0.0,0.0,0.816747,1.111189,0.731094,1.877389,1.677851,1.40159,1.143121,0.63218,0.985021,1.492693,1.86694,0.002679,1.323294,1.285065,0.785752,1.227992,0.0,0.001254,1.590093,1.218482,0.653456,0.100522,1.818272,1.070225,0.780253
quiet,5.113943,0.013936,0.501778,0.367053,0.092384,0.0,1.304268,0.663956,0.51655,0.0,0.0,0.0,2.219095,0.247508,1.548709,0.199245,0.001506,0.540414,0.87141,0.667843,0.777673,0.9409,0.438174,0.359803,0.981931,0.454928,1.230252,0.822742,1.955486,0.0,0.501867,2.065088,1.279269,0.0,0.379631,0.0,0.260406,1.127067,1.325092,0.0,...,0.436455,3.252858,0.67439,0.255486,0.084204,0.76783,1.344793,0.297371,2.850458,9.683829,0.421562,1.078315,0.325569,0.0,0.0,1.811128,0.275505,0.524022,2.694602,0.162594,0.675321,0.261911,0.607849,1.339019,1.451679,0.221017,0.0,0.968308,0.738363,0.500621,0.339719,0.0,0.0,1.544041,0.344132,0.508542,0.29354,0.553514,0.095346,2.185207
neighborhood,1.152834,0.0,0.328196,0.360327,0.396772,0.0,1.100316,0.608129,0.229104,0.0,0.0,0.0,1.110929,0.620931,0.73866,0.456386,0.002464,0.416087,1.034358,0.559663,0.357854,1.026287,0.418196,0.316983,1.927874,0.744319,0.688606,0.812745,0.513809,0.0,1.129036,0.991098,1.284469,0.0,0.326907,0.0,0.213029,0.230503,1.602447,0.0,...,0.394631,9.450113,0.911494,0.928904,0.137768,1.717753,3.921933,0.364902,2.278511,0.0,0.282161,2.730147,0.774796,0.0,0.0,1.426742,0.090152,0.800208,1.012811,0.394111,1.205357,1.058696,0.464108,0.292107,1.263251,0.072322,0.01152,1.11148,0.715884,0.20477,0.222329,0.0,0.0,3.459858,0.469202,0.46376,0.256143,0.58362,1.013984,1.943083


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [None]:
def topk(df, center_word, N=10):
  '''
  filter the words, select the word that what we want to see. I will be a single row.

  take a transpoze, sort is descending take top N'th word which is related to our selected word

  return:
  top related words
  '''
  top_words = df[df.index == center_word].T.sort_values(center_word, ascending=False).head(N).index.to_list()
  return top_words

In [None]:
topk(pmidf, 'quick')

['respond',
 'reply',
 'responding',
 'answer',
 'responds',
 'efficient',
 'answering',
 'communicate',
 'ride',
 'thorough']

In [None]:
topk(pmidf, 'hotel',5)

['compared', 'cheaper', 'decided', 'expensive', 'glad']

In [None]:
topk(pmidf, 'coffee')

['tea',
 'complimentary',
 'cheese',
 'fresh',
 'stocked',
 'shops',
 'italian',
 'wine',
 'delicious',
 'included']