# Reviews analysis

In [1]:
!pip install ipython-autotime

%load_ext autotime

time: 2.6 ms (started: 2021-06-04 06:55:38 +00:00)


In [2]:
pip install --user -U nltk

Requirement already up-to-date: nltk in /root/.local/lib/python3.7/site-packages (3.6.2)
time: 2.94 s (started: 2021-06-04 06:55:38 +00:00)


# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
time: 3.23 ms (started: 2021-06-04 06:55:41 +00:00)


In [4]:
import pandas as pd
from nltk.tag import pos_tag,pos_tag_sents
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
import string

time: 801 ms (started: 2021-06-04 06:55:41 +00:00)


  from pandas import Panel


In [5]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
from nltk.util import ngrams
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...


time: 221 ms (started: 2021-06-04 06:55:42 +00:00)


[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [6]:
# load stopwords
sw = set(stopwords.words('english'))

time: 6.32 ms (started: 2021-06-04 06:55:42 +00:00)


In [7]:
p = 'some_directory'
df = pd.read_csv(os.path.join(p,'/content/drive/MyDrive/colab data/reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

time: 4.08 s (started: 2021-06-04 06:55:42 +00:00)


In [8]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


time: 23.7 ms (started: 2021-06-04 06:55:46 +00:00)


In [9]:
df.shape

(452143, 6)

time: 4.85 ms (started: 2021-06-04 06:55:46 +00:00)


### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [10]:
def process_reviews(df):
  ''' This function perform task of tokenizing ,tagging and transform the tagged word into lower case.
    argument = DataFrame
    return = DataFrame with three additional columns''' 
  # word tokenizing
  df['tokenized'] = df['comments'].apply(word_tokenize)
  # tagging using pos_tag
  tag = []
  for comment in df.comments:
    tag.append(pos_tag(comment.translate(str.maketrans('', '', string.punctuation)).split()))
  df["tagged"] = tag
  # converting all the tagged words to lower to reduce memory usuage.
  lower_tag = []
  for tag in df.tagged:
    lwr_tag = []
    for word in tag:
      wrd = (word[0].lower(), word[1])
      lwr_tag.append(wrd)
    lower_tag.append(lwr_tag)
  df["lower_tagged"] = lower_tag
  return df

time: 14.9 ms (started: 2021-06-04 06:55:46 +00:00)


In [11]:
df = process_reviews(df)

time: 23min 4s (started: 2021-06-04 06:55:46 +00:00)


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [12]:
def get_vocab(df):
  ''' This function generate two list, first one contains most common 1000 nouns and second one contains
  most common 1000 verb/adjective.
  argument: DataFrame
  return: two lists.'''
  new_list=[]
  new_list1=[]
  for i in range(len(df.tagged)):
    x=df["lower_tagged"][i]
    new_list.append(x)
  for j in range(len(new_list)):
    t=new_list[j]
    for k in range(len(t)):
      p=new_list[j][k]
      new_list1.append(p)
  noun_list =[]
  verb_list = []
  noun_in = ['NNP','NNS','NNPS','NN'] # list of noun tags to check whether tag is noun or not
  verb_in = ['JJS','JJ','JJR']        # list of verb/adjective tags to check whether tag is verb/adjective or not
  # foor loop to check tag type.
  for tok,tag in new_list1:
    if tag in noun_in:
      noun_list.append(tok)           # if tag is noun it will be added to noun list
    elif tag in verb_in:
      verb_list.append(tok)           # if tag is verb/adjective it will be added to adjective

  # using Counter function to count the the occurance of a word.
  noun_count = Counter(noun_list)
  verb_count = Counter(verb_list)
  noun_sorted = noun_count.most_common()
  verb_sorted = verb_count.most_common()
  new_verb=[]
  new_noun=[]
  for i in tqdm(range(len(verb_sorted))):
    L=verb_sorted[i][0]
    new_verb.append(L)
  for i in tqdm(range(len(noun_sorted))):
    L=noun_sorted[i][0]
    new_noun.append(L)
  # removing puntuation who got tagged as noun or verb/adjective
  new_noun = [''.join(c for c in s if c not in string.punctuation) for s in new_noun]
  new_noun = [s for s in new_noun if s]
  new_verb = [''.join(c for c in s if c not in string.punctuation) for s in new_verb]
  new_verb = [s for s in new_verb if s]
  # to keep unique values in both the lists
  final_noun=[]
  for i in new_noun:
    if i not in new_verb:
      final_noun.append(i)
  # extracting only top 1000 words in both the vocabs
  cent_vocab = final_noun[:1000]
  cont_vocab = new_verb[:1000]
  return cent_vocab, cont_vocab

time: 50.4 ms (started: 2021-06-04 07:18:50 +00:00)


In [13]:
cent_vocab, cont_vocab = get_vocab(df)

100%|██████████| 41728/41728 [00:00<00:00, 1125124.83it/s]
100%|██████████| 169268/169268 [00:00<00:00, 1216996.10it/s]


time: 2min 4s (started: 2021-06-04 07:18:50 +00:00)


### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [14]:
def get_coocs(df, cent_vocab, cont_vocab):
  '''This function get_coocs(df, center_vocab, context_vocab) which takes as 
  argument: the DataFrame generated in step 1, and the lists generated in step 2  
  returns a dictionary of dictionaries, of the form in the example below
     ‘A big restaurant served delicious food in big dishes’
     {‘restaurant’: {‘big’: 2, ‘served’:1, ‘delicious’:1}} '''
  # we are taking context as a full review
  coocs = {}
  #dataframe comments into list as reviews
  df["lower_tokenized"]=df.comments.apply(str.lower).apply(word_tokenize) # lower casing the comment section and applying word tokenizing to reach each word using for loop.
  reviews = df["lower_tokenized"].to_list() # storing complete lower_tokeized column into reviews as list.
  #First loop in cent_vocab
  for cent in tqdm(cent_vocab):
    #context coocs dictionary
    cont_coocs = defaultdict(int)
    #Second loop in reviews
    for review in reviews:
      #check the center vocab  in review or not
      if cent in review:
        #3rd loop in cont_vocab
        for cont in cont_vocab:
          #check the context vocab  in review or not
          if cont in review:
            #make a dictionary occurrence for context vocab
            if cont in cont_coocs:
              cont_coocs[cont] += 1
            else:
              cont_coocs[cont] = 1
          else:
            cont_coocs[cont] = 0
          #store the center key with value as context dictionry
      coocs[cent] = cont_coocs 
  return coocs

time: 13.4 ms (started: 2021-06-04 07:20:54 +00:00)


In [15]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

100%|██████████| 1000/1000 [40:00<00:00,  2.40s/it]

time: 44min 42s (started: 2021-06-04 07:20:54 +00:00)





### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [16]:
def cooc_dict2df(coocs):
  ''' This function takes dictionary of dictionaries as argument and return a DataFrame'''
  coocdf = pd.DataFrame.from_dict(coocs,orient= 'index',dtype='Int64').fillna(0)
  return coocdf

time: 13.8 ms (started: 2021-06-04 08:05:37 +00:00)


In [17]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(996, 1000)

time: 1.17 s (started: 2021-06-04 08:05:37 +00:00)


### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [18]:
def cooc2pmi(df):
  ''' This function takes step 4 DataFrame as argunment and return new Dataframe with PMI score instead of raw co-occurence count.'''
  row_totals = df.sum(axis=1).astype(float)         # take the total sum of all rows in dataframe
  prob_cols_given_row = (df.T / row_totals).T       # calculating the probability of each index against total sum of rows
  col_totals = df.sum(axis=0).astype(float)         # calculating sum of all rows.
  prob_of_cols = col_totals / sum(col_totals)       # calculating the probability of each index against total sum of columns.
  ratio = prob_cols_given_row / prob_of_cols        # calculating ratio
  ratio[ratio==0] = 0.00001                         # replacing ratios that have zero value with 0.00001 to avoid mathematical error.
  pmidf = np.log(ratio)                             # calculating log of ratio using numpy library log function
  pmidf[pmidf < 0] = 0
  pmidf = pmidf.fillna(0.00001)
  return pmidf

time: 11 ms (started: 2021-06-04 08:05:39 +00:00)


In [19]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

(996, 1000)

time: 726 ms (started: 2021-06-04 08:05:39 +00:00)


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [20]:
def topk(df, center_word, N=10):
  ''' This function takes PMI score filled in Dataframe,center_word and an optional N argument with default value 10 as input 
      return a list of N strings in order of their PMI score with the center_word'''
  top_words = df[center_word].sort_values(ascending = False).head(N) # finding the top N PMI score with there index values.
  top_words= list(top_words.index.values) # storing index values in a list
  return top_words

time: 3.82 ms (started: 2021-06-04 08:05:39 +00:00)


In [21]:
topk(pmidf,'coffee')

['maker',
 'mornings',
 'cup',
 'stroopwaffels',
 'pods',
 'difficulties',
 'office',
 'sanders',
 'inconvenience',
 'ducks']

time: 6.63 ms (started: 2021-06-04 08:05:39 +00:00)
