# HOTEL REVIEWS: ENTITY EXTRACTIONS: BOW, BIGRAMS AND TRIGRAMS FEATURIZATION

The main objective of this exercise is to device several featurization process that help us to perform Entity Recognition (ER) in a corpus of hotel reviews

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

from pyspark.sql.types import  *
from pyspark.sql.functions import *
from pyspark.sql import *

from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, NGram

from pyspark.ml import Pipeline, PipelineModel

In [2]:
UDF_docCleaner= udf(lambda doc: " ".join(re.sub('[^0-9a-zA-Z]+', ' ', doc).
                                             lower().
                                             split()
                                        ), StringType() 
                   )

In [3]:
corpus = spark.read.\
    option("sep", ",").\
    option("header", "true").\
    option("inferSchema", True).\
    csv("gs://manualrg-formacion/hot_rev/data/Hotels_Reviews.csv").\
    withColumn('doc', UDF_docCleaner('review')).persist()

In [12]:
n = corpus.count()
print('Number of reviews: ', n)

('Number of reviews: ', 515679)


In [14]:
n_hotels = corpus.select(col('Hotel_Name')).distinct().count()
print('Number of hotels: ', n_hotels)

('Number of hotels: ', 1492)


The dataset contains 515.679 from 1492 hotels placed across Europe, each review is splitted in positive and nevative and a score is given. Our aim is to detect what concepts lead customers to a good or a bad review.

In [4]:
corpus.printSchema()

root
 |-- Hotel_Address: string (nullable = true)
 |-- Additional_Number_of_Scoring: integer (nullable = true)
 |-- Review_Date: string (nullable = true)
 |-- Average_Score: double (nullable = true)
 |-- Hotel_Name: string (nullable = true)
 |-- Reviewer_Nationality: string (nullable = true)
 |-- Negative_Review: string (nullable = true)
 |-- Review_Total_Negative_Word_Count: integer (nullable = true)
 |-- Total_Number_of_Reviews: integer (nullable = true)
 |-- Positive_Review: string (nullable = true)
 |-- Review_Total_Positive_Word_Count: integer (nullable = true)
 |-- Total_Number_of_Reviews_Reviewer: integer (nullable = true)
 |-- Reviewer_Score: integer (nullable = true)
 |-- Tags: string (nullable = true)
 |-- days_since_review_old: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- lng: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- idhotel: integer (nullable = true)
 |-- hotel_country: string (nullable = true)
 |-- review: string (nullable = true

In [11]:
subset = corpus.limit(10).toPandas()
subset[['Hotel_Name','Negative_Review', 'Positive_Review','review', 'Reviewer_Score']]

Unnamed: 0,Hotel_Name,Negative_Review,Positive_Review,review,Reviewer_Score
0,11 Cadogan Gardens,Thought the prise of drinks at the bar a littl...,We were particularly impressed by the very war...,We were particularly impressed by the very war...,10
1,11 Cadogan Gardens,Nothing in particular just the usual problems ...,The atmosphere and staff were excellent just w...,The atmosphere and staff were excellent just w...,9
2,11 Cadogan Gardens,I found the floors in the corridors to be a bi...,Bed was amazingly comfortable The building is ...,Bed was amazingly comfortable The building is ...,9
3,11 Cadogan Gardens,I thought I had booked the refundable rate but...,Lovely hotel,Lovely hotel I thought I had booked the refund...,10
4,11 Cadogan Gardens,Room was far too small for 2 No shelves drawer...,Concierge service excellent Bed very comfortab...,Concierge service excellent Bed very comfortab...,8
5,11 Cadogan Gardens,No Negative,Customer service was above and beyond from all...,Customer service was above and beyond from all...,10
6,11 Cadogan Gardens,No Negative,Everything Most comfortable bed Extremely Clen...,Everything Most comfortable bed Extremely Clen...,10
7,11 Cadogan Gardens,Room size and outlook very disappointing Bathr...,Egg white omelette for breakfast Concierge ver...,Egg white omelette for breakfast Concierge ver...,8
8,11 Cadogan Gardens,There was nothing to dislike,The bed was spacious and very comfy Loads of h...,The bed was spacious and very comfy Loads of h...,9
9,11 Cadogan Gardens,No Negative,slightly older property but still very quaint ...,slightly older property but still very quaint ...,10


In [21]:
corpus_positive = corpus.where(col('Positive_Review') != 'No Positive')
n_pos = corpus_positive.count()
n_pos

479609

In [19]:
corpus_negative = corpus.where(col('Negative_Review') != 'No Negative')
n_neg = corpus_negative.count()
n_neg

386999

## Build a featurization pipeline and obtain three doc-ngram matrices:
* BOW with simple tokenization
* BOW with bigrams
* BOW with trigrams

In [88]:
regexTokenizer = RegexTokenizer().setInputCol("Positive_Review").setOutputCol("tokens").\
    setPattern("\\s+").\
    setMinTokenLength(2)
remover = StopWordsRemover().setInputCol("tokens").setOutputCol("tokens_rm").\
    setCaseSensitive(False)
bigram = NGram().setInputCol("tokens_rm").setOutputCol("bigrams").setN(2)
trigram = NGram().setInputCol("tokens_rm").setOutputCol("trigrams").setN(3)

stages=[regexTokenizer, remover, bigram, trigram]


featEng_pl = Pipeline().setStages(stages).fit(corpus_positive)
pos_docTerm_df = featEng_pl.transform(corpus_positive)
pos_docTerm_df.limit(5).toPandas()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Count,Total_Number_of_Reviews,Positive_Review,...,hotel_country,review,label,flg_same_country,days_since_review,doc,tokens,tokens_rm,bigrams,trigrams
0,11 Cadogan Gardens Sloane Square Kensington an...,101,08/03/2017,8.7,11 Cadogan Gardens,United Kingdom,Thought the prise of drinks at the bar a littl...,13,393,We were particularly impressed by the very war...,...,United Kingdom,We were particularly impressed by the very war...,1,1,0,we were particularly impressed by the very war...,"[we, were, particularly, impressed, by, the, v...","[particularly, impressed, warm, welcome, recei...","[particularly impressed, impressed warm, warm ...","[particularly impressed warm, impressed warm w..."
1,11 Cadogan Gardens Sloane Square Kensington an...,101,07/21/2017,8.7,11 Cadogan Gardens,Australia,Nothing in particular just the usual problems ...,27,393,The atmosphere and staff were excellent just w...,...,United Kingdom,The atmosphere and staff were excellent just w...,1,0,13,the atmosphere and staff were excellent just w...,"[the, atmosphere, and, staff, were, excellent,...","[atmosphere, staff, excellent, expect, small, ...","[atmosphere staff, staff excellent, excellent ...","[atmosphere staff excellent, staff excellent e..."
2,11 Cadogan Gardens Sloane Square Kensington an...,101,07/16/2017,8.7,11 Cadogan Gardens,United Arab Emirates,I found the floors in the corridors to be a bi...,41,393,Bed was amazingly comfortable The building is ...,...,United Kingdom,Bed was amazingly comfortable The building is ...,1,0,18,bed was amazingly comfortable the building is ...,"[bed, was, amazingly, comfortable, the, buildi...","[bed, amazingly, comfortable, building, full, ...","[bed amazingly, amazingly comfortable, comfort...","[bed amazingly comfortable, amazingly comforta..."
3,11 Cadogan Gardens Sloane Square Kensington an...,101,07/10/2017,8.7,11 Cadogan Gardens,United States of America,I thought I had booked the refundable rate but...,32,393,Lovely hotel,...,United Kingdom,Lovely hotel I thought I had booked the refund...,1,0,24,lovely hotel i thought i had booked the refund...,"[lovely, hotel]","[lovely, hotel]",[lovely hotel],[]
4,11 Cadogan Gardens Sloane Square Kensington an...,101,07/05/2017,8.7,11 Cadogan Gardens,Ireland,Room was far too small for 2 No shelves drawer...,48,393,Concierge service excellent Bed very comfortab...,...,United Kingdom,Concierge service excellent Bed very comfortab...,1,0,29,concierge service excellent bed very comfortab...,"[concierge, service, excellent, bed, very, com...","[concierge, service, excellent, bed, comfortab...","[concierge service, service excellent, excelle...","[concierge service excellent, service excellen..."


In [89]:
regexTokenizer = RegexTokenizer().setInputCol("Negative_Review").setOutputCol("tokens").\
    setPattern("\\s+").\
    setMinTokenLength(2)
#TODO: erase negations and conjuntions like: but, however, etc.
remover = StopWordsRemover().setInputCol("tokens").setOutputCol("tokens_rm").\
    setCaseSensitive(False)
bigram = NGram().setInputCol("tokens_rm").setOutputCol("bigrams").setN(2)
trigram = NGram().setInputCol("tokens_rm").setOutputCol("trigrams").setN(3)

stages=[regexTokenizer, remover, bigram, trigram]

featEng_pl = Pipeline().setStages(stages).fit(corpus_negative)
neg_docTerm_df = featEng_pl.transform(corpus_negative)
neg_docTerm_df.limit(5).toPandas()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Count,Total_Number_of_Reviews,Positive_Review,...,hotel_country,review,label,flg_same_country,days_since_review,doc,tokens,tokens_rm,bigrams,trigrams
0,11 Cadogan Gardens Sloane Square Kensington an...,101,08/03/2017,8.7,11 Cadogan Gardens,United Kingdom,Thought the prise of drinks at the bar a littl...,13,393,We were particularly impressed by the very war...,...,United Kingdom,We were particularly impressed by the very war...,1,1,0,we were particularly impressed by the very war...,"[thought, the, prise, of, drinks, at, the, bar...","[thought, prise, drinks, bar, little, excessive]","[thought prise, prise drinks, drinks bar, bar ...","[thought prise drinks, prise drinks bar, drink..."
1,11 Cadogan Gardens Sloane Square Kensington an...,101,07/21/2017,8.7,11 Cadogan Gardens,Australia,Nothing in particular just the usual problems ...,27,393,The atmosphere and staff were excellent just w...,...,United Kingdom,The atmosphere and staff were excellent just w...,1,0,13,the atmosphere and staff were excellent just w...,"[nothing, in, particular, just, the, usual, pr...","[nothing, particular, usual, problems, staying...","[nothing particular, particular usual, usual p...","[nothing particular usual, particular usual pr..."
2,11 Cadogan Gardens Sloane Square Kensington an...,101,07/16/2017,8.7,11 Cadogan Gardens,United Arab Emirates,I found the floors in the corridors to be a bi...,41,393,Bed was amazingly comfortable The building is ...,...,United Kingdom,Bed was amazingly comfortable The building is ...,1,0,18,bed was amazingly comfortable the building is ...,"[found, the, floors, in, the, corridors, to, b...","[found, floors, corridors, bit, squeaky, much,...","[found floors, floors corridors, corridors bit...","[found floors corridors, floors corridors bit,..."
3,11 Cadogan Gardens Sloane Square Kensington an...,101,07/10/2017,8.7,11 Cadogan Gardens,United States of America,I thought I had booked the refundable rate but...,32,393,Lovely hotel,...,United Kingdom,Lovely hotel I thought I had booked the refund...,1,0,24,lovely hotel i thought i had booked the refund...,"[thought, had, booked, the, refundable, rate, ...","[thought, booked, refundable, rate, found, cou...","[thought booked, booked refundable, refundable...","[thought booked refundable, booked refundable ..."
4,11 Cadogan Gardens Sloane Square Kensington an...,101,07/05/2017,8.7,11 Cadogan Gardens,Ireland,Room was far too small for 2 No shelves drawer...,48,393,Concierge service excellent Bed very comfortab...,...,United Kingdom,Concierge service excellent Bed very comfortab...,1,0,29,concierge service excellent bed very comfortab...,"[room, was, far, too, small, for, no, shelves,...","[room, far, small, shelves, drawers, store, cl...","[room far, far small, small shelves, shelves d...","[room far small, far small shelves, small shel..."


## Compute an inverted index from every doc-term matrix
An inverted index is a data structure that maps from tokens to the set of documents that every token belongs to
Then, for every token (or ngram) statistics are computed and an analysis is carried out in order to get the most relevant ngrams, that will be a close estimation of the set of entities to extract

In [90]:
def computeStats(df, label):
  stats= df.agg(sum(label).alias('N1'), count('*').alias('N') ).toPandas()
  stats['N0'] = stats['N']-stats['N1']
  stats['baseline_accuracy'] = stats['N0']/stats['N']
  stats['apriori'] = 1.0-stats['baseline_accuracy']

  return stats

In [91]:
#corpus_stats = computeStats(docTerm_df, 'label')
#N0 = stats['N0'].values[0]
#N1 = stats['N1'].values[0]
#N = stats['N'].values[0]


pos_stats = computeStats(pos_docTerm_df, 'label')
pos_N0 = stats['N0'].values[0]
pos_N1 = stats['N1'].values[0]
pos_N = stats['N'].values[0]


neg_stats = computeStats(neg_docTerm_df, 'label')
neg_N0 = stats['N0'].values[0]
neg_N1 = stats['N1'].values[0]
neg_N = stats['N'].values[0]

corpus_stats

Unnamed: 0,N1,N,N0,baseline_accuracy,apriori
0,499761,515679,15918,0.030868,0.969132


In [35]:
def invIdx(tokens_rdd):
    return tokens_rdd.map(lambda doc: (map(lambda tok: (tok, (doc[0], doc[3]) ), doc[2] )) ).\
    flatMap(lambda x: x).groupByKey().map(lambda idx: (idx[0], list(idx[1])) ).\
    map(lambda pair: (pair[0], pair[1], Counter(pair[1]).values(), Counter(pair[1]).keys() )  ).\
    map(lambda tup: (tup[0], tup[1], 
                              reduce(lambda x, y: (x+y), tup[2]), len(tup[2]), 
                              reduce(lambda x, y: x+y, map(lambda pair: pair[1], tup[1])) ) 
       )

In [37]:
invIdx_schema= StructType().add(StructField('token', StringType(), True)).\
                                add(StructField('TotalFreq', IntegerType(), True)).\
                               add(StructField('DocFreq', IntegerType(), True)).\
                               add(StructField('E', IntegerType(), True))

Build an inverted index for positive reviews

In [92]:
pos_rdd_tokens = pos_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['tokens_rm'], row['label']) )
pos_invIdx1_rdd = invIdx(pos_rdd_tokens)
pos_invIdx1_df = pos_invIdx1_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)
    
pos_invIdx1_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,povnovic,1,1,1
1,unimaginative,2,2,2
2,svietlana,1,1,1
3,divinely,4,4,4
4,midmost,3,2,3
5,nunnery,1,1,1
6,hisotrical,1,1,1
7,foun,1,1,0
8,ubeatable,1,1,1
9,advices,25,25,25


In [93]:
pos_rdd_bigrams = pos_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['bigrams'], row['label']) )
pos_invIdx2_rdd = invIdx(pos_rdd_bigrams)
pos_invIdx2_df = pos_invIdx2_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)

pos_invIdx2_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,food corner,3,3,3
1,pool greatly,1,1,1
2,including evening,2,2,2
3,wanted last,2,2,2
4,evening tasty,1,1,1
5,london ritz,3,2,3
6,sry pleasant,1,1,1
7,helpful wi,11,11,11
8,palladium also,2,2,2
9,high rises,1,1,1


In [94]:
pos_rdd_trigrams = pos_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['trigrams'], row['label']) )
pos_invIdx3_rdd = invIdx(pos_rdd_trigrams)
pos_invIdx3_df = pos_invIdx3_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)

pos_invIdx3_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,beds good check,1,1,1
1,romantic inner courtyard,1,1,1
2,business vic hotel,1,1,1
3,good varied service,2,2,2
4,comfiest bed pretty,1,1,1
5,parisien view try,1,1,1
6,upgraded staff friendly,4,4,4
7,far though walk,1,1,1
8,piazza duomo 24,1,1,1
9,like time travel,1,1,1


Compute statistics for every token in each inverted index

In [48]:
pos_df_list = [pos_invIdx1_df, pos_invIdx2_df, pos_invIdx3_df]
pos_df_list2 = map(lambda df:
    df.withColumn('review_type', lit('Positive')).\
    withColumn('NE', col('TotalFreq')-col('E')).\
    withColumn('eprop', col('E')/col('TotalFreq')).\
    withColumn('neprop', col('NE')/col('TotalFreq')).\
    withColumn("N1", lit(pos_N1)).withColumn("N0", lit(pos_N0)).\
    withColumn("P1",col("E")/col("N1")).withColumn("P0",col("NE")/col("N0")).\
    withColumn("x",when(col("P1") ==0.0, float('-inf')).\
        when(col("P0") == 0.0, float('inf')).\
        otherwise(col("P1")/col("P0"))).\
    withColumn("WOEValue",when(col("x")==float('-inf'), float('-inf')).\
        when(col("x")==float('inf'),float('inf')).\
        otherwise(log(col("P1")/col("P0")))).\
    withColumn("token_length", length(col("token")).cast(DoubleType())).\
    withColumn("WOEAbs", abs(col("WOEValue"))).\
    withColumn("flg_singleLabel", when( (col("E") ==0.0) | (col("NE") == 0.0), 1).otherwise(0)).\
    sort(col('DocFreq').desc()), df_list)

pos_invIdx1_pd = pos_df_list2[0].limit(100).toPandas()
pos_invIdx2_pd = pos_df_list2[1].limit(100).toPandas()
pos_invIdx3_pd = pos_df_list2[2].limit(100).toPandas()

Build an inverted index for negative reviews

In [109]:
neg_rdd_tokens = neg_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['tokens_rm'], row['label']) )
neg_invIdx1_rdd = invIdx(neg_rdd_tokens)
neg_invIdx1_df = neg_invIdx1_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)
    
neg_invIdx1_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,fawn,1,1,1
1,roomsinclude,1,1,1
2,divinely,1,1,1
3,colony,3,3,3
4,wonderfull,5,5,5
5,foun,1,1,0
6,apnea,1,1,1
7,pony,2,2,1
8,revelers,12,12,12
9,gag,3,3,3


In [96]:
neg_rdd_bigrams = neg_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['bigrams'], row['label']) )
neg_invIdx2_rdd = invIdx(neg_rdd_bigrams)
neg_invIdx2_df = neg_invIdx2_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)

neg_invIdx2_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,deal promptly,1,1,1
1,games served,1,1,0
2,cathedral came,1,1,1
3,forever next,1,1,1
4,since past,1,1,1
5,contacted guest,1,1,1
6,tax child,1,1,1
7,need experienced,1,1,1
8,experiencing breakfast,1,1,1
9,opinion slippers,1,1,1


In [97]:
neg_rdd_trigrams = neg_docTerm_df.rdd.map(lambda row: (row['id'], row['doc'], row['trigrams'], row['label']) )
neg_invIdx3_rdd = invIdx(neg_rdd_trigrams)
neg_invIdx3_df = neg_invIdx3_rdd.map(lambda row: (row[0], row[2], row[3], row[4]) ).toDF(invIdx_schema)

neg_invIdx3_df.limit(10).toPandas()

Unnamed: 0,token,TotalFreq,DocFreq,E
0,sea get opposite,1,1,1
1,entrance difficult strollers,1,1,1
2,menu quality good,1,1,1
3,money website hotels,1,1,1
4,guests afford stay,1,1,0
5,guests worthy business,1,1,0
6,wasn working wait,2,2,2
7,receptionist best room,1,1,1
8,see hotel fenced,1,1,1
9,curtains dirty corridors,1,1,1


In [98]:
neg_df_list = [neg_invIdx1_df, neg_invIdx2_df, neg_invIdx3_df]
neg_df_list2 = map(lambda df:
    df.withColumn('review_type', lit('Negative')).\
    withColumn('NE', col('TotalFreq')-col('E')).\
    withColumn('eprop', col('E')/col('TotalFreq')).\
    withColumn('neprop', col('NE')/col('TotalFreq')).\
    withColumn("N1", lit(neg_N1)).withColumn("N0", lit(neg_N0)).\
    withColumn("P1",col("E")/col("N1")).withColumn("P0",col("NE")/col("N0")).\
    withColumn("x",when(col("P1") ==0.0, float('-inf')).\
        when(col("P0") == 0.0, float('inf')).\
        otherwise(col("P1")/col("P0"))).\
    withColumn("WOEValue",when(col("x")==float('-inf'), float('-inf')).\
        when(col("x")==float('inf'),float('inf')).\
        otherwise(log(col("P1")/col("P0")))).\
    withColumn("token_length", length(col("token")).cast(DoubleType())).\
    withColumn("WOEAbs", abs(col("WOEValue"))).\
    withColumn("flg_singleLabel", when( (col("E") ==0.0) | (col("NE") == 0.0), 1).otherwise(0)).\
    sort(col('DocFreq').desc()), df_list)

neg_invIdx1_pd = neg_df_list2[0].limit(100).toPandas()
neg_invIdx2_pd = neg_df_list2[1].limit(100).toPandas()
neg_invIdx3_pd = neg_df_list2[2].limit(100).toPandas()

Append both DataFrames and write then to GS

In [99]:
ap=[]
for i,j in zip(pos_df_list2, neg_df_list2):
    df = i.union(j)
    ap.append(df)

In [100]:
invIdx1_rank_df = ap[0]
invIdx1_rank_df.persist()
invIdx1_rank_df.write.mode("overwrite").option("sep", ",").\
    option("header", "true").\
    csv("gs://manualrg-formacion/hot_rev/data/InvIdx1/")

In [101]:
invIdx2_rank_df = ap[1]
invIdx2_rank_df.persist()
invIdx2_rank_df.write.mode("overwrite").option("sep", ",").\
    option("header", "true").\
    csv("gs://manualrg-formacion/hot_rev/data/InvIdx2/")

In [102]:
invIdx3_rank_df = ap[2]
invIdx3_rank_df.persist()
invIdx3_rank_df.write.mode("overwrite").option("sep", ",").\
    option("header", "true").\
    csv("gs://manualrg-formacion/hot_rev/data/InvIdx3/")

## Entity Candidates: top 20 most frequent ngrams

In [103]:
pos_invIdx1_rank_pd = invIdx1_rank_df.where(col('review_type') == lit('Positive')).limit(20).toPandas()
pos_invIdx1_rank_pd

Unnamed: 0,token,TotalFreq,DocFreq,E,review_type,NE,eprop,neprop,N1,N0,P1,P0,x,WOEValue,token_length,WOEAbs,flg_singleLabel
0,staff,229290,209738,224386,Positive,4904,0.978612,0.021388,499761,15918,0.448987,0.308079,1.457375,0.376637,5.0,0.376637,0
1,room,298282,203214,286060,Positive,12222,0.959025,0.040975,499761,15918,0.572394,0.76781,0.745489,-0.293715,4.0,0.293715,0
2,location,202687,196217,198528,Positive,4159,0.979481,0.020519,499761,15918,0.397246,0.261277,1.520404,0.418976,8.0,0.418976,0
3,hotel,191256,138335,183150,Positive,8106,0.957617,0.042383,499761,15918,0.366475,0.509235,0.719659,-0.328978,5.0,0.328978,0
4,negative,129478,129373,128935,Positive,543,0.995806,0.004194,499761,15918,0.257993,0.034112,7.563053,2.023275,8.0,2.023275,0
5,breakfast,137864,121677,135065,Positive,2799,0.979697,0.020303,499761,15918,0.270259,0.175839,1.536972,0.429814,9.0,0.429814,0
6,good,131381,107988,129103,Positive,2278,0.982661,0.017339,499761,15918,0.258329,0.143108,1.805131,0.590633,4.0,0.590633,0
7,great,114921,97389,114407,Positive,514,0.995527,0.004473,499761,15918,0.228923,0.03229,7.0895,1.958615,5.0,1.958615,0
8,friendly,89468,88015,88877,Positive,591,0.993394,0.006606,499761,15918,0.177839,0.037128,4.789918,1.566513,8.0,1.566513,0
9,helpful,80039,78368,79476,Positive,563,0.992966,0.007034,499761,15918,0.159028,0.035369,4.496284,1.503251,7.0,1.503251,0


In [104]:
neg_invIdx1_rank_pd = invIdx1_rank_df.where(col('review_type') == lit('Negative')).limit(20).toPandas()
neg_invIdx1_rank_pd

Unnamed: 0,token,TotalFreq,DocFreq,E,review_type,NE,eprop,neprop,N1,N0,P1,P0,x,WOEValue,token_length,WOEAbs,flg_singleLabel
0,staff,229290,209738,224386,Negative,4904,0.978612,0.021388,499761,15918,0.448987,0.308079,1.457375,0.376637,5.0,0.376637,0
1,room,298282,203214,286060,Negative,12222,0.959025,0.040975,499761,15918,0.572394,0.76781,0.745489,-0.293715,4.0,0.293715,0
2,location,202687,196217,198528,Negative,4159,0.979481,0.020519,499761,15918,0.397246,0.261277,1.520404,0.418976,8.0,0.418976,0
3,hotel,191256,138335,183150,Negative,8106,0.957617,0.042383,499761,15918,0.366475,0.509235,0.719659,-0.328978,5.0,0.328978,0
4,negative,129478,129373,128935,Negative,543,0.995806,0.004194,499761,15918,0.257993,0.034112,7.563053,2.023275,8.0,2.023275,0
5,breakfast,137864,121677,135065,Negative,2799,0.979697,0.020303,499761,15918,0.270259,0.175839,1.536972,0.429814,9.0,0.429814,0
6,good,131381,107988,129103,Negative,2278,0.982661,0.017339,499761,15918,0.258329,0.143108,1.805131,0.590633,4.0,0.590633,0
7,great,114921,97389,114407,Negative,514,0.995527,0.004473,499761,15918,0.228923,0.03229,7.0895,1.958615,5.0,1.958615,0
8,friendly,89468,88015,88877,Negative,591,0.993394,0.006606,499761,15918,0.177839,0.037128,4.789918,1.566513,8.0,1.566513,0
9,helpful,80039,78368,79476,Negative,563,0.992966,0.007034,499761,15918,0.159028,0.035369,4.496284,1.503251,7.0,1.503251,0


In [105]:
pos_invIdx2_rank_pd = invIdx2_rank_df.where(col('review_type') == 'Positive').limit(20).limit(20).toPandas()
pos_invIdx2_rank_pd

Unnamed: 0,token,TotalFreq,DocFreq,E,review_type,NE,eprop,neprop,N1,N0,P1,P0,x,WOEValue,token_length,WOEAbs,flg_singleLabel
0,great location,30426,30344,30351,Positive,75,0.997535,0.002465,499761,15918,0.060731,0.004712,12.889554,2.556417,14.0,2.556417,0
1,staff friendly,25411,25336,25174,Positive,237,0.990673,0.009327,499761,15918,0.050372,0.014889,3.383218,1.218827,14.0,1.218827,0
2,friendly staff,24859,24832,24787,Positive,72,0.997104,0.002896,499761,15918,0.049598,0.004523,10.965227,2.394729,14.0,2.394729,0
3,friendly helpful,21738,21696,21673,Positive,65,0.99701,0.00299,499761,15918,0.043367,0.004083,10.620178,2.362756,16.0,2.362756,0
4,good location,19883,19846,19607,Positive,276,0.986119,0.013881,499761,15918,0.039233,0.017339,2.262706,0.816562,13.0,0.816562,0
5,staff helpful,17569,17508,17418,Positive,151,0.991405,0.008595,499761,15918,0.034853,0.009486,3.67407,1.3013,13.0,1.3013,0
6,helpful staff,17050,17025,17016,Positive,34,0.998006,0.001994,499761,15918,0.034048,0.002136,15.940601,2.768869,13.0,2.768869,0
7,excellent location,11928,11914,11909,Positive,19,0.998407,0.001593,499761,15918,0.023829,0.001194,19.964012,2.993931,18.0,2.993931,0
8,location great,10877,10868,10775,Positive,102,0.990622,0.009378,499761,15918,0.02156,0.006408,3.364676,1.213332,14.0,1.213332,0
9,location good,10575,10551,10180,Positive,395,0.962648,0.037352,499761,15918,0.02037,0.024815,0.820875,-0.197385,13.0,0.197385,0


In [None]:
neg_invIdx2_rank_pd = invIdx2_rank_df.where(col('review_type') == 'Negative').limit(20).toPandas()
neg_invIdx2_rank_pd

In [112]:
pos_invIdx3_rank_pd = invIdx3_rank_df.where(col('review_type') == 'Positive').limit(20).toPandas()
pos_invIdx3_rank_pd

Unnamed: 0,token,TotalFreq,DocFreq,E,review_type,NE,eprop,neprop,N1,N0,P1,P0,x,WOEValue,token_length,WOEAbs,flg_singleLabel
0,staff friendly helpful,9634,9627,9601,Positive,33,0.996575,0.003425,499761,15918,0.019211,0.002073,9.266776,2.226436,22.0,2.226436,0
1,friendly helpful staff,6891,6889,6884,Positive,7,0.998984,0.001016,499761,15918,0.013775,0.00044,31.323405,3.444366,22.0,3.444366,0
2,location friendly staff,3626,3626,3617,Positive,9,0.997518,0.002482,499761,15918,0.007237,0.000565,12.800653,2.549496,23.0,2.549496,0
3,staff helpful friendly,3138,3138,3128,Positive,10,0.996813,0.003187,499761,15918,0.006259,0.000628,9.963063,2.298885,22.0,2.298885,0
4,good value money,2930,2917,2905,Positive,25,0.991468,0.008532,499761,15918,0.005813,0.001571,3.701112,1.308633,16.0,1.308633,0
5,within walking distance,2532,2518,2517,Positive,15,0.994076,0.005924,499761,15918,0.005036,0.000942,5.344636,1.676093,23.0,1.676093,0
6,hotel great location,2361,2361,2353,Positive,8,0.996612,0.003388,499761,15918,0.004708,0.000503,9.368242,2.237325,20.0,2.237325,0
7,friendly staff negative,2137,2137,2135,Positive,2,0.999064,0.000936,499761,15918,0.004272,0.000126,34.001183,3.526395,23.0,3.526395,0
8,staff great location,2102,2102,2102,Positive,0,1.0,0.0,499761,15918,0.004206,0.0,inf,inf,20.0,inf,1
9,staff extremely helpful,1970,1969,1967,Positive,3,0.998477,0.001523,499761,15918,0.003936,0.000188,20.883786,3.038973,23.0,3.038973,0


In [None]:
neg_invIdx3_rank_pd = invIdx3_rank_df.where(col('review_type') == 'Negative').limit(20).toPandas()
neg_invIdx3_rank_pd