In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from scipy.sparse import hstack, vstack, coo_matrix, csr_matrix, bmat
sns.set_style("darkgrid")
data_path = "data/"

# Model building

In order to give better reccomendations to travellers, the price point of a shop or restaurant is essential. A student, living of a part time job has a different budget than the business traveler. Thus, if you often visit cheap eats you'll likely love the areas at your destination that serve streetfood and the like, while the guy putting down the company card will have no qualms about having an expensive meal. 

In this section a model to predict prices will be built. The model will be based on the following features:
- Bag of Words or TF-IDF features
- Raw review length
- Lix number [see here](https://en.wikipedia.org/wiki/Lix_(readability_test))
- Sentiment score (VADER)

First of all, the reviews will be used to predict the price of the reviewed place. Either a bag of words count or a TF-IDF will be used, with the thesis that TF-IDF will perform better. The BoW count simply tells how any times a word is used in a review where TF-IDF show the relative importance of a given word in the review, taking into account all reviews.  

Secondly the length of the review is thought to influence the price in some way. It can be argued that it could affect both negatively and positively as long reviews tend to be either very positive or negative.

The LIX number is an index that indicates readability of a text. It provides a number from zero and up (though likely below 100), which indicates how hard the text is to understand. It follows that below 25 are kids books and 35 to 45 is around the level of most newspapers. Above 55 is usually reserved for hard academic papers. The thesis is that more eloquent reviews might lead to a higher price. 

Finally, the sentiment score is calculated using the VADER (Valence Aware Dictionary and sEntiment Reasoner) framework. This is a popular and powerful framework for performing sentiment analysis and provides three values for a given text. The degree to which the text is positive, negative and neutral. 

---

First, the preprocessed NLP data is loaded and prepared for analysis.

In [2]:
df = pd.read_csv("data/NLP_data.csv")
df['reviewTextClean'] = df['reviewTextClean'].str.lower()
df.dropna(subset=["reviewTextClean"], inplace=True)

In [131]:
df.head(3)

Unnamed: 0,rating,reviewText,reviewTextClean,posReviewPercent,negReviewPercent,midReviewPercent,price,LIX,NumberOfWords,gPlusPlaceId,LIXNorm,NoWNorm,pred_price
0,3.0,This is a very cute hotel with good amenities ...,cute hotel good amenity nice location great cr...,0.141,0.109,0.751,3,28.5,50,106689630448064755324,0.12206,0.064644,3
1,4.0,Love this place. The Great/Good: Massage an...,love place massage facial technician best loun...,0.234,0.051,0.716,3,82.047619,63,108256990636148259283,0.352293,0.081794,2
2,5.0,"service is amazing, the line goes so fast.",service amazing line go fast,0.352,0.0,0.648,2,33.0,8,105947477166033397439,0.141409,0.009235,1


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow_counts = count_vect.fit_transform(df['reviewTextClean'].values)

(96001, 33795)

In [5]:
print("Length of vocabulary:", len(count_vect.vocabulary_))
print("Length of corpus:", bow_counts.shape[0])

Length of vocabulary: 33795
Length of corpus: 96001


Create tf-idf representation of `reviewTextClean`

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
bow_tfidf = tfidf.fit_transform(bow_counts)

Add remaining features to both bow_counts and bow_tfidf. Here, both LIX and Number of Words are fitted such that they take values between zero and one to match the scale of the TF-IDF. This is done using min-max scaling. 

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["LIXNorm", "NoWNorm"]] = scaler.fit_transform(df[["LIX", "NumberOfWords"]])

In [8]:
df["price"] = df["price"].astype(int)

The neutral sentiment score is not added as a feature. It would be heavily correlated with the positive and negative sentiment scores, since the three sum to one. 

In [9]:
rem_feats = csr_matrix(df[["posReviewPercent", "negReviewPercent", "LIXNorm", "NoWNorm"]])
bow_counts_full = hstack([bow_counts, rem_feats]).tocsr()
bow_tfidf_full = hstack([bow_tfidf, rem_feats]).tocsr()

In [10]:
bow_counts

<96001x33795 sparse matrix of type '<class 'numpy.int64'>'
	with 2035355 stored elements in Compressed Sparse Row format>

In [11]:
bow_counts_full

<96001x33799 sparse matrix of type '<class 'numpy.float64'>'
	with 2350414 stored elements in Compressed Sparse Row format>

In [12]:
print(bow_tfidf_full[0,:])

  (0, 1254)	0.2865966896963391
  (0, 2533)	0.10711259765276591
  (0, 6048)	0.27831494510811655
  (0, 7158)	0.28031377173992783
  (0, 7583)	0.16834943938545552
  (0, 8103)	0.2545201118484472
  (0, 9298)	0.10483606009852262
  (0, 11508)	0.1564140656440828
  (0, 12864)	0.06794621292653356
  (0, 13111)	0.0694752219899452
  (0, 14447)	0.3576194807022489
  (0, 16802)	0.14417594583527765
  (0, 17415)	0.1284114453908536
  (0, 19909)	0.13346342235554148
  (0, 20051)	0.09669954162366268
  (0, 26347)	0.16408146827974746
  (0, 27946)	0.2000790176860044
  (0, 28372)	0.09847745742261445
  (0, 30105)	0.12602822034467961
  (0, 30254)	0.2728413629251984
  (0, 30982)	0.16176057418419473
  (0, 31186)	0.25223180565723413
  (0, 31797)	0.25635987216511785
  (0, 32542)	0.15739559923055757
  (0, 33301)	0.2697086464659349
  (0, 33795)	0.141
  (0, 33796)	0.109
  (0, 33797)	0.1220604978234178
  (0, 33798)	0.06464379947229551


First the data is split into training and test. In this case, to keep a completely clean test set, two splits are made. First we split the data into a developement set and a test set. This test set is set aside for final testing once the models has been tuned. The developement set is then split into train and validation such that one can train and test a model as usual. This is done for both the bag of words set and the TF-IDF set in order to test the hypothesis that a TF-IDF performs better than a simple bag of words count. 

In [13]:
from sklearn.model_selection import train_test_split
y = df.price.values-1

# Split into development and test split for bow and tfidf only
X_bow_dev, X_bow_test, X_tfidf_dev, X_tfidf_test, X_tfidf_f_dev, X_tfidf_f_test, y_dev, y_test = train_test_split(bow_counts, bow_tfidf, bow_tfidf_full, y, test_size=0.2, stratify=y)

# Split development into training and validation for bow and tfidf only
X_bow_train, X_bow_val, X_tfidf_train, X_tfidf_val, X_tfidf_f_train, X_tfidf_f_val, y_train, y_val = train_test_split(X_bow_dev, X_tfidf_dev, X_tfidf_f_dev, y_dev, test_size=0.1, stratify=y_dev)

In [14]:
# Function that returns performance of classifier 
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
def classification_metrics(header_text, clf, X_train, X_test, y_train, y_test):
    print("="*35)
    print(" "*int((35-len(header_text))/2), header_text)
    print("="*35)
    y_pred = clf.predict(X_train)
    print("Train Accuracy:", accuracy_score(y_train, y_pred))
    y_pred = clf.predict(X_test)
    print("Val Accuracy:", accuracy_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred, average="weighted"))
    print(confusion_matrix(y_test, y_pred))

First a simple dummy classifier was tested using the bag of word counts. It predicts the class of the observation from the most frequent class in the set. This performs about as well as expected with a accuracy of 55.82% for both training and validation. This is also the case when adding the additional features as the most frequent class in the observed y does not change with additional features.

In [15]:
# Dummy classifier predicting most frequent class
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_bow_train, y_train)
y_pred = dummy.predict(X_bow_val)
classification_metrics("Dummy classifier", dummy, X_bow_train, X_bow_val, y_train, y_val)

          Dummy classifier
Train Accuracy: 0.5582465277777777
Val Accuracy: 0.558203125
F1 Score: 0.3999359566934069
[[   0 1382    0]
 [   0 4287    0]
 [   0 2011    0]]


From the dummy classifier, the next step to test was the TF-IDF data. This was done using a logistic regression as that provides a discrete outcome. This performs significantly better than the dummy classifier with an accuracy of 65% on the validation set. When adding the extra features troubles arise

In [16]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver="saga")
lr.fit(X_tfidf_train, y_train)
classification_metrics("LogisticRegression tf-idf", lr, X_tfidf_train, X_tfidf_val, y_train, y_val)

      LogisticRegression tf-idf
Train Accuracy: 0.7214554398148149
Val Accuracy: 0.64921875
F1 Score: 0.6296259411112886
[[ 505  818   59]
 [ 212 3607  468]
 [  30 1107  874]]


In [129]:
lr = LogisticRegression(solver="saga")
lr.fit(X_tfidf_f_train, y_train)
classification_metrics("LogisticRegression tf-idf full", lr, X_tfidf_f_train, X_tfidf_f_val, y_train, y_val)

   LogisticRegression tf-idf full
Train Accuracy: 0.7216145833333333
Val Accuracy: 0.64921875
F1 Score: 0.6294342783846033
[[ 504  818   60]
 [ 210 3611  466]
 [  29 1111  871]]


Then xgboost is employed to see what a Gradient Boosting can do to classify the price points of the establishments. This does not perform better than the logistic regression.

In [21]:
import xgboost
xgb = xgboost.XGBClassifier(eval_metric="mlogloss", use_label_encoder=False)
xgb.fit(X_tfidf_train, y_train)
classification_metrics("XGBoost tf-idf", xgb, X_tfidf_train, X_tfidf_val, y_train, y_val)

           XGBoost tf-idf
Train Accuracy: 0.5934751157407407
Val Accuracy: 0.58671875
F1 Score: 0.47844368750126176
[[ 130 1243    9]
 [  38 4190   59]
 [   5 1820  186]]


In [22]:
xgb.fit(X_tfidf_f_train, y_train)
classification_metrics("XGBoost tf-idf full", xgb, X_tfidf_f_train, X_tfidf_f_val, y_train, y_val)

         XGBoost tf-idf full
Train Accuracy: 0.5942708333333333
Val Accuracy: 0.5859375
F1 Score: 0.47892009874393476
[[ 129 1247    6]
 [  44 4178   65]
 [   6 1812  193]]


Finally, as the Logistic regression seems to perform the best, an Ordinal Logistic Regression is tested. This could be useful as there is an inherent order to the price points in the sense that `$$` is more expensive than `$` etc. (`$` < `$$` < `$$$`) 

In [23]:
from mord import LogisticAT
from sklearn.preprocessing import StandardScaler
# y must be integer - fixed in df 
#y_train = np.array([int(i) for i in y_train])
scaler = StandardScaler(with_mean=False)
X_tfidf_train = scaler.fit_transform(X_tfidf_train)
lat = LogisticAT(alpha=1.0, verbose=0)
lat.fit(X_tfidf_train, y_train)
classification_metrics("Ordinal Logistic Regression tf-idf", lat, X_tfidf_train, X_tfidf_val, y_train, y_val)

 Ordinal Logistic Regression tf-idf
Train Accuracy: 0.783839699074074
Val Accuracy: 0.558203125
F1 Score: 0.3999359566934069
[[   0 1382    0]
 [   0 4287    0]
 [   0 2011    0]]


In [24]:
X_tfidf_f_train = scaler.fit_transform(X_tfidf_f_train)
lat.fit(X_tfidf_f_train, y_train)
classification_metrics("Ordinal Logistic Regression tf-idf full", lat, X_tfidf_f_train, X_tfidf_f_val, y_train, y_val)

 Ordinal Logistic Regression tf-idf full
Train Accuracy: 0.7839265046296297
Val Accuracy: 0.558203125
F1 Score: 0.3999359566934069
[[   0 1382    0]
 [   0 4287    0]
 [   0 2011    0]]


As ordinal regression (as well as dummy and xgboost) performs significantly worse than the logistic regression, the logistic regression is chosen to be tuned in order to increase performance. This is done through a grid search to tune the hyperparameters.

In [130]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from scipy.stats import loguniform

# As the goal is to classify, the RepeatedStratifiedKFold is used
cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=1)

space = dict()
space['solver'] = ['saga']
space['penalty'] = ['none', 'l1', 'l2']
space['C'] = [0.1, 1, 10]

search = GridSearchCV(lr, space, cv=cv)
result = search.fit(X_tfidf_f_train, y_train)





The found hyperparameters can be seen below, and will be used for predicting the price points of the full data set.

In [132]:
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.6408613040123456
Best Hyperparameters: {'C': 1, 'penalty': 'l2', 'solver': 'saga'}


In [113]:
X = bow_tfidf_full
predict = lr.predict(X)
# Get prices back to 1-2-3 format instead of 0-1-2
df["pred_price"] = predict+1

In [114]:
# Fraction that matches real price
sum(sum([df["price"] == df["pred_price"]])) / len(df)

0.7022531015301924

In [125]:
# First create a series with the Id, predicted price and the counts
p_choice = df.groupby(['gPlusPlaceId', 'pred_price']).size()
# Create dataframe
df_p = p_choice.to_frame('count').reset_index()
# Find largest count for each place Id
df_pmax = df_p.groupby('gPlusPlaceId').max()['count'].reset_index()
# Merge the largest count and the df with the price points
df_pfin = df_pmax.merge(df_p, on=['gPlusPlaceId', 'count'])
# Drop the counts
df_pfin = df_pfin.drop(['count'], axis=1)
# As there are cases where there are equal number of reviews, these cases pick the larger price
df_pfin = df_pfin.groupby('gPlusPlaceId').max()['pred_price'].reset_index()

In [127]:
#print(df_pfin['pred_price'].value_counts())

#df_pori = df.groupby('gPlusPlaceId').mean()['price'].reset_index()
#print(df_pori['price'].value_counts())

Finally, the prices are added to the places data, such that every establishment has a price associated with it.

In [128]:
df_places = pd.read_csv("data/places_final.csv")
df_places = pd.merge(df_places, df_pfin, how="left", on="gPlusPlaceId")
df_places.head()

Unnamed: 0,gPlusPlaceId,name,price,lat,lon,city,address,Grid,category,pred_price
0,101742583391038750118,Carpo London,,51.509499,-0.135762,London,"16 Piccadilly, London W1J 0DE, United Kingdom",L159,Retail,
1,100574642292837870712,Premium Cars,,51.514637,-0.06498,London,"10 Commercial Road Premium Cars First Floor, S...",L186,Other,
2,105185983265572241970,eSpares Ltd,,51.479416,-0.179209,London,"Chelsea Wharf, 15 Lots Rd, London, Chelsea SW1...",L40,Wholesale,
3,104500852703501308358,Superdrug,,51.494537,-0.145769,London,"Unit 35, Victoria Railway Station, London SW1V...",L101,Retail,
4,107519298595557659572,Kura,2.0,51.502122,-0.163029,London,"3-4 Park Close, London SW1X 7PQ, United Kingdom",L137,Restaurant,3.0
