<a href="https://colab.research.google.com/github/karencfisher/hotel-reviews/blob/main/notebooks/hotel_classification_quick_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install scikit-multilearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [64]:
!wget https://raw.githubusercontent.com/karencfisher/hotel-reviews/main/data/reviews_sample.csv

--2023-02-13 19:28:44--  https://raw.githubusercontent.com/karencfisher/hotel-reviews/main/data/reviews_sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23866047 (23M) [text/plain]
Saving to: ‘reviews_sample.csv’


2023-02-13 19:28:44 (155 MB/s) - ‘reviews_sample.csv’ saved [23866047/23866047]



In [58]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
from skmultilearn.adapt import MLkNN 
from sklearn.metrics import hamming_loss, accuracy_score 
from sklearn.pipeline import Pipeline

In [66]:
df_reviews = pd.read_csv('reviews_sample.csv')
df_reviews.drop(columns=['Overall', 'Location'], inplace=True)
df_reviews

Unnamed: 0,Cleanliness,Rooms,Service,Sleep Quality,Value,Content,average_score,sentiment
0,4,3,4,2,3,great choice views service restaurant smallish...,3.285714,0
1,3,3,3,3,3,good location family stayed thanksgiving weeke...,3.285714,0
2,3,3,3,4,3,economy comfort booked hotel day best thing ho...,3.142857,0
3,2,2,3,3,2,disappointed booked stay thru priceline cheape...,2.428571,0
4,2,1,1,1,1,hate hotel stayed standard hate hotel stayed s...,1.428571,0
...,...,...,...,...,...,...,...,...
35117,4,4,2,5,4,pool awesome service last two days great today...,3.714286,1
35118,5,4,5,4,5,hidden little gem stayed many great hotels tha...,4.571429,1
35119,3,5,4,5,4,pretty perfect westin really gave true resort ...,4.428571,1
35120,5,3,4,3,3,generallly ok bad place price deal especially ...,3.714286,1


Convert ratings to tags for Cleanliness, Rooms, Service, and Value. 1 if less than rating of 4 to indicate in need of attention.

In [67]:
tags = ['Cleanliness', 'Rooms', 'Service', 'Sleep Quality', 'Value']
df_reviews[tags].astype(int)

for tag in tags:
  df_reviews[tag] = df_reviews[tag].apply(lambda x: 1 if x < 4 else 0)

df_reviews

Unnamed: 0,Cleanliness,Rooms,Service,Sleep Quality,Value,Content,average_score,sentiment
0,0,1,0,1,1,great choice views service restaurant smallish...,3.285714,0
1,1,1,1,1,1,good location family stayed thanksgiving weeke...,3.285714,0
2,1,1,1,0,1,economy comfort booked hotel day best thing ho...,3.142857,0
3,1,1,1,1,1,disappointed booked stay thru priceline cheape...,2.428571,0
4,1,1,1,1,1,hate hotel stayed standard hate hotel stayed s...,1.428571,0
...,...,...,...,...,...,...,...,...
35117,0,0,1,0,0,pool awesome service last two days great today...,3.714286,1
35118,0,0,0,0,0,hidden little gem stayed many great hotels tha...,4.571429,1
35119,1,0,0,0,0,pretty perfect westin really gave true resort ...,4.428571,1
35120,0,1,0,1,1,generallly ok bad place price deal especially ...,3.714286,1


Test with a small sample of the data set

In [68]:
y = np.asarray(df_reviews[tags])
x_train, x_test, y_train, y_test = train_test_split(df_reviews['Content'], 
                                                    y, 
                                                    test_size=0.3, 
                                                    random_state=42)
x_train.shape, y_train.shape

((24585,), (24585, 5))

Construct a consistent pipeline. 

In [69]:
class CLF:
  def __init__(self, max_features, k_neighbors):
    self.vectorize = TfidfVectorizer(max_features=max_features)
    self.clf = MLkNN(k=k_neighbors)

  def fit(self, x, y):
    self.vectorize.fit(x)
    x = self.vectorize.transform(x)
    self.clf.fit(x, y)
    return self, x

  def predict(self, x):
    x = self.vectorize.transform(x)
    return self.clf.predict(x)


In [81]:
pipe = CLF(500, 5)
_, x = pipe.fit(x_train, y_train)



Problem is we are trying to use K nearest neighbors on a very sparse input

In [82]:
x.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.11661247, 0.        , 0.        , ..., 0.13311904, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [83]:
y_hat = pipe.predict(x_test)


In [84]:
y_hat.toarray()[:5]

array([[0, 1, 0, 1, 1],
       [1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0]])

In [85]:
y_test[:5]

array([[1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1],
       [1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [75]:
x_test[:5]

4520     noisiest hotel ever stayed j johnson north eas...
13214    annoying deposit second time ive stayed hotel ...
5706     great view need renovation husband went chicag...
29850    fantastic central location would use room grea...
34791    would go back great location quiet guest house...
Name: Content, dtype: object

In [86]:
print(accuracy_score(y_test, y_hat)) 
print(hamming_loss(y_test, y_hat))

0.40922463699345163
0.2645155167504982


In [87]:
myreview = "This is the niceset hotel I have stayed in. The staff was really helpful and kind. The room was spotless. I will recommend this to all my friends!"
result = pipe.predict([myreview])


In [88]:
result.toarray()

array([[0, 0, 0, 0, 0]])

In [89]:
badreview = "I had no sleep at all. The room was filthy. The service really did not care at all."
result = pipe.predict([badreview])
result.toarray()

array([[1, 1, 0, 0, 1]])

In [80]:
tags

['Cleanliness', 'Rooms', 'Service', 'Sleep Quality', 'Value']

In [90]:
review = "Hardly worth the price"
result = pipe.predict([badreview])
result.toarray()

array([[1, 1, 0, 0, 1]])