# Recommendation system based on reviews

## Context:
You, as a Data Scientist, have been asked to build a recommendation service for users on a vacation rental platform based on their previous experience.
## Task:
Your task would be to develop a recommendation model that could recommend returning users new properties based on their old reviews. Let’s just assume that our platform has only vacation houses in London and we would like to recommend new properties only to our loyal returning users.
## Data:
As an input you get the London Airbnb Dataset where you can find user reviews and general information about listings.

## Deliverables / outcome:
Upon completion of your analysis, your presentation should encompass the following:

• insights and challenges that you’ve faced during the discovery process,

• results of the sentiment analysis, how you extracted signals for the recommendation model

• recommendation model itself: what approach and algorithm was selected, why and how it can be evaluated.


# Idea of the solution

## Algorithm

We use the k-means algorithm to cluster all the listings based on the reviews. The features would form the unsupervised clusters based on TF-IDF scores of the text.


**How do we do that ?**

Each listings's reviews are collected and concatenated as a single string. Thus, each listings has the feature set of tf-idf scores for the concatenated string of reviews. Further the tf-idf scores as a feature set is used to find the euclidean distance between selected points in space, thus allowing us to implement the k-means algorithm.


**What is TF-IDF score ?**

Given a **document**(concatenated string of a listing) in a **corpus**(across the reviews of all listings), It tells how rarely a word occurs accross the corpus and how frequently it occurs in a that particular document.

**Example for intution**

Consider comparing reviews of chocolates. Let's assume there are three variants in chocolates available in the market. 

***Review for Variant 1*** : This is the best choclate in the world.

***Review for Variant 2*** : I liked this choclate.

Given that similarity of two sentences here is based on Euclidean distance, the reviews would have closer distance due the presence of the word " Chocolate". 

However, there would a be lot of noice and misallocations, but it's possibility is very less as the reviews for rental places would involve some amount of context to express the thoughts. Also we concatenate all the reviews for the listing, which reduce the noise by considering the tf-idf scores for each word.


In [1]:
import pandas as pd

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
import random

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

# Read data

In [2]:
df_reviews= pd.read_csv('review_listings_merged.csv')
df_reviews.sample(3)

Unnamed: 0,listing_id,reviewer_id,comments,id,name,description,neighborhood_overview
720318,18767858,12124772,Fantastic one bedroom apartment in Hunter Stre...,18767858,Fantastic one bedroom apartment in Hunter Stre...,"One Bedroom apartment perfect for couples, sol...",Bloomsbury is within LB of Camden and bordered...
191174,3252847,38134121,Perfect Shoreditch 2 bedroom Zone1Location Loc...,3252847,Perfect Shoreditch 2 bedroom Zone1,Location Location Location !!! PLEASE KINDLY A...,
467072,12076769,114721342,DOUBLE ROOM NEXT TO EXCEL & CITYA strategicall...,12076769,DOUBLE ROOM NEXT TO EXCEL & CITY,A strategically located cosy room next to the ...,


# Modeling

In [3]:
#Buidling a pipeline to extract TF-IDF scores of words from the given text devoid of stop words
pipeline = Pipeline(steps= [('tfidf', TfidfVectorizer(lowercase=True,
                                                      max_features=240,
                                                      stop_words= list(ENGLISH_STOP_WORDS))),
                            ('model',KMeans(n_clusters=60))])

In [4]:
#We fit the model pipeline with review texts and load the df_review dataframe with cluster predictions
model = pipeline.fit(df_reviews['comments'])

  super()._check_params_vs_input(X, default_n_init=10)


In [5]:
df_reviews['Cluster'] = model.predict(df_reviews['comments'])

In [6]:
df_reviews.sample(3)

Unnamed: 0,listing_id,reviewer_id,comments,id,name,description,neighborhood_overview,Cluster
1088621,34107611,257727801,Single room - 3min walk from Sudbury Hill stat...,34107611,Single room - 3min walk from Sudbury Hill station,The room is always clean and freshly painted. ...,,4
245125,4671441,58069122,Poppyseed Studio .. Home from home!Poppyseed S...,4671441,Poppyseed Studio .. Home from home!,"Poppyseed Studio is a gorgeous, sunny studio r...",I love the local shopkeepers and the village a...,2
1126202,37589784,9582336,Cozy Double Bedroom in Euston London (13)A spa...,37589784,Cozy Double Bedroom in Euston London (13),"A spacious private bedroom with wood flooring,...",Euston station is located in Camden in Euston ...,50


# Recommendation

In [7]:
df_listings= pd.read_csv('recommendations/listings.csv')

  df_listings= pd.read_csv('recommendations/listings.csv')


In [8]:
def suggest_listings(df: pd.DataFrame, reviewer_id: int) -> list[int]:
    list_seen_listings = set(df[df['reviewer_id']==reviewer_id]['listing_id'].values)
    cluster = df[df['reviewer_id']==reviewer_id]['Cluster'].values[0]
    list_cluster_listings = set(df[df['Cluster']==cluster]['listing_id'].values)
    return random.sample(list_cluster_listings.difference(list_seen_listings),3)

In [9]:
suggest_listings(df=df_reviews, reviewer_id=1621287)

since Python 3.9 and will be removed in a subsequent version.
  return random.sample(list_cluster_listings.difference(list_seen_listings),3)


[31325432, 20098244, 35024701]

In [10]:
df_reviews[df_reviews['reviewer_id']==1621287]['description'].values

array(["A quiet, light flat in a very quiet road in central London. The building is Georgian-style and in a leafy quiet street, five mins walk from Borough tube. Beautiful garden. A whole flat in a quiet, Georgian-style building, two double bedrooms and a living room which can be used as a third double bedroom.  Its in a very quiet leafy street and there is a really nice garden as well as access to two beautiful private gardens shared with the other residents of Trinity Street, twenty seconds walk away in the street, so plenty of choice in case we ever get a sunny day!  There's wifi and an equipped kitchen that you're very welcome to use. Its a lovely, historical area, you are five minutes walk from Borough market,  fifteen from Waterloo, twenty from the Southbank. Everywhere. It's an amazing neighbourhood - real Dickensian London! We're five minutes walk from Borough tube station, and ten minutes from London Bridge and the world-famous Borough Market. A few minutes further is the rive

In [11]:
df_listings[df_listings['id'].isin(suggest_listings(df=df_reviews, reviewer_id=1621287))]['description'].values

since Python 3.9 and will be removed in a subsequent version.
  return random.sample(list_cluster_listings.difference(list_seen_listings),3)


array(["Homely but peaceful apartment in Hackney Central (less than 10 minutes' walk from the nearest station), beautiful building in a fun area of London with lots of amenities on your doorstep or easy access to the centre of London.",
       "Please READ the WHOLE LISTING Attentively! NO other party (friend, relatives, etc) bookings! Unless you email me and I approve. Please read the house rules and only if you are happy with Everything you can book/request. A lovely and spacious room - suits travellers & people who come to London for business/work. Ideal for those who work for BBC (just (Phone number hidden by Airbnb) min away Walk), Westfield Mall (10 minute away Walk). Locked room. Check in ONLY BY 10PM! NO guests who work nights and sleep during daytime. The kitchen will be shared only with 1 person and is available for light cooking and eating. NO Cooking with strong spices (ex. garlic, onions, etc). The bathroom is shared with 1 or 2 people, depending on the circumstances.  In 

# Outline

## Evaluation

Due to lack of time, we didn't perform evaluation of our recommender. Here one could start with classification metrics, e.g Precision@k (fraction of top k recommended items that are relevant to the user), Recall@k (fraction of top k recommended items that are in a set of items relevant to the user). However as for any other task, metrics should be selected based on business objective.

## Next Steps

At the moment our model has several limitations:
- listings without review are not included
- it does not have ranking of suggestions