# Machinaal Leren - Sprint 2: Text Data

## Task:
What insights can we gain from the text data (title, description and reviews) ?

Possible tasks:
* Detect duplicate listings
* Extract keywords from reviews / descriptions
* Automatically make a list of positive and negative points for a listing based on the reviews
* Recognize listings from the same owner
* Detect anomalies (listings/ reviews that are very different from other listings/ reviews)
* Detect reviews that are very similar
* Perform sentiment analysis on a review
* ...

## Work table


|Task|Peter Bonnarens|Philip Kukoba|Lennert Franssens|
|------|------|------|------|
|Detect duplicate listings  |X  |X  | X |
|Extract keywords  |_  |_  | _ |
|Automatically make a list  |_  |_  | _ |
|Recognize listings from the same owner  |_  |_  | _ |
|Detect anomalies  |_  |_  | _ |
|Detect reviews that are very similar  |_  |_  | _ |
|Perform sentiment analysis on a review  |_  |_  | _ |

# 1 - Loading the dataset

In [8]:
%matplotlib inline

# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd  
import seaborn as sns 
from matplotlib import rcParams
import warnings

# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.width', None)
# pd.set_option('display.max_colwidth', -1)
# delete warnings from output
warnings.filterwarnings('ignore')

# figure size in inches
plt.rcParams['figure.figsize'] = 15, 12

# loading the datasets into pandas dataframes
reviews = pd.read_csv("data/reviews.csv")
listings = pd.read_csv("data/listings.csv")

In [9]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,187870,540088,2011-09-17,1003634,Dominick,"This is a very lovely room. The room, bathroom..."
1,187870,581572,2011-09-29,178012,Nancy,Tamara's place in Ghent was really nice and cl...
2,187870,715167,2011-11-13,1391583,Charlotte,We spend one night at Tamara's and it was perf...
3,187870,834756,2012-01-03,1513484,Ger Y Flo,"Pasamos unos dias increibles en Gante, la habi..."
4,187870,855004,2012-01-10,1503813,Max,My girlfriend and I had the most wonderful tim...


In [10]:
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,Unnamed: 74
0,187870,https://www.airbnb.com/rooms/187870,20210716195504,2021-07-16,Luxury studio 'Spiegelhof' in the centre of Ghent,The top floor of our house in the center of Gh...,We live in a quiet and pleasant neighborhood w...,https://a0.muscache.com/pictures/26c46224-795c...,904277,https://www.airbnb.com/users/show/904277,...,4.58,4.85,,t,1,0,1,0,3.61,
1,203806,https://www.airbnb.com/rooms/203806,20210716195504,2021-07-16,Flat close to Sint-Pieters Station!,This well-located and comfortable one-bedroom ...,The house is located in a dead-end street - so...,https://a0.muscache.com/pictures/fe477b7f-93ac...,999847,https://www.airbnb.com/users/show/999847,...,4.49,4.54,,f,2,2,0,0,2.84,
2,204245,https://www.airbnb.com/rooms/204245,20210716195504,2021-07-16,Modern studio's in Ghent centre,<b>The space</b><br />We offer luxury studio's...,,https://a0.muscache.com/pictures/1766072/02965...,1003009,https://www.airbnb.com/users/show/1003009,...,4.97,4.63,,f,2,2,0,0,0.29,
3,216715,https://www.airbnb.com/rooms/216715,20210716195504,2021-07-16,converted loft,Please read full desription for how to get the...,It's close to all that you will want or need v...,https://a0.muscache.com/pictures/1927009/20760...,911466,https://www.airbnb.com/users/show/911466,...,4.89,4.76,,t,1,1,0,0,4.31,
4,252269,https://www.airbnb.com/rooms/252269,20210716195504,2021-07-16,Large & bright town House - Center Ghent - max 8p,"bright, spacious, authentic & beautifully rest...","our neighbourhood is quiet, but nicely vibrati...",https://a0.muscache.com/pictures/69675b54-3e78...,1195314,https://www.airbnb.com/users/show/1195314,...,4.98,4.72,,f,1,1,0,0,2.23,


Find lines to shift and add them to a mask - we've found that some lines are shifted 1 to the right beginning on the host_id column (that now contains garbage data)

In [11]:
shifted_lines = listings[pd.to_numeric(listings["host_verifications"], errors='coerce').notnull()].id
mask = listings['id'].isin(shifted_lines)

# shift lines 1 to the left
listings.loc[mask, 'host_since':'reviews_per_month'] = listings.loc[mask, 'host_since':'reviews_per_month'].shift(-1, axis=1)

In [12]:
listings.drop(['host_response_time','host_response_rate','host_acceptance_rate','host_is_superhost','host_listings_count','host_total_listings_count',
            'host_has_profile_pic','host_identity_verified','latitude','longitude','accommodates','bathrooms','bedrooms','beds','price','minimum_nights',
            'maximum_nights','minimum_minimum_nights','maximum_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm',
            'maximum_nights_avg_ntm','calendar_updated','has_availability','availability_30','availability_60','availability_90','availability_365',
            'number_of_reviews','number_of_reviews_ltm','number_of_reviews_l30d','first_review','last_review','license','instant_bookable','calculated_host_listings_count',
            'calculated_host_listings_count_entire_homes','calculated_host_listings_count_private_rooms','calculated_host_listings_count_shared_rooms',
            'reviews_per_month'], axis=1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,amenities,calendar_last_scraped,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,Unnamed: 74
0,187870,https://www.airbnb.com/rooms/187870,20210716195504,2021-07-16,Luxury studio 'Spiegelhof' in the centre of Ghent,The top floor of our house in the center of Gh...,We live in a quiet and pleasant neighborhood w...,https://a0.muscache.com/pictures/26c46224-795c...,904277,https://www.airbnb.com/users/show/904277,...,"[""Microwave"", ""Keypad"", ""Extra pillows and bla...",2021-07-16,4.85,4.90,4.95,4.90,4.87,4.58,4.85,
1,203806,https://www.airbnb.com/rooms/203806,20210716195504,2021-07-16,Flat close to Sint-Pieters Station!,This well-located and comfortable one-bedroom ...,The house is located in a dead-end street - so...,https://a0.muscache.com/pictures/fe477b7f-93ac...,999847,https://www.airbnb.com/users/show/999847,...,"[""Dishwasher"", ""Ethernet connection"", ""Microwa...",2021-07-16,4.59,4.72,4.73,4.80,4.79,4.49,4.54,
2,204245,https://www.airbnb.com/rooms/204245,20210716195504,2021-07-16,Modern studio's in Ghent centre,<b>The space</b><br />We offer luxury studio's...,,https://a0.muscache.com/pictures/1766072/02965...,1003009,https://www.airbnb.com/users/show/1003009,...,"[""Shampoo"", ""Wifi"", ""TV"", ""Heating"", ""Dedicate...",2021-07-16,4.79,4.73,4.93,4.83,4.70,4.97,4.63,
3,216715,https://www.airbnb.com/rooms/216715,20210716195504,2021-07-16,converted loft,Please read full desription for how to get the...,It's close to all that you will want or need v...,https://a0.muscache.com/pictures/1927009/20760...,911466,https://www.airbnb.com/users/show/911466,...,"[""Long term stays allowed"", ""Microwave"", ""Cook...",2021-07-16,4.75,4.81,4.90,4.89,4.85,4.89,4.76,
4,252269,https://www.airbnb.com/rooms/252269,20210716195504,2021-07-16,Large & bright town House - Center Ghent - max 8p,"bright, spacious, authentic & beautifully rest...","our neighbourhood is quiet, but nicely vibrati...",https://a0.muscache.com/pictures/69675b54-3e78...,1195314,https://www.airbnb.com/users/show/1195314,...,"[""Hot water kettle"", ""Dishwasher"", ""Microwave""...",2021-07-16,4.89,4.92,4.93,4.92,4.97,4.98,4.72,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,51022259,https://www.airbnb.com/rooms/51022259,20210716195504,2021-07-16,Full apartment near Sint Pieters Station,This is a 2 bedroom apartment. Guests will en...,Apartment is located in neighborhood of Sint P...,the main station in Ghent. The street is very...,https://a0.muscache.com/pictures/47c0402d-8d28...,97573561,...,"[""Microwave"", ""Fire extinguisher"", ""Dishes and...",2021-07-16,,,,,,,,
919,51036302,https://www.airbnb.com/rooms/51036302,20210716195504,2021-07-16,Guesthouse Flora,Geniet van de moderne en ouderwetse charme van...,De Flora is het meest noordelijke gedeelte van...,https://a0.muscache.com/pictures/186f59af-f089...,171955140,https://www.airbnb.com/users/show/171955140,...,"[""Hot water kettle"", ""Microwave"", ""Extra pillo...",2021-07-16,,,,,,,,
920,51053223,https://www.airbnb.com/rooms/51053223,20210716195504,2021-07-16,Large modern house 10min from downtown Ghent/Gent,Just 10 minutes from historic downtown Ghent. ...,,https://a0.muscache.com/pictures/7de8d4c4-0155...,2650664,https://www.airbnb.com/users/show/2650664,...,"[""Baby safety gates"", ""Dishwasher"", ""Backyard""...",2021-07-16,,,,,,,,
921,51056846,https://www.airbnb.com/rooms/51056846,20210716195504,2021-07-16,"Comfort, quiet en green in ancient part of center",,,https://a0.muscache.com/pictures/a7faff71-0c08...,45920980,https://www.airbnb.com/users/show/45920980,...,"[""Air conditioning"", ""Paid parking off premise...",2021-07-16,,,,,,,,


Drop all rows without comments.

In [13]:
reviews = reviews.dropna()

In [14]:
listings['description'] = listings['description'].str.replace(r'<b>.*?<\/b>', '', regex=True)
listings['description'] = listings['description'].str.replace(r'<[^<>]*>', ' ', regex=True)
listings = listings[listings['description'].notna()]

## Detect duplicate listings
To detect duplicates we compare entries based on the description column. The column should be carefully chosen since e.g. comparing based on the name might mark non-duplicates as duplicates, if both have the same generic name.

In [15]:
print(listings[listings.duplicated(['description']) == True])

           id                            listing_url       scrape_id  \
122   8296779   https://www.airbnb.com/rooms/8296779  20210716195504   
211  14963441  https://www.airbnb.com/rooms/14963441  20210716195504   
241  17415990  https://www.airbnb.com/rooms/17415990  20210716195504   
242  17416223  https://www.airbnb.com/rooms/17416223  20210716195504   
244  17416437  https://www.airbnb.com/rooms/17416437  20210716195504   
245  17416518  https://www.airbnb.com/rooms/17416518  20210716195504   
389  26390559  https://www.airbnb.com/rooms/26390559  20210716195504   
403  26984036  https://www.airbnb.com/rooms/26984036  20210716195504   
420  28424935  https://www.airbnb.com/rooms/28424935  20210716195504   
465  32184650  https://www.airbnb.com/rooms/32184650  20210716195504   
467  32354302  https://www.airbnb.com/rooms/32354302  20210716195504   
547  36794750  https://www.airbnb.com/rooms/36794750  20210716195504   
572  38261340  https://www.airbnb.com/rooms/38261340  2021071619

## Extract keywords from reviews / descriptions

TODO WRITE INTRO

### Import packages and English stop words

In [16]:
import math
import nltk

from nltk import tokenize
from operator import itemgetter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# this has to be run once (its a download)
# nltk.download('stopwords')

#TODO EXTEND TO DUTCH (THE DOWNLOAD HAS DUTCH STOPWORDS ASWELL)

stop_words = set(stopwords.words('english'))

0    The top floor of our house in the center of Gh...
1    This well-located and comfortable one-bedroom ...
Name: description, dtype: object

### Find total number of words and sentences
These will be used later for TF and IDF

In [31]:
#for testing purposes
sample_review = "This is a great little apartment, a 10 minute walk from the town centre and close to the canal. It has everything that you would need. I stayed for 3 weeks and it was super comfortable. Jeroen is a great host, providing a wealth of information on the local area and tourist advice for visiting the sights. The no.2. tram stops right outside the apartment which is super convenient for Sint Pieter's station. The Van Hecke bakery across the road is great, as is the frituur (amazing chips!). There's also a lovely quiet spot to sit by the canal if you walk down Iepenstraat. The only negatives for me are that the walls/windows are a little thin, so (as a light sleeper) I could hear the road and my neighbour quite late at night (with earplugs though it was not a problem). For the kitchen, be aware that the oven is actually a microwave/grill (in case you are preparing a meal plan in advance), and there are no scales or measuring jugs. Otherwise everything was perfect and I would happily return to stay there again!"
total_words = sample_review.split()
total_word_length = len(total_words)
print("The total number of words is " + str(total_word_length))

# this needs to be done once before using tokenize
# todo move this somewhere
nltk.download('punkt')

total_sentences = tokenize.sent_tokenize(sample_review)
total_sent_len = len(total_sentences)
print("The total number of sentences is " + str(total_sent_len))

The total number of words is 185
The total number of sentences is 11


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\p_kuk\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Calculate TF for each word

In [32]:
tf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in tf_score:
            tf_score[each_word] += 1
        else:
            tf_score[each_word] = 1

# Dividing by total_word_length for each dictionary element
tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items())
print(tf_score)

{'This': 0.005405405405405406, 'great': 0.010810810810810811, 'little': 0.010810810810810811, 'apartment,': 0.005405405405405406, '10': 0.005405405405405406, 'minute': 0.005405405405405406, 'walk': 0.010810810810810811, 'town': 0.005405405405405406, 'centre': 0.005405405405405406, 'close': 0.005405405405405406, 'canal': 0.010810810810810811, 'It': 0.005405405405405406, 'everything': 0.010810810810810811, 'would': 0.010810810810810811, 'need': 0.005405405405405406, 'I': 0.016216216216216217, 'stayed': 0.005405405405405406, '3': 0.005405405405405406, 'weeks': 0.005405405405405406, 'super': 0.010810810810810811, 'comfortable': 0.005405405405405406, 'Jeroen': 0.005405405405405406, 'host,': 0.005405405405405406, 'providing': 0.005405405405405406, 'wealth': 0.005405405405405406, 'information': 0.005405405405405406, 'local': 0.005405405405405406, 'area': 0.005405405405405406, 'tourist': 0.005405405405405406, 'advice': 0.005405405405405406, 'visiting': 0.005405405405405406, 'sights': 0.0054054

 ### Calculate IDF for each word

In [33]:
def check_sent(word, sentences): #helper function
    final = [all([w in x for w in word]) for x in sentences] 
    sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    return int(len(sent_len))

idf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in idf_score:
            idf_score[each_word] = check_sent(each_word, total_sentences)
        else:
            idf_score[each_word] = 1

# Performing a log and divide
idf_score.update((x, math.log(int(total_sent_len)/y)) for x, y in idf_score.items())

print(idf_score)

{'This': 2.3978952727983707, 'great': 0.3184537311185346, 'little': 0.3184537311185346, 'apartment,': 2.3978952727983707, '10': 2.3978952727983707, 'minute': 2.3978952727983707, 'walk': 1.0116009116784799, 'town': 2.3978952727983707, 'centre': 2.3978952727983707, 'close': 2.3978952727983707, 'canal': 0.4519851237430572, 'It': 2.3978952727983707, 'everything': 1.0116009116784799, 'would': 0.3184537311185346, 'need': 2.3978952727983707, 'I': 0.7884573603642703, 'stayed': 2.3978952727983707, '3': 2.3978952727983707, 'weeks': 2.3978952727983707, 'super': 0.20067069546215124, 'comfortable': 2.3978952727983707, 'Jeroen': 2.3978952727983707, 'host,': 2.3978952727983707, 'providing': 2.3978952727983707, 'wealth': 2.3978952727983707, 'information': 2.3978952727983707, 'local': 2.3978952727983707, 'area': 2.3978952727983707, 'tourist': 2.3978952727983707, 'advice': 2.3978952727983707, 'visiting': 2.3978952727983707, 'sights': 2.3978952727983707, 'The': 0.7884573603642703, 'no2': 2.39789527279837

### Calculate TF*IDF

In [34]:
tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()}
print(tf_idf_score)

{'This': 0.012961596069180382, 'great': 0.003442743039119293, 'little': 0.003442743039119293, 'apartment,': 0.012961596069180382, '10': 0.012961596069180382, 'minute': 0.012961596069180382, 'walk': 0.010936226072199783, 'town': 0.012961596069180382, 'centre': 0.012961596069180382, 'close': 0.012961596069180382, 'canal': 0.004886325662087105, 'It': 0.012961596069180382, 'everything': 0.010936226072199783, 'would': 0.003442743039119293, 'need': 0.012961596069180382, 'I': 0.012785795032934113, 'stayed': 0.012961596069180382, '3': 0.012961596069180382, 'weeks': 0.012961596069180382, 'super': 0.0021694129239151487, 'comfortable': 0.012961596069180382, 'Jeroen': 0.012961596069180382, 'host,': 0.012961596069180382, 'providing': 0.012961596069180382, 'wealth': 0.012961596069180382, 'information': 0.012961596069180382, 'local': 0.012961596069180382, 'area': 0.012961596069180382, 'tourist': 0.012961596069180382, 'advice': 0.012961596069180382, 'visiting': 0.012961596069180382, 'sights': 0.012961

### Get the X most significant words

In [35]:
def get_top_n(dict_elem, n):
    result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) 
    return result

print(get_top_n(tf_idf_score, 5))

{'This': 0.012961596069180382, 'apartment,': 0.012961596069180382, '10': 0.012961596069180382, 'minute': 0.012961596069180382, 'town': 0.012961596069180382}


## Keyword extraction from descriptions using TF-IDF

TODO explanation

TODO expand to reviews aswell

In [36]:
import texthero as hero
listings['tfidf'] = hero.tfidf(listings['description'])
listings.head(2)

ModuleNotFoundError: No module named 'texthero'