# Project Machine Learning: Group 25
### Peter Bonnarens, Lennert Franssens & Philip Kukoba

# Sprint 1 : Tabular Data

### Possible tasks:
* Thorough exploratory data analysis, e.g.:
    * Are there substantial price differences between neighbourhoods ?
    * Are there hosts with more than one listing ? How does this impact the price ?
    * What is the correlation between the review score and the price ?
    * ...

    Not enough to just show a plot! Clearly describe WHAT question you investigated, WHY you think this is a relevant question
    and WHAT you deduce/conclude from the results of your data analysis

* Are there outliers ?
* A new Airbnb owner needs to pick an appropriate price:
    * Train a model to predict the price based on a selection of features
    * Find the most similar listings
    
* ...

# Table of work (who did what)

<br>

## Exploratory Data Analysis (EDA)
|                   	| EDA step 1 	| EDA step 2A 	| EDA step 2B 	| EDA step 2C 	| EDA step 2D 	| EDA step 3A 	| EDA step 3B 	|
|:-----------------:	|:----------:	|:-----------:	|:-----------:	|:-----------:	|:-----------:	|:-----------:	|:-----------:	|
|  Peter Bonnarens  	|      X     	|             	|      X      	|             	|             	|             	|             	|
| Lennert Franssens 	|      X     	|      X      	|      X      	|             	|      X      	|      X       	|             	|
|   Philip Kukoba   	|      X     	|             	|             	|      X      	|             	|             	|             	|

<br>

## Linear Regression Model (LR)
|                   	| LR step 1 	| LR step 2 	| LR step 3 	| LR step 4 	|
|:-----------------:	|:---------:	|:---------:	|:---------:	|:---------:	|
|  Peter Bonnarens  	|     X     	|           	|           	|     X     	|
| Lennert Franssens 	|           	|           	|           	|           	|
|   Philip Kukoba   	|           	|           	|           	|           	|

<br>

## K Nearest Neighbors Model (KNN)
|                   	| KNN step 1 	| KNN step 2 	| KNN step 3 	| KNN step 4 	|
|:-----------------:	|:----------:	|:----------:	|:----------:	|:----------:	|
|  Peter Bonnarens  	|            	|            	|            	|            	|
| Lennert Franssens 	|     X      	|     X       	|     X       	|     X      	|
|   Philip Kukoba   	|            	|            	|            	|            	|

# Exploratory Data Analysis (EDA)

## Step 1: imports & loading the dataset
In this step we import the needed libraries and read the dataset into a pandas dataframe.

In [None]:
%matplotlib inline

# imports
import numpy as np
import matplotlib.pyplot as plt

import regex as re

import pandas as pd  
import seaborn as sns 
from matplotlib import rcParams

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KDTree
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

import warnings

# delete warnings from output
warnings.filterwarnings('ignore')

# figure size in inches
plt.rcParams['figure.figsize'] = 20,16

# loading the dataset into pandas dataframe
listings = pd.read_csv("data/listings.csv")

## Step 2: preprocessing

Before we can start our EDA, we need to preprocess our data. This means changing string values to integers, removing NaN values, removing garbage data...
Let us first take a look at our dataset. 

In [None]:
listings.head()

In [None]:
print(listings.shape)

Notice that the dataset contains a lot of text values that are not needed for this sprint, and that we start with 923 rows of data and 75 features. The first step is to make a selection of the features we think can be useful or can give us insights.

### 2A : Feature subset selection

Here is a list of the features that we think could possibly be useful during this sprint:
* **id**: (int64) unique identifier for the listing
* **host_id**: (object) unique identifier for the host
* **host_response_time**: (object) description of how long it usually takes the host to respond
* **host_response_rate**: (object) the % rate at which the host responds
* **host_acceptance_rate**: (object) the % rate at which the host accepts booking requests
* **host_total_listings_count**: (int64) The number of listings the host has
* **host_verifications**: (object) array containing the different types of verification methods the host supports
* **host_has_profile_pic**: (object) boolean value that indicates if the host has a profile picture or not
* **host_identity_verified**: (object) boolean value that indicates wether the host is verified or not
* **neighbourhood_cleansed**: (object) the neighbourhood as geocoded using the latitude and longitude
* **latitude**: (object) latitude of listing
* **longitude**: (object) longitude of listing
* **room_type**: (object) room type
* **accomodates**: (object) maximum capacity of the listing
* **bedrooms**: (object) number of bedrooms in the listing
* **beds**: (float64) number of beds
* **price**: (object) daily price in local currency
* **minimum_nights**: (object) minimum number of night stay for the listing
* **maximum_nights**: (int64) maximum number of night stay for the listing
* **number_of_reviews**: (object) the number of reviews the listing has
* **number_of_reviews_ltm**: (int64) the number of reviews the listing has (in the last 12 months)
* **last_review**: (object) the date of the last/newest review
* **review_scores_rating**: (object) overall rating of the listing
* **instant_bookable**: (object) boolean value that indicates wwhether the guest can automatically book the listing without the host requiring to accept their booking request
* **reviews_per_month**: (float64) the number of reviews the listing has over the lifetime of the listing

We noticed that some rows in the dataset contained data that was shifted 1 column to the right starting from the 'host_since' column. Instead of removing these rows from the dataset, we decided to shift these rows 1 column back to the left.

In [None]:
# find lines to shift and add them to a mask - we've found that some lines are shifted 1 to the right beginning on the host_id column (that now contains garbage data)
shifted_lines = listings[pd.to_numeric(listings["host_verifications"], errors='coerce').notnull()].id
mask = listings['id'].isin(shifted_lines)

# shift lines 1 to the left
listings.loc[mask, 'host_since':'reviews_per_month'] = listings.loc[mask, 'host_since':'reviews_per_month'].shift(-1, axis=1)

Some of these features do not have the types we expect them to be. This is due to the fact that there are still NaN/garbage values in the dataset. Some columns also nead to be cast to the correct type. We can check the types of the columns like this:

In [None]:
# prepare extra columns to split number of bathrooms per type
listings['priv_bath'] = listings['bathrooms_text']
listings['bathrooms'] = listings['bathrooms_text']

# filter columns

listings = listings[["id", "host_id", "host_response_time", "host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
    "host_verifications", "host_identity_verified", "neighbourhood_cleansed", "latitude", "longitude", "property_type", "room_type",
    "accommodates", "priv_bath", "bathrooms", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights","availability_90",
    "number_of_reviews", "number_of_reviews_ltm", "last_review", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication",
    "review_scores_location", "review_scores_value", "instant_bookable", "reviews_per_month"]]

In [None]:
print(listings.dtypes.to_string())

### 2B : cleaning the data
Now all of the data are in the right columns, we can clean the individual records.
First we drop the rows where NaN values are present. These rows are considered as corrupt rows that contain invalid or too few data to work with.
After dropping these rows, we can convert rows with numeric data saved as strings to numeric types.

Some other rows still contains textual data that can easily be transformed to numeric data.


In [None]:
# host_response_time
# 0 = best resposne time, 1,2... worse
# listings["host_response_time"].unique()

#listings["host_response_time"] = [0 if x == 'within an hour' 
#                                    else 1 if x == 'within a few hours' 
#                                    else 2 if x == 'within a day' 
#                                    else 3 if x == 'a few days or more' 
#                                    else 4
#                                    for x in listings["host_response_time"]]

# clean property types
listings['property_type'] = ['Room'    if re.match('.*room.*', x, re.IGNORECASE) 
                                            else 'House' if re.match('.*house.*', x, re.IGNORECASE) 
                                            else 'Apartment' if re.match('.*apartment.*', x, re.IGNORECASE) 
                                            else 'Other'
                                            for x in listings["property_type"]]

# convert percentage to float
listings["host_response_rate"] = listings['host_response_rate'].str.rstrip('%').astype('float') / 100.0
listings['host_acceptance_rate'] = listings['host_acceptance_rate'].str.rstrip('%').astype('float') / 100.0

# convert to number of verification types
listings['host_verifications'] = listings['host_verifications'].apply(eval).apply(lambda x: len(x))

# convert booleans: 1 if true, 0 if false
listings["host_identity_verified"] = listings["host_identity_verified"].apply(lambda x: 1 if x == 't' else 0 if x == 'f' else None)
listings["instant_bookable"] = listings["instant_bookable"].apply(lambda x: 1 if x == 't' else 0 if x == 'f' else None)
listings["host_is_superhost"] = listings["host_is_superhost"].apply(lambda x: 1 if x == 't' else 0 if x == 'f' else None)

# private bathroom and shared bathroom
listings['bathrooms'] = listings['bathrooms'].replace('\s.*', '', regex=True)
listings['bathrooms'] = listings['bathrooms'].replace('^[a-zA-Z].*', '0.5', regex=True)
listings['priv_bath'] = listings['priv_bath'].replace('.*ared.*', '0', regex=True)
listings['priv_bath'] = listings['priv_bath'].replace('.*[a-zA-Z].*', '1', regex=True)

listings['bathrooms'] = listings['bathrooms'].astype(float)
listings['priv_bath'] = listings['priv_bath'].astype(float)

# convert currency to float
listings['price'] = listings['price'].replace('[\$,)]', '', regex=True).astype(float)

In [None]:
# first we drop the rows that contain missing values
listings.dropna(inplace=True)

# next we convert some columns to numeric values
listings[["accommodates", "bedrooms", "minimum_nights", "number_of_reviews", "review_scores_rating", "longitude", "beds", "host_id", "host_total_listings_count", "maximum_nights", "number_of_reviews_ltm"]] = listings[["accommodates","bedrooms", "minimum_nights", "number_of_reviews", "review_scores_rating", "longitude","beds", "host_id", "host_total_listings_count", "maximum_nights", "number_of_reviews_ltm"]].apply(pd.to_numeric)

## Step 3 : plots

### 3A: Histograms of most interesting numeric features

In [None]:
listings.hist(figsize=(25,25))

### 3B: Distribution and pairplot
The first graph shows the distribution of the prices.
The second graph shows scatterplots. There we can find out if there are good predictors for our target price.

In [None]:
plt.figure(figsize=(16,12))
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#dataset = listings[["price", "neighbourhood_cleansed", "host_response_time", "host_response_rate", "host_acceptance_rate", "host_total_listings_count", "host_verifications", "host_identity_verified", "accommodates", "bedrooms", "beds", "minimum_nights", "maximum_nights", "number_of_reviews", "number_of_reviews_ltm", "review_scores_rating", "instant_bookable", "reviews_per_month", "room_type", "rt_Entire home/apt", "rt_Hotel room", "rt_Private room", "rt_Shared room"]]
dataset = listings[["neighbourhood_cleansed", "host_response_time", "host_response_rate", "host_acceptance_rate", "host_is_superhost", "host_total_listings_count", 
    "host_verifications", "host_identity_verified", "property_type","room_type",
    "accommodates", "priv_bath", "bathrooms", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "availability_90",
    "number_of_reviews", "number_of_reviews_ltm", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication",
    "review_scores_location", "review_scores_value", "instant_bookable", "reviews_per_month"]]
sns.distplot(dataset.price, bins=50)
plt.xlabel("Price")
plt.xticks(np.arange(min(dataset.price.to_numpy()), max(dataset.price.to_numpy()), 50.0))
plt.show()

#sns.pairplot(dataset, vars=["host_response_time", "host_response_rate", "host_acceptance_rate", "host_total_listings_count", "host_verifications", "host_identity_verified", "accommodates", "bedrooms", "beds", "number_of_reviews", "number_of_reviews_ltm", "review_scores_rating", "instant_bookable", "reviews_per_month", "room_type", "rt_Entire home/apt", "rt_Hotel room", "rt_Private room", "rt_Shared room"])

plt.figure(figsize=(40,40))
for i, k in enumerate(dataset.keys()):
    plt.subplot(7,7,1+i)
    plt.xticks(rotation=90)
    sns.scatterplot(x=dataset[k], y=dataset["price"])


### 3C: Boxplots
Some boxplots to gain more insight about the used price range per accommodate, room type and neighbourhood.

In [None]:
plt.figure(figsize=(16,12))
sns.boxplot(x="accommodates", y="price", data=dataset)

In [None]:
plt.figure(figsize=(16,12))
sns.boxplot(x=dataset.room_type, y=dataset.price)

In [None]:
plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
sns.boxplot(x=dataset.neighbourhood_cleansed, y=dataset.price)

### 3D: Impact on the price if host has more than one listing

### 3E: Correlation between review score and price

### 3F: Correlation between review score and number of reviews

### 3G: Number of listings per number of accommodates, bathrooms, bedrooms and beds

In [None]:
dataset[['accommodates', 'bathrooms', 'bedrooms', 'beds']].hist()

### 3H: Distribution of property and room type

In [None]:
dataset['room_type'].hist()
dataset['room_type'].value_counts(normalize=True)

In [None]:
dataset['property_type'].hist()
dataset['property_type'].value_counts(normalize=True)

### 3H: One-hot encodings and correlation matrix
We do a one-hot encoding for categorical features. After that we can plot a correlation matrix.

In [None]:
# Replace categorical features with one-hot encodings
a = pd.get_dummies(listings['host_response_time'], prefix = "hrt")
b = pd.get_dummies(listings['room_type'], prefix = "rt")
c = pd.get_dummies(listings['property_type'], prefix = "pt")
frames = [listings, a, b, c]
listings = pd.concat(frames, axis = 1)

In [None]:
correlation_matrix = listings.corr().round(2)
plt.figure(figsize=(20,14))
sns.heatmap(data=correlation_matrix, annot=True)

### 3I: Log transformation of columns that could benefit from it
We can see that there are some features that could benefit from a log transformation. This can be concluded from the histograms from 3A.

After the transformation we can see that there are some features that are normally distributed.

In [None]:
tfo_listings = listings[["host_total_listings_count", "accommodates", "bathrooms", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews"]]

for col in tfo_listings.keys():
    tfo_listings[col] = tfo_listings[col].astype('float64').replace(0.0, 0.01)
    tfo_listings[col] = np.log(tfo_listings[col])

In [None]:
tfo_listings.hist()

Now we can drop some colums that are not needed anymore.

In [None]:
dataset.drop(['neighbourhood_cleansed'], inplace=True, axis=1)
dataset.drop(['room_type'], inplace=True, axis=1)
dataset.drop(['property_type'], inplace=True, axis=1)
dataset.drop(['host_response_time'], inplace=True, axis=1)

# Linear Regression Model

## Step 1 : Train - Test - Split

In [None]:
x = tfo_listings.drop(['price'], axis=1) # pd.concat((), axis=1)
y = tfo_listings['price']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

## Step 2: Normalizing the data

In [None]:
# TODO
# formule: x-xmin/xmax-xmin

## Step 3 : Training the model

In [None]:
lin_model = LinearRegression()
lin_model.fit(x_train, y_train)

y_train_predict = lin_model.predict(x_train)
y_test_predict = lin_model.predict(x_test)

## Step 4 : Measure the performance of the model

In [None]:
# model evaluation for training set
n_train = len(x_train)  # sample size
p_train = len(x_train.columns)  # number of independent variables
R2_train = r2_score(y_train, y_train_predict)
RMSE_train = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
# use the adjusted R² score to counter accidental increase of score with number of input features.
adj_R2_train = 1 - ((1-R2_train) * (n_train-1)/(n_train-p_train-1))   #Adj R2 = 1-(1-R2)*(n-1)/(n-p-1)

print("Model train performance")
print("--------------------------------------")
print('RMSE is {}'.format(RMSE_train))
print('R2 score is {}'.format(R2_train))
print('adjusted R2 score is {}'.format(adj_R2_train))
print("\n")

# model evaluation for testing set
n_test = len(x_test)
p_test = len(x_test.columns)
R2_test = r2_score(y_test, y_test_predict)
RMSE_test = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
adj_R2_test = 1 - ((1-R2_test) * (n_test-1)/(n_test-p_test-1))   #Adj R2 = 1-(1-R2)*(n-1)/(n-p-1)

print("Model test performance")
print("--------------------------------------")
print('RMSE is {}'.format(RMSE_test))
print('R2 score is {}'.format(R2_test))
print('adjusted R2 score is {}'.format(adj_R2_test))
print("\n")


print("Model parameters")
print("--------------------------------------")
print(lin_model.coef_)
print(lin_model.intercept_)

Compare the predicted values with the actual output.

In [None]:
y_test_array = np.array(list(y_test))
y_test_predict_array = np.array(y_test_predict)
compare_table = pd.DataFrame({'Truth': y_test_array.flatten(), 'Predicted': y_test_predict_array.flatten()})
compare_table

# Random Forest Regression

## Step 1: Setup

In [None]:
#helper functions

def run_predictions(tree, x):
    predictions = []
    for index, sample in x.iterrows():
        prediction = tree.predict(sample)
        predictions.append(prediction)
    return predictions
    
def visualize_results(predictions, ground_truth):
    
    plt.scatter(ground_truth, predictions, alpha=0.5)
    plt.xlabel("Ground truth price")
    plt.ylabel("Predicted price")
    plt.show()
    
    rmse = (np.sqrt(mean_squared_error(ground_truth, predictions)))
    r2 = r2_score(ground_truth, predictions)
    print('RMSE is {}'.format(rmse))
    print('R2 score is {}'.format(r2))

## Step 2: Split dataset

In [None]:
#Split dataset in training and test set 
#TODO choose features more wisely

ds_copy = tfo_listings #dataset[["price", "host_response_time", "host_total_listings_count", "host_verifications", "host_identity_verified", "accommodates", "bathrooms", "bedrooms", "beds", "minimum_nights", "maximum_nights", "instant_bookable"]]
X = ds_copy.drop(['price'], axis=1)
Y = ds_copy["price"]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

## Step 3: Run random forest regression

In [None]:
regr = RandomForestRegressor(max_depth=8, random_state=0)
regr.fit(X_train, Y_train)

predictions = regr.predict(X_test)

visualize_results(predictions, Y_test)

It is clear from the RMSE and R2 measurements that the random forest regression does not deliver the best result. But we see the right shape of y=x with some noise.

# K Nearest Neighbors Model

## Step 1 : Use one-hot encodings

In [None]:
# done in first part of this notebook

## Step 2: Normalizing and splitting the data

In [None]:
knn_dataset = tfo_listings #dataset[["price", "host_response_time", "host_total_listings_count", "host_verifications", "host_identity_verified", "accommodates", "bathrooms", "bedrooms", "beds", "minimum_nights", "maximum_nights", "instant_bookable"]]
#columns = ["host_total_listings_count", "host_verifications", "accommodates", "bathrooms", "bedrooms", "beds"]
#knn_dataset[columns] = (knn_dataset[columns] - np.min(knn_dataset[columns])) / (np.max(knn_dataset[columns]) - np.min(knn_dataset[columns])).values
#knn_dataset.describe()
#knn_dataset.dtypes

# Split data in a training and test set
y = knn_dataset.price.values
x = knn_dataset.drop(['price'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0)
print(len(x_train))
print(len(x_test))
print(len(y_train))
print(len(y_test))

## Step 3 : Training the model

In [None]:
# The algorithm
class MyKNeighborsClassifier:
    def __init__(self, k):
        self.k = k
    def fit(self, x, y):
        self.tree = KDTree(x)
        self.y = y
    def predict(self, x):
        _, ind = self.tree.query(x, k=self.k)
        return self.y[ind].mean(axis=1)


knn = MyKNeighborsClassifier(3)
knn.fit(x_train, y_train)
predictions = knn.predict(x_test)

predictions

## Step 4: Measure the performance of the model

In [None]:
# TODO: Fix these tests!!! Use the y_test to compare to the predictions values.
#accuracy = (predictions == y_test).mean()

# TODO: Remove outliers (less common listings) - where price > 170
accuracy = np.where((predictions > y_test - 20) & (predictions < y_test + 20), True, False).mean()
print(accuracy)

#TP = (predictions[y_test == 1] == 1).sum()
#print(TP)

#TN = (predictions[y_test == 0] == 0).sum()
#print(TN)

#FP = (predictions[y_test == 1] == 0).sum()
#print(FP)

#FN = (predictions[y_test == 0] == 1).sum()
#print(FN)

#accuracy = (TP+TN)/(TP+TN+FN+FP)
#print(accuracy)

#precision = TP / (TP + FP)
#print(precision)

#recall = TP / (TP + FN)
#print(recall)

#F1 = 2 *  (precision*recall)/(precision+recall)
#print(F1)

#accuracies = []
#for k in range(1, 50):
#    knn = MyKNeighborsClassifier(k)
#    knn.fit(x_train, y_train)
#    predictions = knn.predict(x_test)  > 0.5
#    accuracies.append((predictions == y_test).mean())
#plt.plot(accuracies)


#knn = MyKNeighborsClassifier(5)
#knn.fit(x_train, y_train)
#predictions = knn.predict(x_test)

#fpr, tpr, thresholds = roc_curve(y_test, predictions)
#roc_auc = auc(fpr, tpr)
        
#plt.plot(fpr, tpr, color='darkorange', label='ROC curve (AUC = %0.2f)' % roc_auc)
#plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

#for x, y, txt in zip(fpr, tpr, thresholds):
#    plt.annotate(np.round(txt,2), (x, y-0.04))
    
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')
#plt.title('Receiver operating characteristic')
#plt.legend(loc="lower right")