<a href="https://colab.research.google.com/github/isaacchunn/SC1015_MiniPrj_Airbnb/blob/main/Airbnb_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset : Airbnb Singapore Dataset from InsideAirbnb
#### Question : If we were an AirBnb host, how can we maximise our profit?


Dataset from Airbnb : **"Singapore, 29 December 2022"**  
Source: http://insideairbnb.com/get-the-data/


# Contents
  1. KMeans
  2. Random Forest

---

### Essential Libraries

Import essential libraries such as numpy, pandas, matplotlib and seaborn.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Isaac Chun Jun Heng U2221389B
# J'sen Ong Jia Xuan  U2220457J
# Tang Teck Meng U2221809C

In [None]:
#Basic libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt #We only need pyplot
sb.set() #Set the default Seaborn style for graphics

### Additional Libraries

Import additional libraries

> sklearn : Conduct linear regression analysis

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### General Utility Functions

In [None]:
def countOutliers (df):
    #Get the q1 and q3 datas to find out the 25% and 75% range, then calculate inter quartile range and then find out whiskers.
    #Then count how many points lie outside of this range.
    q1 = df.quantile(0.25)
    q3 = df.quantile(0.75)
    #Interquartile
    iqr = q3 - q1
    #Calculate whiskers
    leftWhisker = q1 - (1.5 * iqr)
    rightWhisker = q3 + (1.5 * iqr)
    outliers = 0;
    #Loop through data now
    for data in df:
        if(data < leftWhisker or data > rightWhisker):
            outliers+=1

    return outliers

In [None]:
def removeOutliers(df, colName):
  q1 = df[colName].quantile(0.25)
  q3 = df[colName].quantile(0.75)
  iqr = q3-q1
  low = q1 - 1.5 * iqr
  high = q3 + 1.5 * iqr
  result = df.loc[(df[colName] >= low) & (df[colName] <= high)]
  return result

In [None]:
def remove_outliers(df, columns, factor=1.5):
    # loop through each column and remove outliers based on the IQR method
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        upper_bound = q3 + factor * iqr
        lower_bound = q1 - factor * iqr
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
    return df

### Mount Google Drive (unused, uncomment if need add anything from google drive.)

In [None]:
# from google.colab import drive 
# drive.mount('/content/gdrive')

---

>## Hypothesis  

1. The number of amenities a listing provides will affect its price, the more the amenities, the higher the listing price
2. Variables related to a listing's review will have positive correlation to listing's price

---

>## Import the Dataset  

We have imported the cleaned dataset based on our EDA done in the other files.

In [None]:
url = "https://raw.githubusercontent.com/isaacchunn/SC1015_MiniPrj_Airbnb/main/listings_cleaned.csv"
airDF = pd.read_csv(url)
airDF.head()

In [None]:
airDF.info()

In [None]:
print(airDF.dtypes)

---

>## Cleaning our DataFrame/Dataset

### 1. Drop properties with N/A or 0% acceptance rate as these properties do not get stayed at by visitors.

In [None]:
#Drop all the properties that has no host acceptance rate then drop
airDF = airDF.dropna(subset=["host_acceptance_rate"])
#Then remove all the 0% acceptance rate
airDF = airDF[airDF["host_acceptance_rate"] != 0]
#Resort our indexes
airDF = airDF.reset_index(drop=True)
airDF.head(n=5)

### 2. Clean the price column using code as it has "$", "," and "." 


In [None]:
airDF["price"]

In [None]:
breaks = [",", "$"]
for i in range(len(airDF["price"])):
    s = airDF.loc[:,("price")][i]
    for x in breaks:
        s = s.replace(x,"")
    s = "".join(s.split(".")[:-1])
    airDF.loc[:,("price")][i] = int(s)
airDF = airDF.astype({'price': 'int32'})

We also remove any outliers as it is unrealistic for a property to have above > $45,000 per night

In [None]:
airDF = airDF[airDF.price < 45000]
#Resort our indexes
airDF = airDF.reset_index(drop=True)
airDF["price"]

### 3. Convert the amenities column to a list, and add a new column with the number of amenities to be used for our prediction.

In [None]:
airDF["amenities"]

In [None]:
#Add a new column of amenities
airDF["no_amenities"] = 0
#Replace all with the integer variant
count = 0
for x in airDF["amenities"]:   
    #Convert string into list
    #Convert string into list
    x = x.replace('[',"")
    x = x.replace(']',"")
    x = x.replace('"', "")
    x = x.replace(", ", ",")
    x = x.split(",")
    airDF["amenities"][count] = x
    airDF["no_amenities"][count] = len(x)
    count += 1

In [None]:
airDF["amenities"].head(n=5)

In [None]:
airDF["no_amenities"].head(n=5)

In [None]:
#Want to visualize the total count of amenities so we can form a generalization such that our number of amenities remains reliable.
amenityCount = {}
for x in airDF["amenities"]:
    for item in x:
        if item in amenityCount:
            amenityCount[item] += 1
        else:
            amenityCount[item] = 1
        
#Add it to a DF
amenityCountDF = pd.DataFrame(columns = ["amenity", "count"])
count = 0
for keys, values in amenityCount.items():
    amenityCountDF.loc[count] = [keys, values]
    count += 1

#Sort the DF
amenityCountDF = amenityCountDF.sort_values(by="count", ascending = False)
amenityCountDF.head(n=15)

In [None]:
amenityCountDF.tail(n=10)

We have decided to only use those amenities that are very prominent in most of the listings as the number of amenities should be consistent, and not be filled with many values that do not matter. For example, we do not know what Fire TV is.

In [None]:
#Changeable cutoff that are determined by us to check for robustness of our model
amenityCutOff = 30

In [None]:
uselessAmenityList = amenityCountDF[amenityCountDF["count"] <= amenityCutOff]["amenity"].values.tolist()

In [None]:
#Remove all values in our df that correspond to our useless amenity list
count = 0
for x in airDF["amenities"]:
    l = [i for i in x if i not in uselessAmenityList]
    airDF["amenities"][count] = l
    airDF["no_amenities"][count] = len(l)
    count +=1

### 4. Fill in na values in host_response_time to be a value as we are using it to gather insights

In [None]:
print("Null values:", airDF["host_response_time"].isnull().sum().sum())

In [None]:
airDF["host_response_time"].value_counts()

In [None]:
#Fill it to be the worst scenario to achieve better distribution
airDF = airDF.fillna(value = {"host_response_time": "a few days or more"})
None

In [None]:
print("Null values:", airDF["host_response_time"].isnull().sum().sum())

In [None]:
airDF["host_response_time"].value_counts()

---

>## Splitting the Dataset

In [None]:
#Split the dataset into train and test in 80:20 ratio
train_data, test_data = train_test_split(airDF, test_size = 0.2, random_state = 55)

#Print out what we have in our test and train data
print("Train Data :")
print("Data type : ", type(train_data))
print("Data dim : ", train_data.shape)
print("---------------------------------------")
print("Test Data :")
print("Data type : ", type(test_data))
print("Data dim : ", test_data.shape)
print("---------------------------------------")

---

>## 1. Multi-variate K Means

In [None]:
priceDF = airDF["price"]
priceDF.head(n=5)

In [None]:
#Input the numerical values we had identified beforehand
kmeansDF = airDF[["accommodates","no_amenities","number_of_reviews", "price","review_scores_rating"]].copy()
# filling in null values with median
kmeansDF.fillna(kmeansDF.median(), inplace = True)

#Plot its data on 2d grids
sb.pairplot(kmeansDF)

In [None]:
# Import kmeans model from sklearn
from sklearn.cluster import KMeans

#Vary the number of clusters
minClusterRange = 1
maxClusterRange = 20

#We want to use the elbow method, so we will compute all sse for each "k" and store into our sse list 
#"k-means++" employs an advanced trick to speed up convergence
sse = [] 
for k in range(minClusterRange, maxClusterRange + 1):
  kmeans = KMeans(n_clusters = k, init = "k-means++", n_init= 100)
  kmeans.fit(kmeansDF)
  sse.append(kmeans.inertia_)

#Plot the SSE curve to find our elbow point
f = plt.figure(figsize=(16,4))
plt.plot(range(minClusterRange, maxClusterRange+1), sse)
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.xticks(np.arange(minClusterRange, maxClusterRange+1, step = 1))
plt.show()
None

It seems from this that our elbow point is either 2 or 3 as our best k. Let us use another technique called silhouette coefficient to find out the best k

In [None]:
from sklearn.metrics.cluster import silhouette_score
silhouette_coefficients = []
minClusterRange = 2 #Start at 2 for silhouette coefficient
maxClusterRange = 20

for k in range(minClusterRange, maxClusterRange +1):
  kmeans = KMeans(n_clusters = k, init= "k-means++", n_init = 100)
  kmeans.fit(kmeansDF)
  score = silhouette_score(kmeansDF, kmeans.labels_)
  silhouette_coefficients.append(score)

In [None]:
#Plot out what we have found based on our silhouette coefficients
plt.plot(range(minClusterRange, maxClusterRange + 1), silhouette_coefficients)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.xticks(np.arange(minClusterRange, maxClusterRange+1, step = 1))
plt.show()

From this, let us try 2 as it has the highest score for now

In [None]:
k = 2

#Use our kmeans with our newly found k
kmeans = KMeans(n_clusters = k,         
               init = "k-means++",
               n_init = 100)                 

#Fit the kmeans onto our DF
kmeans.fit(kmeansDF)
#Then call predict 
kmeansPrediction = kmeans.predict(kmeansDF)
None

In [None]:
kmeans_labeled = kmeansDF.copy()
kmeans_labeled["Cluster"] = pd.Categorical(kmeansPrediction)

# Catplot the counts in our cluters
sb.catplot(y = "Cluster", data = kmeans_labeled, kind = "count")

In [None]:
#Plot all our clusters on 2d grids using cluster column
sb.pairplot(kmeans_labeled, vars = kmeansDF.columns.values, hue = "Cluster")

In [None]:
# Boxplots for all Features against the Clusters
f, axes = plt.subplots(5, 1, figsize=(20,35))
sb.boxplot(x = 'accommodates', y = 'Cluster', data = kmeans_labeled, ax = axes[0])
sb.boxplot(x = 'no_amenities', y = 'Cluster', data = kmeans_labeled, ax = axes[1])
sb.boxplot(x = 'number_of_reviews', y = 'Cluster', data = kmeans_labeled, ax = axes[2])
sb.boxplot(x = 'price', y = 'Cluster', data = kmeans_labeled, ax = axes[3])
sb.boxplot(x = 'review_scores_rating', y = 'Cluster', data = kmeans_labeled, ax = axes[4])

In [None]:
# Average Behaviour of each Cluster
cluster_data = pd.DataFrame(kmeans_labeled.groupby(by = "Cluster").mean())
cluster_data.plot.bar(figsize = (16,6))
     

In [None]:
# Create a data frame containing our centroids
centroids = pd.DataFrame(kmeans.cluster_centers_, columns=kmeansDF.columns)
centroids['Cluster'] = centroids.index

f, axes = plt.subplots(1, 1, figsize=(16,10))
pd.plotting.parallel_coordinates(centroids, 'Cluster', color=('#556270', '#4ECDC4', '#C7F464'))