# Predicting Destinations with the Airbnb Dataset using KNN Classifier

This notebook demonstrates the entire process of building a predictive model using KNN classifier to suggest the first destination of new Airbnb Users. All the processes involved, such as data wrangling, exploratory data analysis, inferential statistics.

## Data Wrangling
In the first section of the notebook, I will attempt at cleaning the Airbnb Kaggle Competition Data and wrangling into a form that is suitable for further analysis. The entire data wrangling process will be done using the Python Pandas library.

I have used functions for data cleaining.

In [2]:
import pandas as pd
import numpy as np
from random import randint
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

The first step is to load all the data available to us in a Pandas Dataframe and extract basic information such as number of samples, number of null values rows, number of features, etc. Here I have used KNN classifier.

The next step would be to deal with the missing values using a suitable method (dropping, interpolating, etc.) and convert certain features into a more suitable form for applying inferential statistics and machine learning algorithms.

In [3]:
def findNA(df):
    df = df.replace(r'\s+', np.nan, regex=True)
    df = df.replace('-unknown-',np.nan, regex=False)
    df = df.replace('Other/Unknown',np.nan, regex=False)
    df = df.dropna(thresh=10)
    return df

This function will remove '-unknown-' and 'Other/Unknown' values from the CSV file and return cleaned data frame.
Here I have set thresh value to 10. It means that in a single row at least 10 N/A values are allowed.

In [4]:
def encodeDate(df):
    df['date_account_created']=pd.to_datetime(df['date_account_created']).dt.dayofweek
    df['date_first_booking']=pd.to_datetime(df['date_first_booking']).dt.dayofweek
    return df

encodeDate function will extract day of week from date_account_created and date_first_booking columns.

In [5]:
def weightedRandomImputation(df):
    for col in df:
        nan_count=df[col].isnull().sum()
        if col=='age':
            df=handleOutlierAge(df)
            
        # For parameters other then age, compute their missing value using stratified methodology of missing value imputation    
        if nan_count>0 and col!='age': 
            df_counts=df[col].value_counts()
            Total_minus_unknown = 0
            Total_minus_unknown = len(df[col]) - len(df_counts)
            ratio_list=[]
            for i in range(len(df_counts)):
                ratio_list.append(float(df_counts[i])*100/float(Total_minus_unknown))
            min_ratio = min(ratio_list)
            ratio_list = [int(x/min_ratio) for x in ratio_list]
            counts_list=df_counts.index.tolist()
            pairs = list(zip(ratio_list,counts_list))
            df[col]=df[col].apply(lambda x: weightedRandomHelper(pairs) if(pd.isnull(x)) else x)

        # Creating bins for signup_flow parameter
        if col=='signup_flow': 
            bins = [-1,5,10,15,20,28]
            group_names = [0,1,2,3,4]
            df['signup_flow_bins'] = pd.cut(df['signup_flow'], bins, labels=group_names)

    return df


The function weightedRandomImputation() takes data frame as an argument and removes outliers from age column.
For the columns other than age , if N/A value count is greater than zero then it will identify those values and replace it with the mean values.

Here I have created bins for signup_flow column.

In [6]:
def weightedRandomHelper(pairs):  
    total = sum(pair[0] for pair in pairs)
    r = randint(1, total)
    for (weight, value) in pairs:
        r -= weight
        if r <= 0: return value

In [7]:
def handleOutlierAge(df):
    df['age']=df['age'].apply(lambda x: datetime.now().year-x if x>1900 else x)
    
    #Valid age range between 14 to 90 as per data, otherwise check if its outlier or not
    df['age']=df['age'].apply(lambda x: x if 14<=x<=90 else np.nan)     
    mean = df['age'].mean()
    mean = int(mean)
    df['age']=df['age'].apply(lambda x: mean if np.isnan(x) else x) 
    return df

This function will remove outliers from age column.Here I have taken valid age range is between 14 to 90

In [9]:
def KNNClassifier(df,df_test):
    print("\nLearning the KNN Classifier Model...")
    Y_train = df.country_destination
    X_train = df.drop('country_destination', 1)
    X_train = X_train.drop('id', 1)

    #preprocess of test
    Y_test = df_test.country_destination
    X_test = df_test.drop('country_destination', 1)
    X_test = X_test.drop('id', 1)

    # encode Y train
    le = LabelEncoder()
    Y_train = le.fit_transform(Y_train)

    X_train = X_train.apply(LabelEncoder().fit_transform)
    X_test= X_test.apply(LabelEncoder().fit_transform)

    # Encode Y Test 
    le_t = LabelEncoder()
    Y_test = le_t.fit_transform(Y_test)

    #dropping columns as they dont improve accuracy
    X_train = X_train.drop('timestamp_first_active', 1)
    X_train = X_train.drop('language', 1)
    X_train = X_train.drop('signup_app', 1)
    X_test = X_test.drop('timestamp_first_active', 1)
    X_test = X_test.drop('language', 1)
    X_test = X_test.drop('signup_app', 1)

    n_neighbors = 300
    #for weights in ['uniform', 'distance']:
    for weights in ['distance']:
    # we create an instance of Neighbours Classifier and fit the data.
        #clf = KNeighborsClassifier(n_neighbors, weights=weights)
        clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights,algorithm='ball_tree')
        clf.fit(X_train, Y_train)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
        Y_pred = clf.predict(X_test)
        prediction_knn = [round(value) for value in Y_pred]
        accuracy = accuracy_score(Y_test, prediction_knn)
        print("Accuracy with KNN is : %.2f%%" % (accuracy * 100.0))
        
    print("The confusion matrix is : \n",confusion_matrix(Y_test, Y_pred ))
    print("Mean Absolute error is :",mean_absolute_error(Y_test, Y_pred ))
    print("Evaluation Metrics :\n",classification_report(Y_test, Y_pred ))

In [10]:
df = pd.read_csv('train_users_2.csv')   #load data

print("Doing Preprocessing")
print("Handling Missing Values")
df = findNA(df)
original_data  = df.copy()
original_data=encodeDate(original_data)   #convert date to the day of the week with Monday=0, Sunday=6
original_data=weightedRandomImputation(original_data) # Missing Value Imputation

df,df_test = train_test_split( df, test_size=0.3, stratify=df['country_destination'])

df=encodeDate(df)   #convert date to the day of the week with Monday=0, Sunday=6
df=weightedRandomImputation(df) # Missing Value Imputation

#preprocess of test
df_test = encodeDate(df_test)
df_test = weightedRandomImputation(df_test)

Doing Preprocessing
Handling Missing Values


In [11]:
KNNClassifier(df,df_test)


Learning the KNN Classifier Model...
Accuracy with KNN is : 60.80%
The confusion matrix is : 
 [[    0     0     0     0     0     0     0   120     0     0    42     0]
 [    0     0     0     0     0     0     0   322     0     0   106     0]
 [    0     0     0     0     0     0     0   225     0     0    93     0]
 [    0     0     0     0     0     0     0   524     0     0   151     0]
 [    0     0     0     0     0     0     0  1193     0     0   314     0]
 [    0     0     0     0     0     0     0   538     0     0   159     0]
 [    0     0     0     0     0     0     0   670     0     0   181     0]
 [    0     2     0     0     1     0     1 33920     0     0  3439     0]
 [    0     0     0     0     0     0     0   174     0     0    55     0]
 [    0     0     0     0     0     0     0    50     0     0    15     0]
 [    0     0     0     0     0     0     0 13698     0     0  5015     0]
 [    0     0     0     0     0     0     0  2312     0     0   716     0]]
Mea

  'precision', 'predicted', average, warn_for)


By using KNN Classifier , model has achieved 60.80% accuracy.

References

Repositories

https://github.com/karvenka/kaggle-airbnb/blob/master/notebooks/Venkatesan_Karthick_Final_Project_Report.ipynb

https://github.com/Sapphirine/Airbnb-New-User-Bookings-Prediction/blob/master/preprocessing%26prediction.ipynb

https://github.com/Currie32/AirBnB-Predicting-Destination/blob/master/Predicting_Destination.ipynb

Kaggle Competition

https://www.kaggle.com/meicher/predicting-first-destination-4-models

https://www.kaggle.com/svpons/three-level-classification-architecture

The code in the document by Kandarp Vyas is licensed under the MIT License https://opensource.org/licenses/MIT