# Keras Neural Network

This notebook demonstrates the entire process of building a predictive model using Keras sequential model to suggest the first destination of new Airbnb Users. All the processes involved, such as data wrangling, exploratory data analysis, inferential statistics.

In [1]:
import pandas as pd
import numpy as np
from random import randint
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import classification_report
import os
import matplotlib.pyplot as plt
import time
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.advanced_activations import LeakyReLU
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


The first step is to load all the data available to us in a Pandas Dataframe and extract basic information such as number of samples, number of null values rows, number of features, etc. Here I have used Keras and Tensorflow as a backend.

The next step would be to deal with the missing values using a suitable method (dropping, interpolating, etc.) and convert certain features into a more suitable form for applying inferential statistics and machine learning algorithms.

In [2]:
def findNA(df):
    df = df.replace(r'\s+', np.nan, regex=True)
    df = df.replace('-unknown-',np.nan, regex=False)
    df = df.replace('Other/Unknown',np.nan, regex=False)
    df = df.dropna(thresh=10) #Ignore the rows with majority Missing Value during Analysis
    return df

This function will remove '-unknown-' and 'Other/Unknown' values from the CSV file and return cleaned data frame.
Here I have set thresh value to 10. It means that in a single row at least 10 N/A values are allowed. 

In [3]:
def encodeDate(df):
    df['date_account_created']=pd.to_datetime(df['date_account_created']).dt.dayofweek
    df['date_first_booking']=pd.to_datetime(df['date_first_booking']).dt.dayofweek
    return df

encodeDate function will extract day of week from date_account_created and date_first_booking columns.

In [4]:
def weightedRandomImputation(df):
    for col in df:
        nan_count=df[col].isnull().sum()
        if col=='age':
            df=handleOutlierAge(df)
            
        # For parameters other then age, compute their missing value using stratified methodology of missing value imputation    
        if nan_count>0 and col!='age': 
            df_counts=df[col].value_counts()
            Total_minus_unknown = 0
            Total_minus_unknown = len(df[col]) - len(df_counts)
            ratio_list=[]
            for i in range(len(df_counts)):
                ratio_list.append(float(df_counts[i])*100/float(Total_minus_unknown))
            min_ratio = min(ratio_list)
            ratio_list = [int(x/min_ratio) for x in ratio_list]
            counts_list=df_counts.index.tolist()
            pairs = list(zip(ratio_list,counts_list))
            df[col]=df[col].apply(lambda x: weightedRandomHelper(pairs) if(pd.isnull(x)) else x)

        # Creating bins for signup_flow parameter
        if col=='signup_flow': 
            bins = [-1,5,10,15,20,28]
            group_names = [0,1,2,3,4]
            df['signup_flow_bins'] = pd.cut(df['signup_flow'], bins, labels=group_names)

    return df


The function weightedRandomImputation() takes data frame as an argument and removes outliers from age column.
For the columns other than age , if N/A value count is greater than zero then it will identify those values and replace it with the mean values.

Here I have created bins for signup_flow column.

In [5]:
def weightedRandomHelper(pairs):  
    total = sum(pair[0] for pair in pairs)
    r = randint(1, total)
    for (weight, value) in pairs:
        r -= weight
        if r <= 0: return value

In [6]:
def handleOutlierAge(df):
    df['age']=df['age'].apply(lambda x: datetime.now().year-x if x>1900 else x)
    
    #Valid age range between 14 to 90 as per data, otherwise check if its outlier or not
    df['age']=df['age'].apply(lambda x: x if 14<=x<=90 else np.nan)     
    mean = df['age'].mean()
    mean = int(mean)
    df['age']=df['age'].apply(lambda x: mean if np.isnan(x) else x) 
    return df

This function will remove outliers from age column.Here I have taken valid age range is between 14 to 90

In [7]:
df = pd.read_csv('train_users_2.csv')   #load data

print("Doing Preprocessing")
print("Handling Missing Values")
df = findNA(df)
original_data  = df.copy()
original_data=encodeDate(original_data)   #convert date to the day of the week with Monday=0, Sunday=6
original_data=weightedRandomImputation(original_data) # Missing Value Imputation

df,df_test = train_test_split( df, test_size=0.3, stratify=df['country_destination'])

df=encodeDate(df)   #convert date to the day of the week with Monday=0, Sunday=6
df=weightedRandomImputation(df) # Missing Value Imputation

#preprocess of test
df_test = encodeDate(df_test)
df_test = weightedRandomImputation(df_test)

Doing Preprocessing
Handling Missing Values


In [8]:
def ANN(df,df_test):

    print("\nLearning the Keras Neural Network Classifier Model...")
    Y_train = df.country_destination
    X_train = df.drop('country_destination', 1)
    X_train = X_train.drop('id', 1)

    #preprocess of test
    Y_test = df_test.country_destination
    X_test = df_test.drop('country_destination', 1)
    X_test = X_test.drop('id', 1)

    # encode Y train
    le = LabelEncoder()
    Y_train = le.fit_transform(Y_train)

    # Encode Y Test 
    le_t = LabelEncoder()
    Y_test = le_t.fit_transform(Y_test)

    #dropping columns as they dont improve accuracy
    X_train = X_train.drop('timestamp_first_active', 1)
    X_train = X_train.drop('language', 1)
    X_train = X_train.drop('signup_app', 1)
    X_test = X_test.drop('timestamp_first_active', 1)
    X_test = X_test.drop('language', 1)
    X_test = X_test.drop('signup_app', 1)

    # encode class values as integers
    encoder = LabelEncoder()
    encoder.fit(Y_train)
    encoded_Y = encoder.transform(Y_train)
    print(encoded_Y)
    # convert integers to dummy variables (i.e. one hot encoded)
    Y_train = pd.DataFrame(np_utils.to_categorical(encoded_Y))

    encoder = LabelEncoder()
    encoder.fit(Y_test)
    encoded_Y = encoder.transform(Y_test)
    print(encoded_Y)
    # convert integers to dummy variables (i.e. one hot encoded)
    Y_test = pd.DataFrame(np_utils.to_categorical(encoded_Y))
    
    df_encoded = pd.DataFrame(index=range(1,len(X_train)))    
    train = pd.concat([X_train, X_test])

    for col in train:
        if col=='age': 
            bins = [13,20,30,40,50,60,70,80,91]
            group_names = [0,1,2,3,4,5,6,7]
            train['age_bins'] = pd.cut(train['age'], bins, labels=group_names)
            train=train.drop('age',1)
            col = 'age_bins'
        encoder = LabelEncoder()
        encoder.fit(train[col])
        encoded_Col = encoder.transform(train[col])
        df_encoded = pd.concat([df_encoded,pd.DataFrame(np_utils.to_categorical(encoded_Col))],axis=1)
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=len(df_encoded.columns), activation='relu'))
    model.add(Dense(12, activation='relu'))
    model.add(Dense(12, activation='relu'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    #model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
    history = model.fit(df_encoded.values[:len(X_train)], Y_train.values, epochs=15, batch_size=1000)
    scores = model.evaluate(df_encoded.values[:len(X_train)], Y_train.values)
    print("\nTraining Score: %.2f" % (scores[1]*100))
    scores = model.evaluate(df_encoded.values[len(X_train):], Y_test.values)
    print("\nTesting Score: %.2f" % (scores[1]*100))

    Y_pred = model.predict(df_encoded.values[len(X_train):])
    print("The confusion matrix is : \n",confusion_matrix(Y_test.values.argmax(axis=1), Y_pred.argmax(axis=1)))
    print("Mean Absolute error is :",mean_absolute_error(Y_test.values.argmax(axis=1), Y_pred.argmax(axis=1)))
    print("Evaluation Metrics : \n",classification_report(Y_test.values.argmax(axis=1), Y_pred.argmax(axis=1)))

Here I have used Sequential Keras model.There are three layers and in each layer activation function is ReLu.

By using Cross-Entropy cost and ADAM optimizer , model has achieved training score and testing score accuracy 58.35% .

In [9]:
ANN(df,df_test)


Learning the Recurrent Neural Network Classifier Model...
[10  7 11 ...  7 11 11]
[10  7 10 ... 11  1  7]
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15

Training Score: 58.38

Testing Score: 58.39
The confusion matrix is : 
 [[    0     0     0     0     0     0     0   162     0     0     0     0]
 [    0     0     0     0     0     0     0   427     0     0     1     0]
 [    0     0     0     0     0     0     0   318     0     0     0     0]
 [    0     0     0     0     0     0     0   673     0     0     2     0]
 [    0     0     0     0     0     0     0  1505     0     0     2     0]
 [    0     0     0     0     0     0     0   695     0     0     2     0]
 [    0     0     0     0     0     0     0   851     0     0     0     0]
 [    0     0     0     0     0     0     0 37349     0     0    14     0]
 [    0     0     0     0     0     0     0   228

  'precision', 'predicted', average, warn_for)


References

Repositories

https://github.com/karvenka/kaggle-airbnb/blob/master/notebooks/Venkatesan_Karthick_Final_Project_Report.ipynb

https://github.com/Sapphirine/Airbnb-New-User-Bookings-Prediction/blob/master/preprocessing%26prediction.ipynb

https://github.com/Currie32/AirBnB-Predicting-Destination/blob/master/Predicting_Destination.ipynb

Kaggle Competition

https://www.kaggle.com/meicher/predicting-first-destination-4-models

https://www.kaggle.com/svpons/three-level-classification-architecture

The code in the document by Kandarp Vyas is licensed under the MIT License https://opensource.org/licenses/MIT