<h1>Using Renthop data to find whether a rental listing is of High, Medium or low interest</h1>

In [1]:
#begin with standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#import modeling algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier

# Modelling Helpers
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.cross_validation import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.metrics import log_loss



In [17]:
train_df = pd.read_json('renthop/train.json/train.json')
test_df = pd.read_json('renthop/test.json/test.json')
full_df = train_df.append(test_df, ignore_index = True)
df = full_df[:49352]
print('The test dataset:', test_df.shape, 'The train dataset', train_df.shape, 'The whole dataset: ', full_df.shape)

('The test dataset:', (74659, 14), 'The train dataset', (49352, 15), 'The whole dataset: ', (124011, 15))


<h4>The test dataset has one less column because the train dataset includes the interest level</h4>

In [18]:
df.shape

(49352, 15)

<h3>Convert the 'created' column to date time for easier analysis</h3>

In [24]:
full_df['created']=pd.to_datetime(df['created'])
full_df['year']=full_df['created'].dt.year
full_df['month'] = full_df['created'].dt.month
full_df['day'] = full_df['created'].dt.day

<h3>Since our dataframe is full of categorical variables the easiest way to fit those into a model algorithm is to simply take the amount of those variables i.e. the more 'features' the better</h3> 

In [21]:
full_df['len_photos'] = full_df['photos'].apply(len)
full_df['len_features'] = full_df['features'].apply(len)
full_df['len_desc'] = df['description'].apply(lambda x: len(x.split(" ")))

In [28]:
cols_to_keep = ['bathrooms', 'bedrooms', 'latitude', 
                'longitude', 'price', 'len_photos', 
                'len_features', 'len_desc', 'year', 
                'month', 'day']
df = full_df[:49352]

In [29]:
X = df[cols_to_keep]
y = df['interest_level']
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.33)

In [31]:
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(X_train, y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.63200451267855529

In [32]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

0.64302378233253077

In [33]:
clf = GaussianNB()
clf.fit(X_train, y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

1.9071312968953897

In [34]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

  np.exp(prob, prob)


0.72439762583433065

In [35]:
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train, y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val, y_val_pred)

6.0017451126330039

<h1>Now that we have a baseline Logloss let's do some feature engineering</h1>