# Craigslist Capstone Project - Pre-processing and Training Data Development

#### The goal of this capstone is to predict market rent prices in the San Francisco Bay Area. The metros of interest are San Francisco, Peninsula, East Bay.

Prior to this notebook, the Exploratory data analysis was performed and correlations between variables plotted. 

#### This notebook covers the scope of
1. Splitting the data into testing and training datasets
2. Feature engineering for categorical variables
3. Impute missing values
4. Removing outliers
5. Removing extra columns not used
6. Standardizing the magnitude of numeric features using a scaler

## Imports

In [61]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression

## Load Data

In [62]:
# The data is the interim directory
df = pd.read_csv('/Users/pandabear/springboard/CapstoneTwoProject/data/interim/listing_df_EDA.csv',index_col='listing_id')
# Drop listing_city 
df.drop(['listing_city'], axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,listing_nh,listing_price,listing_sqft,animals_cats,animals_dogs,smoking,wheelchair accessible,has_AC,hasEVCharging,laundry_in_bldg,...,walk_score,transit_score,bike_score,is_rent_controlled,pets_allowed,has_amenities,no_bedrooms,no_bathrooms,laundry_none,parking_detached
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7495842903,palo alto,1695,400.0,0,0,0,0,0,0,0,...,85.0,46.0,100.0,0,1,1,1,1.0,0,0
7495966009,mission district,4999,927.0,0,0,0,0,0,0,0,...,97.0,83.0,100.0,0,1,1,1,1.0,0,0
7496082921,oakland east,2125,505.0,1,1,1,0,0,0,0,...,89.0,72.0,91.0,0,0,1,0,1.0,0,0
7496134063,pacific heights,4500,,0,0,1,0,0,0,1,...,94.0,82.0,57.0,0,0,1,2,1.0,0,0
7496160361,oakland east,2315,605.0,0,0,0,0,0,0,0,...,89.0,72.0,91.0,0,0,1,1,1.0,0,0


## 1. Split the data into testing and training datasets

In [63]:
y = df['listing_price']
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=1,shuffle=True)


In [64]:
X_train.shape, X_test.shape

((9661, 32), (4141, 32))

In [65]:
X_train.head()

Unnamed: 0_level_0,listing_nh,listing_price,listing_sqft,animals_cats,animals_dogs,smoking,wheelchair accessible,has_AC,hasEVCharging,laundry_in_bldg,...,walk_score,transit_score,bike_score,is_rent_controlled,pets_allowed,has_amenities,no_bedrooms,no_bathrooms,laundry_none,parking_detached
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7514897829,SOMA / south beach,4425,655.0,1,1,1,0,1,1,1,...,97.0,100.0,81.0,0,0,1,1,1.0,0,1
7525474147,richmond / seacliff,3695,1078.0,1,1,0,0,0,0,0,...,95.0,64.0,92.0,0,0,1,2,1.0,0,0
7538296288,inner sunset / UCSF,3045,,0,0,0,0,0,0,1,...,97.0,74.0,92.0,0,1,1,2,1.0,0,0
7531584813,oakland lake merritt / grand,1999,,1,0,0,0,0,0,1,...,78.0,55.0,69.0,0,0,0,1,1.0,0,0
7532134267,north beach / telegraph hill,1195,,0,0,1,0,0,0,0,...,100.0,90.0,71.0,0,0,0,1,0.5,1,0


In [66]:
y_train.shape, y_test.shape

((9661,), (4141,))

## 2. Feature engineering for categorical variables
The remaining categorical features are neighborhood and city. Since a city includes multiple neighborhoods, we can use the neighborhood feature in lieu of the city feature entirely. 
The most common size of rental units are 1 bedroom/1 bathroom ones. There are too many neighborhoods to one hot encode so instead impute using the mean price per 1 bedroom/1 bathroom listing for that neighborhood or city. 

In [67]:
# List all the unique neighborhoods in the entire dataset
unique_nh = df.listing_nh.unique()
print(f'There are {len(unique_nh)} unique neighborhoods in the dataframe')

There are 83 unique neighborhoods in the dataframe


Let's create or own Estimator to replace the categorical feature neighborhoods to the average listing price of 1 bedroom/1 bathroom places for each neighborhood. 

In [68]:
class NHAveragePrice:
        
    def fit(self, X, y=None):
        # Select only one bedroom/one bathroom listings from X_train
        onebed_onebath_listings = X_train[(X_train['no_bedrooms'] == 1) & (X_train['no_bathrooms'] == 1)]

        # Group by neighborhood and find the mean price for each neighborhood in the training set
        nh_mean_price = onebed_onebath_listings.groupby('listing_nh')['listing_price'].mean().reset_index()

        # Convert to a dictionary
        nh_mean_price.set_index(['listing_nh'], inplace = True)
        self.nh_mean_price_dict = nh_mean_price.to_dict()['listing_price']
        
        # Create a new column 'average_1bed1bath_price_by_nh' using the dictionary nh_mean_price_dict for both train and test sets
        self.mean_price = nh_mean_price.mean()[0]
        
        return self
    
    def transform(self, X):
        # Add a new column called 'average_1bed1bath_price_by_nh'
        X['average_1bed1bath_price_by_nh'] = X['listing_nh'].apply(lambda x: (self.nh_mean_price_dict.get(x, self.mean_price)))
        return X

## 3. Handle missing values

In [69]:
missing = pd.concat([X_train.isnull().sum(), 100 * X_train.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count',ascending=False)

Unnamed: 0,count,%
listing_sqft,3162,32.729531
transit_score,1705,17.648277
bike_score,1273,13.17669
walk_score,294,3.043163
listing_nh,0,0.0
housing_condo,0,0.0
laundry_none,0,0.0
no_bathrooms,0,0.0
no_bedrooms,0,0.0
has_amenities,0,0.0


The missing addresses are ok to ignore since we will not be using that feature directly. The neighborhood and walk_score, transit_score, bike_scores are better representations of how desirable a property is compared to the exact address. 

### Impute any missing square footage values 
Let's use the median value for the listing's number of bedroom/bathroom

In [70]:
class ImputeMissingSqft:
        
    def fit(self, X, y=None):
        # Group by bedrooms and bathrooms to find median square footage
        self.missing_sqft = X.groupby(by=['no_bedrooms','no_bathrooms']).median()['listing_sqft'].reset_index()
        self.missing_sqft.rename(columns = {'listing_sqft':'listing_sqft_median'}, inplace = True)
        mean_sqft_by_bedroom = self.missing_sqft.groupby(by='no_bedrooms').mean().reset_index()[['no_bedrooms','listing_sqft_median']]
        
        # Create a dictionary of the average sqft by bedrooms
        self.mean_sqft_by_bedroom_dict = mean_sqft_by_bedroom.set_index('no_bedrooms').to_dict()['listing_sqft_median']
        
        # Fillna for the missing values in the dictionary
        self.missing_sqft.loc[self.missing_sqft['listing_sqft_median'].isnull(),'listing_sqft_median'] = self.missing_sqft['no_bedrooms'].map(self.mean_sqft_by_bedroom_dict)
        return self
    
    def transform(self, X):
        # Left join on bedrooms and bathrooms and use the median column if listing_sqft is null
        combined_df = X.merge(self.missing_sqft, on=['no_bedrooms','no_bathrooms'], how='left')
        
        combined_df['listing_sqft'].fillna(combined_df['listing_sqft_median'], inplace=True)
        combined_df.drop(['listing_sqft_median'], axis=1, inplace=True)
        X = combined_df
        return X

### Impute any missing Walk score, Transit Score, Bike Score values
Let's use the median value for the listing's neighborhood. All transit scores range from 0 to 100

In [71]:
class ImputeMissingWalkscore:
        
    def fit(self, X, y=None):
        # Group by neighborhood to find median walks_score, transit_score, bike_score
        self.median_scores = X.groupby(by=['listing_nh']).median()[['walk_score','transit_score','bike_score']].reset_index()
        self.median_scores.rename(columns = {'walk_score':'median_walk_score','transit_score':'median_transit_score','bike_score':'median_bike_score'}, inplace = True)
        self.median_scores.fillna(0, inplace=True)
        return self
    
    def transform(self, X):
        # Left join on neighborhoods
        nh_df = X.merge(self.median_scores, on=['listing_nh'], how='left')

        # Use the mean score values if walk_score, transit_score or bike_score is null
        nh_df['walk_score'].fillna(nh_df['median_walk_score'],inplace=True)
        nh_df['transit_score'].fillna(nh_df['median_transit_score'],inplace=True)
        nh_df['bike_score'].fillna(nh_df['median_bike_score'],inplace=True)

        # In case there is a new neighborhood that doesn't have a mean walk/bike/transit score, use the mean of median_scores to fill null values
        nh_df['walk_score'].fillna(self.median_scores['median_walk_score'].mean(),inplace=True)
        nh_df['transit_score'].fillna(self.median_scores['median_transit_score'].mean(),inplace=True)
        nh_df['bike_score'].fillna(self.median_scores['median_bike_score'].mean(),inplace=True)
        
        # Drop unused columns
        nh_df.drop(['median_walk_score','median_transit_score','median_bike_score','listing_nh'], axis=1,inplace=True)

        X = nh_df
        return X

## 4. Remove outliers for studio listings
During EDA, it became clear that studio listing prices had a very long tail - possibly because some listings were misclassified as studios when they should have been 1 or 2 bedroom apartments. 
To remove these outliers, cap the prices of studios to the 99th percentile

In [72]:
class RemoveOutliersStudios:
    def fit(self, X, y=None):
        # Cap the top 1% percentile of all listing_price for studios
        self.upper_lim = X[X['no_bedrooms'] == 0]['listing_price'].quantile(q = 0.99)
        return self
    
    def transform(self, X):
        X.loc[(X['listing_price'] > self.upper_lim) & (X['no_bedrooms'] == 0),'listing_price'] = self.upper_lim
        # Drop listing_price column
        X.drop(['listing_price'], axis=1, inplace=True)
        return X

## 6. Standardize numeric features

In [73]:
class DataframeMinMaxScaler:
    def __init__(self, columns):
        self.columns = columns
        self.scaler = MinMaxScaler()
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns])
        return self
    
    def transform(self, X):
        X[self.columns] = self.scaler.transform(X[self.columns])
        return X

In [74]:
nh = NHAveragePrice()
sqft = ImputeMissingSqft()
walk = ImputeMissingWalkscore()
outliers = RemoveOutliersStudios()
scale = DataframeMinMaxScaler(['listing_sqft','walk_score','transit_score','bike_score','no_bedrooms','no_bathrooms','average_1bed1bath_price_by_nh'])

p = Pipeline([
    ('nh_average_price', nh),
    ('impute_missing_sqft', sqft),
    ('impute_missing_walkscore', walk),
    ('remove_outliers_studios', outliers),
    ('min_max_scaler', scale)
])

X_train = p.fit_transform(X_train)
X_test = p.transform(X_test)

In [77]:
X_train

Unnamed: 0,listing_sqft,animals_cats,animals_dogs,smoking,wheelchair accessible,has_AC,hasEVCharging,laundry_in_bldg,laundry_in_unit,laundry_onsite,...,transit_score,bike_score,is_rent_controlled,pets_allowed,has_amenities,no_bedrooms,no_bathrooms,laundry_none,parking_detached,average_1bed1bath_price_by_nh
0,0.044703,1,1,1,0,1,1,1,0,0,...,1.000,0.81,0,0,1,0.166667,0.142857,0,1,0.583913
1,0.086818,1,1,0,0,0,0,0,1,0,...,0.640,0.92,0,0,1,0.333333,0.142857,0,0,0.378340
2,0.067105,0,0,0,0,0,0,1,0,0,...,0.740,0.92,0,1,1,0.333333,0.142857,0,0,0.350351
3,0.047192,1,0,0,0,0,0,1,0,0,...,0.550,0.69,0,0,0,0.166667,0.142857,0,0,0.257338
4,0.037087,0,0,1,0,0,0,0,0,0,...,0.900,0.71,0,0,0,0.166667,0.000000,1,0,0.519050
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9656,0.047192,0,0,1,0,0,0,0,0,0,...,0.470,0.82,0,0,0,0.166667,0.142857,1,0,0.163383
9657,0.064118,0,0,0,0,0,0,0,1,0,...,1.000,0.97,0,0,1,0.333333,0.428571,0,0,0.583913
9658,0.156710,0,0,1,0,0,0,0,1,0,...,0.325,0.67,0,1,0,0.666667,0.428571,0,0,0.508036
9659,0.026185,0,0,0,0,0,0,0,0,1,...,0.480,1.00,0,0,1,0.000000,0.142857,0,1,0.421000
