<h2><center>Predicting Home Values in Los Angeles’ South Bay</center></h2>
<h3><center>Springboard | Capstone #1 - In Depth Analysis</center></h3>
<h4><center>By: Lauren Broussard</center></h4>
---

Using some of our previous findings, we will now use machine learning to see how well we can predict housing prices in the South Bay area. 

As we are looking at a continous random variable (as opposed to a discrete variable), we'll look at this as a regression problem. Further, since we already have labeled data (features and housing prices), we'll do a Supervised Learning approach. Additionally, we will try to determine which features are most important in predicting home prices in this area. 

For this problem, we'll be using Random Forest Regression, an ensemble method that expands on the Decision Tree approach. 

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd


%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# import south_bay dataset 
south_bay = pd.read_csv('south_bay_cleaned.csv', parse_dates = ['SOLD DATE'])

### Data Preparation and Encoding 

#### Drop Columns & Set Dates/Seasons

In [3]:
# drop address and mls number, as these won't work in model
# drop $/square feet
south_bay.drop(['ADDRESS','MLS#','$/SQUARE FEET', 'CITY'], axis=1,inplace=True)

In [4]:
# handle dates 
# create separate date columns 

south_bay['SOLD_YEAR'] = south_bay['SOLD DATE'].dt.year
south_bay['SOLD_MONTH'] = south_bay['SOLD DATE'].dt.month
south_bay['SOLD_WEEK'] = south_bay['SOLD DATE'].dt.week
south_bay['SOLD_DAY'] = south_bay['SOLD DATE'].dt.day


# add column - seasons to dataset 
# create dictionary mapping of seasons to months
seasons = ['Winter', 'Winter', 'Spring', 'Spring', 'Spring', 'Summer', 'Summer',\
           'Summer', 'Fall', 'Fall', 'Fall', 'Winter']

month_to_season = dict(zip(range(1,13), seasons))

# map months to seasons and create new column 
south_bay['SEASON'] = south_bay['SOLD DATE'].dt.month.map(month_to_season) 

In [5]:
# drop datetime column 
south_bay.drop(['SOLD DATE'], axis=1,inplace=True)

#### One-Hot Encoding

We'll use one-hot encoding to change categorical columns to binary values before putting them in the model. 

In [6]:
# one hot encoding on all categorical variables
south_bay_f = pd.get_dummies(south_bay)

# Display the first 5 rows of the last columns
south_bay_f.iloc[:,10:].head(5)

Unnamed: 0,LONGITUDE,SOLD_YEAR,SOLD_MONTH,SOLD_WEEK,SOLD_DAY,PROPERTY TYPE_Condo/Co-op,PROPERTY TYPE_Mobile/Manufactured Home,PROPERTY TYPE_Single Family Residential,PROPERTY TYPE_Townhouse,NEIGHBORHOOD_Alondra Park,...,NEIGHBORHOOD_Rolling Hills Estates,NEIGHBORHOOD_San Pedro,NEIGHBORHOOD_Torrance,NEIGHBORHOOD_Watts,NEIGHBORHOOD_Westchester,NEIGHBORHOOD_Wilmington,SEASON_Fall,SEASON_Spring,SEASON_Summer,SEASON_Winter
0,-118.271532,2019,2,5,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,1
1,-118.280823,2018,5,22,31,0,0,1,0,0,...,0,0,0,0,0,1,0,1,0,0
2,-118.26543,2019,10,44,31,0,0,1,0,0,...,0,0,0,0,0,1,1,0,0,0
3,-118.270032,2019,4,16,15,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,0
4,-118.248288,2018,2,7,12,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1


#### Separate Feature Data from Target (Price) Data

In [7]:
# split data into target and features 
X = south_bay_f.drop(['PRICE'],axis=1) 

y = south_bay_f['PRICE']

## Random Forest Regression

#### Split into Training and Testing Data

In [8]:
# split data into training and testing 
from sklearn.model_selection import train_test_split 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=42)

#### Establish Baseline

If we were to venture a guess to predict housing prices on each property without any other knowledge, we might guess the median home price for all the properties. We'll set that as our baseline and see if our model can outperform that. 

In [9]:
# get median price error

baseline_preds = np.median(y_test)

baseline_errors = round(np.mean(abs(baseline_preds - y_test)),2)

print("Mean Absolute Error (Baseline): ${d}".format(d=baseline_errors))

Mean Absolute Error (Baseline): $485739.85


#### Run Initial Random Forest Regressor

In [10]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with default values decision trees
randomforest = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 42)


# Train the model on training data
randomforest.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=10, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

#### Predict and Estimate Results

In [11]:
# predict price based on trained model
y_pred = randomforest.predict(X_test)

In [12]:
print("Mean Accuracy(Training): {d}".format(d = randomforest.score(X_train,y_train)))
print("Mean Accuracy(Testing): {d}".format(d = randomforest.score(X_test,y_test)))

Mean Accuracy(Training): 0.9579185771822071
Mean Accuracy(Testing): 0.837605960188698


The training data has an accuracy of 97%, but the testing data accuracy goes down. This could suggest overfitting of the training data. 

In [13]:
from sklearn import metrics

# print metrics: Mean Absolute Error, Root Mean Squared Error, R2
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R^2 Score:', np.sqrt(metrics.r2_score(y_test, y_pred)))    

Mean Absolute Error: 129338.74704593819
Root Mean Squared Error: 366855.16132501746
R^2 Score: 0.9152081512905673


The mean absolute error indicates that on average, the model predicts homes within $121,749.

In [14]:
errors = (y_pred - y_test)

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)

# Calculate and display accuracy

accuracy = 100 - np.mean(mape)
print('Mean Error Percentage: {d}%'.format(d=round(accuracy, 2)))

Mean Error Percentage: 97.22%


Our mean error percentage ended up at about 97% for our model.

#### Feature Importance

In [15]:
# Get numerical feature importances

importances = list(randomforest.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:15} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: SQUARE FEET     Importance: 0.57
Variable: LONGITUDE       Importance: 0.15
Variable: LOT SIZE        Importance: 0.06
Variable: NEIGHBORHOOD_Manhattan Beach Importance: 0.06
Variable: YEAR BUILT      Importance: 0.03
Variable: LATITUDE        Importance: 0.03
Variable: ZIP OR POSTAL CODE Importance: 0.01
Variable: BEDS            Importance: 0.01
Variable: BATHS           Importance: 0.01
Variable: DAYS ON MARKET  Importance: 0.01
Variable: HOA/MONTH       Importance: 0.01
Variable: SOLD_WEEK       Importance: 0.01
Variable: SOLD_DAY        Importance: 0.01
Variable: NEIGHBORHOOD_Hermosa Beach Importance: 0.01
Variable: NEIGHBORHOOD_Redondo Beach Importance: 0.01
Variable: SOLD_YEAR       Importance: 0.0
Variable: SOLD_MONTH      Importance: 0.0
Variable: PROPERTY TYPE_Condo/Co-op Importance: 0.0
Variable: PROPERTY TYPE_Mobile/Manufactured Home Importance: 0.0
Variable: PROPERTY TYPE_Single Family Residential Importance: 0.0
Variable: PROPERTY TYPE_Townhouse Importance: 0.0


#### Parameter Tuning: max_depth & n_estimators

In [16]:
# TODO Parameter tuning: n_estimators, max_depth

### VISUALIZE 

In [17]:
# TODO: Plot variable importance 

In [18]:
# TODO: Plot y values over predictions (home prices )

## Conclusion

In [19]:
# TODO 