# The Dataset

The dataset I used for my final project is a documentation of property sales in Brooklyn from 2003-2017 that I found on Kaggle. The target variable I tried to predict in this dataset was the sale price of the properties. In Part 1 I wanted an accuracy score of 80%, after many attempts I did not come close to that mark.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

In [2]:
#Reading in of the dataset

brooklyn = pd.read_csv("brooklyn_sales_map.csv")

In [3]:
#A brief look at the dataset

brooklyn.head()

Unnamed: 0.1,Unnamed: 0,borough,neighborhood,building_class_category,tax_class,block,lot,building_class,address,zip_code,...,XCoord,YCoord,ZoneMap,ZMCode,Sanborn,TaxMap,EDesigNum,Version,SHAPE_Leng,SHAPE_Area
0,1,3,DOWNTOWN-METROTECH,28 COMMERCIAL CONDOS,4,140,1001,R5,330 JAY STREET,11201,...,,,,,,,,,,
1,2,3,DOWNTOWN-FULTON FERRY,29 COMMERCIAL GARAGES,4,54,1,G7,85 JAY STREET,11201,...,988208.0,195011.0,12d,,302 016,30101.0,,17V1.1,1560.0,140132.0
2,3,3,BROOKLYN HEIGHTS,21 OFFICE BUILDINGS,4,204,1,O6,29 COLUMBIA HEIGHTS,11201,...,985952.0,195007.0,12d,,302 004,30106.0,,17V1.1,891.0,34656.0
3,4,3,MILL BASIN,22 STORE BUILDINGS,4,8470,55,K6,5120 AVENUE U,11234,...,1006597.0,161424.0,23b,,319 077,32502.0,,17V1.1,3730.0,797555.0
4,5,3,BROOKLYN HEIGHTS,26 OTHER HOTELS,4,230,1,H8,21 CLARK STREET,11201,...,985622.0,193713.0,12d,,302 014,30106.0,,17V1.1,621.0,21360.0


# Cleaning of the Data

During Part 2 of this project I realized my dataset had repetitive columns, columns that were mostly null values, columns that had a lot of values equal to zero when they shouldn't have been, and columns that were obsolete like "borough." In order to get the most accurate model possible, I had to do a lot of cleaning before the dataset could be properly analyzed.

### Dropping Unnecessary columns

In [4]:
brooklyn = brooklyn.drop(brooklyn.columns[0], axis=1)

In [5]:
brooklyn = brooklyn.drop(['borough','address','building_class','ZipCode','Address','BldgClass','YearAlter1','YearAlter2',
                          'BoroCode','ZMCode','EDesigNum','tax_class','OwnerType','OwnerName','YearBuilt','SanitBoro','SanitSub',
                          'Version', 'UnitsTotal','block','lot','SHAPE_Leng','SHAPE_Area','ZoneMap','XCoord','YCoord','CD','CT2010',
                         'CB2010','year_of_sale','Council','FireComp','year_built','HealthCent','HealthArea','SanitDistr','ZoneDist1',
                         'SplitZone','LandUse','Easements','ComArea','ResArea','AreaSource',
                         'NumBldgs','zip_code','UnitsRes','LotDepth','BldgDepth','ProxCode',
                         'IrrLotCode','LotType','BsmtCode','AssessLand','ExemptLand','ExemptTot','BuiltFAR','ResidFAR','CommFAR','FacilFAR',
                          'BBL','Tract2010','Sanborn','TaxMap','building_class_at_sale','neighborhood','building_class_category'], axis = 'columns')

### Disregarding null and non-sensical values

In [6]:
brooklyn = brooklyn.dropna()

In [7]:
brooklyn = brooklyn[brooklyn.sale_price >= 1000]

In [8]:
brooklyn = brooklyn[brooklyn.gross_sqft >= 100.0]

In [9]:
brooklyn = brooklyn[brooklyn.AssessTot >= 500.0]

In [10]:
brooklyn = brooklyn[brooklyn.land_sqft != 0]

In [11]:
brooklyn = brooklyn[brooklyn.total_units != 0]

In [12]:
brooklyn = brooklyn[brooklyn.commercial_units != 0]

In [13]:
brooklyn = brooklyn[brooklyn.residential_units != 0]

In [14]:
brooklyn = brooklyn[brooklyn.LotArea != 0]

In [15]:
brooklyn = brooklyn[brooklyn.BldgArea != 0]

In [16]:
brooklyn = brooklyn[brooklyn.NumFloors != 0]

In [17]:
brooklyn = brooklyn[brooklyn.LotFront != 0]

In [18]:
brooklyn = brooklyn[brooklyn.BldgFront != 0]

In [19]:
brooklyn = brooklyn[brooklyn.tax_class_at_sale != 0]

In [20]:
brooklyn = brooklyn[brooklyn.PolicePrct != 0]

### Shape of dataset post-cleaning

In [21]:
brooklyn.shape

(11353, 15)

In [22]:
brooklyn.dtypes

residential_units      int64
commercial_units       int64
total_units            int64
land_sqft            float64
gross_sqft           float64
tax_class_at_sale      int64
sale_price             int64
SchoolDist           float64
PolicePrct           float64
LotArea              float64
BldgArea             float64
NumFloors            float64
LotFront             float64
BldgFront            float64
AssessTot            float64
dtype: object

### Converting datatypes from numerical into categorical values

In [23]:
brooklyn = brooklyn.astype({'tax_class_at_sale':str, 'SchoolDist':str, 'PolicePrct': str})

# Cleaned Linear Regression Model Part 1

After all of the cleaning, here was one of the best models I came up with. I only used numerical values because the addition of categorical values did not actually change the outcome.

In [24]:
brooklyn_numeric = brooklyn.select_dtypes(['int64', 'float64'])

In [25]:
#Separating the variables between X and y

X = brooklyn_numeric.drop('sale_price', axis='columns')
y = brooklyn_numeric.loc[:, 'sale_price']

In [26]:
#Setting up a training set and a test set
#test_size = .3 means 30% of the data is set aside for the test set. 70% of the data is used for the training set
#You could also use train_size if you wish

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=30)

In [27]:
#Setting up a linear regression model using the training set

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [28]:
#Scoring the model on the training set and test set.
#These are the R-squared values for the training set and test set. 

print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))

0.39868403947888487
0.129957525673436


### Observations

My model clearly suffered from overfitting. You can tell because of the large discrepancy in R-squared values from the training set to the test set. Because the test set performed worse than the training set, there is evidence of high variance in my model. The only good news is that at least my model performed better than a null model, so it has some predictive value, even if it's very little.

# Random Forest Regression

The next best model I came up with was a random forest regression.

In [29]:
#Separating the variables between X and y

X2 = brooklyn_numeric.drop('sale_price', axis='columns')
y2 = brooklyn_numeric.loc[:, 'sale_price']

In [30]:
#Setting up a training set and a test set
#test_size = .3 means 30% of the data is set aside for the test set. 70% of the data is used for the training set
#You could also use train_size if you wish

from sklearn.model_selection import train_test_split

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = .3, random_state=30)

In [31]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=200)
rfr.fit(X2_train, y2_train)
print(rfr.score(X2_train, y2_train))
print(rfr.score(X2_test, y2_test))

0.8812324475685837
0.4265660936875144


### Observations

At first glance it may seem like I came up with a strongly predictive model since the R-squared for the training set showed that it captured 88% of the variance in the model. However, once you look at the test set, you realize my model once again has a severe overfitting problem. The test set captured less than half of the variance as the training set. That is not a good model to rely on. The random forest regression did however have better predictive scores in the training and test sets compared to the standard linear regression model.

# Further Cleaning

Although the best way to reduce the variance in your model is to collect more data, that was not an option I could do here. Instead, the next best thing I could do was make the model less complex. When I did a correlation matrix between all the variables in my model, I noticed some feature variables had correlations of 90% or more between them, which to me signified the dreaded collinearity problem. I dropped the variables I believed suffered from collinearity in order to make my model less complex.

In [32]:
new_brooklyn = brooklyn_numeric.drop(['residential_units','total_units','land_sqft','LotArea','gross_sqft'], axis = 'columns')

# Better Linear Regression Model

After a second round of cleaning I ran a standard linear regression model again.

In [33]:
#Separating the variables between X and y

X3 = new_brooklyn.drop('sale_price', axis='columns')
y3 = new_brooklyn.loc[:, 'sale_price']

In [34]:
#Setting up a training set and a test set
#test_size = .3 means 30% of the data is set aside for the test set. 70% of the data is used for the training set
#You could also use train_size if you wish

from sklearn.model_selection import train_test_split

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size = .3, random_state=30)

In [35]:
#Setting up a linear regression model using the training set

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X3_train, y3_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [36]:
#Scoring the model on the training set and test set.
#These are the R-squared values for the training set and test set. 

print(lr.score(X3_train, y3_train))
print(lr.score(X3_test, y3_test))

0.2882759971197526
0.18442172375582888


### Observations

Although my new model still wasn't very predictive, I believe it was much improved from the first linear regression model. For one, the test score R-squared value increased. For another, the discrepancy in R-squared values from the training set to test was much less than the first try. I still have a high variance problem, but I was happy that it was reduced.

# Random Forest Regression Part 2

I wanted to see if my cleaned up model would lead to a better predictive result through random forest regression

In [37]:
#Separating the variables between X and y

X4 = new_brooklyn.drop('sale_price', axis='columns')
y4 = new_brooklyn.loc[:, 'sale_price']

In [38]:
#Setting up a training set and a test set
#test_size = .3 means 30% of the data is set aside for the test set. 70% of the data is used for the training set
#You could also use train_size if you wish

from sklearn.model_selection import train_test_split

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size = .3, random_state=30)

In [39]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=200)
rfr.fit(X4_train, y4_train)
print(rfr.score(X4_train, y4_train))
print(rfr.score(X4_test, y4_test))

0.8660278199645169
0.408682955945329


### Observations

This result really had no discernable difference from the first random forest regression

# Other Options

In Part 2 I tried several alterations of the data like log and polynomial transformations and those did not improve the predictive capabilities of my model at all. The biggest problem I continued to have besides low R-squared values was the high variance. I tried to address the overfitting problem by using Ridge Regression and XGB Boost.

### Ridge Regression

In [40]:
X5 = new_brooklyn.drop('sale_price', axis='columns')
y5 = new_brooklyn.loc[:, 'sale_price']

In [41]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
ridge = Ridge(alpha=100)
X5_scaled = scaler.fit_transform(X5)
ridge.fit(X5_scaled, y5)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [42]:
ridge.score(X5_scaled, y5)

0.2723957427753054

### Observations

I will admit I am not 100% sure I used ridge regression correctly. The code seemed semi-confusing. In the case that I did use it correctly, a score of .27 is better than any test score in the standard linear regression models I ran.

### XGB Boost

In [43]:
X6 = new_brooklyn.drop('sale_price', axis='columns')
y6 = new_brooklyn.loc[:, 'sale_price']

In [44]:
import xgboost as xgb
xgb_reg = xgb.XGBRegressor()

In [45]:
from sklearn.model_selection import train_test_split

X6_train, X6_test, y6_train, y6_test = train_test_split(X6, y6, random_state=30)

In [46]:
xgb_reg.fit(X6_train, y6_train)

  if getattr(data, 'base', None) is not None and \


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, importance_type='gain',
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)

In [47]:
print(xgb_reg.score(X6_train, y6_train))
print(xgb_reg.score(X6_test, y6_test))

0.815403235116115
0.33989316192020724


### Observations

The XGB Boost model looked like it suffered the same fate as the random forest regressor model. Both had high R-squared values on the training set, but the R-squared values on the test sets were so much smaller that it's not worth touting the models.

# Final Thoughts

After many different variations, my modeling of the sale price of Brooklyn properties did not end up being very predictive, and also suffered from overfitting and high variance. If I had another week, I would try to learn how to rank feature variables. I know we covered it, but I don't know it well enough to use it. Although the random forest models had higher R-squared values than the linear regression models, I still feel more comfortable claiming the linear regression models are more predictive because there was much less variance in those.