[View in Colaboratory](https://colab.research.google.com/github/mad100141/final15388/blob/master/Final.ipynb)

#Real Estate Value Estimation

Our goal for this project is to determine if the exterior appearance of a building has predictive power in determining the real estate value of a home. 

We started with Boston property assessment data, which contains many features which can be used to indicate the value of a home. To get an accurate picture of real market value, we scraped sale prices from zillow for a subset of the properties. We scraped Google Street View images for these properties and extracted 25 features which describe qualities of the home and added those features as predictor variables.

#Boston Data

Our Boston Data is acquired from data.boston.gov. The Boston Assessing Department, charged with  determining the value of property in Boston for the purposes of taxation, has helpfully released their information into the public domain. From their assessments of several different property types we focus on residential properties to acquire information regarding address, property value, building style, exterior and interior condition and several other indicators of value. A full list resides [here](https://data.boston.gov/dataset/property-assessment/resource/b8e32ddf-671f-4a35-b99f-c060bae958e5). We decided to focus on assessments in 2018, acquired [here](https://data.boston.gov/dataset/property-assessment/resource/fd351943-c2c6-4630-992d-3f895360febd), to apply our analysis on the most up to date information regarding Boston residential property.

In [0]:
import pandas as pd
boston = pd.read_csv("ast2018full.csv", dtype = {15:str,60:str,63:str})

In [0]:
boston.head()

In [0]:
boston.LU = boston.LU.astype("category")
boston.R_BLDG_STYL = boston.R_BLDG_STYL.astype("category")

residential_category = ["A","CD","R1","R2","R3","R4","RL"]
residential_bool = [True if x in residential_category else False for x in boston.LU]
residential = boston[residential_bool]

residential.dropna(subset = ["R_BLDG_STYL"], inplace = True)

residential.drop(labels = ['MAIL_ADDRESSEE', 'MAIL_ADDRESS', 'MAIL CS', 
                           'MAIL_ZIPCODE','PID', 'CM_ID', 'GIS_ID','OWNER',
                           'S_BLDG_STYL', 'S_UNIT_RES', 'S_UNIT_COM',
                           'S_UNIT_RC', 'S_EXT_FIN', 'S_EXT_CND'], 
                 axis = 1, inplace = True)


In [0]:
residential = residential.dropna(axis = 1, how = "all") #removing columns with only NA's
residential = residential.dropna(axis = 0, how = "all") #removing rows with only NA's, there's none

In [0]:
residential["ST_ADDRESS"] = residential['ST_NUM'] + " " + residential['ST_NAME'] + " " + residential['ST_NAME_SUF']

In [0]:
residential = residential.drop_duplicates("ST_ADDRESS")

#Scraping Zillow Data

For scraping our Zillow response variables we used Chris Muir's Zillow's Scraper. He uses Python and Selenium to current home listings from given search terms. Acquired [here](https://github.com/ChrisMuir/Zillow) we go through the zillow_runfile.py and change the input search area to the Boston Zipcodes we want to scrape through.



In [0]:
zillow = pd.read_csv("../Zillow/2018-05-07_210318.csv")

Before we can combine our data from Boston and Zillow we run through several cleaning measures regarding unrealistic Sold Prices, NA's in crucial predictor columns, inconsistent address formatting, and odd price inputs. After fixing these we combine the Boston and Zillow data sets on the addresses that appear in both and add in the Sold Prices and Addresses of those properties that we have full information for. We did these things half manually through Excel and some Python commands and as a result recreating the full data cleaning process is somewhat difficult. Our final CSV file containing  around 2000~ properties is below.

In [0]:
residential_zillow = pd.read_csv("residential_zillow.csv")

# Scraping Street View Data
Using the Google Street View API we acquire images from all the properties in our combined data. Scraping the images for our properties took around 3 minutes but we then had to manually parse through all images and remove those that were invalid where invalid is defined as the property being occluded by trees, trucks, or otherwise bad image captures due to angle and height.

In [0]:
import urllib, os
count = 0
myloc = r"streetView" #replace with your own location
key = "&key=AIzaSyBYG7d1Nml_Z6emFfRdqSnJj6065HBFekY"
def GetStreet(Add,SaveLoc, count):
    base = "https://maps.googleapis.com/maps/api/streetview?size=640x480&fov=90&location="
    MyUrl = base + urllib.parse.quote_plus(Add) + key  #added url encoding
    fi = str(count) + ".jpg"
    urllib.request.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))
    count += 1

for location in names:
    GetStreet(Add=location,SaveLoc=myloc, count=count)
    count += 1

In [0]:
#Example Output
im = Image.open("streetView/0.jpg")
plt.imshow(im)

In [0]:
import pickle
with open('streetViewX.pickle', 'rb') as handle:
    X_SV = pickle.load(handle)
with open('streetViewy.pickle', 'rb') as handle:
    y_SV = pickle.load(handle)

# Architectural Style Dataset

# Processing Images

In [0]:
import keras
from keras.applications.vgg19 import VGG19
from keras.applications.vgg19 import preprocess_input
import keras.preprocessing.image as KerasImage
import sys

We used the VGG19 model to process images.  The model can be downloaded through Keras with the weights pretrained on the imagenet dataset. You can read more about the model [here](https://arxiv.org/pdf/1409.1556.pdf). 

The model was originally trained to classify objects like cats and dogs, which isn't the task on hand. However, we can still use the model to process our images and output useful vectors.

We can modify the model relatively easily with Keras, which allows us to strip the classifying layers from the model and use a custom input image vector to 

In [0]:
model = VGG19(include_top=False, input_shape=(800,600,3))

The default model outputs a four dimensional array, so we'll have to flatten that into a 1d vector in order to properly interface with the outputs later on

In [0]:
output = model.get_layer('block5_pool').output
output = keras.layers.Flatten()(output)
model = keras.models.Model(model.input, output)

Our architectural style dataset was organized such that each unique style classification had its own folder containing the images which belonged to that class. 

The following code traverses the parent folder, pushes each image through the VGG model, and labels the resulting vectors with the appropriate class.

In [0]:
# Open the specified folder and get all the subdirectories
dname = "D:\\DataScience\\arcDataset"
sdnames = os.listdir(dname)

# Set up dummy arrays to hold the processed features and labels 
y = np.array([-1])
X = np.array([i for i in range(230400)])

count = 0

for i in range(len(sdnames):
    sdname = sdnames[i] + '/'
               
    # Ignore non-directory files 
    try: fnames = os.listdir(dname + sdname)
    except: continue
               
    for j in range(len(fnames)):
        fname = dname + sdname + fnames[j]
        
        # Ignore non-image files
        try: im = KerasImage.load_img(fname, target_size=(800,600))
        except: continue
               
        # Print out a message to show progression 
        count += 1
        sys.stdout.write("\rsdir: " + str(i) + " im: " + str(j))
        sys.stdout.flush()
               
        # Format image properly
        im = KerasImage.img_to_array(im)
        im = np.expand_dims(im, axis=0)
        
        # Push image through the model
        im = preprocess_input(im)
        im = model.predict(im)
        
        # Correct shape of output vector
        im = np.squeeze(im)
        
        # Add vector to X and label to y
        X = np.append(X, im, axis=0)
        y = np.append(y, np.array([i]), axis=0)

# Delete the placeholder vectors
X = np.delete(X, range(230400), axis=0)
y = np.delete(y, 0, axis=0)

# Reshape the features               
X = X.reshape(count, 230400)

We went through a similar process for our street view images. The code is nearly identical, just with a few lines commented out because the data was unlabeled and because we did not need to traverse a file structure in the same way.

We've omitted this code for the sake of brevity.

#Logistic Regression

To turn these 

In [0]:
import numpy as np
import pickle
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import GradientBoostingRegressor
import scipy.stats

In [0]:
with open('arcDataX.pickle', 'rb') as handle:
    X = pickle.load(handle)
with open('arcDatay.pickle', 'rb') as handle:
    y = pickle.load(handle)

In [0]:
X_arctr, X_arctest, y_arctr, y_arctest = train_test_split(X, y, test_size=0.3, random_state=5)
X_arctest, X_arcval, y_arctest, y_arcval = train_test_split(X_arctest, y_arctest, test_size=0.5, random_state=5)

In [0]:
model_arc = LR(random_state = 0, dual=False, max_iter=3000)
model_arc.fit(X_arctr, y_arctr)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=3000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
model_arc.score(X_arctest, y_arctest)

0.702928870292887

# Adding Features to Original Dataset

In [0]:
data = pd.read_csv("residential_zillow.csv")

In [0]:
#indexed with valid images
df_final = data.iloc()[y_SV.astype(int)]

In [0]:
X_SVC = model_arc.predict_proba(X_SV)

X_SVCdf = pd.DataFrame(data=X_SVC, index=y_SV);

X_SVCdf.index = X_SVCdf.index.astype(int)

df_final1 = pd.concat([df_final, X_SVCdf], axis=1)

In [0]:
y_final = df_final1['SALE_PRICE']

X_final = df_final1.drop('SALE_PRICE', axis=1)
X_final = X_final.drop(['ST_NUM', 'ST_NAME', 'ST_NAME_SUF', 'AV_LAND', 'AV_BLDG', 'AV_TOTAL', 'GROSS_TAX', 'ST_ADDRESS','PTYPE'], axis=1)

X_final = pd.get_dummies(X_final, columns=['R_BLDG_STYL','LU','OWN_OCC','STRUCTURE_CLASS','R_ROOF_TYP','R_EXT_FIN','R_EXT_CND','R_BTH_STYLE','R_BTH_STYLE2','R_BTH_STYLE3','R_KITCH_STYLE','R_KITCH_STYLE2','R_KITCH_STYLE3','R_HEAT_TYP','R_AC','R_OVRALL_CND','R_INT_CND','R_INT_FIN','R_VIEW'])
X_final = X_final.astype(np.float32).fillna(0)

# X_final = X_final.reset_index().drop('index', axis=1)
# y_final = y_final.reset_index().drop('index', axis=1)

In [0]:
X_final = X_final.drop('Unnamed: 0', axis=1).reset_index(drop=True)
y_final = y_final.reset_index(drop=True)

In [0]:
X_final.assign(Value =y_final.squeeze()).apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)['Value']

ZIPCODE             0.079714
LAND_SF             0.263169
YR_BUILT            0.053118
YR_REMOD            0.060519
GROSS_AREA          0.301567
LIVING_AREA         0.303270
NUM_FLOORS          0.072153
R_TOTAL_RMS        -0.009428
R_BDRMS            -0.014479
R_FULL_BTH          0.013468
R_HALF_BTH          0.005019
R_KITCH            -0.055054
R_FPLACE            0.078625
0                   0.371457
1                   0.371457
2                   0.371457
3                   0.371457
4                   0.371457
5                   0.371457
6                   0.371457
7                   0.371457
8                   0.371457
9                   0.371457
10                  0.371457
11                  0.371457
12                  0.371457
13                  0.371457
14                  0.371457
15                  0.371457
16                  0.371457
                      ...   
R_KITCH_STYLE3_N   -0.034522
R_KITCH_STYLE3_S   -0.056608
R_HEAT_TYP_E       -0.022175
R_HEAT_TYP_F  

In [0]:
X_finaltr, X_finaltest, y_finaltr, y_finaltest = train_test_split(X_final, y_final, test_size=0.3, random_state=5)
X_finaltest, X_finalval, y_finaltest, y_finalval = train_test_split(X_finaltest, y_finaltest, test_size=0.5, random_state=5)

## Cross-Validating Hyperparameters

In [0]:
to_try = [2,3,1]
numTrials = 3
best = 0
bestVal = 0
for val in to_try: 
    score = 0
    for i in range(numTrials):
        GBR = GradientBoostingRegressor(n_estimators = 2000, loss='quantile', learning_rate=.014, max_depth=2,\
                                    subsample=.99,alpha=.89)
        GBR.fit(X_finaltr, y_finaltr.squeeze())
        curScore = GBR.score(X_finalval, y_finalval.squeeze())
        score += curScore
        print("testing " + str(val) + " - score: " + str(curScore))
    score /= 3
    print("avg for " + str(val) + ": " + str(score))
    if score > best: 
        best = score
        bestVal = val
print("best choice is: " + str(bestVal))

testing 2 - score: 0.563488374663454
testing 2 - score: 0.5497977215495489
testing 2 - score: 0.564611005381296
avg for 2: 0.5592990338647663
testing 3 - score: 0.550701541561935
testing 3 - score: 0.5742114205959534
testing 3 - score: 0.5532390351072125
avg for 3: 0.559383999088367
testing 1 - score: 0.5656458563598071
testing 1 - score: 0.5625804955060273


KeyboardInterrupt: 

In [0]:
GBR = GradientBoostingRegressor(n_estimators = 2000, loss='quantile', learning_rate=.014, max_depth=2,\
                            subsample=.99,alpha=.89)
GBR.fit(X_finaltr, y_finaltr.squeeze())
GBR.score(X_finalval, y_finalval.squeeze())

0.7901517153918347

In [0]:
X_nSV = X_final.drop(labels=list(range(25)), axis=1)
y_nSV = y_final

In [0]:
X_nSVtr, X_nSVtest, y_nSVtr, y_nSVtest = train_test_split(X_nSV, y_nSV, test_size=0.3, random_state=5)
X_nSVtest, X_nSVval, y_nSVtest, y_nSVval = train_test_split(X_nSVtest, y_nSVtest, test_size=0.5, random_state=5)

In [0]:
GBRnSV = GradientBoostingRegressor(n_estimators = 2000, loss='quantile', learning_rate=.014, max_depth=2,\
                                    subsample=.99,alpha=.89)
GBRnSV.fit(X_nSVtr, y_nSVtr)

GBRnSV.score(X_nSVval, y_nSVval.squeeze())

0.36443930091932547

In [0]:
GBR.score(X_finalval, y_finalval.squeeze())

0.7901517153918347

In [2]:
y_finalval

NameError: ignored

In [0]:
print(scipy.stats.kruskal(GBRnSV.predict(X_nSVtest), y_nSVtest))
print(scipy.stats.kruskal(GBR.predict(X_finaltest),y_finaltest))
print(scipy.stats.kruskal(GBR.predict(X_finaltest),GBRnSV.predict(X_nSVtest)))

KruskalResult(statistic=94.59606977209059, pvalue=2.3348143801153684e-22)
KruskalResult(statistic=98.20801470522338, pvalue=3.7666239843010957e-23)
KruskalResult(statistic=0.031638759587793386, pvalue=0.8588228138954208)


In [0]:
scipy.stats.kruskal((GBRnSV.predict(X_nSVtest), y_nSVtest),(y_finaltest,y_finaltest))

KruskalResult(statistic=876810.1861107722, pvalue=0.0)