# Predicting Home Value
Due to our dataset find, our team has changed our scope from predicting home features to predicting home price using different statistical approaches. Together we will trial and error multiple algorithms and explore there utility in predicting home prices. Across these different algorithms we will use similar metrics such as MSE to determine the relative success of the model.

## Our Scope
A real estate investment firm has tasked our Group1 consulting team to develop a model to predict home prices based on a set of given parameters. Obviously we know location is the biggest idicator of home prices, but our team will use a combination of other home features to figure out the value of a home

## Our Data
We will be using the a publically availble dataset from Kaggle. The data contained in the set Austin, TX House Listings. It was scraped in January 2021 and is highly ranked on Kaggle for being clean and usable. Below is the link to the dataset.
https://www.kaggle.com/datasets/ericpierce/austinhousingprices?resource=download

# Familiarizing with Dataset

In [4]:
# import pandas for EDA
import pandas as pd

In [5]:
file_path = 'data/austinHousingData.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,zpid,city,streetAddress,zipcode,description,latitude,longitude,propertyTaxRate,garageSpaces,hasAssociation,...,numOfMiddleSchools,numOfHighSchools,avgSchoolDistance,avgSchoolRating,avgSchoolSize,MedianStudentsPerTeacher,numOfBathrooms,numOfBedrooms,numOfStories,homeImage
0,111373431,pflugerville,14424 Lake Victor Dr,78660,"14424 Lake Victor Dr, Pflugerville, TX 78660 i...",30.430632,-97.663078,1.98,2,True,...,1,1,1.266667,2.666667,1063,14,3.0,4,2,111373431_ffce26843283d3365c11d81b8e6bdc6f-p_f...
1,120900430,pflugerville,1104 Strickling Dr,78660,Absolutely GORGEOUS 4 Bedroom home with 2 full...,30.432672,-97.661697,1.98,2,True,...,1,1,1.4,2.666667,1063,14,2.0,4,1,120900430_8255c127be8dcf0a1a18b7563d987088-p_f...
2,2084491383,pflugerville,1408 Fort Dessau Rd,78660,Under construction - estimated completion in A...,30.409748,-97.639771,1.98,0,True,...,1,1,1.2,3.0,1108,14,2.0,3,1,2084491383_a2ad649e1a7a098111dcea084a11c855-p_...
3,120901374,pflugerville,1025 Strickling Dr,78660,Absolutely darling one story home in charming ...,30.432112,-97.661659,1.98,2,True,...,1,1,1.4,2.666667,1063,14,2.0,3,1,120901374_b469367a619da85b1f5ceb69b675d88e-p_f...
4,60134862,pflugerville,15005 Donna Jane Loop,78660,Brimming with appeal & warm livability! Sleek ...,30.437368,-97.65686,1.98,0,True,...,1,1,1.133333,4.0,1223,14,3.0,3,2,60134862_b1a48a3df3f111e005bb913873e98ce2-p_f.jpg


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15171 entries, 0 to 15170
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   zpid                        15171 non-null  int64  
 1   city                        15171 non-null  object 
 2   streetAddress               15171 non-null  object 
 3   zipcode                     15171 non-null  int64  
 4   description                 15171 non-null  object 
 5   latitude                    15171 non-null  float64
 6   longitude                   15171 non-null  float64
 7   propertyTaxRate             15171 non-null  float64
 8   garageSpaces                15171 non-null  int64  
 9   hasAssociation              15171 non-null  bool   
 10  hasCooling                  15171 non-null  bool   
 11  hasGarage                   15171 non-null  bool   
 12  hasHeating                  15171 non-null  bool   
 13  hasSpa                      151

As you can see from the initial info() method, the dataset itself is very clean and usable. There isn't much data pre-processing needed in order to clean the data since there are no null values, and the majority of features are in a usable format.

### Dropping non-int and changing booleans
The only pre-processing we will need to do is to drop any d-type that is not an integer, like columns city, streetAddress, and description. We still have longitute and latitude so location is still within the dataset. Also we want to change the true and false values to 1's and 0's to make the entire dataset numerical.

In [7]:
#Droping the columns that are strings
col_drop_list = ['city', 'streetAddress', 'description']
df = df.drop(col_drop_list, axis=1)

df.shape

(15171, 44)

In [8]:
#Changing bool to int
col_bool_list = ['hasAssociation', 'hasCooling', 
                 'hasGarage', 'hasHeating', 'hasSpa', 'hasView']

for col in col_bool_list:
    name = col + '_int'
    df[name] = df[col].astype(int)

df.shape

(15171, 50)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15171 entries, 0 to 15170
Data columns (total 50 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   zpid                        15171 non-null  int64  
 1   zipcode                     15171 non-null  int64  
 2   latitude                    15171 non-null  float64
 3   longitude                   15171 non-null  float64
 4   propertyTaxRate             15171 non-null  float64
 5   garageSpaces                15171 non-null  int64  
 6   hasAssociation              15171 non-null  bool   
 7   hasCooling                  15171 non-null  bool   
 8   hasGarage                   15171 non-null  bool   
 9   hasHeating                  15171 non-null  bool   
 10  hasSpa                      15171 non-null  bool   
 11  hasView                     15171 non-null  bool   
 12  homeType                    15171 non-null  object 
 13  parkingSpaces               151

In [10]:
df.hasAssociation_int.value_counts()

1    8007
0    7164
Name: hasAssociation_int, dtype: int64

In [11]:
#Drop the bool columns
df = df.drop(col_bool_list, axis=1)
df.shape

(15171, 44)

Now the data should be cleaned, all numeric, and ready to be used in the analysis.

### Creating testing and training data
In this next step we will be creating the testing and training data for our algorithm. 

In [12]:
#Creating the initial testing and training datasets
train_df = df.sample(frac=.7, random_state=1)
test_df = df.drop(train_df.index)

print(train_df.shape, test_df.shape)

(10620, 44) (4551, 44)


In [13]:
#Create the target dataset
X_train = train_df.copy()
X_test = test_df.copy()

y_train = X_train.pop('latestPrice')
y_test = X_test.pop('latestPrice')

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(10620, 43) (4551, 43) (10620,) (4551,)


In [14]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [15]:
print(tf.__version__)

2.3.1


## Normalizing the data

In [16]:
train_df.describe().transpose()[['mean', 'std']]

Unnamed: 0,mean,std
zpid,101064300.0,306911500.0
zipcode,78735.8,19.10261
latitude,30.29167,0.09733537
longitude,-97.77896,0.08497445
propertyTaxRate,1.994417,0.05366666
garageSpaces,1.219115,1.341673
parkingSpaces,1.213559,1.342534
yearBuilt,1988.541,21.65107
latestPrice,509382.9,447250.7
numPriceChanges,3.03484,2.508205


In [24]:
from sklearn.preprocessing import Normalizer

In [26]:
norm = Normalizer().fit(train_df)

ValueError: could not convert string to float: 'Single Family'

## Come back to norm after ran initial model
Here we will create a deep neural network in order to try and build a predictor