# Predicting Home Value
Due to our dataset find, our team has changed our scope from predicting home features to predicting home price using different statistical approaches. Together we will trial and error multiple algorithms and explore there utility in predicting home prices. Across these different algorithms we will use similar metrics such as MSE to determine the relative success of the model.

## Our Scope
A real estate investment firm has tasked our Group1 consulting team to develop a model to predict home prices based on a set of given parameters. Obviously we know location is the biggest idicator of home prices, but our team will use a combination of other home features to figure out the value of a home

## Our Data
We will be using the a publically availble dataset from Kaggle. The data contained in the set Austin, TX House Listings. It was scraped in January 2021 and is highly ranked on Kaggle for being clean and usable. Below is the link to the dataset.
https://www.kaggle.com/datasets/ericpierce/austinhousingprices?resource=download

# Familiarizing with Dataset

In [1]:
# import pandas for EDA
import pandas as pd

In [2]:
file_path = 'austinHousingData.csv'
df = pd.read_csv(file_path)
df.head(3)

Unnamed: 0,zpid,city,streetAddress,zipcode,description,latitude,longitude,propertyTaxRate,garageSpaces,hasAssociation,...,numOfMiddleSchools,numOfHighSchools,avgSchoolDistance,avgSchoolRating,avgSchoolSize,MedianStudentsPerTeacher,numOfBathrooms,numOfBedrooms,numOfStories,homeImage
0,111373431,pflugerville,14424 Lake Victor Dr,78660,"14424 Lake Victor Dr, Pflugerville, TX 78660 i...",30.430632,-97.663078,1.98,2,True,...,1,1,1.266667,2.666667,1063,14,3.0,4,2,111373431_ffce26843283d3365c11d81b8e6bdc6f-p_f...
1,120900430,pflugerville,1104 Strickling Dr,78660,Absolutely GORGEOUS 4 Bedroom home with 2 full...,30.432672,-97.661697,1.98,2,True,...,1,1,1.4,2.666667,1063,14,2.0,4,1,120900430_8255c127be8dcf0a1a18b7563d987088-p_f...
2,2084491383,pflugerville,1408 Fort Dessau Rd,78660,Under construction - estimated completion in A...,30.409748,-97.639771,1.98,0,True,...,1,1,1.2,3.0,1108,14,2.0,3,1,2084491383_a2ad649e1a7a098111dcea084a11c855-p_...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15171 entries, 0 to 15170
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   zpid                        15171 non-null  int64  
 1   city                        15171 non-null  object 
 2   streetAddress               15171 non-null  object 
 3   zipcode                     15171 non-null  int64  
 4   description                 15171 non-null  object 
 5   latitude                    15171 non-null  float64
 6   longitude                   15171 non-null  float64
 7   propertyTaxRate             15171 non-null  float64
 8   garageSpaces                15171 non-null  int64  
 9   hasAssociation              15171 non-null  bool   
 10  hasCooling                  15171 non-null  bool   
 11  hasGarage                   15171 non-null  bool   
 12  hasHeating                  15171 non-null  bool   
 13  hasSpa                      151

As you can see from the initial info() method, the dataset itself is very clean and usable. There isn't much data pre-processing needed in order to clean the data since there are no null values, and the majority of features are in a usable format.

### Dropping non-int and changing booleans
The only pre-processing we will need to do is to drop any d-type that is not an integer, like columns city, streetAddress, and description. We still have longitute and latitude so location is still within the dataset. Also we want to change the true and false values to 1's and 0's to make the entire dataset numerical.

In [4]:
#Droping the columns that are strings
col_drop_list = ['city', 'streetAddress', 'description', 'homeType','latest_saledate', 'latestPriceSource', 'homeImage' ]
df = df.drop(col_drop_list, axis=1)

df.shape

(15171, 40)

In [5]:
#Changing bool to int
col_bool_list = ['hasAssociation', 'hasCooling', 
                 'hasGarage', 'hasHeating', 'hasSpa', 'hasView']

for col in col_bool_list:
    name = col + '_int'
    df[name] = df[col].astype(int)

df.shape

(15171, 46)

In [6]:
df.hasAssociation_int.value_counts()

1    8007
0    7164
Name: hasAssociation_int, dtype: int64

In [7]:
#Drop the bool columns
df = df.drop(col_bool_list, axis=1)
df.shape

(15171, 40)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15171 entries, 0 to 15170
Data columns (total 40 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   zpid                        15171 non-null  int64  
 1   zipcode                     15171 non-null  int64  
 2   latitude                    15171 non-null  float64
 3   longitude                   15171 non-null  float64
 4   propertyTaxRate             15171 non-null  float64
 5   garageSpaces                15171 non-null  int64  
 6   parkingSpaces               15171 non-null  int64  
 7   yearBuilt                   15171 non-null  int64  
 8   latestPrice                 15171 non-null  int64  
 9   numPriceChanges             15171 non-null  int64  
 10  latest_salemonth            15171 non-null  int64  
 11  latest_saleyear             15171 non-null  int64  
 12  numOfPhotos                 15171 non-null  int64  
 13  numOfAccessibilityFeatures  151

Now the data should be cleaned, all numeric, and ready to be used in the analysis.

## Analysis
In this lineal regression model We will work all features.

### Creating testing and training data
In this next step we will be creating the testing and training data for our algorithm. 

In [9]:
# Getting column names
column_names=df.columns.tolist()
type(column_names)

list

In [11]:
# Deleting the name of the dependent name 'latestPrice'
column_names.remove('latestPrice')

In [12]:
# Independent values
X = df[column_names]

# Dependent values
y = df['latestPrice']

In [13]:
# importing train_test_split to create testing and trainin data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Comparing size of X sets
print(X.shape, X_train.shape, X_test.shape)

(15171, 39) (12136, 39) (3035, 39)


### Linear regression with StatsModels

In [15]:
import statsmodels.api as sm

In [16]:
# Creating the model
model = sm.OLS(endog=y_train, exog=sm.add_constant(X_train))

In [17]:
# Get model results
results = model.fit()

In [18]:
# Displey results summary
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            latestPrice   R-squared:                       0.443
Model:                            OLS   Adj. R-squared:                  0.442
Method:                 Least Squares   F-statistic:                     247.1
Date:                Tue, 25 Jul 2023   Prob (F-statistic):               0.00
Time:                        15:14:59   Log-Likelihood:            -1.7149e+05
No. Observations:               12136   AIC:                         3.431e+05
Df Residuals:                   12096   BIC:                         3.434e+05
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const               

#### Coeffincient of determination is 0.44 that's not a good number, the better is close to 1
this model is explainin only the 44% of data

### Evaluation

In [19]:
# Get prediction with X_train using the model
y_train_pred = results.predict(sm.add_constant(X_train))
y_train_pred.shape

(12136,)

In [20]:
# RMSE
from sklearn.metrics import mean_squared_error
RMSE_train=mean_squared_error(y_train, y_train_pred, squared=False)
RMSE_train

331574.1996992208

Our model is off by about 331574 dollars in a given prediction

it's a bad number

In [21]:
# Get prediction with X_train using the model
y_test_pred = results.predict(sm.add_constant(X_test))

In [22]:
RMSE_test=mean_squared_error(y_test, y_test_pred, squared=False)
RMSE_test

375425.40475539543

In [23]:
# as long as the difference is small the model is consistent
diff = (RMSE_test - RMSE_train)/ RMSE_test *100
diff

11.680404282908201