## Train and Test Model

In this notebook, we train and test our model that we developed based upon our EDA, to verify that it is giving us a similar result for different samples of the same data

In [1]:
#imports
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

In [2]:
#using our cleaned data from Data Cleaning notebook and verifying all data is numeric
df_clean = pd.read_csv('data/cleaned_kc_house_data.csv')
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21595 entries, 0 to 21594
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   price              21595 non-null  float64
 1   bedrooms           21595 non-null  int64  
 2   bathrooms          21595 non-null  float64
 3   sqft_living        21595 non-null  int64  
 4   sqft_lot           21595 non-null  int64  
 5   floors             21595 non-null  float64
 6   sqft_above         21595 non-null  int64  
 7   yr_built           21595 non-null  int64  
 8   yr_renovated       21595 non-null  float64
 9   zipcode            21595 non-null  int64  
 10  lat                21595 non-null  float64
 11  long               21595 non-null  float64
 12  sqft_living15      21595 non-null  int64  
 13  sqft_lot15         21595 non-null  int64  
 14  sqft_basment_calc  21595 non-null  int64  
 15  grades             21595 non-null  float64
 16  waterfront         215

In [3]:
#assign X and y to variables used for Linear Regression and price, respectively
relevant_columns = [
    'sqft_living',
    'grades',
    'condition',     
    'waterfront',      
    'sqft_basment_calc',  
    'sqft_lot15', 
    'sqft_living15',    
    'long', 
    'lat',   
    'yr_renovated',     
    'yr_built',  
    'sqft_lot', 
    'floors',
    'bathrooms',
    'bedrooms',      
]

y = df_clean['price']
X = df_clean.loc[:, relevant_columns]

In [4]:
#split data into train and test sets randomly
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [5]:
# Verify shapes of training data will be fit for modeling
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

# We always should have the same number of rows in X as values in y
assert X_train.shape[0] == y_train.shape[0]

X_train is a DataFrame with 16196 rows and 15 columns
y_train is a Series with 16196 values


In [6]:
#run linear regression of training data
model = LinearRegression()
model.fit(X_train, y_train)

cross_val_score(model, X_train, y_train, cv=3)

array([0.67768467, 0.69474781, 0.68082473])

In [7]:
#run linear regression of test data
model.fit(X_test, y_test)
model.score(X_test, y_test)

0.6893386927222527

Resulting r-squared values of our train and test data indicate that they performed similarly. Interestingly, test data explained slightly more variance in price than training data did. Because our model is performing similarly for different samples of the whole dataset, we can move forward and feel confident in using it to explain variance in price with the chosen independent variables.

**Random Forest Regression**

We suspect that with the data in our dataset, a Random Forest Regression may be able to better predict price than a linear regression. This is due to factors such as location (lat&long), which clearly affect price, but do not relate to it in a linear way. So we run a Random Forest Regression as an experiment. We include all original data columns in this regression.

In [8]:
y = df_clean['price']
X = df_clean.drop('price', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

lr = RandomForestRegressor(n_estimators=15)
lr.fit(X_train, y_train)

print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.98
Test set score: 0.88


The Random Forest regression explains much more of the variance in price, but we opt not to use it because this is a project based on using linear regression to build explanatory models. However, in a different context we would use a Random Forest regression to understand the factors contributing to price to a greater degree.

## High budget model

Over the course of our project, we changed our business problem and decided to look at what variables contributed to the prices of high value houses rather than all houses in the dataset. We defined high value houses as houses with a sale price of \\$800,000 or greater. 

Other than the sample of houses, the steps that we took to perform analysis and build our model were the same. We recreate that model for just this sample below.

In [9]:
target_df = df_clean.loc[df_clean['price'] >= 800000]

In [10]:
y = target_df['price']
X = target_df.loc[:, relevant_columns]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

cross_val_score(model, X_train, y_train, cv=3)

array([0.52583095, 0.51695284, 0.54535075])

In [11]:
model.fit(X_test, y_test)
model.score(X_test, y_test)

0.5383712985216296

Resulting r-squared values of our train and test data indicate that they performed similarly. Interestingly, test data explained slightly more variance in price than training data did. Because our model is performing similarly for different samples of the whole dataset, we can move forward and feel confident in using it to explain variance in price with the chosen independent variables.