# Linear Regression Modeling of King County Real Estate Sale Prices

## Overview

## Business Problem

## Data Understanding
For this analysis, we will utilize the "King County Housing Price from May 2014- May 2015" created by the Center for Spatial Data Science. It contains 21,597 entries with each entry representing a unique real estate transaction. The data contain a 21 columns. Two of these columns, "id" and "date" describe transaction identity. The remaining columns cover a variety of characteristics of the real estate property. There are a mix of categorical and numerical columns. 

In [1]:
#import necessary packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics

from random import gauss
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats as stats
from sklearn.feature_selection import RFE
from scipy.stats import kurtosis, skew
from sklearn.dummy import DummyRegressor

%matplotlib inline

In [2]:
#ignore pairplot and graph warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
#read in csv file containing the house data
housing = pd.read_csv('data/kc_house_data.csv')
housing.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [4]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

The dataset contains a variety of categorical and numerical data. Columns such as "bedrooms" and "bathrooms" are stored as integers and floats, but may also be thought of as categorical data.  

Most of the columns describe either the transaction or the property. "Sqft_living15" and "sqft_lot15" refer to the average square feet of living/lot of the 15 closest neighbors. 

In [5]:
housing.isnull().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

There are 3 columns with null values:
* waterfront
* view
* yr_renovated

In [6]:
housing.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,1788.596842,1970.999676,83.636778,98077.951845,47.560093,-122.213982,1986.620318,12758.283512
std,2876736000.0,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,827.759761,29.375234,399.946414,53.513072,0.138552,0.140724,685.230472,27274.44195
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,1190.0,1951.0,0.0,98033.0,47.4711,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,1560.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,2210.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


## Data Preparation 

Our data cleaning process involved the following:
* treatment of null values in columns
* replacing or removing unexpected values in columns
* setting column values to an appropriate data type
* investigating duplicate id values
* creating additional feature columns

In [7]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

### Dealing with Null Values

In [8]:
#check nulls in 'waterfront'
print(housing['waterfront'].isna().sum())
housing['waterfront'].value_counts()

2376


NO     19075
YES      146
Name: waterfront, dtype: int64

The 'waterfront' column contains 2376 null values. It has two non-null values: "NO" and "YES". There are 19,075 "NO" counts compared to 146 "YES" counts. Dropping these rows would cause us to lose more than 10% of our data. Our initial approach was to replace all null values in this column with "NO" since the data indicated that "NO" was the most common value.

The 'view' column has 63 nulls. There are five values in this column: "NONE", "AVERAGE", "GOOD", "FAIR", and "EXCELLENT". Our understanding of the data led us to believe that the 'view' and 'waterfront' columns were related. It seemed unlikely that a waterfront view would coincide with "NONE" view. We examined a subset of the data where the 'waterfront' value was "YES" to evaluate if our assumption about 'view' was accurate.

In [10]:
print(housing['view'].isna().sum())
housing['view'].value_counts()

63


NONE         19422
AVERAGE        957
GOOD           508
FAIR           330
EXCELLENT      317
Name: view, dtype: int64

In [None]:
#subset of data where 'waterfront' == YES
yes_waterfront = housing[housing['waterfront']=="YES"]
yes_waterfront['view'].value_counts()

For the 146 "YES" values of 'waterfront', the 'view' values did not contain a single "NONE". This aligned with our understanding of the connection between 'waterfront' and 'view'.

In [None]:
#subset of data where 'waterfront' is null
na_waterfront = housing[housing['waterfront'].isna()]
na_waterfront['view'].value_counts()

Any property with a waterfront will have a non-NONE view. Since there are 2110 "NONE" views in our null 'waterfront' subset, it seems safer to assume "NO" as the default for waterfront.

In [None]:
#replace null values with "NO"
housing['waterfront'].fillna("NO", inplace=True)

In [None]:
#confirm that there are no more nulls in waterfront
housing['waterfront'].isna().sum()

The subset of null values in "view" contains 63 rows. Of these 63 rows, the 'waterfront' values consisted of 62 "NO" and 1 "YES". Since almost every null 'view' was not on a waterfront, we decided to replace the null values of "view" with "NONE".

In [None]:
na_view = housing[housing['view'].isna()]
na_view['waterfront'].value_counts()

In [None]:
#replace null values with "NONE"
housing['view'].fillna("NONE", inplace=True)

In [None]:
#confirm that there are no more nulls in view
housing['view'].isna().sum()

The 'yr_renovated' column contains 3,842 nulls. We wanted to replace these values rather than dropping a large number of rows. Since 'yr_renovated' seems to contain a year value stored as an integer, we inferred that the value of 0.0 meant either a house had never been renovated or no renovation data was available. 

In order to preserve the value as a year, we chose to replace 0.0 values with the value from the 'yr_built' column. A house that had matching values in 'yr_built' and 'yr_renovated' would indicate that the house had never been renovated before. We chose to treat nulls as no renovation data available and treated them the same as 0.0 values.

In [None]:
print(housing['yr_renovated'].isna().sum())
housing['yr_renovated'].value_counts()

In [None]:
#replace null values with 0
housing['yr_renovated'].fillna(0, inplace=True)

In [None]:
housing['yr_renovated'].value_counts()

In order to include the renovation status of the house, we created an additional column 'was_renovated'. This column contains a boolean value of whether or not the house had been renovated. This boolean value was set by checking if the 'yr_renovated' column was not equal to 0 (i.e. a year renovated was provided by the dataset).

In [None]:
#create 'was_renovated' column as boolean
#False if 'yr_renovated' == 0, True otherwise
housing['was_renovated'] = housing['yr_renovated'] != 0.0

In [None]:
housing[housing['was_renovated'] == False]

In [None]:
housing['was_renovated'].value_counts()

In [None]:
renovation = housing[housing['yr_renovated'] > 0]
renovation

In order to check the validity of our 'was_renovated' boolean column, we had to check that no rows in the dataset had a matching 'yr_renovated' and 'yr_built' pair of values. This check would also validate our understanding of the meaning of 0.0 in the 'yr_renovated' column.

In [None]:
renovation_check = housing[(housing['yr_renovated']) == (housing['yr_built'])]
renovation_check

In [None]:
housing['yr_built'].describe()

The renovation status of each property was now stored in our dataframe as a new column. We thought that time between 'yr_built' and 'yr_renovated' might be a feature to examine in our modeling. In order to calculate this difference, we replaced the values of 0 in 'yr_renovated' with the associated year in 'yr_built'.

In [None]:
#replace yr_renovated == 0 with the associated year in yr_built

housing.loc[housing['yr_renovated'] == 0, ['yr_renovated']] = housing['yr_built']

In [None]:
housing['yr_renovated'].value_counts()

In [None]:
housing['yr_renovated'].describe()

### Dealing with Data Types and Unexpected Values

In [None]:
housing.head()

The 'sqft_basement' column was an object datatype. This did not align with our expectations since 'sqft_above' was an integer datatype.

In [None]:
housing['sqft_living'].head()

In [None]:
housing['sqft_basement'].head()

In [None]:
#454 columns with ? for a value. 
housing['sqft_basement'].value_counts()

There were 454 instances of "?" in 'sqft_basement'. Rather than replacing these within the column, we decided to create an alternate 'sqft_basement2' column. Our understanding was that the square footage of the basement should be the difference between square footage of the living space minus square footage of the above area. This would give us a column that contained the correct numerical values for basement square footage matching the datatype of the 'sqft_living' and 'sqft_above' columns.

In [None]:
#create new column 'sqft_basement' which is 'sqft_living' - 'sqft_above'
#addresses 'sqft_basement' ? values.
housing['sqft_basement2'] = housing['sqft_living'] - housing['sqft_above']
housing.head()

In [None]:
housing['sqft_basement2'].value_counts()

We ran into an unexpected value for number of bedrooms.

In [None]:
housing['bedrooms'].value_counts()

It seems very unlikely that the bedroom count would jump suddenly from 11 to 33. In order to understand this situation, we examined the rows with 9 or more bathrooms.

In [None]:
#investigating bedrooms > 8
housing[housing['bedrooms'] > 8]

It is very unlikely that a house with only 1,620 square footage of living space would have 33 bathrooms. The next highest bathroom count was 11 with a 'sqft_living' of 3000. We felt confident that this was most likely a data entry error and that 33 bathrooms should be 3 bathrooms instead. 

In [None]:
#replacing 33 bedrooms with 3
housing['bedrooms'] = housing['bedrooms'].replace(33, 3)

In [None]:
housing['bedrooms'].value_counts()

### Investigating Duplicate ID Values

There were multiple entries with the same id. These were probably the same properties under multiple transactions. We checked an example to see if this is the case.

In [None]:
housing['id'].value_counts()

In [None]:
#id = 795000620 appears 3 times in our dataset
multiple_id = housing[housing['id'] == 795000620]
multiple_id

In this example, we saw that both 'date' and 'price' had changed, indicating that this property was sold for different amounts at each transaction. The other column values were still unchanged. The unchanged values matched our understanding that the ID references a specific property. Column values such as 'yr_built' or 'lat' should remain unchanged for multiple instances of the same property. 

In [None]:
#count non-unqiue entries in the ID column
#represents homes that appear multiple times in our data
housing['id'].duplicated().value_counts()

We created a 'dup_id' column containing a boolean of whether or not the 'id' value was unique to the dataset. This column was used to find instances of properties with multipe transactions in our dataset.

In [None]:
#finding list of duplicate ID values
housing_dupes = housing
housing_dupes["dup_id"] = housing_dupes['id'].duplicated()
housing_dupes[housing_dupes["dup_id"]==True]['id']

In [None]:
duplicate_id_values = list(housing_dupes[housing_dupes["dup_id"]==True]['id'].values)

In [None]:
#investigating a few duplicate values
housing[housing['id']==duplicate_id_values[2]]

Other examples of multiple id instances aligned with our expectations for consistent column values.

### Creating Additional Feature Columns

Additional features we wanted to investigate in our model were the age of the house at time of sale and the age of renovation at time of sale.

'age_at_sale' was calculated by subtracting the 'yr_built' from the 'date'.

'age_of_renovation' was calculated by subtrating the 'yr_renovated' from the 'date'.

In [None]:
#date is stored as a string type object in M/D/Y format
housing['date']

In [None]:
#creating age column based on yr_built and yr_renovated
housing['age_at_sale'] = (housing['date'].str[-4:].astype(int) - housing['yr_built']).astype(int)

housing['age_renovation_at_sale'] = (housing['date'].str[-4:].astype(int) - housing['yr_renovated']).astype(int)

In [None]:
housing.head()

### Our Cleaned Data

In [None]:
housing.info()

In [None]:
housing.columns

After cleaning, our dataframe had all null values replaced. 

5 additional columns were created: 'was_renovated', 'sqft_basement2', 'dup_id', 'age_at_sale', and 'age_renovation_at_sale'.

We saved our cleaned dataframe with additional columns as a separate file in the 'data' folder in this project's GitHub repository.

In [None]:
#exporting cleaned data frame
housing.to_csv('./data/kc_house_data_cleaned.csv')

## Feature Analysis
Let's further explore some of our variables.

In [None]:
![housing_location](./images/housing_location.png)

In [None]:
![housing_location](./images/housing_location.png)

In [None]:
##using QGIS import create distance from waterbody column.
cleaned_housing = pd.read_csv('data/kc_water_dist_homes.csv')
cleaned_housing.head()

In [None]:
min = cleaned_housing['price'].min()
max = cleaned_housing['price'].max()
mean = cleaned_housing['price'].mean()

print (f"The sale price range of homes sold is {min} to {max}")
print (f"The mean sale price of homes was {mean}")

In [None]:
sns.distplot(cleaned_housing['price']);

In [None]:
label = cleaned_housing['price']
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=4)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=4)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('price')
fig.suptitle('Price Distribution');

In [None]:
print ('Skewness =', stats.skew(cleaned_housing['price']))
print ('Kurtosis =', stats.kurtosis(cleaned_housing['price']))

- price is normally distributed but has a significant right tail skew.
- Since the peak of the distribution is to the left of our mean, price is positively skewed.    
- This means that more than half of the houses in our dataset sold for less than the average price $540,000. 
- Our kurtosis and skew are high - we expect to see a positive skew and tail.
- Looking at our box plot - this illustrates that clearly - we have a number of outliers that sold for 
  significantly more than our average.
- Moving forward, let's remove some of our highest priced homes. Let's focus on homes that sold for less than           1,500,000.

In [None]:
cleaned_housing.drop(cleaned_housing[ cleaned_housing['price'] >= 1500000 ].index, inplace = True)

In [None]:
#sns.distplot(cleaned_housing['price']);

label = cleaned_housing['price']
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=4)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=4)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('price')
fig.suptitle('Price Distribution');

In [None]:
## creating new column ratios to adjust for multicollinearity between two independent variables
cleaned_housing['bed_bath_ratio'] = (cleaned_housing['bedrooms'] /  cleaned_housing['bathrooms'])
cleaned_housing['sqft_living_to_bedroom_ratio'] = (cleaned_housing['sqft_living'] /  cleaned_housing['bedrooms'])
cleaned_housing['sqft_living_to_bathroom_ratio'] = (cleaned_housing['sqft_living'] /  cleaned_housing['bathrooms'])
cleaned_housing['ratio_sqft_lot_living'] = (cleaned_housing['sqft_lot'] /  cleaned_housing['sqft_living'])
cleaned_housing['ratio_sqft_living_lot'] = (cleaned_housing['sqft_living'] /  cleaned_housing['sqft_lot'])

In [None]:
cleaned_housing['date'] =  pd.to_datetime(cleaned_housing['date'])

#cleaned_housing['date'] =  cleaned_housing['date'].astype(str)

#cleaned_housing['date'] = cleaned_housing['date'].dt.strftime('%d/%m/%Y')
#cleaned_housing['date'] = pd.to_datetime(cleaned_housing['date'], format='%Y/%m/%d')
#cleaned_housing['date'] = pd.to_datetime(cleaned_housing['date'], format='%m/%d/%Y')


In [None]:
#cleaned_housing['age_at_sale'] = (cleaned_housing['date'].str[4:].astype(int) - cleaned_housing['yr_built']).astype(int)

#cleaned_housing['age_renovation_at_sale'] = (cleaned_housing['date'].str[4:].astype(int) - cleaned_housing['yr_renovated']).astype(int)

In [None]:
cleaned_housing.rename(columns = {'Hub distance_HubDist':'Distance_to_Water'}, inplace = True)

In [None]:
cleaned_housing = cleaned_housing.drop(['field_1'], axis=1)
cleaned_housing = cleaned_housing.drop(['sqft_basement'], axis=1)
cleaned_housing['zipcode'] = cleaned_housing['zipcode'].astype(str)


In [None]:
cleaned_housing.info()

In [None]:
cleaned_housing['condition'].value_counts()

## Train-Test Split

In [None]:
#Setting up train test split
X = cleaned_housing.drop('price', axis=1)
y = cleaned_housing['price']

X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=57)

In [None]:
#Combining X_train and y_train to get train_df
train_df = pd.concat([y_train, X_train], axis=1)
train_df.head()

## Simple Regression

In [None]:
cleaned_housing.corr().price.sort_values(ascending=False)

In [None]:
mask = np.triu(np.ones_like(cleaned_housing.corr(), dtype=bool))

plt.figure(figsize=(20,10))

cor = cleaned_housing.corr().abs()
sns.heatmap(cor, mask=mask, annot=True);

In [None]:
#Model 1 - Simple Regression 1
simple_formula = 'price ~ sqft_living'
simple_model = sm.formula.ols(formula=simple_formula, data=train_df)
simple_model.summary = simple_model.fit().summary()

simple_model.summary

In [None]:
sns.histplot(simple_model.resid);

In [None]:
simple_model = simple_model.resid

fig, ax = plt.subplots(1,2,figsize=(15, 5))
ax[0].scatter(x=simple_model.fittedvalues,y=resid_simple_model)
ax[0].set_xlabel("Predicted Values")
ax[0].set_ylabel("Residual Error")
ax[0].set_title(label="Test for Homoscedasticity")


ax[1].hist(simple_model)
ax[1].set_xlabel("Residual Error")
ax[1].set_ylabel("Count")
ax[1].set_title(label="Histogram of Residual Error");

import statsmodels.api as sm
plt.style.use('ggplot')
fig = sm.graphics.qqplot(simple_model, dist=stats.norm, line='45', fit=True)

In [None]:
y_max = y.max()
y_min = y.min()

ax = sns.scatterplot(X=simple_model.fittedvalues, y=y)
ax.set(ylim=(y_min, y_max))
ax.set(xlim=(y_min, y_max))
ax.set_xlabel("Predicted Sale Price")
ax.set_ylabel("Actual Sale Price")

X_ref = y_ref = np.linspace(y_min, y_max, 100)
plt.plot(X_ref, y_ref, color='red', linewidth=1)
plt.show()

In [None]:
#simple_model_1_train_preds = simple_model.predict(sm.add_constant(X_train['sqft_living']))
#simple_model_1_train_preds
simple_train_preds = simple_model.predict(X_train['sqft_living'])

In [None]:
#Plot our points, rating vs balance, as a scatterplot
plt.scatter(train_df['sqft_living'], train_df['price'])

# Plot the line of best fit!
plt.plot(train_df['sqft_living'], simple_train_preds, color='black')

plt.ylabel('Home Sale Price')
plt.xlabel('Sqft_living')
plt.title('Relationship between Home Sale Price and Sqft living space')
plt.show()

In [None]:
# One last thing - can visualize both train and test set!

# Plot our training data
plt.scatter(train_df['Rating'], train_df['Balance'], color='blue', label='Training')
# Plot our testing data
plt.scatter(test_df['Rating'], test_df['Balance'], color='green', label='Testing')


# Plot the line of best fit
plt.plot(train_df['Rating'], simple_train_preds, color='black')
# Plotting for the test data just to show it's the same!
plt.plot(test_df['Rating'], simple_test_preds, color='red')

plt.ylabel('Credit Card Balance')
plt.xlabel('Credit Rating')
plt.title('Relationship between Credit Rating and Credit Card Balance')
plt.legend()
plt.show()

In [None]:
visualizer = ResidualsPlot(model, hist=False, qqplot=True)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

### Observations

- y = 198.74 * ('sqft') - 99,610. 
- Sqft_living accounts for about 43.6% of the variance in our sale price
- Each unit increase of Sqft_living increases the selling price of homes by on average about $200.


In [None]:
#Model 2 - Simple Regression 2
simple_formula_2 = 'price ~ sqft_living15'
simple_model_2 = sm.formula.ols(formula=simple_formula_2, data=train_df)
simple_model_2summary = simple_model_2.fit().summary()

simple_model_2summary

In [None]:
##input assumption visuals ## y_scaled= np.log(y)??

### Observations

- y = 225.16 * ('sqft') - 60,800. 
- the square footage living space for the nearest 15 homes account for 35% of the variance in our sale price
- Each unit increase of Sqft_living15 increases the selling price of homes on average by about $225.


In [None]:
cat_cols = [c for c in train_df.columns if train_df[c].dtype == 'O']
cat_cols

In [None]:
train_df.columns

In [None]:
# create an encoder object. This will help us to convert categorical variables to new columns
encoder = OneHotEncoder(handle_unknown='error',
                        drop='first', 
                        categories='auto')

ct = ColumnTransformer(transformers=[('ohe', encoder, cat_cols)],
                       remainder='passthrough')
ct.fit(X_train) 
X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)

In [None]:
#create dummy variables for the "condition" column
condition_dummies = pd.get_dummies(X_train_condition['condition'], drop_first=True)
condition_dummies
#drops 'Average', creates 4 additional columns

In [None]:
X_train_dummies = pd.concat([X_train_condition, condition_dummies], axis=1)
X_train_dummies

In [None]:
#Model 3 - Multiple Regression 1
Multiple_formula = 'price ~ sqft_living + yr_built + Distance_to_Water + bed_bath_ratio'
Multiple_model = sm.formula.ols(formula=Multiple_formula, data=train_df)
Multiple_model_summary = Multiple_model.fit().summary()

Multiple_model_summary

In [None]:
#Model 3 - Multiple Regression 2
#add in condition and zip code
Multiple_formula_2 = 'price ~ sqft_living + yr_built + Distance_to_Water + bed_bath_ratio'
Multiple_model_2 = sm.formula.ols(formula=Multiple_formula_2, data=train_df)
Multiple_model_2summary = Multiple_model_2.fit().summary()

Multiple_model_2summary

CODE

In [None]:
sns.pairplot(housing)
plt.show()

In [None]:
lr_rfe = LinearRegression()
select = RFE(lr_rfe, n_features_to_select =4)

In [None]:
ss = StandardScaler()
ss.fit(cleaned_housing.drop('price', axis=1))
cleaned_housing_scaled = ss.transform(cleaned_housing.drop('price', axis=1))

In [None]:
cleaned_housing_scaled

In [None]:
select.fit(X=cleaned_housing_scaled, y=cleaned_housing['price'])

In [None]:
select.support_

In [None]:
select.ranking_

In [None]:
## use sqft_living  yr_built  sqft_living15  sqft_lot15

In [None]:
Polynomials using all except categorical values

In [None]:
X = cleaned_housing.drop('price', axis=1)
y = cleaned_housing['price']
pf = PolynomialFeatures(degree=3)
pf.fit(X)

In [None]:
pf.transform(X)

In [None]:
pf.transform(X).shape

In [None]:
pf.get_feature_names()

In [None]:
polynomial_df = pd.DataFrame(pf.transform(X), columns= pf.get_feature_names() )

In [None]:
lr = LinearRegression()
lr.fit(polynomial_df, y)

In [None]:
lr.score(polynomial_df, y)

## DO NOT RUN CELL BELOW 