# Housing Dataset Analysis

## Introduction
In this project we are going to perform a linear Regression analysis that's going to cover both inferential modeling and predictive modeling. At the end of the analysis we are going to provide a final report on all of our findings.

## Business Understanding
Our client is a real estate agency suituated in King County, Washington that helps homewoners buy / or sell homes. They are looking to get a better understanding on what features about a house are the most important when trying to estimate a homes price in that area and they also what you to come up with a pricing algorithm that can help them to determine the pricing of future homes.

## Data Understanding

We have been provided access to data containing information on over 10,000 homes together with their respective attributes. The datasets are contained in the Data folder where:
 1. kc_house_data.csv contains data on the different homes together with their attributes
 2. column_names.md contains a breakdown on the Column Names and their descriptions for Kings County Data Set

## Load the Data
 
 In the cells below, load the relevant libraries and load the data

In [1]:
# load the imports
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')

from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split

In [2]:
# Load the data
data = pd.read_csv('Data/kc_house_data.csv')
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900,3,1.0,1180,5650,1.0,,0.0,...,7,1180,0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000,3,2.25,2570,7242,2.0,0.0,0.0,...,7,2170,400,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000,2,1.0,770,10000,1.0,0.0,0.0,...,6,770,0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000,4,3.0,1960,5000,1.0,0.0,0.0,...,7,1050,910,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000,3,2.0,1680,8080,1.0,0.0,0.0,...,8,1680,0,1987,0.0,98074,47.6168,-122.045,1800,7503


## Data Cleaning

### Check the Data

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  int64  
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

Based on the summary above, the columns are made up of 21 columns with 21597 entries. There are three columns with missing values(waterfront, view, yr_renovated). The column data types are mainly int64, float64, with  2 columns with the object data type.

Below, i dropped the id and date columns as they won't be used when carrying out the analysis.

In [4]:
data.drop(['id', 'date'], axis=1, inplace=True)

### Check for Null values.

In [5]:
data.isna().sum()

price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

As concluded above there are three columns with missing values, i will then drop the rows with missing values.


In [6]:
data = data.dropna()

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15762 entries, 1 to 21596
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          15762 non-null  int64  
 1   bedrooms       15762 non-null  int64  
 2   bathrooms      15762 non-null  float64
 3   sqft_living    15762 non-null  int64  
 4   sqft_lot       15762 non-null  int64  
 5   floors         15762 non-null  float64
 6   waterfront     15762 non-null  float64
 7   view           15762 non-null  float64
 8   condition      15762 non-null  int64  
 9   grade          15762 non-null  int64  
 10  sqft_above     15762 non-null  int64  
 11  sqft_basement  15762 non-null  object 
 12  yr_built       15762 non-null  int64  
 13  yr_renovated   15762 non-null  float64
 14  zipcode        15762 non-null  int64  
 15  lat            15762 non-null  float64
 16  long           15762 non-null  float64
 17  sqft_living15  15762 non-null  int64  
 18  sqft_l

### Check For Multicollinearity

Multicollinearity describes the relationship between two predictors. In this case our predictors are every other attributes excluding the price which is our target variable. This affects a linear regression model because if two predictors are highly correlated with the target and also highly correlated with each other, it will be hard to distinguish the effects of one predictor variable on target and the other predictor variable on the target. 

This reduces the performance of a linear regression model. Therefore, it is important to check a for multicollinearity before performing the analysis.



The first step in doing this is checking the correlation coefficient between our attributes.

In [8]:
# Calculate the correlation coefficient
data.corr()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
price,1.0,0.305489,0.526155,0.706189,0.084504,0.259505,0.274212,0.396862,0.034367,0.664146,0.612014,0.049345,0.122731,-0.049502,0.306607,0.021215,0.581572,0.079402
bedrooms,0.305489,1.0,0.512243,0.573575,0.02546,0.180485,-0.005833,0.080577,0.020074,0.354243,0.474272,0.153229,0.01743,-0.147255,-0.005917,0.12937,0.39072,0.025217
bathrooms,0.526155,0.512243,1.0,0.753846,0.080362,0.505187,0.065688,0.180923,-0.130287,0.664748,0.685677,0.504841,0.046988,-0.199625,0.02993,0.222755,0.56929,0.081984
sqft_living,0.706189,0.573575,0.753846,1.0,0.165336,0.359407,0.111491,0.285506,-0.062319,0.764251,0.876176,0.31422,0.050232,-0.196537,0.058394,0.239521,0.756676,0.17682
sqft_lot,0.084504,0.02546,0.080362,0.165336,1.0,-0.009924,0.025982,0.077073,-0.016036,0.10895,0.174216,0.051578,0.002147,-0.129494,-0.084304,0.231638,0.145393,0.718489
floors,0.259505,0.180485,0.505187,0.359407,-0.009924,1.0,0.018382,0.027518,-0.261013,0.459843,0.529101,0.487052,-0.00072,-0.05813,0.05819,0.129769,0.281982,-0.013571
waterfront,0.274212,-0.005833,0.065688,0.111491,0.025982,0.018382,1.0,0.409773,0.016454,0.083034,0.077165,-0.024068,0.0878,0.030391,-0.015935,-0.042324,0.090588,0.029636
view,0.396862,0.080577,0.180923,0.285506,0.077073,0.027518,0.409773,1.0,0.046354,0.248679,0.170726,-0.056645,0.098386,0.086479,0.008403,-0.0785,0.277778,0.071496
condition,0.034367,0.020074,-0.130287,-0.062319,-0.016036,-0.261013,0.016454,0.046354,1.0,-0.14781,-0.157958,-0.366938,-0.060845,0.001685,-0.02225,-0.105823,-0.096336,-0.005139
grade,0.664146,0.354243,0.664748,0.764251,0.10895,0.459843,0.083034,0.248679,-0.14781,1.0,0.758289,0.443286,0.011795,-0.18412,0.117425,0.20068,0.717031,0.116671


Below i will plot a heatmap to give us a visual summary of the correlation coefficient.

In [8]:
target_corr =  data.corr()['price'].map(abs).sort_values(ascending=False)
target_corr


price            1.000000
sqft_living      0.701917
grade            0.667951
sqft_above       0.605368
sqft_living15    0.585241
bathrooms        0.525906
view             0.395734
bedrooms         0.308787
lat              0.306692
waterfront       0.276295
floors           0.256804
yr_renovated     0.129599
sqft_lot         0.089876
sqft_lot15       0.082845
yr_built         0.053953
zipcode          0.053402
condition        0.036056
long             0.022036
Name: price, dtype: float64