## Context

As a Data Scientist, you work for Hass Consulting Company which is a real estate leader with over 25 years of experience. You have been tasked to study the factors that affect housing prices using the given information on real estate properties that was collected over the past few months. Later onwards, create a model that would allow the company to accurately predict the sale of prices upon being provided with the predictor variables. 

## Experimental design undertaken

## Exploring the dataset

Here we will seek to understand the dataset, check for data types, dimensionality, missing data, correlation etc

In [0]:
# Import necessary libraries

# import libraries
import numpy as np
import pandas as pd
import seaborn as sb
import scipy as sp

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline

In [51]:
## --PERSONAL NOTE-- ##
#Newer versions of matplotlib have broken Seaborn. Until it gets fixed, you gotta downgrade son.


!pip install matplotlib==3.1.0

Collecting matplotlib==3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/da/83/d989ee20c78117c737ab40e0318ea221f1aed4e3f5a40b4f93541b369b93/matplotlib-3.1.0-cp36-cp36m-manylinux1_x86_64.whl (13.1MB)
[K     |████████████████████████████████| 13.1MB 2.8MB/s 
[31mERROR: plotnine 0.6.0 has requirement matplotlib>=3.1.1, but you'll have matplotlib 3.1.0 which is incompatible.[0m
[31mERROR: mizani 0.6.0 has requirement matplotlib>=3.1.1, but you'll have matplotlib 3.1.0 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
Installing collected packages: matplotlib
  Found existing installation: matplotlib 3.1.2
    Uninstalling matplotlib-3.1.2:
      Successfully uninstalled matplotlib-3.1.2
Successfully installed matplotlib-3.1.0


In [3]:
# Load the dataset

housedf = pd.read_csv ('house_data.csv')
housedf2 = housedf

housedf.head(5)

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


---
**Dataset glossary**

price  - Price of the house

bedrooms - Number of Bedrooms

bathrooms - Number of Bathrooms

sqft_living - Square feet area of living area

sqft_lot  - Square feet area of parking Layout

floors - Number of Floors

waterfront - Whether waterfront is there or not

view - Number of Views

grade - Grades

sqft_basement - Square feet area off basement

yr_built - Year the house is built

yr_renovated - Year the house is renovated

zipcode - zipcode os the house

lat : Latitude of the house

lon : Longitude of the house

---

## Understanding the dataset

In [5]:
# Checking shape, data types

housedf.shape

(21613, 20)

In [6]:
housedf.dtypes

id                 int64
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

It seems odd that the _floors_ and _bathrooms_ columns would be of data type 'float'. However, I know that in real estate, bathrooms can be counted as a half if they only have a toilet with no shower. Not sure what that is for floors, but I'll generate a few random records to explore and get a better idea before proceeding. 

In [7]:
# Generate random columns

housedf.take(np.random.permutation(len(housedf))[:15])

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
5990,1089000190,925000.0,4,2.25,2590,13894,2.0,0,0,4,9,2590,0,1975,0,98005,47.6351,-122.165,2720,13894
3733,6744700900,795000.0,4,2.5,2570,13450,1.0,0,4,3,8,1510,1060,1948,0,98155,47.7429,-122.285,3470,12615
6592,6616000010,814000.0,4,2.5,2840,8820,1.0,0,2,5,8,1420,1420,1952,0,98118,47.5542,-122.265,2310,8750
19960,3821700038,305000.0,3,3.0,1290,1112,3.0,0,0,3,7,1290,0,2008,0,98125,47.7282,-122.296,1230,9000
2567,6071700160,603500.0,6,2.75,2660,8400,1.0,0,0,5,8,1550,1110,1962,0,98006,47.549,-122.173,2280,8400
8171,795000405,285950.0,2,1.0,1170,6000,1.0,0,0,3,6,1170,0,1948,0,98168,47.5033,-122.331,1130,7500
7331,5101402472,340500.0,2,1.0,940,5413,1.0,0,0,3,7,940,0,1923,0,98115,47.6956,-122.304,1340,5296
17250,1442700430,499950.0,5,2.5,3180,23809,1.0,0,0,3,9,3180,0,1978,0,98038,47.3727,-122.054,2500,15778
5249,8802400415,205000.0,3,1.0,1050,8498,1.0,0,0,3,7,1050,0,1958,0,98031,47.4038,-122.203,1340,8498
6192,3726800285,346000.0,2,1.0,1070,2196,1.0,0,0,4,7,880,190,1917,0,98144,47.5726,-122.308,1160,3600


Upon closer inspection, indeed there are incomplete (.5) floors. After researching, I can see that it's not an error. A property with 1.5 floors has the master bedroom on one level, and all other rooms on another.

In [8]:
# Checking for null values

housedf.isnull().any()

id               False
price            False
bedrooms         False
bathrooms        False
sqft_living      False
sqft_lot         False
floors           False
waterfront       False
view             False
condition        False
grade            False
sqft_above       False
sqft_basement    False
yr_built         False
yr_renovated     False
zipcode          False
lat              False
long             False
sqft_living15    False
sqft_lot15       False
dtype: bool

In [10]:
# Check for duplicate rows

duplicates = housedf[housedf.duplicated()]
duplicates.shape

(0, 20)

There are 3 duplicate rows, which will now get dropped!

In [0]:
# Drop duplicate rows

housedf = housedf.drop_duplicates()

Our dataset is complete. There are no null values. It's important to note that this doesn't mean that there aren't any errors, say from transcription, or other data recording forms. The data formats, column headers etc all seem OK, and I have a much better understanding of the dataset. I can now perform descriptive analysis 

## Exploratory Data Analysis

In [11]:
housedf.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0,21610.0
mean,4580161000.0,540178.9,3.370847,2.114739,2079.881212,15108.29,1.494239,0.007543,0.234197,3.40944,7.656779,1788.347894,291.533318,1971.003609,84.322351,98077.945673,47.560049,-122.21391,1986.518695,12769.031976
std,2876547000.0,367387.6,0.93011,0.770204,918.500299,41423.23,0.539994,0.086523,0.766136,0.650764,1.1755,828.138723,442.596699,29.372639,401.499264,53.505373,0.138572,0.140833,685.425781,27305.972464
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321612.5,3.0,1.75,1425.5,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.470925,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7619.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.75,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


I will use Pandas Profiling to perform univariate and bivariate analysis, and check to see how strongly each variable is correlated to all the others.

In [4]:
import pandas_profiling

pandas_profiling.ProfileReport(housedf)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,20
Number of observations,21613
Total Missing (%),0.0%
Total size in memory,3.3 MiB
Average record size in memory,160.0 B

0,1
Numeric,19
Categorical,0
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,21436
Unique (%),99.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4580300000
Minimum,1000102
Maximum,9900000190
Zeros (%),0.0%

0,1
Minimum,1000102
5-th percentile,512480000
Q1,2123000000
Median,3904900000
Q3,7308900000
95-th percentile,9297300000
Maximum,9900000190
Range,9899000088
Interquartile range,5185900000

0,1
Standard deviation,2876600000
Coef of variation,0.62803
Kurtosis,-1.2605
Mean,4580300000
MAD,2543600000
Skewness,0.24333
Sum,98994056770455
Variance,8.2746e+18
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
795000620,3,0.0%,
2206700215,2,0.0%,
643300040,2,0.0%,
3333002450,2,0.0%,
1995200200,2,0.0%,
1781500435,2,0.0%,
3904100089,2,0.0%,
3323059027,2,0.0%,
6300000226,2,0.0%,
9809000020,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1000102,2,0.0%,
1200019,1,0.0%,
1200021,1,0.0%,
2800031,1,0.0%,
3600057,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9842300095,1,0.0%,
9842300485,1,0.0%,
9842300540,1,0.0%,
9895000040,1,0.0%,
9900000190,1,0.0%,

0,1
Distinct count,3625
Unique (%),16.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,540180
Minimum,75000
Maximum,7700000
Zeros (%),0.0%

0,1
Minimum,75000
5-th percentile,210000
Q1,321950
Median,450000
Q3,645000
95-th percentile,1160000
Maximum,7700000
Range,7625000
Interquartile range,323050

0,1
Standard deviation,367360
Coef of variation,0.68007
Kurtosis,34.522
Mean,540180
MAD,234060
Skewness,4.0217
Sum,11675000000
Variance,134960000000
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
450000.0,172,0.8%,
350000.0,172,0.8%,
550000.0,159,0.7%,
500000.0,152,0.7%,
425000.0,150,0.7%,
325000.0,148,0.7%,
400000.0,145,0.7%,
375000.0,138,0.6%,
300000.0,133,0.6%,
525000.0,131,0.6%,

Value,Count,Frequency (%),Unnamed: 3
75000.0,1,0.0%,
78000.0,1,0.0%,
80000.0,1,0.0%,
81000.0,1,0.0%,
82000.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5350000.0,1,0.0%,
5570000.0,1,0.0%,
6890000.0,1,0.0%,
7060000.0,1,0.0%,
7700000.0,1,0.0%,

0,1
Distinct count,13
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.3708
Minimum,0
Maximum,33
Zeros (%),0.1%

0,1
Minimum,0
5-th percentile,2
Q1,3
Median,3
Q3,4
95-th percentile,5
Maximum,33
Range,33
Interquartile range,1

0,1
Standard deviation,0.93006
Coef of variation,0.27591
Kurtosis,49.064
Mean,3.3708
MAD,0.73495
Skewness,1.9743
Sum,72854
Variance,0.86502
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3,9824,45.5%,
4,6882,31.8%,
2,2760,12.8%,
5,1601,7.4%,
6,272,1.3%,
1,199,0.9%,
7,38,0.2%,
8,13,0.1%,
0,13,0.1%,
9,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,13,0.1%,
1,199,0.9%,
2,2760,12.8%,
3,9824,45.5%,
4,6882,31.8%,

Value,Count,Frequency (%),Unnamed: 3
8,13,0.1%,
9,6,0.0%,
10,3,0.0%,
11,1,0.0%,
33,1,0.0%,

0,1
Distinct count,30
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.1148
Minimum,0
Maximum,8
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,1.0
Q1,1.75
Median,2.25
Q3,2.5
95-th percentile,3.5
Maximum,8.0
Range,8.0
Interquartile range,0.75

0,1
Standard deviation,0.77016
Coef of variation,0.36419
Kurtosis,1.2799
Mean,2.1148
MAD,0.61536
Skewness,0.51111
Sum,45706
Variance,0.59315
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
2.5,5380,24.9%,
1.0,3852,17.8%,
1.75,3048,14.1%,
2.25,2047,9.5%,
2.0,1930,8.9%,
1.5,1446,6.7%,
2.75,1185,5.5%,
3.0,753,3.5%,
3.5,731,3.4%,
3.25,589,2.7%,

Value,Count,Frequency (%),Unnamed: 3
0.0,10,0.0%,
0.5,4,0.0%,
0.75,72,0.3%,
1.0,3852,17.8%,
1.25,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
6.5,2,0.0%,
6.75,2,0.0%,
7.5,1,0.0%,
7.75,1,0.0%,
8.0,2,0.0%,

0,1
Distinct count,1038
Unique (%),4.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2079.9
Minimum,290
Maximum,13540
Zeros (%),0.0%

0,1
Minimum,290
5-th percentile,940
Q1,1427
Median,1910
Q3,2550
95-th percentile,3760
Maximum,13540
Range,13250
Interquartile range,1123

0,1
Standard deviation,918.44
Coef of variation,0.44158
Kurtosis,5.2431
Mean,2079.9
MAD,698.32
Skewness,1.4716
Sum,44952873
Variance,843530
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
1300,138,0.6%,
1400,135,0.6%,
1440,133,0.6%,
1010,129,0.6%,
1660,129,0.6%,
1800,129,0.6%,
1820,128,0.6%,
1480,125,0.6%,
1720,125,0.6%,
1540,124,0.6%,

Value,Count,Frequency (%),Unnamed: 3
290,1,0.0%,
370,1,0.0%,
380,1,0.0%,
384,1,0.0%,
390,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9640,1,0.0%,
9890,1,0.0%,
10040,1,0.0%,
12050,1,0.0%,
13540,1,0.0%,

0,1
Distinct count,9782
Unique (%),45.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,15107
Minimum,520
Maximum,1651359
Zeros (%),0.0%

0,1
Minimum,520
5-th percentile,1800
Q1,5040
Median,7618
Q3,10688
95-th percentile,43339
Maximum,1651359
Range,1650839
Interquartile range,5648

0,1
Standard deviation,41421
Coef of variation,2.7418
Kurtosis,285.08
Mean,15107
MAD,13837
Skewness,13.06
Sum,326506890
Variance,1715700000
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
5000,358,1.7%,
6000,290,1.3%,
4000,251,1.2%,
7200,220,1.0%,
4800,120,0.6%,
7500,119,0.6%,
4500,114,0.5%,
8400,111,0.5%,
9600,109,0.5%,
3600,103,0.5%,

Value,Count,Frequency (%),Unnamed: 3
520,1,0.0%,
572,1,0.0%,
600,1,0.0%,
609,1,0.0%,
635,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
982998,1,0.0%,
1024068,1,0.0%,
1074218,1,0.0%,
1164794,1,0.0%,
1651359,1,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.4943
Minimum,1
Maximum,3.5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.0
Q1,1.0
Median,1.5
Q3,2.0
95-th percentile,2.0
Maximum,3.5
Range,2.5
Interquartile range,1.0

0,1
Standard deviation,0.53999
Coef of variation,0.36136
Kurtosis,-0.48472
Mean,1.4943
MAD,0.48852
Skewness,0.61618
Sum,32296
Variance,0.29159
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,10680,49.4%,
2.0,8241,38.1%,
1.5,1910,8.8%,
3.0,613,2.8%,
2.5,161,0.7%,
3.5,8,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1.0,10680,49.4%,
1.5,1910,8.8%,
2.0,8241,38.1%,
2.5,161,0.7%,
3.0,613,2.8%,

Value,Count,Frequency (%),Unnamed: 3
1.5,1910,8.8%,
2.0,8241,38.1%,
2.5,161,0.7%,
3.0,613,2.8%,
3.5,8,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0075418

0,1
0,21450
1,163

Value,Count,Frequency (%),Unnamed: 3
0,21450,99.2%,
1,163,0.8%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.2343
Minimum,0
Maximum,4
Zeros (%),90.2%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,2
Maximum,4
Range,4
Interquartile range,0

0,1
Standard deviation,0.76632
Coef of variation,3.2706
Kurtosis,10.893
Mean,0.2343
MAD,0.42255
Skewness,3.3957
Sum,5064
Variance,0.58724
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,19489,90.2%,
2,963,4.5%,
3,510,2.4%,
1,332,1.5%,
4,319,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,19489,90.2%,
1,332,1.5%,
2,963,4.5%,
3,510,2.4%,
4,319,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,19489,90.2%,
1,332,1.5%,
2,963,4.5%,
3,510,2.4%,
4,319,1.5%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4094
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,3
Q1,3
Median,3
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,1

0,1
Standard deviation,0.65074
Coef of variation,0.19087
Kurtosis,0.52576
Mean,3.4094
MAD,0.56072
Skewness,1.0328
Sum,73688
Variance,0.42347
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
3,14031,64.9%,
4,5679,26.3%,
5,1701,7.9%,
2,172,0.8%,
1,30,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1,30,0.1%,
2,172,0.8%,
3,14031,64.9%,
4,5679,26.3%,
5,1701,7.9%,

Value,Count,Frequency (%),Unnamed: 3
1,30,0.1%,
2,172,0.8%,
3,14031,64.9%,
4,5679,26.3%,
5,1701,7.9%,

0,1
Distinct count,12
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.6569
Minimum,1
Maximum,13
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,6
Q1,7
Median,7
Q3,8
95-th percentile,10
Maximum,13
Range,12
Interquartile range,1

0,1
Standard deviation,1.1755
Coef of variation,0.15352
Kurtosis,1.1909
Mean,7.6569
MAD,0.9296
Skewness,0.7711
Sum,165488
Variance,1.3817
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
7,8981,41.6%,
8,6068,28.1%,
9,2615,12.1%,
6,2038,9.4%,
10,1134,5.2%,
11,399,1.8%,
5,242,1.1%,
12,90,0.4%,
4,29,0.1%,
13,13,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1,1,0.0%,
3,3,0.0%,
4,29,0.1%,
5,242,1.1%,
6,2038,9.4%,

Value,Count,Frequency (%),Unnamed: 3
9,2615,12.1%,
10,1134,5.2%,
11,399,1.8%,
12,90,0.4%,
13,13,0.1%,

0,1
Distinct count,946
Unique (%),4.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1788.4
Minimum,290
Maximum,9410
Zeros (%),0.0%

0,1
Minimum,290
5-th percentile,850
Q1,1190
Median,1560
Q3,2210
95-th percentile,3400
Maximum,9410
Range,9120
Interquartile range,1020

0,1
Standard deviation,828.09
Coef of variation,0.46304
Kurtosis,3.4023
Mean,1788.4
MAD,640.39
Skewness,1.4467
Sum,38652488
Variance,685730
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
1300,212,1.0%,
1010,210,1.0%,
1200,206,1.0%,
1220,192,0.9%,
1140,184,0.9%,
1400,180,0.8%,
1060,178,0.8%,
1180,177,0.8%,
1340,176,0.8%,
1250,174,0.8%,

Value,Count,Frequency (%),Unnamed: 3
290,1,0.0%,
370,1,0.0%,
380,1,0.0%,
384,1,0.0%,
390,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
7880,1,0.0%,
8020,1,0.0%,
8570,1,0.0%,
8860,1,0.0%,
9410,1,0.0%,

0,1
Distinct count,306
Unique (%),1.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,291.51
Minimum,0
Maximum,4820
Zeros (%),60.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,560
95-th percentile,1190
Maximum,4820
Range,4820
Interquartile range,560

0,1
Standard deviation,442.58
Coef of variation,1.5182
Kurtosis,2.7156
Mean,291.51
MAD,363.24
Skewness,1.578
Sum,6300385
Variance,195870
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,13126,60.7%,
600,221,1.0%,
700,218,1.0%,
500,214,1.0%,
800,206,1.0%,
400,184,0.9%,
1000,149,0.7%,
900,144,0.7%,
300,142,0.7%,
200,108,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0,13126,60.7%,
10,2,0.0%,
20,1,0.0%,
40,4,0.0%,
50,11,0.1%,

Value,Count,Frequency (%),Unnamed: 3
3260,1,0.0%,
3480,1,0.0%,
3500,1,0.0%,
4130,1,0.0%,
4820,1,0.0%,

0,1
Distinct count,116
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1971
Minimum,1900
Maximum,2015
Zeros (%),0.0%

0,1
Minimum,1900
5-th percentile,1915
Q1,1951
Median,1975
Q3,1997
95-th percentile,2011
Maximum,2015
Range,115
Interquartile range,46

0,1
Standard deviation,29.373
Coef of variation,0.014903
Kurtosis,-0.65741
Mean,1971
MAD,24.566
Skewness,-0.46981
Sum,42599334
Variance,862.8
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
2014,559,2.6%,
2006,454,2.1%,
2005,450,2.1%,
2004,433,2.0%,
2003,422,2.0%,
2007,417,1.9%,
1977,417,1.9%,
1978,387,1.8%,
1968,381,1.8%,
2008,367,1.7%,

Value,Count,Frequency (%),Unnamed: 3
1900,87,0.4%,
1901,29,0.1%,
1902,27,0.1%,
1903,46,0.2%,
1904,45,0.2%,

Value,Count,Frequency (%),Unnamed: 3
2011,130,0.6%,
2012,170,0.8%,
2013,201,0.9%,
2014,559,2.6%,
2015,38,0.2%,

0,1
Distinct count,70
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,84.402
Minimum,0
Maximum,2015
Zeros (%),95.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,2015
Range,2015
Interquartile range,0

0,1
Standard deviation,401.68
Coef of variation,4.7591
Kurtosis,18.701
Mean,84.402
MAD,161.67
Skewness,4.5495
Sum,1824186
Variance,161350
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
0,20699,95.8%,
2014,91,0.4%,
2013,37,0.2%,
2003,36,0.2%,
2000,35,0.2%,
2007,35,0.2%,
2005,35,0.2%,
2004,26,0.1%,
1990,25,0.1%,
2006,24,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,20699,95.8%,
1934,1,0.0%,
1940,2,0.0%,
1944,1,0.0%,
1945,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2011,13,0.1%,
2012,11,0.1%,
2013,37,0.2%,
2014,91,0.4%,
2015,16,0.1%,

0,1
Distinct count,70
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,98078
Minimum,98001
Maximum,98199
Zeros (%),0.0%

0,1
Minimum,98001
5-th percentile,98004
Q1,98033
Median,98065
Q3,98118
95-th percentile,98177
Maximum,98199
Range,198
Interquartile range,85

0,1
Standard deviation,53.505
Coef of variation,0.00054554
Kurtosis,-0.85348
Mean,98078
MAD,46.721
Skewness,0.40566
Sum,2119758513
Variance,2862.8
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
98103,602,2.8%,
98038,590,2.7%,
98115,583,2.7%,
98052,574,2.7%,
98117,553,2.6%,
98042,548,2.5%,
98034,545,2.5%,
98118,508,2.4%,
98023,499,2.3%,
98006,498,2.3%,

Value,Count,Frequency (%),Unnamed: 3
98001,362,1.7%,
98002,199,0.9%,
98003,280,1.3%,
98004,317,1.5%,
98005,168,0.8%,

Value,Count,Frequency (%),Unnamed: 3
98177,255,1.2%,
98178,262,1.2%,
98188,136,0.6%,
98198,280,1.3%,
98199,317,1.5%,

0,1
Distinct count,5034
Unique (%),23.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,47.56
Minimum,47.156
Maximum,47.778
Zeros (%),0.0%

0,1
Minimum,47.156
5-th percentile,47.31
Q1,47.471
Median,47.572
Q3,47.678
95-th percentile,47.75
Maximum,47.778
Range,0.6217
Interquartile range,0.207

0,1
Standard deviation,0.13856
Coef of variation,0.0029134
Kurtosis,-0.67631
Mean,47.56
MAD,0.11483
Skewness,-0.48527
Sum,1027900
Variance,0.0192
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
47.5491,17,0.1%,
47.6846,17,0.1%,
47.6624,17,0.1%,
47.5322,17,0.1%,
47.6711,16,0.1%,
47.6886,16,0.1%,
47.6955,16,0.1%,
47.68600000000001,15,0.1%,
47.6647,15,0.1%,
47.6904,15,0.1%,

Value,Count,Frequency (%),Unnamed: 3
47.1559,1,0.0%,
47.1593,1,0.0%,
47.1622,1,0.0%,
47.1647,1,0.0%,
47.1764,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
47.7771,2,0.0%,
47.7772,3,0.0%,
47.7774,1,0.0%,
47.7775,3,0.0%,
47.7776,3,0.0%,

0,1
Distinct count,752
Unique (%),3.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-122.21
Minimum,-122.52
Maximum,-121.31
Zeros (%),0.0%

0,1
Minimum,-122.52
5-th percentile,-122.39
Q1,-122.33
Median,-122.23
Q3,-122.12
95-th percentile,-121.98
Maximum,-121.31
Range,1.204
Interquartile range,0.203

0,1
Standard deviation,0.14083
Coef of variation,-0.0011523
Kurtosis,1.0495
Mean,-122.21
MAD,0.11516
Skewness,0.88505
Sum,-2641400
Variance,0.019833
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
-122.29,116,0.5%,
-122.3,111,0.5%,
-122.36200000000001,104,0.5%,
-122.291,100,0.5%,
-122.37200000000001,99,0.5%,
-122.363,99,0.5%,
-122.288,98,0.5%,
-122.35700000000001,96,0.4%,
-122.28399999999999,95,0.4%,
-122.365,94,0.4%,

Value,Count,Frequency (%),Unnamed: 3
-122.519,1,0.0%,
-122.515,1,0.0%,
-122.514,1,0.0%,
-122.512,1,0.0%,
-122.511,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-121.325,1,0.0%,
-121.321,1,0.0%,
-121.319,1,0.0%,
-121.316,1,0.0%,
-121.315,2,0.0%,

0,1
Distinct count,777
Unique (%),3.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1986.6
Minimum,399
Maximum,6210
Zeros (%),0.0%

0,1
Minimum,399
5-th percentile,1140
Q1,1490
Median,1840
Q3,2360
95-th percentile,3300
Maximum,6210
Range,5811
Interquartile range,870

0,1
Standard deviation,685.39
Coef of variation,0.34502
Kurtosis,1.5971
Mean,1986.6
MAD,536.22
Skewness,1.1082
Sum,42935359
Variance,469760
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
1540,197,0.9%,
1440,195,0.9%,
1560,192,0.9%,
1500,181,0.8%,
1460,169,0.8%,
1580,167,0.8%,
1610,166,0.8%,
1800,166,0.8%,
1720,166,0.8%,
1620,165,0.8%,

Value,Count,Frequency (%),Unnamed: 3
399,1,0.0%,
460,2,0.0%,
620,2,0.0%,
670,1,0.0%,
690,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5600,1,0.0%,
5610,1,0.0%,
5790,6,0.0%,
6110,1,0.0%,
6210,1,0.0%,

0,1
Distinct count,8689
Unique (%),40.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12768
Minimum,651
Maximum,871200
Zeros (%),0.0%

0,1
Minimum,651.0
5-th percentile,1999.2
Q1,5100.0
Median,7620.0
Q3,10083.0
95-th percentile,37063.0
Maximum,871200.0
Range,870549.0
Interquartile range,4983.0

0,1
Standard deviation,27304
Coef of variation,2.1384
Kurtosis,150.76
Mean,12768
MAD,10119
Skewness,9.5067
Sum,275964632
Variance,745520000
Memory size,169.0 KiB

Value,Count,Frequency (%),Unnamed: 3
5000,427,2.0%,
4000,357,1.7%,
6000,289,1.3%,
7200,211,1.0%,
4800,145,0.7%,
7500,142,0.7%,
8400,116,0.5%,
3600,111,0.5%,
4500,111,0.5%,
5100,109,0.5%,

Value,Count,Frequency (%),Unnamed: 3
651,1,0.0%,
659,1,0.0%,
660,1,0.0%,
748,2,0.0%,
750,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
434728,1,0.0%,
438213,1,0.0%,
560617,1,0.0%,
858132,1,0.0%,
871200,1,0.0%,

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


A few observations:
1. What is immediately apparent is that all measurements of square footage (living space, basement, parking lot) are positively correlated with the price. As one goes up, so does the other.
2. Location, which is ususally a key indicator for property value  in the real estate business, doesn't really help us here since the locations we're given are in latitude + longitude format. One could go to extremes by checkig to see what locations the coordinates translate to, and creating clusters for the same, hoping to find that all the properties listed in the dataset are clustered in similar ways. Only then can one use this categorical data as part of the analysis. 
3. Year built/ renovated don't seem too have much bearing in our dataset as well, unlike with real world examples where one may observe that older houses in certain areas, with a certain je ne sais quoi, may attract higher prices

#### Multivariate Analysis

In [0]:
# preprocessing
X = housedf.drop ('price', 1)
y = housedf ['price']

# splitting the data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Normalization
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying PCA
from sklearn.decomposition import PCA

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

##### Checking for correlations (VIF)


In [6]:
corr = housedf.corr()
corr



Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,1.0,-0.016797,0.001286,0.00516,-0.012258,-0.132109,0.018525,-0.002721,0.011592,-0.023783,0.00813,-0.010842,-0.005151,0.02138,-0.016907,-0.008224,-0.001891,0.020799,-0.002901,-0.138798
price,-0.016797,1.0,0.308338,0.525134,0.702044,0.089655,0.256786,0.266331,0.397346,0.036392,0.667463,0.605566,0.323837,0.053982,0.126442,-0.053168,0.306919,0.021571,0.585374,0.082456
bedrooms,0.001286,0.308338,1.0,0.515884,0.576671,0.031703,0.175429,-0.006582,0.079532,0.028472,0.356967,0.4776,0.303093,0.154178,0.018841,-0.152668,-0.008931,0.129473,0.391638,0.029244
bathrooms,0.00516,0.525134,0.515884,1.0,0.754665,0.08774,0.500653,0.063744,0.187737,-0.124982,0.664983,0.685342,0.28377,0.506019,0.050739,-0.203866,0.024573,0.223042,0.568634,0.087175
sqft_living,-0.012258,0.702044,0.576671,0.754665,1.0,0.172826,0.353949,0.103818,0.284611,-0.058753,0.762704,0.876597,0.435043,0.318049,0.055363,-0.19943,0.052529,0.240223,0.75642,0.183286
sqft_lot,-0.132109,0.089655,0.031703,0.08774,0.172826,1.0,-0.005201,0.021604,0.07471,-0.008958,0.113621,0.183512,0.015286,0.05308,0.007644,-0.129574,-0.085683,0.229521,0.144608,0.718557
floors,0.018525,0.256786,0.175429,0.500653,0.353949,-0.005201,1.0,0.023698,0.029444,-0.263768,0.458183,0.523885,-0.245705,0.489319,0.006338,-0.059121,0.049614,0.125419,0.279885,-0.011269
waterfront,-0.002721,0.266331,-0.006582,0.063744,0.103818,0.021604,0.023698,1.0,0.401857,0.016653,0.082775,0.072075,0.080588,-0.026161,0.092885,0.030285,-0.014274,-0.04191,0.086463,0.030703
view,0.011592,0.397346,0.079532,0.187737,0.284611,0.07471,0.029444,0.401857,1.0,0.04599,0.251321,0.167649,0.276947,-0.05344,0.103917,0.084827,0.006157,-0.0784,0.280439,0.072575
condition,-0.023783,0.036392,0.028472,-0.124982,-0.058753,-0.008958,-0.263768,0.016653,0.04599,1.0,-0.144674,-0.158214,0.174105,-0.361417,-0.060618,0.003026,-0.014941,-0.1065,-0.092824,-0.003406


In [0]:
plt.figure(figsize = (15, 10))
sb.heatmap(X.corr(), annot = True) 
plt.title('Correlation Heatmap')
plt.show()

In [0]:
# This function will calculate VIF and drop highly correlated variables

from statsmodels.stats.outliers_influence import variance_inflation_factor    
def calculate_vif_(X, thresh):
  cols = X.columns
  variables = np.arange(X.shape[1])
  dropped=True
  while dropped:
      dropped=False
      c = X[cols[variables]].values
      vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]
      maxloc = vif.index(max(vif))
      if max(vif) > thresh:
          print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
          variables = np.delete(variables, maxloc)
          dropped=True
  print('Remaining variables:')
  print(X.columns[variables])
  return X[cols[variables]]

In [9]:
# X is the 'housedf' dataset, minus the price column, which is what we're trying to predict using our model.
# 4 is the threshold we're setting for VIF

calculate_vif_(X, 4)

  vif = 1. / (1. - r_squared_i)


dropping 'sqft_living' at index: 3
dropping 'sqft_above' at index: 9
Remaining variables:
Index(['id', 'bedrooms', 'bathrooms', 'sqft_lot', 'floors', 'waterfront',
       'view', 'condition', 'grade', 'sqft_basement', 'yr_built',
       'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15',
       'sqft_lot15'],
      dtype='object')


Unnamed: 0,id,bedrooms,bathrooms,sqft_lot,floors,waterfront,view,condition,grade,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,3,1.00,5650,1.0,0,0,3,7,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,3,2.25,7242,2.0,0,0,3,7,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,2,1.00,10000,1.0,0,0,3,6,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,4,3.00,5000,1.0,0,0,5,7,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,3,2.00,8080,1.0,0,0,3,8,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,3,2.50,1131,3.0,0,0,3,8,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,4,2.50,5813,2.0,0,0,3,8,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,2,0.75,1350,2.0,0,0,3,7,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,3,2.50,2388,2.0,0,0,3,8,0,2004,0,98027,47.5345,-122.069,1410,1287


LEt's assess the heatmap. Using a threshold of 0.7 fro our heatmap assessment, we can use the heatmap to see which variables are highly correlated:



1.   Bathrooms vs sqft_living
2.   sqft_living vs grade
3. sqft_living vs sqft_above
4. grade vs sqft_above

There's nothing unusual here. Number of bathrooms will go up as square footage goes up.
ALso, assuming the grades are ordinal in nature, it makes sense that the higher grade houses have more square footage. Once again, nothing unusual there.

These results are supported by the output of the custom function  in the cell above the heatmap. It's no surprise that the two features it picked as having VIFs beyond our desired threshold.

All the measures of square footage are tied into almost everything else.
If we choose to remove one variable from each of the pairs listed above, we get the same result as we did with the VIF assessment function.

We can remove the sqft_living and sqft_above features and see how they affect our accuracy, compared to accuracy when we leave them in.




# Modeling
## Multiple Linear Regression

In [0]:
# Subsetting the data
X = housedf[['bedrooms', 'sqft_lot', 'sqft_above', 'condition', 'yr_built','grade','sqft_basement']]
y = housedf['price']

I feel that removing the sqft_living and sqft_above variables is a crazy thing to do, seeing as they're our two most prominent measures of  Square Footage, but the accuracy will tell all.

In [0]:
# Dividing our data into training and test sets
# ---
# 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [12]:
# Training the Algorithm
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [13]:
# coefficients for our test set attributes. 
## PERSONAL NOTE ##
# Coefficient = SLope 

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
bedrooms,-41929.781901
sqft_lot,-0.329274
sqft_above,197.264599
condition,19973.750571
yr_built,-3349.760677
grade,142619.467402
sqft_basement,210.115954


Below are our predictions compared side by side with the actual, expected values

In [0]:
# Making Predictions
# 
y_pred = regressor.predict(X_test)

# To compare the  output values (without sqft_living and sqft_above)
# 
price_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
price_df

Unnamed: 0,Actual,Predicted
20188,602000.0,452347.721441
7573,320000.0,373921.914343
12873,245000.0,442744.430295
209,464000.0,629456.463528
19155,190000.0,12621.214042
...,...,...
11080,368888.0,422043.965175
18910,734200.0,650067.034180
15838,411000.0,702926.979414
8695,410000.0,705791.314014


Oof! The predictions are waaaay off. As stated before, the 2 features that the VIF function recommended for dropping are the key indicators for square footage, so we cannot possibly make predictions without them. This proves that. 

Let's try it with just the _sqft_above_ feature, which has a lower correlation index compared to _sqft_living_. Modifications are made by adding and subtracting features from X and y above.

In [14]:
# This is the version that includes sqft_above ONLY
# Making Predictions
# 
y_pred = regressor.predict(X_test)

# To compare the  output values 
# 
price_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
price_df

Unnamed: 0,Actual,Predicted
957,323000.0,4.995287e+05
14143,690000.0,6.330798e+05
19452,142000.0,-6.493529e+05
20510,560000.0,3.358024e+05
18354,545000.0,3.550805e+05
...,...,...
5349,386591.0,3.817564e+05
11082,344950.0,5.254779e+05
4413,1150000.0,1.129544e+06
17461,622200.0,4.243678e+05


Our results are much better. We can look at the last version where we keep both features. We will do proper accuracy checks later as well.

In [39]:
# This is the version that includes both sqft_above and sqft_living
# Making Predictions
# 
y_pred = regressor.predict(X_test)

# To compare the  output values 
# 
price_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
price_df

Unnamed: 0,Actual,Predicted
20188,602000.0,342260.514727
7573,320000.0,373414.905415
12873,245000.0,378083.784647
209,464000.0,549113.424868
19155,190000.0,140518.290604
...,...,...
11080,368888.0,424518.621360
18910,734200.0,750301.931671
15838,411000.0,587348.015428
8695,410000.0,666443.556808


In [32]:
# This is the version that includes sqft_living alone

# 
y_pred = regressor.predict(X_test)

# To compare the  output values 
# 
price_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
price_df

Unnamed: 0,Actual,Predicted
20188,602000.0,342260.514727
7573,320000.0,373414.905415
12873,245000.0,378083.784647
209,464000.0,549113.424868
19155,190000.0,140518.290604
...,...,...
11080,368888.0,424518.621360
18910,734200.0,750301.931671
15838,411000.0,587348.015428
8695,410000.0,666443.556808


Judging just from observation, it seems that we get similar levels of accuracy when both features are included. 

## Evaluating our linear regression model

I will do two versions of these tests, just to see in numbers, the difference between the model when you include one of our variables in question, and when you include both.

In [40]:
# Evaluating the Algorithm, when you include both features
# ---
# 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 148921.50835837732
Mean Squared Error: 57719331207.42597
Root Mean Squared Error: 240248.4780543385


In [33]:
# Evaluating the Algorithm, when you include ONLY sqft_living
# ---
# 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 148921.50835837715
Mean Squared Error: 57719331207.42588
Root Mean Squared Error: 240248.4780543383


In [15]:
# Evaluating the Algorithm, when you include ONLY sqft_above
# ---
# 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 148317.1684772434
Mean Squared Error: 57090904671.09287
Root Mean Squared Error: 238937.03076562425


All three versions described above produce the same MAE, MSE And RMSE. 

Going forward I'll use ONLY _sqft_above_.

In [47]:
# R2 score

from sklearn.metrics import r2_score
m = r2_score(y_test, y_pred)
m

0.6102276905307119

In [0]:
# Plotting the residual plot
# Residuals have been calculated by by substracting the test value from the predicted value
# 

residuals = np.subtract(y_pred, y_test)

# Plotting the residual scatterplot
#
plt.scatter(y_pred, residuals, color='black')
plt.title('Residual Plot')
plt.ylabel('residual')
plt.xlabel('fitted values')
plt.axhline(y = float(residuals.mean()), color='red', linewidth=1)
plt.show()


In [19]:

# Barlett's test for heteroskedasticity.


import scipy as sp
test_result, p_value = sp.stats.bartlett(y_pred, residuals)
# To interpret the results we must also compute a critical value of the chi squared distribution

degree_of_freedom = len(y_pred)-1
probability = 1 - p_value
critical_value = sp.stats.chi2.ppf(probability, degree_of_freedom)
print(critical_value)

# If the test_result is greater than the critical value, then we reject our null
# hypothesis. This would mean that there are patterns to the variance of the data
# Otherwise, we can identify no patterns, and we accept the null hypothesis that 
# the variance is homogeneous across our data
if (test_result > critical_value):
  print('the variances are unequal, and the model should be reassessed')
else:
  print('the variances are homogeneous!')

inf
the variances are homogeneous!


# Quantile Regression

In [21]:
import statsmodels.formula.api as smf

# Dividing our data into training and test sets
# ---
# 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

# Finding the regression coefficients for the conditioned median, 0.5 quantile
#
mod = smf.quantreg('y ~ X', housedf)
res = mod.fit(q=.5)

# Then print out the summary of our model
#
print(res.summary())

                         QuantReg Regression Results                          
Dep. Variable:                      y   Pseudo R-squared:               0.3647
Model:                       QuantReg   Bandwidth:                   2.524e+04
Method:                 Least Squares   Sparsity:                    3.483e+05
Date:                Sun, 02 Feb 2020   No. Observations:                21613
Time:                        19:32:40   Df Residuals:                    21605
                                        Df Model:                            7
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2.756e+06   9.56e+04     28.820      0.000    2.57e+06    2.94e+06
X[0]       -2.756e+04   1588.628    -17.348      0.000   -3.07e+04   -2.44e+04
X[1]          -0.0937      0.029     -3.201      0.001      -0.151      -0.036
X[2]         132.2671      2.589     51.091      0.0



#### Making predictions

In [22]:
 
y_pred = regressor.predict(x_test)

# To compare the actual output values for X_test with the predicted values
# 
quan = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
quan

Unnamed: 0,Actual,Predicted
957,323000.0,4.995287e+05
14143,690000.0,6.330798e+05
19452,142000.0,-6.493529e+05
20510,560000.0,3.358024e+05
18354,545000.0,3.550805e+05
...,...,...
5349,386591.0,3.817564e+05
11082,344950.0,5.254779e+05
4413,1150000.0,1.129544e+06
17461,622200.0,4.243678e+05


#### Evaluating the Algorithm


In [23]:

# 
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 148317.1684772434
Mean Squared Error: 57090904671.09287
Root Mean Squared Error: 238937.03076562425


In [24]:
#R2 score
from sklearn.metrics import r2_score
q = r2_score(y_test, y_pred)
q

0.6096865094006017

Our accuracy is just about the same as with the linear model.

60% vs 61%

In [0]:
# Plotting the residual plot
# Residuals have been calculated by by substracting the test value from the predicted value
# 

residuals = np.subtract(y_pred, y_test)
# Plotting the residual scatterplot
#
plt.scatter(y_pred, residuals, color='black')
plt.title('Residual Plot')
plt.ylabel('residual')
plt.xlabel('fitted values')
plt.axhline(y = float(residuals.mean()), color='red', linewidth=1)
plt.show()

# Ridge Regression

In [0]:

# Importing our libraries
# 
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [0]:

# Setting predictors and predicted features
#
x = housedf[['bedrooms', 'bathrooms', 'sqft_above', 'sqft_lot', 'floors','condition', 'grade', 'sqft_above', 'sqft_basement', 'sqft_living15',
'sqft_lot15']]
y = housedf['price']

In [0]:
# Splitting the dataset into training and testing sets
#
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 10)
scaler = StandardScaler()
x = scaler.fit_transform(x)

In [30]:
# Creating our baseline regression model
# This is a model that has no regularization
# 
regression = LinearRegression()
regression.fit(x_train,y_train)
y_pred=regression.predict(x_test)
first_model = (mean_squared_error(y_test, y_pred))
# first_model = (mean_squared_error(y_true=y,y_pred=regression.predict(x_test)))
print(first_model)

62903720909.19179


In [0]:
# determining the most appropriate value for the l2 regularization.
 
ridge = Ridge(normalize=True)
search = GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

In [32]:
# We now use the .fit function to run the model and then use the .best_params_ and
#  .best_scores_ function to determine the models strength. 
# 
search.fit(x_train,y_train)
search.best_params_

{'alpha': 0.001}

In [33]:
search.best_score_

-57942891435.690384

In [34]:
# We can confirm this by fitting our model with the ridge information and finding the mean squared error below
#
ridge = Ridge(normalize=True,alpha=0.001)
ridge.fit(x_train,y_train)
second_model = (mean_squared_error(y_true=y_test,y_pred=ridge.predict(x_test)))
print(second_model)

62909465959.67396


In [35]:
# Making predictions
#
y_pred = ridge.predict(x_test)
y_pred
# To compare the actual output values for P_test with the predicted values
# 
r = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
r

Unnamed: 0,Actual,Predicted
957,323000.0,5.085278e+05
14143,690000.0,5.376131e+05
19452,142000.0,-5.099233e+05
20510,560000.0,4.025400e+05
18354,545000.0,2.616187e+05
...,...,...
6939,800000.0,7.342564e+05
19910,430000.0,4.225951e+05
20466,750000.0,7.696969e+05
14961,1480000.0,1.156522e+06


In [36]:
# CHecking accuracy using R2 score


from sklearn.metrics import r2_score
r = r2_score(y_test, y_pred)
r

0.5530780215725002

# Conclusion

Using the models above, we had our accuracy fluctuating between 55% and 61%, with Ridge regression scoring lowest. It is worth noting though, that we included _sqft_living15_ and _sqft_living15_ which weren't present in the other models. This is a classic example of how increasing complexity of the dataset (more features) may not necessarily yield a better accuracy.

This accuracy can be improved by scrutinizing our features more, and deciding what to leave out.

ALso, if we were to assign weights to the different features, anything to do with sq. footage had too much sway compared to the others.