# Linear Regression using the Housing Market in King County USA

In this example we are going to look at the housing market in King County USA.  In fact, we are going to try to predict the values of the houses in that county using certain attributes about the houses in that county. Examples of attributes that would have an affect on the price of the house could be square feet, number of bedrooms, number of bathrooms and so on. In machine learning these are called features. What features might a house have to make it more expensive?  Makes sense, right?  This is an example of a linear regression problem.  

In [3]:
import numpy as np
import pandas as pd
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [4]:
df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [5]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null object
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.4+ MB


In [7]:
dfx = df.drop(['date'],1)
dfx = dfx.drop(['id'],1)
dfx = dfx.drop(['price'],1)

In [8]:
dfx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 18 columns):
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(4), int64(14)
memory usage: 3.0 MB


In [9]:
y = df['price']
y.head()

0    221900.0
1    538000.0
2    180000.0
3    604000.0
4    510000.0
Name: price, dtype: float64

In [10]:
x = np.array(dfx)
y = np.array(y)

In [11]:
x = preprocessing.scale(x)

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y)

In [13]:
clf = linear_model.LinearRegression()

In [14]:
clf.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:
confidence = clf.score(x_test, y_test)
print(confidence)

0.698402047409


In [33]:
clf.predict(x_test[0])



array([ 417825.96464797])

In [34]:
x_test[0]

array([-0.39873715,  1.47406291, -0.62052215, -0.33467207,  2.78843855,
       -0.08717263, -0.30575946, -0.62918687, -0.55883575, -0.33619148,
       -0.65868104,  1.22545173, -0.21012839,  1.02908966,  1.22146986,
       -0.91676561, -0.69531597, -0.42204975])

In [35]:
y_test[0] - clf.predict(x_test[0])



array([-97875.96464797])

In [None]:
Simply put you may remember linear regression as:

y = m*x + b

Not too difficult huh.  So how can we use that equation to predict the housing market in King County?  Well we can say that the y value would be the price of the house.  Okay, we're good so far but what about the values of m and x.  Well remember m is the slope of the line and x is the value of x that fits a line to equal y.  So using that logic let's think of an attribute of a house that would have a bearing on it's price.  Number of bedrooms is a good feature.  Intuitively we can say more bedrooms means a higher price for a house. Ok let's see how this would work:

y = m*x + b

x will be the number of bedrooms 
b which is the intercept, will be zero for now
m will be the slope which we will have to calculate

For a simple example let's consider the following:

1 bathroom house costs $100,000
2 bathroom house costs $200,000
3 bathroom house costs $300,000

Ok cool,  now let's use a bit more complicated equation you learned back in middle school:

m = (



Fitting the line that would make: 

100000 = 1*m
200000 = 2*m
300000 = 3*m