<h1>Housing Price Prediction</h1>

<p>We will be predicting housing prices for a dataset of 100,000 entries. Complete data can be found from 1995 to 2015. We are tasked with predicting the data for housing prices in 2016.</p>

<h2>Exploratory Data Analysis:</h2>

In [27]:
#Let's start with our fundamental libraries Pandas and Numpy
import numpy as np
import pandas as pd

In [28]:
data = pd.read_csv('data.csv')

In [29]:
print(data.shape) #let's better understand the shape of our data
print(data.head()) #let's examine a few initial lines

(100000, 13)
   ID     Price        Date  Postcode Property_Type Old_New Duration  \
0   0  117000.0  1995-01-01  SW17 9QF             T       N        F   
1   1   40000.0  1995-01-03   EN1 1DN             F       N        L   
2   2   31000.0  1995-01-03  DE22 3SE             T       N        F   
3   3   76000.0  1995-01-03    W2 6HD             F       N        L   
4   4   19000.0  1995-01-03   M28 3LB             T       N        L   

                                     Street Locality        Town  \
0                      196 CROWBOROUGH ROAD   LONDON      LONDON   
1                    7 POYNTER ROAD, FLAT B      NaN     ENFIELD   
2                           69 WOLFA STREET    DERBY       DERBY   
3  READING HOUSE, HALLFIELD ESTATE, FLAT 32   LONDON      LONDON   
4                          29 ALFRED STREET  WORSLEY  MANCHESTER   

              District              County PPD_Category_Type  
0           WANDSWORTH      GREATER LONDON                 A  
1              ENFI

<p><br></br>Initially, there are a few important things to note. First, we'll need to deal with datetimes [Date] column. Second, the ID column looks redundant.</p> 

In [30]:
data.set_index('ID') #let's set the index to the identifier

Unnamed: 0_level_0,Price,Date,Postcode,Property_Type,Old_New,Duration,Street,Locality,Town,District,County,PPD_Category_Type
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,117000.0,1995-01-01,SW17 9QF,T,N,F,196 CROWBOROUGH ROAD,LONDON,LONDON,WANDSWORTH,GREATER LONDON,A
1,40000.0,1995-01-03,EN1 1DN,F,N,L,"7 POYNTER ROAD, FLAT B",,ENFIELD,ENFIELD,GREATER LONDON,A
2,31000.0,1995-01-03,DE22 3SE,T,N,F,69 WOLFA STREET,DERBY,DERBY,DERBY,DERBYSHIRE,A
3,76000.0,1995-01-03,W2 6HD,F,N,L,"READING HOUSE, HALLFIELD ESTATE, FLAT 32",LONDON,LONDON,CITY OF WESTMINSTER,GREATER LONDON,A
4,19000.0,1995-01-03,M28 3LB,T,N,L,29 ALFRED STREET,WORSLEY,MANCHESTER,SALFORD,GREATER MANCHESTER,A
5,85000.0,1995-01-03,BN15 8LJ,D,N,F,206 BRIGHTON ROAD,LANCING,LANCING,ADUR,WEST SUSSEX,A
6,34000.0,1995-01-03,FY4 2LW,S,N,F,"50A, BELVERE AVENUE",BLACKPOOL,BLACKPOOL,BLACKPOOL,BLACKPOOL,A
7,100000.0,1995-01-03,TW5 0QN,S,N,F,33 ALDERNEY AVENUE,HOUNSLOW,HOUNSLOW,HOUNSLOW,GREATER LONDON,A
8,125000.0,1995-01-04,BN6 9RH,T,N,F,"YEOMANS, WEST FURLONG LANE",HURSTPIERPOINT,HASSOCKS,MID SUSSEX,WEST SUSSEX,A
9,93000.0,1995-01-04,RG40 1RH,S,N,F,49 BEAN OAK ROAD,WOKINGHAM,WOKINGHAM,WOKINGHAM,WOKINGHAM,A


In [31]:
type(data['Date'][0]) #We need to check if the Date section is already a datetime object

str

In [32]:
data['Date'] = pd.to_datetime(data['Date']) #Its not a datetime object. So, we need to convert it.
type(data['Date'][0]) #And now we check that it worked

pandas.tslib.Timestamp

<p>We'll need to calculte and inflation index when looking at real estate prices. That's because real estate prices vary with the money supply.</p>

<h2>Calculate Inflation Index: Normalize Prices</h2>

<p> We'll use 2015 as our price baseline. That's because 2015 prices will most closely reflect 2016 prices. To calculate our inflation index, we'll need to group the dataframe by year.</p>

In [33]:
ydata = data.groupby(data['Date'].dt.year)['Price'].agg(['mean']) #Group data by year & aggregate the Price column by mean. 
print(ydata)

               mean
Date               
1995   68660.632542
1996   73383.390315
1997   78657.197796
1998   87103.476171
1999   98177.520404
2000  106836.469288
2001  120481.960712
2002  138926.335859
2003  153174.375533
2004  177231.923583
2005  186010.195248
2006  204160.713852
2007  220733.704603
2008  214670.558594
2009  211182.480743
2010  235332.610411
2011  225632.925040
2012  244244.278894
2013  260611.373401
2014  266799.890543
2015  280843.503962
2016            NaN


In [34]:
#Let's divide ydata by 2015 to find the pricing index for that year
ydata = ydata/ydata.iloc[20,0]
ydata.columns = ['Pindex']
print(ydata)

        Pindex
Date          
1995  0.244480
1996  0.261296
1997  0.280075
1998  0.310150
1999  0.349581
2000  0.380413
2001  0.429000
2002  0.494675
2003  0.545408
2004  0.631070
2005  0.662327
2006  0.726955
2007  0.785967
2008  0.764378
2009  0.751958
2010  0.837949
2011  0.803412
2012  0.869681
2013  0.927959
2014  0.949995
2015  1.000000
2016       NaN


In [35]:
#Now multiply individual housing prices by 1/pindex for that year. Vectorized implementation would be better. But we're doing a timed exercise.
count = 0
for index,row in data.iterrows():
    count += 1
    if count % 10000 == 0:
        print(row['Date'].year)
        print(row['Price'])
    yr = row['Date'].year
    newprice = row['Price']*1/float(ydata.loc[yr])
    data.set_value(index,'Price',round(newprice,2))

1997
67500.0
1999
46000.0
2001
42000.0
2002
220000.0
2004
335500.0
2006
169950.0
2007
160000.0
2011
155000.0
2014
122500.0
2016
nan


In [36]:
print(data.head())

   ID      Price       Date  Postcode Property_Type Old_New Duration  \
0   0  478566.67 1995-01-01  SW17 9QF             T       N        F   
1   1  163612.54 1995-01-03   EN1 1DN             F       N        L   
2   2  126799.71 1995-01-03  DE22 3SE             T       N        F   
3   3  310863.82 1995-01-03    W2 6HD             F       N        L   
4   4   77715.95 1995-01-03   M28 3LB             T       N        L   

                                     Street Locality        Town  \
0                      196 CROWBOROUGH ROAD   LONDON      LONDON   
1                    7 POYNTER ROAD, FLAT B      NaN     ENFIELD   
2                           69 WOLFA STREET    DERBY       DERBY   
3  READING HOUSE, HALLFIELD ESTATE, FLAT 32   LONDON      LONDON   
4                          29 ALFRED STREET  WORSLEY  MANCHESTER   

              District              County PPD_Category_Type  
0           WANDSWORTH      GREATER LONDON                 A  
1              ENFIELD      GREA

<p>Great, now we have our set to the 2015 price index. Let's start our learning on the data set.</p>

<h2>Random Forest Classifier:</h2>

<p> First, we need to hot-encode the data.</p>

In [37]:
#pd.get_dummies keeps crashing the kernal. This is likely a memory issue. So, we're going to have to concat the dataset to get what we want.

data = data.drop('Street', 1) #First let's drop all the location data
data = data.drop('Locality', 1)
data = data.drop('District', 1)
data = data.drop('County', 1)
data = data.drop('Postcode', 1)
data = data.drop('Town', 1)
data = data.drop('Date', 1)

data = pd.concat([data, pd.get_dummies(data['Property_Type'],prefix = 'Property_Type_')],axis=1)
data = data.drop('Property_Type', 1)
data = pd.concat([data, pd.get_dummies(data['Old_New'],prefix = 'Old_New_')],axis=1)
data = data.drop('Old_New', 1)
data = pd.concat([data, pd.get_dummies(data['Duration'],prefix = 'Duration_')],axis=1)
data = data.drop('Duration', 1)
data = pd.concat([data, pd.get_dummies(data['PPD_Category_Type'],prefix = 'PPD_Category_Type_')],axis=1)
data = data.drop('PPD_Category_Type', 1)
print("One-hot encoded the data!")

One-hot encoded the data!


In [38]:
print(data.shape)
print(data.head())

(100000, 14)
   ID      Price  Property_Type__D  Property_Type__F  Property_Type__O  \
0   0  478566.67               0.0               0.0               0.0   
1   1  163612.54               0.0               1.0               0.0   
2   2  126799.71               0.0               0.0               0.0   
3   3  310863.82               0.0               1.0               0.0   
4   4   77715.95               0.0               0.0               0.0   

   Property_Type__S  Property_Type__T  Old_New__N  Old_New__Y  Duration__F  \
0               0.0               1.0         1.0         0.0          1.0   
1               0.0               0.0         1.0         0.0          0.0   
2               0.0               1.0         1.0         0.0          1.0   
3               0.0               0.0         1.0         0.0          0.0   
4               0.0               1.0         1.0         0.0          0.0   

   Duration__L  Duration__U  PPD_Category_Type__A  PPD_Category_Type__B  

<p>Great. Our data is cleaned and diced. Now, let's do a random forest model to predict housing prices.</p>

In [39]:
#now, let's cut the data into traning and evaluation sets. We'll subsequently cut the training data as well.
mdata = data[85000:98818]
tdata = data[98818:99999]

#segment the label data and the feature data for training
labels = np.array(mdata['Price'])
features= mdata.drop('Price', axis = 1)
feature_list = list(features.columns)
features = np.array(features)

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)
print(train_features.shape)

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 500, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

#Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'pounds.')


(74113, 13)
Mean Absolute Error: 157573.33 pounds.
