# Regression modelling


When a problem is presented, a solution is required. However, to solve said problem, one must observe and understand what type of problem is at hand, and what the best solution would be.  
In this case, the problem is to determine an estimated price of a house, given certain variables. Looking at the problem, we can observe that the target(price) is continuous and not categorical. From this observation, we can simply say this is a regression problem not a classification problem.  
So we know why we are implementing a regression model.

In [1]:
# Import necessary packages.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts
import pickle
import json


# Load the previously processed dataframe
%store -r df5

In [2]:
df = df5
df.head()

Unnamed: 0,location,total_sqft,bath,price,bedrooms
0,1st Block Jayanagar,2850.0,4.0,428.0,4
1,1st Block Jayanagar,1630.0,3.0,194.0,3
2,1st Block Jayanagar,1875.0,2.0,235.0,3
3,1st Block Jayanagar,1200.0,2.0,130.0,3
4,1st Block Jayanagar,1235.0,2.0,148.0,2


There are over 200 locations in the dataset, and they are all string data type.<br>
Machine learning models do not parse strings(words), only numbers, hence,<br>
I will use a one-hot encoding approach facilitated by pd.get_dummies, to create<br>
a new dataframe of zeros and ones that represent the locations as numbers instead of strings.<br>
For example, if the location is 'Electric City', the 'Electric City' column will have a value 1 and every<br> other columns will be zeros.

In [3]:
dummies = pd.get_dummies(df.location)
dummies.tail()

Unnamed: 0,1st Block Jayanagar,1st Phase JP Nagar,2nd Phase Judicial Layout,2nd Stage Nagarbhavi,5th Block Hbr Layout,5th Phase JP Nagar,6th Phase JP Nagar,7th Phase JP Nagar,8th Phase JP Nagar,9th Phase JP Nagar,...,Vishveshwarya Layout,Vishwapriya Layout,Vittasandra,Whitefield,Yelachenahalli,Yelahanka,Yelahanka New Town,Yelenahalli,Yeshwanthpur,other_locations
9975,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9976,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9979,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9980,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Good! Now, we have this dataframe that represents the locations.  
We can now attach this to our original dataset and drop the location column as it is no longer needed.

In [4]:
# New dataframe formed by concatenating the original dataframe and dummies. 
df1 = pd.concat([df, dummies.drop('other_locations', axis=1)], axis=1)
df1 = df1.drop(['location'], axis=1)
df1.head()

Unnamed: 0,total_sqft,bath,price,bedrooms,1st Block Jayanagar,1st Phase JP Nagar,2nd Phase Judicial Layout,2nd Stage Nagarbhavi,5th Block Hbr Layout,5th Phase JP Nagar,...,Vijayanagar,Vishveshwarya Layout,Vishwapriya Layout,Vittasandra,Whitefield,Yelachenahalli,Yelahanka,Yelahanka New Town,Yelenahalli,Yeshwanthpur
0,2850.0,4.0,428.0,4,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1630.0,3.0,194.0,3,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1875.0,2.0,235.0,3,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1200.0,2.0,130.0,3,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1235.0,2.0,148.0,2,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice I dropped the 'others' column.  
This is done to prevent the dummy variable trap. So when all the column are zero then it will entail 'others' location.

In [5]:
df1.shape

(7061, 245)

The data is in a much better form for training a regression model.    
As we saw from the first notebook, we had over 13000 samples. After series of cleaning and outlier filtering, we were able to get rid of over 6000 samples. Now, we are left with 7251 samples which we'll use to train our model.  
<br>
Let's get to it...  
We'll define our features and target; features being the independent variables(total_sqft, bath, bedrooms, locations) and target being the dependent variable(price)...

In [6]:
# Assign features for training to X.
X = df1.drop('price', axis=1)
X

Unnamed: 0,total_sqft,bath,bedrooms,1st Block Jayanagar,1st Phase JP Nagar,2nd Phase Judicial Layout,2nd Stage Nagarbhavi,5th Block Hbr Layout,5th Phase JP Nagar,6th Phase JP Nagar,...,Vijayanagar,Vishveshwarya Layout,Vishwapriya Layout,Vittasandra,Whitefield,Yelachenahalli,Yelahanka,Yelahanka New Town,Yelenahalli,Yeshwanthpur
0,2850.0,4.0,4,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1630.0,3.0,3,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1875.0,2.0,3,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1200.0,2.0,3,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1235.0,2.0,2,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9975,1200.0,2.0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9976,1800.0,1.0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9979,1353.0,2.0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9980,812.0,1.0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Assign targets for training to Y.
Y = df1.price
Y

0       428.0
1       194.0
2       235.0
3       130.0
4       148.0
        ...  
9975     70.0
9976    200.0
9979    110.0
9980     26.0
9983    400.0
Name: price, Length: 7061, dtype: float64

A normal practice before training a model is to split the dataset into chunks, one for __training__ and one for __testing__.  
This is done so we can evaluate the model's performance on unseen data. At times, a model will perform exceptionally on the training data, but perform poorly on unseen data. This is because of overfitting, but we can catch that using the testing dataset and adjust the model appropriately.

In [8]:
# Split the data into training and testing chunks.
X_train, X_test, Y_train, Y_test = tts(X, Y, test_size=.1, random_state=1234)

We are using a linear regression model for this problem.  
We'll fit the data on the training dataset and evaluate it on the testing dataset.

In [9]:
model = LinearRegression()

# Training step!
model.fit(X_train.values, Y_train.values)
accuracy = (model.score(X_test.values, Y_test.values))*100 
print(f'Accuracy of the model: {(accuracy):.1f}%')

Accuracy of the model: 89.9%


Viola! The model achieved 90% accuracy. That's quite amazing.  
I am usually keen to implement a K-Fold cross validation, but I will let it slide this time.  
Now, we'll write a function that takes __location, sqft, bath and bedrooms,__ and returns an estimated price for the particular house.

In [10]:
def predictPrice(location, sqft, bath, bedrooms):
    """
    This function takes a few input on the type of house a user is hoping for,
    and returns and estimated price...
    
    First, get the column-wise index of the passed location on "X".
    
    Second, create and array of zeros which will serve as a sample.
    
    Then, assign sqft as the first feature, bath as the second feature,
    and bedrooms as the third feature.
    
    Finally, using "loc_index", assign 1 to that index on 'x'.
    
    
    NB: `if loc_index > 0:`: if loc_index is zero or null, 
        that means the location is "others".
    """
    loc_index = np.where(X.columns==location)[0][0]
    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bedrooms
    if loc_index > 0:
        x[loc_index] = 1
        
    # Return predicted price.
    return model.predict([x])[0] 

__NB:__ In the above function, the arrangement of the values in __x__ array is very important because that's<br>
the structure of the data when we were training the model.

Okay!<br>
The function is ready, Let's see estimated prices.

In [11]:
price = predictPrice(location='1st Block Jayanagar', sqft=2500, bath=4, bedrooms=4)
print(f"Price: {price:.2f}Lakh")

Price: 305.44Lakh


In [12]:
price = predictPrice(location='1st Block Jayanagar', sqft=2500, bath=6, bedrooms=6)
print(f"Price: {price:.2f}Lakh")

Price: 297.76Lakh


In [13]:
price = predictPrice(location='Electronic City', sqft=1500, bath=4, bedrooms=4)
print(f"Price: {price:.2f}Lakh")

Price: 83.56Lakh


In [14]:
price = predictPrice(location='Electronic City', sqft=2000, bath=4, bedrooms=4)
print(f"Price: {price:.2f}Lakh")

Price: 127.26Lakh


In [15]:
price = predictPrice('2nd Stage Nagarbhavi', 1000, 2, 3)
print(f"Price: {price:.2f}Lakh")

Price: 180.71Lakh


In [16]:
price = predictPrice('Vittasandra', 1000, 2, 3)
print(f"Price: {price:.2f}Lakh")

Price: 37.82Lakh


Fantastic!!!<br>
As we can see, There is something interesting about the first two prices. Here we have two houses of 2500sqft, but the one with less number of bathrooms and bedrooms cost more than the one with more bathrooms and bedrooms. This could mean that since the house has less number of rooms, each room will be bigger. And perhaps the market favours houses with bigger rooms over houses with more rooms...  

Also, we can see how much location pays to the price of a house, and the model testifies that as well.

The next stage of this project is building a website where a user can input a few details of their prospective home, and get an estimated price.  
We'll save the trained model and the columns as is in X.columns. The structure of this column is very important in utilising the model.

In [17]:
# Export the model as a pickle file.
with open('house_price_model.pickle', 'wb') as f:
    pickle.dump(model, f)

In [18]:
# Export the columns as it is, The structure is extremely important for utilising the model.
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open('coulmns.json', 'w') as f:
    f.write(json.dumps(columns))

In [19]:
# Delete all stored variables.
%store -z

The rest of this project is on [my](https://github.com/ifunanyaScript/House-Pricing-Regression) github.  
You should check it out...

In [20]:
# ifunanyaScript