# ANALYSIS USING SIMPLE LINEAR REGRESSION MODELS

In this notebook we will use house prices data to create 2 simple linear regression models, one using GrLivArea as the feature and the other using LotArea as the feature. This tutorial is intended to be simple, and it is intended to build the foundation.

**Add the directory of m_learn package to the system path**

In [1]:
from config import *
append_path('../../')

**Import statements**

In [2]:
import pandas as pd
from m_learn.linear_model import simple_linear_regression
from sklearn.model_selection import train_test_split

## 1. Load & Inspect the data

In [3]:
# read the data
data = pd.read_csv('./../../data/house_prices/train.csv')

In [4]:
# print the head of the data
data.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


In [5]:
# inspect the interested features
data[["GrLivArea", "LotArea", "SalePrice"]].head()

Unnamed: 0,GrLivArea,LotArea,SalePrice
0,1710,8450,208500
1,1262,9600,181500
2,1786,11250,223500
3,1717,9550,140000
4,2198,14260,250000


**Splitting the data into train and test sets**

`Note:` It is a universal practice in machine learning to split the data into training and test sets. Training set is the data set which we use to train our machine learning models. Test set is the data set using which we access the performance of the model in terms of test error like test RSS; here we use the model to predict new examples the model has not seen yet. We will discuss this in great detail in later chapter where we discuss how to access the performance of machine learning models.

In [6]:
train_data, test_data = train_test_split(data, test_size = 0.3, random_state = 1)

## 2. Model 1 (using "GrLivArea" as the feature)

**Create and fit the model**

In [7]:
#creating a learning regression model using sqft living as feature 
model1 = simple_linear_regression()

##fitting the model
model1.fit(train_data['GrLivArea'],train_data['SalePrice'])

**RSS on test data**

In [8]:
print("Test RSS for model 1 (GrLivArea): ", model1.rss(test_data['GrLivArea'],test_data['SalePrice']))

Test RSS for model 1 (GrLivArea):  1403450631521.484


## 3. Model 2 (using "LotArea" as the feature)

**Create and fit the model**

In [9]:
# create a another learning regression model using no. of bedrooms as feature 
model2 = simple_linear_regression()
    
# fitting the model
model2.fit(train_data['LotArea'],train_data['SalePrice'])

**RSS on test data**

In [10]:
print("Test RSS for model 2 (LotArea): ", model1.rss(test_data['LotArea'],test_data['SalePrice']))

Test RSS for model 2 (LotArea):  661004880264694.5


## 4. Inference

We noticed the RSS of the model1 is lower, and hence "GrLivArea" is a stronger feature than "LotArea" in a simple_linear_regression model which uses a single feature. 