# Linear Regression

## Case study - House price prediction
This example demonstrates how to build a linear regression model on a house price dataset.

In [None]:
import pandas as pd
house_data = pd.read_csv('LR_house_price.csv')          # Load house price data.
house_data.head()                                       # Observe several data samples.

Unnamed: 0,Sqft,Floor,Bedroom,Living.Room,Bathroom,Price
0,1177.698,2.0,2.0,2.0,2.0,62000.0
1,2134.8,5.0,4.0,2.0,2.0,78000.0
2,1138.56,5.0,2.0,2.0,1.0,58000.0
3,1458.78,2.0,3.0,2.0,2.0,45000.0
4,967.776,11.0,3.0,2.0,2.0,45000.0


In [None]:
# Pre-process data, determine feature x and label y
columns = house_data.columns.tolist()     # Get column names.
columns.remove('Price')                   # Remove 'Price' (label y column)  
feature_data = house_data[columns]        # Assign a variable to features x, including all columns except 'Price'
target_data = house_data.Price            # Assign 'Price' to label y

In [None]:
from sklearn.model_selection import train_test_split
trainX,testX, trainY,testY = train_test_split(feature_data, target_data, train_size=0.70)     # Split the data into two subsets for training and testing.
print('Training:' + str(trainX.shape))     # Count data samples in Training set.
print('Test:' + str(testX.shape))          # Count data samples in Test set.

Training:(8851, 5)
Test:(3794, 5)


Parameter ``train_size`` has a float value from 0.0 to 1.0, determining the ratio of Training and Test set.
- ``train_size``: 0.75 by defaut.
- ``test_size``: 0.25 by default, or 1 - ``train_size``.

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)     # Initialize Linear Regression model with normalization.

Parameter ``normalization`` is to nomalize features x by subtracting from mean value and dividing by L2-norm, to bring values to the range [-1,1]. For instance x=[1,2,5] with mean(x) = 4 and $|| x ||_2$ = sqrt($1^2 + 2^2 + 5^2$) =3.74. After normalization, we have x=[-3/3.74, -2/3.74, 1/3.74].


In [None]:
model.fit(trainX,trainY)                                # Learn Linear Regression using (x, y) pairs in the Training set.
print("Model intercept: " + str(model.intercept_))      # Observe the bias (theta_0) parameter.
print("Model coefficients: " + str(model.coef_))        # Observe 5 coefficients corresponding to 5 features x after learning.

Model intercept: 49.92762188031338
Model coefficients: [   35.46789579  1584.36955485  -204.89885422 -4405.69237182
  3730.19008227]


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)




In [None]:
# To fix the above warning, use use Pipeline with a StandardScaler in a preprocessing stage
#from sklearn.preprocessing import StandardScaler
#from sklearn.pipeline import make_pipeline
#model = make_pipeline(StandardScaler(), LinearRegression())
#model.fit(trainX,trainY)

In [None]:
testX[:5]     # Observe 5 features x of the first 5 samples in the Test set.

Unnamed: 0,Sqft,Floor,Bedroom,Living.Room,Bathroom
8207,5760.402,16.0,4.0,2.0,3.0
8332,1067.4,7.0,4.0,2.0,2.0
8468,878.826,8.0,3.0,2.0,2.0
1660,1444.548,3.0,4.0,2.0,3.0
10440,1152.792,9.0,4.0,2.0,1.0


In [None]:
testY[:5]     # Observe labels y of the first 5 samples in the Test set.

8207     230500.0
8332      46800.0
8468      41900.0
1660      57000.0
10440     49300.0
Name: Price, dtype: float64

In [None]:
model.predict(testX[:5])     # Make prediction on the first 5 samples in the Test set.

array([231268.76840658,  46828.34647171,  41929.29190085,  57597.70429444,
        49295.57005609])

In [None]:
from sklearn.metrics import mean_absolute_error
pred = model.predict(testX)                                              # Make prediction on the whole Test set.
mean_absolute_error(y_pred=pred, y_true=testY)                           # Calculate mean absolute error to observe the performance of the learned model based on the predictions and the labels.

889.9037646716085

Formula to calculate mean absolute value:

\begin{equation*}
Mean\_Absolute\_Value = \frac{1}{n}\sum_{i=1}^n ||y_{pred}^i - y_{true}^i||
\end{equation*}