# Linear Regression in Sci-Kit Learn - Introduction

This dataset concerns housing values in suburbs of Boston. The original dataset was taken from the StatLib library which is maintained at Carnegie Mellon University, here it is downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/).

Your goal is to create and train a model that can estimate the average housing price.

### Dataset description (columns)

     1. CRIM     per capita crime rate by town
     2. ZN       proportion of residential land zoned for lots over 
                 25,000 sq.ft.
     3. INDUS    proportion of non-retail business acres per town
     4. CHAS     Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
     5. NOX      nitric oxides concentration (parts per 10 million)
     6. RM       average number of rooms per dwelling
     7. AGE      proportion of owner-occupied units built prior to 1940
     8. DIS      weighted distances to five Boston employment centres
     9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's of dollars
    

In [1]:
import pandas as pd 
import numpy as np


Load and display data.

In [None]:
# Uncomment this if you are using Google Colab
#!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv

In [3]:
data = pd.read_csv('housing.csv')
data.shape

(506, 14)

### Task 1
Select X (columns `['CRIM', 'TAX', 'RM']`) and y (column `MEDV`)

In [9]:
# Enter your code here
X = data[['CRIM', 'TAX', 'RM']]
X.shape

(506, 3)

In [11]:
# Enter your code here
y = data['MEDV']
y.shape

(506,)

### Task 2
Split data into two subsets
- train subset: 70% of data
- test subset: 30% of data
- set random_state to 1

In [13]:
# Enter your code here
from sklearn.model_selection import train_test_split

In [14]:
# Enter your code here
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.3, random_state=1)

### Task 3
Create and train linear regression model.

In [16]:
# Enter your code here
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

### Task 4
Compute $R^2$ coefficient for train and test datasets. Use `model.score()` to do it.

$$R^2=1-\frac{\Sigma{(y-\hat{y})^2}}{\Sigma{(y-\overline{y})^2}}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $\overline{y}$ - mean value of `y`

In [17]:
# Enter your code here
print ('R2 train score:', model.score(X_train, y_train))
print ('R2 test score:', model.score(X_test, y_test))

R2 train score: 0.5096603576929333
R2 test score: 0.690189333092642


### MAPE - Mean Absolute Percentage Error

$$MAPE = \frac{1}{n} \sum{ \left\lvert{\frac{y-\hat{y}}{y}}\right\rvert}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $n$ - number of samples

In [22]:
y_pred = model.predict(X_train)
mape_train = np.mean(np.abs((y_train - y_pred)  / y_train) ) * 100

### Task 5
Create a function mape, that returns  𝑀𝐴𝑃𝐸  value given  𝑋 ,  𝑦  and the model that is used to create  𝑦̂   estimates. Then use your function to compute  𝑀𝐴𝑃𝐸  for train and test datasets. 

In [26]:
def mape(model, X, y):
    # Enter your code here
    y_pred = model.predict(X)
    return 100 * np.mean(np.abs((y - y_pred)/y))

In [27]:
# Enter your code here
print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('Train mape {:.3f}%'.format(mape(model,X_test,y_test)))

Train mape 21.552%
Train mape 20.784%


## Random forest regressor

In [30]:
# Enter your code here
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train,y_train)
print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('Test mape {:.3f}%'.format(mape(model,X_test,y_test)))

Train mape 7.598%
Test mape 16.510%


### Task 6
Experiment with `min_samples_leaf` parameter to avoid overfitting.

In [31]:
# Enter your code here
model = RandomForestRegressor(min_samples_leaf=12)
model.fit(X_train,y_train)

print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('Test mape {:.3f}%'.format(mape(model,X_test,y_test)))

Train mape 16.342%
Test mape 18.688%


# Part 2

### Task 7
Select all 13 features as $X$ and split dataset into two subsets (the same split ratio and random state).

In [34]:
# Enter your code here
X = data.drop('MEDV', 1)
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33


In [35]:
# Enter your code here
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(354, 13)
(152, 13)
(354,)
(152,)


In [None]:
# Enter your code here

In [None]:
# Enter your code here

### Task 8
Train and test linear regression model. Compare the results with the previous ones.

In [36]:
# Enter your code herefrom sklearn.linear_model 
model = LinearRegression()
model.fit(X_train, y_train)

print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('test mape {:.3f}%'.format(mape(model,X_test,y_test)))

Train mape 16.715%
test mape 16.208%


### Task 9
Train and test Random Forest model (keep all parameters default). Does your model suffer from overfitting / underfitting?

In [37]:
# Enter your code here
model = RandomForestRegressor()
model.fit(X_train, y_train)

print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('test mape {:.3f}%'.format(mape(model,X_test,y_test)))


Train mape 4.266%
test mape 11.243%


### Task 10
Try to modify `min_samples_leaf` parameter to get the best model possible.

In [38]:
# Enter your code here
model = RandomForestRegressor(min_samples_leaf=16)
model.fit(X_train, y_train)

RandomForestRegressor(min_samples_leaf=16)

In [None]:

print('Train mape {:.3f}%'.format(mape(model,X_train,y_train)))
print('test mape {:.3f}%'.format(mape(model,X_test,y_test)))