# CCMVI2085U Session 04 Linear Regression | Exercise Solution

[**CCMVI2085U Machine Learning for Predictive Analytics in Business @ CBS ISUP 2020**](https://kursuskatalog.cbs.dk/2019-2020/KAN-CCMVI2085U.aspx?lang=en-GB)

Course coordinator: [Bowei Chen](https://boweichen.github.io/) | Email: [bc.acc@cbs.dk](mailto:bc.acc@cbs.dk)

-----

<div class="alert alert-info">
In this workshop practice, you are going to use this Jupyter notebook the perform a few linear regression models for the following marketing analytics project. Enter your solution (i.e., Python codes) under each question.
</div>

## Data loading and descriptive summary

Q1. Load `Advertising.csv` using the `pandas` library and save the dataset into `df`

In [1]:
import pandas as pd
df = pd.read_csv("data/Advertising.csv", header=0)

Q2. Check the dimension of the dataset

In [2]:
df.shape

(200, 4)

Q3. Show the last 10 rows/instances in the dataset. 

Hint: use `tail` function. You can type the following code in the code cell to see how to use `tail` function

```python
? pd.DataFrame.tail
```

or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

In [3]:
df.tail(10)

Unnamed: 0,TV,Radio,Newspaper,Sales
190,39.5,41.1,5.8,10.8
191,75.5,10.8,6.0,9.9
192,17.2,4.1,31.6,5.9
193,166.8,42.0,3.6,19.6
194,149.7,35.6,6.0,17.3
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,13.4


Q4. Extract variable names from the dataset and save them into `col_names`, and show variables data types

In [4]:
col_names = df.columns
print(col_names)
print(df.dtypes)

Index(['TV', 'Radio', 'Newspaper', 'Sales'], dtype='object')
TV           float64
Radio        float64
Newspaper    float64
Sales        float64
dtype: object


Q5. Summarise the basic descrptive statistics of the numerical variables in the dataset

In [5]:
df.describe()

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


Q6. Check if there is any missing values in the dataset. 

Hint: using `isnull()` function

In [6]:
df.isnull()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
195,False,False,False,False
196,False,False,False,False
197,False,False,False,False
198,False,False,False,False


We can chain a `sum()` method on, instead of getting the total sum of missing values, we can obtain a list of all the summations of each column


In [7]:
df.isnull().sum() 

TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64

## Data preparation

Q1. We need to convert the variables in dataframe structure into ndarray. Convert the feature/predictor variables into a feature vectors and save them into an $N \times D$ ndarray named $x$. Convert the response variable `Sales` into an $1 \times N$ ndarray named $y$

In [8]:
x = df[col_names[:-1]].values
y = df['Sales'].values
y = y.flatten()
print(x)
print(y)

[[230.1  37.8  69.2]
 [ 44.5  39.3  45.1]
 [ 17.2  45.9  69.3]
 [151.5  41.3  58.5]
 [180.8  10.8  58.4]
 [  8.7  48.9  75. ]
 [ 57.5  32.8  23.5]
 [120.2  19.6  11.6]
 [  8.6   2.1   1. ]
 [199.8   2.6  21.2]
 [ 66.1   5.8  24.2]
 [214.7  24.    4. ]
 [ 23.8  35.1  65.9]
 [ 97.5   7.6   7.2]
 [204.1  32.9  46. ]
 [195.4  47.7  52.9]
 [ 67.8  36.6 114. ]
 [281.4  39.6  55.8]
 [ 69.2  20.5  18.3]
 [147.3  23.9  19.1]
 [218.4  27.7  53.4]
 [237.4   5.1  23.5]
 [ 13.2  15.9  49.6]
 [228.3  16.9  26.2]
 [ 62.3  12.6  18.3]
 [262.9   3.5  19.5]
 [142.9  29.3  12.6]
 [240.1  16.7  22.9]
 [248.8  27.1  22.9]
 [ 70.6  16.   40.8]
 [292.9  28.3  43.2]
 [112.9  17.4  38.6]
 [ 97.2   1.5  30. ]
 [265.6  20.    0.3]
 [ 95.7   1.4   7.4]
 [290.7   4.1   8.5]
 [266.9  43.8   5. ]
 [ 74.7  49.4  45.7]
 [ 43.1  26.7  35.1]
 [228.   37.7  32. ]
 [202.5  22.3  31.6]
 [177.   33.4  38.7]
 [293.6  27.7   1.8]
 [206.9   8.4  26.4]
 [ 25.1  25.7  43.3]
 [175.1  22.5  31.5]
 [ 89.7   9.9  35.7]
 [239.9  41.5

Q2 . Create a training set and a test set from the original data by using `train_test_split` from `sklearn.model_selection`. You can import this function using 

```python
from sklearn.model_selection import train_test_split
```
see here how to use this function
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=200)

## Implementing linear regression models

Q1. Check the lecture HTML and import `LinearRegression`, `Ridge`, `Lasso` from `sklearn` library. 

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

Q2. Implement the multiple linear regression, ridge regression and lasso regression with the training data

In [11]:
reg = LinearRegression()
reg.fit(x_train, y_train)
pred_train = reg.predict(x_train)
pred_test = reg.predict(x_test)

reg_ridge = Ridge(alpha=0.5)
reg_ridge.fit(x_train, y_train)
pred_ridge_train = reg_ridge.predict(x_train)
pred_ridge_test = reg_ridge.predict(x_test)

reg_lasso = Lasso(alpha=0.5)
reg_lasso.fit(x_train, y_train)
pred_lasso_train = reg_lasso.predict(x_train)
pred_lasso_test = reg_lasso.predict(x_test)

Q3. Test and compare these models' performance in both training set and test set based on mean squared errors

In [12]:
from sklearn.metrics import mean_absolute_error

print("=====\n")
print("Training MSE\n")
print("Multiple linear reg: %0.5f" % mean_absolute_error(y_train, pred_train))
print("Ridge reg: %0.5f" % mean_absolute_error(y_train, pred_ridge_train))
print("Lasso reg: %0.5f" % mean_absolute_error(y_train, pred_lasso_train))

print("=====\n")
print("Test MSE\n")
print("Multiple linear reg: %0.5f" % mean_absolute_error(y_test, pred_test))
print("Ridge reg: %0.5f" % mean_absolute_error(y_test, pred_ridge_test))
print("Lasso reg: %0.5f" % mean_absolute_error(y_test, pred_lasso_test))

=====

Training MSE

Multiple linear reg: 1.19149
Ridge reg: 1.19149
Lasso reg: 1.19074
=====

Test MSE

Multiple linear reg: 1.34719
Ridge reg: 1.34720
Lasso reg: 1.34750
