# Regression Practical Assessment
This assessment is for determining how much you have learnt in the past sprint, the results of which will be used to determine how EDSA can best prepare you for the working world. This assessment consists of and practical questions in Regression.

The answers for this test will be input into Athena as Multiple Choice Questions. The questions are included in this notebook and are made **bold** and numbered according to the Athena Questions.

As this is a time-constrained assessment, if you are struggling with a question, rather move on to a task you are better prepared to answer rather than spending unnecessary time on one question.

**_Good Luck!_**

## Honour Code
I **YOUR NAME, YOUR SURNAME**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).  

Non-compliance with the honour code constitutes a material breach of contract.

### Download the data

Download the Notebook and data files here: https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Machine_Learning_Assessment.zip

### Imports

In [194]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

### Reading in the data
For this assessment we will be using a dataset about the quality of wine. Read in the data and take a look at it.

**Note** the feature we will be predicting is quality, i.e. the label is quality.

In [231]:
df = pd.read_csv('winequality.csv')
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Task 1 - Data pre-processing

Write a function to pre-process the data so that we can run it through the classifier. The function should:
* Split the data into features and labels
* Standardise the features using sklearn's ```StandardScaler```
* Split the data into 75% training and 25% testing data
* Set random_state to equal 16 for this internal method
* If there are any NAN values, fill them with zeros

_**Function Specifications:**_
* Should take a dataframe as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

**Note: be sure to pay attention to the test size and random state you use as the following questions assume you split the data correctly**

In [232]:
#question 11
for x in df:
    print(x, df[x].isna().mean())


type 0.0
fixed acidity 0.0015391719255040787
volatile acidity 0.001231337540403263
citric acid 0.00046175157765122367
residual sugar 0.00030783438510081576
chlorides 0.00030783438510081576
free sulfur dioxide 0.0
total sulfur dioxide 0.0
density 0.0
pH 0.0013852547329536709
sulphates 0.0006156687702016315
alcohol 0.0
quality 0.0


In [233]:
def data_preprocess(df):

    #your code here
    
    df['pH'] = df['pH'].fillna(0)
    df['sulphates'] = df['sulphates'].fillna(0)
    df['chlorides'] = df['chlorides'].fillna(0)
    df['residual sugar'] = df['residual sugar'].fillna(0)
    df['citric acid'] = df['citric acid'].fillna(0)
    df['volatile acidity'] = df['volatile acidity'].fillna(0)
    df['fixed acidity'] = df['fixed acidity'].fillna(0)
    y = df['quality'].values
    df.pop('quality')
    x = df.values
    
    
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    x = sc.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 16)
    
    return (X_train, y_train), (X_test, y_test)

In [234]:
((X_train, y_train), (X_test, y_test)) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[3])
print(y_test[3])

[ 1.75018984 -0.00412596  0.12564592  0.97278786 -0.70253493  0.51297929
 -0.36766435 -1.26942219  0.21456681  0.92881824  2.13682458  0.42611996]
7
[-0.57136659 -0.30574457 -0.54115965 -0.12776549  1.33614333 -0.37168026
  1.26631947  0.30531117  0.18121615 -0.91820734  0.32886116 -0.07697409]
6


In [66]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[3])
print(y_test[3])

[ 1.75018984 -0.01279612  0.12343265  0.97285518 -0.70302881  0.51268878
 -0.36766435 -1.26942219  0.21456681  1.13061485  2.14299274  0.42611996]
7
[-0.57136659 -0.32152108 -0.54511832 -0.12892086  1.33606028 -0.37231913
  1.26631947  0.30531117  0.18121615 -1.17289356  0.32795025 -0.07697409]
6


In [237]:
#Question11
X_train[12][5]

-0.8282787380653501

In [238]:
#Question 12
print(X_test[12, 5])

-0.17191842758365714


In [239]:
#Question 13
print(y_train[15])

6


In [240]:
#Question 14
print(y_test[15])

5


_**Expected Outputs:**_

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[0])
print(y_train[0])
print(X_test[3])
print(y_test[3])

[ 1.75018984 -0.00412596  0.12564592  0.97278786 -0.70253493  0.51297929
 -0.36766435 -1.26942219  0.21456681  0.92881824  2.13682458  0.42611996]
7
[-0.57136659 -0.30574457 -0.54115965 -0.12776549  1.33614333 -0.37168026
  1.26631947  0.30531117  0.18121615 -0.91820734  0.32886116 -0.07697409]
6

```

**Q11. What is the result of printing out the 6th column and the 13th row of X_train?**

**Q12. What is the result of printing out 6th column and the 13th row of X_test?**

**Q13. What is the result of printing out the 16th row y_train?**

**Q14. What is the result of printing out the 16th row of y_test?**

## Task 2 - Train Linear Regression Model

Since this dataset is about predicting quality, which ranges from 1 to 10, lets try fit the data to a regression model and see how well that performs.

Fit a model using sklearn's `LinearRegression` class with its default parameters. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `LinearRegression` model.
* The returned model should be fitted to the data.

In [241]:
def train_model(X_train, y_train):
    
    #your code here
    global reg
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    
    return reg

In [242]:
#Question 15

train_model(X_train, y_train).intercept_

5.821226664268554

In [243]:
#Question 16
train_model(X_train, y_train).coef_[2]

-0.2561262855781778

**Q15. What is the result of printing out ***model.intercept_*** for the fitted model rounded to 3 decimal places?**

**Q16. What is the result of printing out ***model.coef_[2]*** for the fitted model rounded to 2 decimal places?**

## Task 3 - Test Regression Model

We would now like to test our regression model. This test should give the residual sum of squares, which for your convenience is written as
$$
RSS = \sum_{i=1}^N (p_i - y_i)^2,
$$
where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.

In [207]:
def test_model(model, X_test, y_test):
    
    #your code here
    y_pred = model.predict(X_test)
    print(np.sum((y_pred-y_test)**2))
    
    return

In [244]:
#Question 17
test_model(reg, X_test, y_test)

882.300568198362


**Q17. What is the Residual Sum of Squares value for the fitted Linear Regression Model on the test set?**

## Task 4 - Train Decision Tree Regresson Model

Let us try improve this accuracy by training a model using sklearn's `DecisionTreeRegressor` class with a random state value of 42. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `DecisionTreeRegressor` model with a random state value of 42.
* The returned model should be fitted to the data.

In [245]:
def train_dt_model(X_train, y_train):
    
    #your code here
    global dt
    dt = DecisionTreeRegressor(random_state=42)
    dt.fit(X_train, y_train)
    
    return dt

In [246]:
train_dt_model(X_train, y_train)

DecisionTreeRegressor(random_state=42)

Now that you have trained your model, lets see how well it does on the test set. Use the test_reg_model function you previously created to do this.

In [247]:
def r_err(predictions, y_test):
    
    #your code here
    rss = np.sum(np.square(y_test - predictions))
    print(rss)

In [248]:
r_err(reg.predict(X_test), y_test)

882.300568198362


In [249]:
test_model(dt, X_test, y_test)

1113.0


**Q18. What is the Residual Sum of Squares value for the fitted Decision Tree Regression Model on the test set?**

## Task 5 - Mean Absolute Error
Write a function to compute the Mean Absolute Error (MAE), which is given by:

$$
MAE = \frac{1}{N} \sum_{n=i}^N |p_i - y_i|
$$

where $p_i$ refers to the $i^{\rm th}$ `prediction`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take two `arrays` as input. You can think of these as the `predictions` and `y_test` variables you get when testing a model. 
* Should return the mean absolute error over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 3 decimal places.

In [252]:
def mean_abs_err(predictions, y_test):
    
    #your code here
    from sklearn import metrics
    return metrics.mean_absolute_error(predictions, y_test)

In [253]:
print(mean_abs_err(np.array([7.5,7,1.2]),np.array([3.2,2,-2])))

4.166666666666667


**Q9. What is the result of printing out mean_abs_err(np.array([7.5,7,1.2]),np.array([3.2,2,-2]))?**

In [254]:
print(mean_abs_err(np.array([7.5,7,1.2]),np.array([3.2,2,-2])))

4.166666666666667


**Q10. Which regression model (Linear vs DecisionTree) has the lowest Mean Absolute error?**

In [255]:
mean_abs_err(dt.predict(X_test),y_test)

0.48553846153846153

In [256]:
mean_abs_err(reg.predict(X_test),y_test)

0.5769367364597567