<a target="_blank" href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/labs/07%20-%20Regression/walkthrough/walkthrough_07.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<h1 align="center"> WALKTHROUGH 7</h1>

<div>
<td> 
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2b/Logo_Universit%C3%A9_de_Lausanne.svg/2000px-Logo_Universit%C3%A9_de_Lausanne.svg.png" style="padding-right:10px;width:240px;float:left"/></td>
<h2 style="white-space: nowrap">Business Intelligence and Analytics</h2></td>
<hr style="clear:both">
<p style="font-size:0.85em; margin:2px; text-align:justify">

</div>

Regression is to relate input variables to the output variable, to either predict outputs for new inputs and/or to understand the effect of the input on the output. In prediction, we wish to predict the output for a new input vector. In interpretation, we wish to understand the effect of inputs on output.

For both goals, we need to find a function that approximates the output “well enough” given some inputs:

$$y_n =f(\boldsymbol{x_{n}})$$

In python, a useful library exists to apply regression and other Machine Learning and statisticals tools over the data. It is the so called **sklearn**.

This walkthrough will teach you how to use this library in the context of regression.

In [None]:
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
%matplotlib inline

## 1. Load the dataset


From this library we import the `LinearRegression` module and the different datasets used for our examples. In this section, we will discuss the basics of using the linear model with the weather dataset as example. Then you will be given a task and perform your own linear regression.

In [None]:
#Load the dataset
url = "https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/weather.csv"
weather = pd.read_csv(url).drop_duplicates().dropna()

# Display a sample of the data
display(weather.head())

# Print the data types
print(weather.dtypes)
print("Data matrix shape: ", weather.shape)

# Display the columns names
print("Columns names: ", weather.columns) 

In [None]:
# Display correlation of the features for numeric_only features
display(weather.corr(numeric_only=True))
display(weather.corrwith(weather['Temp3pm'], numeric_only=True))

**Note:** The purpose here is to predict the temperature from other features (like humidity or pression). It is called multivariate linear regression when we use several features as input, univariate otherwise. We will only work with values concerning **3pm** for simplicity.

A LinearRegression has this form for one feature: $$ Y_i = w_0 + w_1 X_i + \epsilon_i$$

The betas correspond to the weights of the variables (coefficients). Combined with the features (X matrix) we want to predict the target variable (Y vector). The regression will compute the best value of $w_i$.

For now we will focus on a simple linear regression with **one feature variable**. We would like to know if we can use the humity to predict the temperature. Let's separate the feature input from the target output.

In [None]:
X = weather[['Humidity3pm']] 
y = weather[['Temp3pm']]

## 2. Splitting the dataset

Sklearn has a very useful module to seprate your dataset in a training and in a testing set. The training set will be used to retreive the best values of the weights according to a combination of input/output while the test set will be used to evaluate/predict our model. Since our model will be trained on particular values we want to test our data on a new set of data (the test set)

The test size here is of 20% of the original data.

In [None]:
# Split the data into training/testing sets
# Split the targets into training/testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

**Note:** Generally you should normalize the data right after splitting the datset. The normalization is important here to reduce the variance of our model and get better results. We skip this step for now.

The sklearn code uses `MinMaxScaler` module to normalize the data. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.  

This is an example of how to use it:
```python
from sklearn.preprocessing import MinMaxScaler
#Define the scaler
scaler = MinMaxScaler()
#Fit the scaler
scaler.fit(X_train)
#Transform the train and the test set
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#These two steps can be merged into one (only for the train set)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

## 3. Create/Fit the model

To predict the target variable we will use a simple linear regression. We can import the module following this path (already done at the beginning of the file):

```python
from sklearn.linear_model import LinearRegression
```

**Note:** 
- We create a new LinearRegression model from sklearn
- The `fit()` function will fill the linear model from the X_train (feature) and the y_train data (target)
- The ``score()``function returns the coefficient of determination R^2 of the prediction. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

After fitting the model, we can easily retreive the values of the different beta coefficients (the intercept, and the weight of each feature).

In [None]:
# There are three steps to model something with sklearn

# 1. Set up the model
model = LinearRegression(fit_intercept= True)

# 2. Use fit
model.fit(X_train, y_train)

# 3. Check the score/accuracy
print("R^2 Score of the model: ", round(model.score(X_test, y_test), 3))


In [None]:
print("Intercept: ", model.intercept_[0]) 
print("Features coefficients (weigths): ", model.coef_.flatten()[0])# Get the coefficients, w

**Note:** Considering this linear equation: $ Y_i = w_0 + w_1 X_i + \epsilon_i$

The intercept corresponds to the value of $w_0$. There is only one coefficient,  $w_1$ linked to the humidity feature. Since we have only one value for intercept and coefficients represented as arrays, we apply `flattent()` and `[0]`.

## 4. Prediction/Evaluation

Once the model is trained, we can use the ``predict()`` function to predict the values of the test set using `X_test`. This prediction can be compared to the truth value i.e `y_test`.

Here is an example for one value prediction. Our model takes a matrix as inputs (X matrix), so even if we want to predict a scalar value we should use `[[...]]`.

In [None]:
print("Particular value of humidity: ", X_test.iloc[0].values)

# Compute the prediction for input 28 (humidity)
prediction = model.predict([[28]])
print("Prediction/Truth for humidity 28: ", prediction, y_test.iloc[0].values)

**Note:** Try to use `flatten()` and `[0]` in order to display correctly the above values.

In [None]:
# Compute the prediction for input 28 (humidity) with flatten
# YOUR CODE HERE

## 5. Evaluation and plotting

To better understand why the prediction and actual value are different , we can plot the predictions (line) and the true values from the test set (dots). It is more interesting to predict from the test set because our model is not trained on these values unlike the train set.

In [None]:
# Model prediction from X_test
predictions = model.predict(X_test)

In [None]:
# Plot the prediction (the line) over the true value (the dots)
import seaborn as sns
sns.set_style("darkgrid")
plt.scatter(X_test, y_test)
plt.plot(X_test, predictions, 'r')
plt.title("Humidity against temperature")
plt.xlabel('Humidity')
plt.ylabel('Temperature')
plt.show()

We can compare the error of our model by using some metrics like the **MAE (mean absolute error)**, **MSE (mean squared error)** or **R^2** score. Sklearn offers some nice modules to compute these measures. These modules are imported at the begining of the file.

These metrics takes the `y_test` values and the `predictions` as arguments. Basically it will analyse how far the prediction is from the true value. Using these metrics is very helpful when looking for the best model.

In [None]:
# Compare the MAE the MSE and the R^2
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("MAE %.2f" % mae)
print("MSE %.2f" % mse)
print("R^2 %.2f" % r2)

It is also interesting to compare the result of these metrics between the data from the `test set` and those from the `train set` as it enables you to see whether your model gives a good prediction or not.

In [None]:
pred = model.predict(X_train)
mae = mean_absolute_error(y_train, pred)
mse = mean_squared_error(y_train, pred)
r2 = r2_score(y_train, pred)

print("MAE %.2f" % mae)
print("MSE %.2f" % mse)
print("R^2 %.2f" % r2)

Remember, the higher the R² value, the better the fit. In this case, the testing data yields a higher coefficient. While it might seem a bit counterintuitive
 Furthermore, the R² calculated with test data is an unbiased measure of your model’s prediction performance.

## 6. Multivariate Regression


Here, we will apply the same method to several features. For instance it should be interesting to use these variables: humidity, pressure, sunshine and cloud data to predict the temperature. We continue to work with values concerning 3pm for simplicity.

In [None]:
X = weather[['Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'Sunshine']] 
y = weather[['Temp3pm']]

### Split the data into a training set and a test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

### Fit the model

In [None]:
# 1. Set up the model
model = LinearRegression()
# 2. Use fit
model.fit(X_train, y_train)
# 3. Check the score/accuracy
print("R^2 Score of the model: ", round(model.score(X_test, y_test), 3))
# 4. Print the coefficients of the linear model
print("Intercept: ", model.intercept_) 
print("Features coefficients (weigths): ", model.coef_)# Get the coefficients, w

### Prediction
We use the predict() function to predict the values of the test set using X_test.
This prediction can be compared to the truth value i.e y_test.


In [None]:
print("Particular value of ['Humidity3pm', 'Cloud3pm', 'Pressure3pm', 'Sunshine']: ", X_test.iloc[0].values)
prediction = model.predict([[ 28.0, 7.0, 1018.2, 7.3]])
print("Prediction/Truth for [ 28.0, 7.0, 1018.2, 7.3]: ", prediction, y_test.iloc[0].values)

### Evaluation
Lastly, we use the MAE (mean absolute error), MSE (mean squared error) or R^2 score to analyse how far the prediction is from the true value.
These  metrics takes the `y_test` values and the `predictions` as arguments. 

In [None]:
predictions = model.predict(X_test)

print("MAE %.2f" % mean_absolute_error(y_test, predictions))
print("MSE %.2f" % mean_squared_error(y_test, predictions))
print("R^2 %.2f" % r2_score(y_test, predictions))

In [None]:
predictions = model.predict(X_train)

print("MAE %.2f" % mean_absolute_error(y_train, predictions))
print("MSE %.2f" % mean_squared_error(y_train, predictions))
print("R^2 %.2f" % r2_score(y_train, predictions))

In [None]:
# Arrays to save the different errors
train_err = []
test_err = []

# Iterate over 1, 2, 3 and 4 features
for nbr_col in range(1, 5):
    # Select the good number of features for X
    X_temp = X[X.columns[:nbr_col]]
    # Split the dat set
    X_train, X_test, y_train, y_test = train_test_split(X_temp, y, test_size=0.2, random_state=10)
    # Normalize the data
    # Create new scaler from MinMaxScaler()
    scaler = MinMaxScaler()
    # Fit and transform the original data
    scaler.fit(X_train, y_train)
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    # Create the linear model
    LR = LinearRegression(fit_intercept=False)
    # Fit the linear model
    LR.fit(X_train, y_train)
    
    #Compute and save the mean absolute error fro training and testing set
    train_err.append(mean_absolute_error(y_train, LR.predict(X_train)))
    test_err.append(mean_absolute_error(y_test, LR.predict(X_test)))

# Print the train and the test errors
print("Train error: ", train_err)
print("Test error : ", test_err)

plt.title("Training and test error regarding the number of features")
plt.plot(range(1,5), train_err, label="train_error")
plt.plot(range(1,5), test_err, label="test_error")
plt.legend(fontsize=10)
plt.xlabel("Number of features")
plt.ylabel("Error")
plt.show()


## QUIZ ON MOODLE

In [None]:
weather.head(5)

### QUIZ - QUESTION 1 ON MOODLE

In [None]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.preprocessing import Normalizer

X = weather[['Humidity9am', 'Temp9am', 'Cloud9am', 'WindSpeed9am', 'Sunshine']]
y = weather[['Evaporation']]

# Split the data set using a test size of 0.2 and a random state of 10
# YOUR CODE HERE
X_train, X_test, y_train, y_test = ...

# DO NOT NORMALIZE THE DATA

# Create the linear model
# YOUR CODE HERE
model = ...

# Fit the linear model
# YOUR CODE HERE

# Compute the prediction
# YOUR CODE HERE
predictions = ...

# Compute the mean absolute error
# YOUR CODE HERE

print("Intercept: ", model.intercept_) 
print("Features coefficients (weigths): ", model.coef_)

### QUIZ - QUESTION 2 ON MOODLE

In [None]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.preprocessing import Normalizer

X = weather[['Humidity9am', 'Temp9am', 'Cloud9am', 'WindSpeed9am', 'Sunshine']]
y = weather[['Evaporation']]

# Split the data set using a test size of 0.2 and a random state of 10
X_train, X_test, y_train, y_test = ...


# Normalize the data using the Normalizer
# Create new scaler from Normalizer()
scaler = Normalizer()
# Fit and transform the original data
scaler.fit(X_train, y_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create the linear model
# YOUR CODE HERE
model = ...

# Fit the linear model
# YOUR CODE HERE

# Compute the prediction
# YOUR CODE HERE
predictions = ...

# Compute the R^2
r2 = ...

print("Intercept: ", model.intercept_) 
print("Features coefficients (weigths): ", model.coef_)