In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Sklearn for Linear Regression

In this notebook you will use Scikit-Learn to perform Linear Regression on an E-commerce Customer Dataset.

## Goal

The goal is to predict the ‘Yearly Amount Spent’ by a customer on an E-commerce platform, so that this information can be used to give the particular customer personalized offers or Loyalty membership etc.

The notebook will help you with the initial data exploration, but you will perform the actual modelling!

## Linear regression reminder

*Singular explanory variable ($x$):* 
 
$$ y = \mathbf{a} x + \mathbf{b} $$

*More than one explanatory variable ($x_1, x_2, ...$):*

$$ y(\mathbf{w}, \mathbf{x}) = w_0 + w_1 x_1 (+ w_2 x_2 + ...)$$

*A possible regression line*

<img src="images/linear-regression.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

*Residuals can be computed to measure how good the regression line is*

<img src="images/residual.png" style="display: block;margin-left: auto;margin-right: auto;height: 400px"/>

## Loading in the data

This is the dataset we will be performing linear regression on.

We will try to predict‘Yearly Amount Spent’, so this will be our target vector *or* dependent variable.

In [None]:
customers = (
    pd.read_csv('data/Ecomm-Customers.csv')
    .rename(str.lower, axis='columns')
)
customers.head()

### <mark>Exercise: Perform some preliminary analysis on the dataset.</mark>

* How many customers are in this data?

* What datatypes does the dataset contain? Are there any missing values?

* How many predictive features are there? Will all the features be informative for the target?

* How many different values are there in the 'Yearly Amount Spent' column? Write a sentence to argue why this is a regression task, rather than classification task.

## Correlation

*Correlation - a statistic that measures the degree to which two variables move in relation to each other.*

We will try to learn a model that predicts the yearly amount spent by a customer. However, what if we are able to predict that already from one variable alone? Or what if some variables are not relevant for predicting the yearly amount spent? 

Let's investigate whether any of the other variables correlate with the yearly amount spent.

### Visual inspection

Using pairplot we can visualise the relationship between the numerical columns.

In [None]:
sns.pairplot(customers)

Feom the plots, ‘Length of Membership’ and ‘Time in App’ appear to be the varibles that have the most correlation with the dependent variable.

### Correlation matrix

We can also compute the actual pairwise correlation of the columns.

In [None]:
customers.corr()

### Heatmap

This correlation matrix can be visualised with a heatmap. 

In [None]:
sns.heatmap(customers.corr(), linewidth=0.5, annot=True);

Along with the aforementioned variables, we can see that Avg. Session Length may also be informative for predicting the dependent variable.

## <mark>Exercise: Build a Linear Regression Model </mark>

### <mark>1. Create Predictor variables 'X' and Target Variable 'y'</mark>

For the time being, let’s form our feature matrix using the variables that appear to have a high degree of correlation with the dependent variable.

Create the X and y variables. X should have the variables: `Time on App` and `Length of Membership`.


### <mark>2. Split the data into a training set and a testing set.</mark>

We will train our model on the training set and then use the test set to evaluate the model.

Reserve 30% of the data for testing and a random seed of 100.

### <mark> 3. Import the model</mark>

Look up how to import a simple [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

Import the model and then instantiate it.

### <mark>4. Train the model</mark>

Fit the model to the train set.

---

# Intercept and Coefficients

## 4. Investigate the model's intercept and coefficients.

What are they?

```python
# Print the intercept
print("Intercept", model.intercept_)

# Print the coefficients
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
```

*TAKE CARE: Make sure you replace `model` if you have instantiated your model with a different name!*

### Interpreting the model

a) What does the intercept represent in this context?

b) What do the coefficients represent?

c) Are the coefficients comparable?

## 5. Model Predictions

Use your model to make prediction on the test set.



## 6. Evaluating Predictions

We want to evaluate whether we have made good predictions or not.

a) Why would accuracy not be a good metric here?

### Visual inspection

b) **Make a scatter plot** that plots the test set against your predictions.

```python
plt.scatter(y_test, y_pred)
plt.xlabel("Test")
plt.ylabel("Predictions")
```

c) If you're linear regression was really accurate what would you expect to see?
How does yours compare?

### Regression Metrics

Rather than trying to interpret our graph, we can also calculate some metrics.

For example, we could calculate the average error of our regression (e.g. how wrong we were on average). The model that gives you the smallest average error would be the best.

It is common to report one of the following: 

Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE)

d) Why would it not be ok to just calculate Mean Error?

Import metrics from sklearn and then calulate the following:

e) Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE)

You can find these in sklearn.metrics: 
- mean_absolute_error
- mean_squared_error

f) What is the difference between the metrics, what will MSE/RMSE be more sensitive to than MAE

## 7. Overfitting

By evaluating predictions on the data we trained on we can check for overfitting/underfitting.

a) Make predictions on the train set and compute the same metrics as before (MAE, MSE & RMSE).

b) What would you expect to see if we were overfitting.

In [None]:
train_pred = lr.predict(X_train)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_train, train_pred))
print('MSE:', metrics.mean_squared_error(y_train, train_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, train_pred)))

## <mark> Exercise: Incorporating more features </mark>

Remember we had left out a variable which had lesser degree of positive correlation? Add that variable (Avg. Session Length) and see if it improves the model.

Compare performance using the same metrics as before (MAE, MSE & RMSE) when this extra information is included in the feature matrix

# Conclusion

We've now seen regression models and seen how the steps for building a regression model follow that of what we saw in the classification examples. We also saw the need for different metrics.