<img src='../images/gdd-logo.png' align=right width=300px>

# Sklearn for Linear Regression

In this notebook you will use Scikit-Learn to perform Linear Regression on an E-commerce Customer Dataset.

## Goal

The goal is to predict the ‘Yearly Amount Spent’ by a customer on an E-commerce platform, so that this information can be used to give the particular customer personalized offers or Loyalty membership etc.

The notebook will help you with the initial data exploration, but you will perform the actual modelling!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading in the data

This is the dataset we will be performing linear regression on.

We will try to predict‘Yearly Amount Spent’, so this will be our target vector *or* dependent variable.

In [None]:
customers = (
    pd.read_csv('../data/Ecomm-Customers.csv')
    .rename(str.lower, axis='columns')
)
customers.head()

### <mark>Exercise: Perform some preliminary analysis on the dataset.</mark>

* How many customers are in this data?

In [None]:
customers['email'].nunique()

* What datatypes does the dataset contain? Are there any missing values?

In [None]:
customers.dtypes

In [None]:
customers.isnull().sum()

* How many predictive features are there? Will all the features be informative for the target?

In [None]:
customers.shape

In [None]:
customers['avatar'].nunique()

* How many different values are there in the 'Yearly Amount Spent' column? Write a sentence to argue why this is a regression task, rather than classification task.

In [None]:
customers['yearly amount spent'].nunique()

In [None]:
customers['yearly amount spent'].min()

## Correlation

*Correlation - a statistic that measures the degree to which two variables move in relation to each other.*

We will try to learn a model that predicts the yearly amount spent by a customer. However, what if we are able to predict that already from one variable alone? Or what if some variables are not relevant for predicting the yearly amount spent? 

Let's investigate whether any of the other variables correlate with the yearly amount spent.

### Visual inspection

Using pairplot we can visualise the relationship between the numerical columns.

In [None]:
sns.pairplot(customers)

Feom the plots, ‘Length of Membership’ and ‘Time in App’ appear to be the varibles that have the most correlation with the dependent variable.

### Correlation matrix

We can also compute the actual pairwise correlation of the columns.

In [None]:
customers.corr()

### Heatmap

This correlation matrix can be visualised with a heatmap. 

In [None]:
sns.heatmap(customers.corr(), linewidth=0.5, annot=True);

Along with the aforementioned variables, we can see that Avg. Session Length may also be informative for predicting the dependent variable.

## <mark>Exercise: Build a Linear Regression Model </mark>

### <mark>1. Create Predictor variables 'X' and Target Variable 'y'</mark>

For the time being, let’s form our feature matrix using the variables that appear to have a high degree of correlation with the dependent variable.

Create the X and y variables. X should have the variables: `Time on App` and `Length of Membership`.


In [None]:
feature_columns = ['time on app', 'length of membership']
X = customers.loc[:, feature_columns]
y = customers.loc[:, 'yearly amount spent']

### <mark>2. Split the data into a training set and a testing set.</mark>

We will train our model on the training set and then use the test set to evaluate the model.

Reserve 30% of the data for testing and a random seed of 100.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.3)
print(f'X_train shape: {X_train.shape}', 
      f'X_test shape: {X_test.shape}', 
      f'y_train shape: {y_train.shape}', 
      f'X_test shape: {y_test.shape}',
     sep='\n')

### <mark> 3. Import the model</mark>

Look up how to import a simple [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

Import the model and then instantiate it.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

### <mark>4. Train the model</mark>

Fit the model to the train set.

In [None]:
model.fit(X_train, y_train)

---

# Intercept and Coefficients

## 4. Investigate the model's intercept and coefficients.

What are they?

```python
# Print the intercept
print("Intercept", model.intercept_)

# Print the coefficients
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
```

*TAKE CARE: Make sure you replace `model` if you have instantiated your model with a different name!*

In [None]:
# Print the intercept
print("Intercept", model.intercept_)

# Print the coefficients
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)

### Interpreting the model

a) What does the intercept represent in this context?

b) What do the coefficients represent?

c) Are the coefficients comparable?

***yearly amount spent = 36.7 * Time on App + 64.8 * Length of Membership - 177.6***

Coefficients are not comparable as the `time on app` is MUCH smaller than `length of membership`

## 5. Model Predictions

Use your model to make prediction on the test set.



In [None]:
y_pred = model.predict(X_test)

In [None]:
np.array([y_test[:10], y_pred[:10]]).T

## 6. Evaluating Predictions

We want to evaluate whether we have made good predictions or not.

a) Why would accuracy not be a good metric here?

### Visual inspection

b) **Make a scatter plot** that plots the test set against your predictions.

```python
plt.scatter(y_test, y_pred)
plt.xlabel("Test")
plt.ylabel("Predictions")
```

c) If you're linear regression was really accurate what would you expect to see?
How does yours compare?

In [None]:
plt.scatter(y_test, y_pred)
plt.xlabel("Test")
plt.ylabel("Predictions")

### Regression Metrics

Rather than trying to interpret our graph, we can also calculate some metrics.

For example, we could calculate the average error of our regression (e.g. how wrong we were on average). The model that gives you the smallest average error would be the best.

#### **Coefficient of Determination**, denoted $R^2$ 
We can use `.score()` method here that, when you apply this method to a Linear Regression model, returns $R^2$.

Find the coefficient of determination of your model using the following:

```python
model.score(X_test, y_test)
```

In [None]:
model.score(X_test, y_test)

The **Coefficient of Determination**, or $R^2$ score, is a **goodness-of-fit** measurement for regression models. 

It is a measure of the percentage of variance in the target, which the features explain collectively. 

Accordingly, the $R^2$ score indicates the strength of the relationship between your model and the target on a scale between **0 – 100%**.

As well as $R^2$, it is common to report one of the following: 

- **Mean Absolute Error (MAE)**
    -  MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions.
- **Mean Squared Error (MSE)**
    - The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs.
- **Root Mean Squared Error (RMSE)**
    - The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model's predicted values. 

d) Why would it not be ok to just calculate Mean Error?

<font color='blue'> There are three metrics which are generally used for evaluation of Regression problems (like Linear Regression, Decision Tree Regression, Random Forest Regression etc.):

- ***Mean Absolute Error (MAE):*** This measures the absolute average distance between the real data and the predicted data, but it fails to punish large errors in prediction.
- ***Mean Square Error (MSE):*** This measures the squared average distance between the real data and the predicted data. Here, larger errors are well noted (better than MAE). But the disadvantage is that it also squares up the units of data as well. So, evaluation with different units is not at all justified.
- ***Root Mean Squared Error (RMSE):*** This is actually the square root of MSE. Also, this metrics solves the problem of squaring the units.

Hence, all the metrics above measures the average model prediction error ranging between 0 to infinity with negatively oriented scores which means lower the evaluation value better is your model.
Choosing better metrics of evaluation is totally dependent to the scenario of your model and data. In short, MSE evaluates with squared units. And mathematically, taking up absolute value for evaluation is quite undesirable, making RMSE a distinct advantageous over other two.

Import metrics from sklearn and then calulate the following:

e) Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE)

You can find these in sklearn.metrics: 
- mean_absolute_error
- mean_squared_error

In [None]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

f) What is the difference between the metrics, what will MSE/RMSE be more sensitive to than MAE

## 7. Overfitting

By evaluating predictions on the data we trained on we can check for overfitting/underfitting.

a) Make predictions on the train set and compute the same metrics as before (MAE, MSE & RMSE).

b) What would you expect to see if we were overfitting.

In [None]:
train_pred = model.predict(X_train)

In [None]:
print('MAE:', metrics.mean_absolute_error(y_train, train_pred))
print('MSE:', metrics.mean_squared_error(y_train, train_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, train_pred)))

## <mark> Exercise: Incorporating more features </mark>

Remember we had left out a variable which had lesser degree of positive correlation? Add that variable (Avg. Session Length) and see if it improves the model.

In [None]:
feature_columns = ['time on app', 'length of membership', 'avg. session length']
X = customers[feature_columns]
y = customers.loc[:, 'yearly amount spent']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, train_size=0.3)
print(f'X_train shape: {X_train.shape}', 
      f'X_test shape: {X_test.shape}', 
      f'y_train shape: {y_train.shape}', 
      f'X_test shape: {y_test.shape}',
     sep='\n')

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
train_pred = model.predict(X_train)

Compare performance using the same metrics as before (MAE, MSE & RMSE) when this extra information is included in the feature matrix

In [None]:
print('TEST DATA:')
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
print('TRAIN DATA:')
print('MAE:', metrics.mean_absolute_error(y_train, train_pred))
print('MSE:', metrics.mean_squared_error(y_train, train_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, train_pred)))

# Conclusion

We've now seen regression models and seen how the steps for building a regression model follow that of what we saw in the classification examples. We also saw the need for different metrics.