<a href="https://colab.research.google.com/github/jonathandsouza/jy-notebooks/blob/main/MLP1_1_Scikit_learn_and_LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline 

## Using real-life data in Google Colab

In this course we will be using some real-life datasets. Let's first briefly discuss how to use real-life data in Google Colab:

Pandas supports reading all kinds of datasets, but in practice the most commonly used are probably excel and csv files.

Since we work with Google Colab, there is an extra step involved: we also have to make the data available in this online environment.

For that we use the code below, to mount our Google Drive.

In [None]:
from google.colab import drive

* You can then click on the folder icon, here on the left, to see your drive-folder.
* You can now upload the dataset(s) you need to your Drive. You can then load the data into your notebook using the pandas read_csv and read_excel functions.

You can find the path (in quotes) to which you need to refer by clicking on the three dots in your folder structure and choosing "copy path". You can then paste that path into your read function.

I have put my folder datasets directly in my main folder on Google Drive:

In [None]:
df_iris = pd.read_csv("/content/drive/MyDrive/datasets/iris.csv")

df_iris.head(3)

The folder with (open) datasets we use in this course can be found [here](https://drive.google.com/drive/folders/1QIgKvGSeltdzH6HMLlEM_ddYYObsrBO2?usp=sharing).

So please download the entire folder and upload them to your own drive.
Now we can start using the data.

## Churn case
In this course we will have a returning case with regards to churning (=customers stop using a product/service). 

Our ultimate goal is to predict which Telco customers will churn. (but we will try some other predictions first)

We have a dataset (churn_data.csv) of 10,000 customers. We know these things about the customers:
- What technology they use (e.g. 4G or cable)
- Age
- How long they have been a customer
- How many times they called the helpdesk last year
- The average monthly invoice amount
- A churn indicator: indicates whether the customer churned

Let's first take a look at the data:

In [None]:
df_churn = pd.read_csv("/content/drive/MyDrive/datasets/churn_data.csv", delimiter=";", decimal=",")
df_churn.head(10)

In [None]:
df_churn.describe()

Because we're just starting out, we will start with a little experiment: trying to predict the AverageBill of customers, only based on their Age and number of SupportCallsLastYear.

#Modelling data with scikitlearn - the basics

Among other things, with scikitlearn we can:
* split datasets into training data and test data
* train different types of models
* analyze the performance of models
* optimize hyperparameters
* select features



Below we split the columns into two categories:
*   the features, these are the independent variables
*   the target, this is the variable we want to predict (AverageBill)

So the question is: can we predict the bill, based on the given characteristics?

In [None]:
features = df_churn[["Age", "SupportCallsLastYear"]]
target = df_churn.AverageBill

To make sure that a trained model is suitable for making predictions on new data, we usually split a dataset into a training set and a test set.

We train the model on the training set and then we test the performance on the test set.

In [None]:
# import sklearn train_test_split functionality
from sklearn.model_selection import train_test_split

We are now going to split the data into different sets:
* the features_train, consists of the feature columns, and from those we take the largest part of the rows to use for training
* the features_test, again the feature columns, but these rows go into the test set
* target_train, this is the target and again the rows we use for training
* target_test, the remaining rows for the test set

Scikitlearn's train_test_split function returns these four sets.

We input the features and target into the train_test_split() function, and we indicate that the size of the test set is 20% of the total. 

NB: We set a random_state here. The random_state takes care of random selection of the rows for test vs. train. Fixing this value ensures that you can reproduce this selection the next time you run this notebook.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                            target, 
                                                                            test_size=0.2, 
                                                                            random_state=1);

Let's double check whether the split went well.

In [None]:
features_train.shape, features_test.shape, target_train.shape, target_test.shape

We see that of the 10000 entries in the dataset, 2000 have ended up in the test set. That is indeed exactly 20%. Furthermore, we see that both features datasets contain 2 columns, and that for both target datasets the result appears to be a Series object (1 column).

Note that it is quite possible that by splitting the data sets the distributions of the training set and the test set no longer correspond. To ensure that the distribution matches after splitting, you can use "stratified sampling" (which we will not discuss here).

In [None]:
features_train.head()

In [None]:
target_train.head()

#Training a model

We are going to initialize our first regression model with default settings.

Each regressor has:
* a `fit(X, y)` function to fit the model to the data
* a `predict(X)` function that uses the fitted model to do predictions

The `fit(X,y)` function expects a features-dataframe (with every feature in a column, and each item in a single row), and after the comma a Series containing the target-variable.

The `predict(X)` function expects a dataframe containing the samples to predict

Passing a pandas dataframe works fine, but sometimes it causes problems and you have to make it into a numpy matrix. You can do this in the following way: `matrix = df.values`

In [None]:
# Initialize our model to be a linear regression model (using build in sklearn functionality)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

Lets train the LinearRegression model on the splitted churn dataset. As input for the fit() function, we use the created features_train and target_train variables.

In [None]:
lr.fit(features_train, target_train)

SciKitLearn has now created a model that is stored in 'lr'. This is the trained regression model that tries to predict the AverageBill on the basis of the given features.

##Using the model for predictions on new data

We can use the trained model for predictions on new data from the test set.

In [None]:
preds = lr.predict(features_test)
preds

In [None]:
preds.shape

We see that the model has made a prediction for each of the 2000 entries in the test set. We get this back in the form of a Numpy Array (so not as a series!)

But is the model right?

We also know what the real value was for these 2000 entries. So we can compare the results of the model with the real answer to see how well the model performs on new data.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
print("RMSE",mean_squared_error(target_test,preds, squared=False))
print("MAE",mean_absolute_error(target_test,preds))
print("R2",r2_score(target_test,preds))

Well.. not the best model. A mean average error of almost half of the average bill. 

And looking at the low R-squared score we can also see our model can not explain variations in AverageBill very well.

And actually, that was to be expected. We didn't do any proper research upfront. For example, we did not check if there is even a linear relationship between the variables. 

But for now, it is ok, we just created our first Machine Learning model!

## Using baseline scores
Interpreting performance scores can be quite tricky. But what might help is using a baseline. For example a baseline of always using the mean value as a prediction, or the median value. 
You can create such a dummy-model and see the performance scores of using such a strategy.
Those dummy performance scores can be a usefull starting point for comparison with your actual model.

In [None]:
from sklearn.dummy import DummyRegressor
# let's try always using the mean value
dummy = DummyRegressor(strategy="mean")

dummy.fit(features_train, target_train)
dummy_preds = dummy.predict(features_test)
print("MAE",mean_absolute_error(target_test,dummy_preds))
print("R2",r2_score(target_test,dummy_preds))

It seams our model at least outperforms the dummy regressor. So predictions are more accurate than just using the mean AverageBill all the time.

# Assignment 1.1

**Question 1: making a LR model from scratch**

Below we first fetch an existing dataset to use in this question.

It is a dataset on california houses and their value (average house value in units of 100,000).

In [None]:
from sklearn.datasets import fetch_california_housing;

data = fetch_california_housing(as_frame=True);

# Get the features and the target
df_features = data['data']
df_target = data['target']


print('features shape',df_features.shape);
print('target shape',df_target.shape);


print('Independent variables:\n',df_features.describe())
print('\nTarget variable:\n',df_target.describe());
print(df_features.head());


features shape (20640, 8)
target shape (20640,)
Independent variables:
              MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692

Train and evaluate a Linear Regression model on this dataset.

Remember to train_test_split(), and you might want to create a dummy model to compare your scores to.

In [None]:
#1. Split the data in trainging & tesgting; 

from sklearn.model_selection import train_test_split;

training_feature, testing_feature, training_target, testing_target = train_test_split(df_features, df_target, test_size=0.2, random_state=1);

print(training_feature.shape, training_target.shape);


(16512, 8) (16512,)


In [None]:
# create a model from 

from sklearn.linear_model import LinearRegression;

lr_model = LinearRegression();

lr_model.fit(training_feature, training_target);


In [None]:
# Use Model to gain prediction. 

predictions = lr_model.predict(testing_feature);


In [None]:
# Test accuracy of the data 

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score;

def evaluateMetrics(testing_target, predictions):
    print("RMSE:", mean_squared_error(testing_target, predictions, squared=False));
    print("MAE:", mean_absolute_error(testing_target, predictions));
    print("R2:", r2_score(testing_target, predictions));

evaluateMetrics(testing_target, predictions);



RMSE: 0.7274202599183853
MAE: 0.5328685121247797
R2: 0.596596837481235


In [None]:
# Create a dummy 

from sklearn.dummy import DummyRegressor;

dummy_model = DummyRegressor();

dummy_model.fit(training_feature, training_target);

dummy_preds = dummy_model.predict(testing_feature);

print("RMSE:", mean_squared_error(testing_target, dummy_preds));
print("MAE:", mean_absolute_error(testing_target, dummy_preds));
print("R2:", r2_score(testing_target, dummy_preds));



RMSE: 1.3136235330402446
MAE: 0.9090896299308748
R2: -0.0014734336890012134


**Question 2: improve model**

Try improving the model performance by going back to the dataset and removing the outliers. 
After removing outliers, again divide the dataset into features and target, train_test_split, and train and evaluate your new model.