# **CLASSIFICATION AND REGRESSION PROBLEMS:**

**2 types of problems in prediction:**

**1. Classification problems**
Here the outcome variable (the variable we want to predict) can only assume certain values (finite number of values).

Examples (classification):
*	Classifying email as spam or not
*	Classifying a credit card transaction as fraudulent or not
*	Predict whether a customer will buy a new product after a marketing campaign

**2. Regression problems** => where the variable to predict is numerical.

Examples (regressions):
*	To predict the next year total sales
*	To predict the price within one month of a stock (stock market)
The price and the total sales can be any number, so it’s numerical.





# **REGRESSION PROBLEM**




**Dataset: red wine**

In [None]:
import pandas as pd
import numpy as np


In [None]:
df = pd.read_csv('/content/winequality-red.csv', sep=';')

In [None]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


This can be treated as a regression model, where the outcome variable is `quality`.

In [None]:
from sklearn.linear_model import LinearRegression

SKLEARN expects two dataframes:

* `x`, the dataframe with the predictors.
* `y`, the dataframe with the outcome variable.

In [None]:
x = df.drop('quality', axis=1) #axis=1 to instruct python to consider 'quality' as a column
y = df['quality']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=45)

In [None]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


As can be seen, the predictors have quite different scales. This may negatively impact the predict performance of the model.

To solve this problem, we can scale all predictors to have them with a similar scale.

For that, we need `StandardScaler` function.

In [None]:
from sklearn.preprocessing import StandardScaler

Sklearn gives us the possibility of using a pipeline to group sequential steps.

The function is `Pipeline`.


In [None]:
from sklearn.pipeline import Pipeline

In [None]:
#to define the pipeline, not do
scaler = Pipeline([
    ('scale', StandardScaler())
])


We now need to apply the scale to the column of the predictors.

For this, we need the function `ColumnTransformer`.

In [None]:
from sklearn.compose import ColumnTransformer

preprocesser = ColumnTransformer([
    ('scale2', scaler, x.columns.to_list())#'scale2' is the name of the step scaler  inside the 'pipeline' ColumnTranformer. #This name can be whatever name we like
],
  remainder = 'passthrough' #this is to pass to transformed columns to the model
)

Now, everything is ready to apply the linear regression model.

We will use a pipeline with two steps:

* preprocessing step;
* linear regression.

In the steps of pipelines, the names between quotes are names and they be whatever we want.


In [None]:
pipe = Pipeline([
        ('pre', preprocesser),
        ('ln', LinearRegression())
    ])

pipe.fit(x_train, y_train)

In [None]:
y_pred = pipe.predict(x_train)
y_pred

array([6.48553273, 4.72694379, 5.84088531, ..., 5.48660099, 6.34047368,
       5.4922784 ])

CONTINUAÇÃO (13-03)
1ª tabela que desenhei no tablet

In [None]:
pd.DataFrame({
    'y_true': y_train,
    'y_pred': y_pred
})

Unnamed: 0,y_true,y_pred
1108,7,6.485533
709,5,4.726944
823,6,5.840885
4109,6,6.496332
1243,7,6.425209
...,...,...
4473,5,5.191853
580,5,5.176507
163,6,5.486601
4703,7,6.340474


Python has a function to calculate the mean of the absolute error. We need first to load the function.

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(y_train, y_pred)

0.5859156936493961

Python has a function to calculate the mean squared error. We need first to load the function.

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(y_train, y_pred)

0.5656942908707611

Since the errors are squared, to better compare with the outcome variable, we can use square root of the mean squared error.

Acrescentando `squared=False` obtemos a root.

In [None]:
mean_squared_error(y_train, y_pred, squared=False)

0.7521265125434424

In [None]:
np.sqrt(0.5656942908707611)

0.7521265125434424

In [None]:
y_train.mean()

5.8797856049004595

In [None]:
mae = mean_absolute_error(y_train, y_pred)
mse = mean_squared_error(y_train, y_pred)
rmse = mean_squared_error(y_train, y_pred, squared=False)

In [None]:
print(f'MAE = {mae}')
print(f'MSE = {mse}')
print(f'RMSE = {rmse}')

MAE = 0.5859156936493961
MSE = 0.5656942908707611
RMSE = 0.7521265125434424


**COEFICIENT OF CORRELATION**

To calculate the R2, we need to load `r2_score` function.

R2 => good predictons, if the value of R2 is close to 1

R2 => bad predictions, if the value of R2 is close to 0

R2 = 0.9 => very good predictions



In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_train, y_pred)

0.27577917074526737

In [None]:
r2 = r2_score(y_train, y_pred)

In [None]:
print(f'R2= {r2}')

R2= 0.27577917074526737


**ANALYSIS OF THE RESULT OF R2** 

This model has a poor performance because the prediction is bad (R2 is too small).

The predictive performance of this model is not good, since R2 is low. This means that we should try another model. Or this is suggesting that some important predictors are missing.

**NOTE ABOUT PREDICTION MODELS:** 

* There is not a perfect model. Não há um que seja substancialmente melhor do que outro.

  1. Some work fine with a specific dataset, but not with another dataset
  2. That's why we need to learn several models.

* If a model does poorly, that may mean that important predictors are missing.
* In addiction, some problems cannot be predicted at all => example: euromillions.


To get an idea about how our model works with unseen data, we will use the `test set`, which has not been used for anything up to now.

In [None]:
y_pred = pipe.predict(x_test)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

In [None]:
print(f'MAE = {mae}')
print(f'MSE = {mse}')
print(f'RMSE = {rmse}')
print(f'R2= {r2}')

MAE = 0.5730106620077944
MSE = 0.5543180198119331
RMSE = 0.7445253654590508
R2= 0.30403261702378337


The errors don't change a lot, so this model will not work a lot, but in the next dataset will work very good.