This is a continuation of Sprint 3. So the first line of code are a copied from it.

In [16]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [17]:
# read the data
red_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
white_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')

In [18]:
# remove lines that have all values duplicated
red_wine.drop_duplicates(inplace=True)
white_wine.drop_duplicates(inplace=True)

In [19]:
# create a df with all wines for later analysis

# add color of wine as parameter
red_wine['red'] = 1
red_wine['white'] = 0
white_wine['white'] = 1
white_wine['red'] = 0

# combine the wine dfs
wine = pd.concat([red_wine, white_wine])

In [20]:
# use the rename method to change all columns names lowercase and add an underscore if they are made of 2 words
wine.rename(str.lower, axis='columns', inplace=True)  # make the names lowercase
wine.columns = wine.columns.str.replace(' ', '_')       # replace space with underscore in column names

In [21]:
X = wine # --> the features we will keep to build our model
y = X['red'] # --> what you're trying to predict
X.drop(['red', 'white'],axis=1,inplace=True)
print(y)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

0       1
1       1
2       1
3       1
5       1
       ..
4893    0
4894    0
4895    0
4896    0
4897    0
Name: red, Length: 5320, dtype: int64


# Epic 1: Pick your (ML) fighter
After all your hard work, it is finally time to choose the ML algorithm that you'll use to answer your question. Not all algorithms are created equal, so selecting one depends on the problem at hand: Is it a supervised or unsupervised learning problem? Is it a regression, a classification or a forecasting problem?

We won't be going into the details of each algorithm in this guide (they're too many!), but bear in mind that you will have to apply several algorithms to the same data set and compare the results.

**Read the Machine Learning Fundamentals chapters if you haven't.**

Please refer to the ML algorithm map we saw during the spike. You can [find it here](https://www.mdpi.com/2075-4426/11/1/32/htm) (figure 1).

We suggest that you explore the different algorithms and their pros and cons. Here is a [wonderful article](https://builtin.com/data-science/tour-top-10-algorithms-machine-learning-newbies) that explains in simple terms different algorithms and their applications. Also, in [this link](https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html), you can find more tips about how to choose your algorithm.

**Hint: start with a simple logistic regression**

### Fit your model
Fitting the model means training the model on training data using the .fit method provided in sklearn. For illustration purposes, we will use Logistic Regression.

In [22]:
# Example
# Fit the model
lr = LogisticRegression()
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Predict the test set
Now you will apply the .predict() method to make predictions on test data. These predictions are stored in 'pred_lr'.

In [23]:
# Make predictions
pred_lr = lr.predict(X_test)

In [24]:
pred_lr

array([0, 0, 0, ..., 0, 1, 0])

There is no point in making predictions if you do not evaluate the results. You will now measure the effectiveness of your trained models to determine and compare how well a model performs. These model-evaluation techniques are crucial in machine learning model development.

Here’s a list of evaluators used for classification problems.

# Epic 2: Evaluate your model

### The confusion matrix
The confusion matrix is not an evaluation metric but allows you to have a tabular visualisation of the predictions made by your model vs their actual class.

In this example, we want to classify if wine was red or white. After we’ve done the predictions with our model, we want to understand how many predictions were correct in each category and were wrong.

One of the easiest ways to visualise your results is with a confusion matrix, as shown below:



In [25]:
print("Confusion matrix:")
print(confusion_matrix(y_test, pred_lr))

Confusion matrix:
[[768   7]
 [ 10 279]]


The metrics are calculated by using true and false positives, true and false negatives:

- **TN / True Negative**: when a case was negative and predicted negative
- **TP / True Positive**: when a case was positive and predicted positive
- **FN / False Negative**: when a case was positive but predicted negative
- **FP / False Positive**: when a case was negative but predicted positive

### Accuracy score
Probably the simplest valuation metric, defined as the number of correct predictions divided by the total number of predictions. But be very careful! When you have **imbalance data**, where there are more samples or one category than of another, **the accuracy score can be misleading**.

To illustrate this, we’ll see a commonly used example: imagine you work at a hospital, and you have created an ML model to classify tumours between benign and malign. You run your model and these are your results:
- Of the 91 benign tumours, the model correctly identifies 90 as benign. That's good.
- However, of the 9 malignant tumours, the model only correctly identifies 1 as malignant, meaning that 8 out of 9 malignancies go undiagnosed!

While 91% accuracy may seem good at first glance, your model has zero predictive ability to distinguish malignant tumours from benign tumours, making it useless.

**Accuracy Score = (TP + TN) / (TP + TN + FP + FN)**

In [26]:
print("Accuracy score:", accuracy_score(y_test, pred_lr))

Accuracy score: 0.9840225563909775


### The classification report
A Classification report measures the quality of predictions from a classification algorithm, reflecting how many predictions are accurate, whether they're true or false. The report also shows the main classification metrics precision, recall and f1-score on a per-class basis.

In [28]:
print(classification_report(y_test, pred_lr, target_names=["red","white"]))

              precision    recall  f1-score   support

         red       0.99      0.99      0.99       775
       white       0.98      0.97      0.97       289

    accuracy                           0.98      1064
   macro avg       0.98      0.98      0.98      1064
weighted avg       0.98      0.98      0.98      1064

