# Universidade Federal do Rio Grande do Norte


## Programa de Pós-Graduação em Engenharia Elétrica e de Computação
## EEC1509 - Aprendizagem de Máquina


# Group

## João Lucas Correia Barbosa de Farias

## Júlio Freire Peixoto Gomes


# Project 1 - Red Wine Quality Classification


## About the Project
This project is divided in 8 files including this one, where each one represents one step in the process of deploying a machine learning algorithm. In this case, we choose a Decision Tree algorithm as Classifier due to its simplicity and because it is the algorithm we saw in class. However, other classifiers may perform a better fit.

The dataset has some characteristics about red wines and their quality based on that information, so our mission is to predict the quality of any red wine using the same information we used to train our model.


### The details about the dataset are shown below.

For more information, read [Cortez et al., 2009].

### Input variables (based on physicochemical tests):


1. fixed acidity

2. volatile acidity

3. citric acid

4. residual sugar

5. chlorides

6. free sulfur dioxide

7. total sulfur dioxide

8. density

9. pH

10. sulphates

11. alcohol

Output variable (based on sensory data):

12. quality (score between 0 and 10)

## The dataset was taken from Kaggle:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

# 1.0 Install and Load Libraries


In [None]:
!pip install wandb

In [None]:
import wandb
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile
import os

# 2.0 Preprocessing

After performing the EDA, we now move on to preprocessing the dataframe, that is, removing all duplicated rows and changing or removing missing or 'broken' values.

## 2.1 Login to Weights & Biases

In [None]:
# login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 2.2 Download raw_data.csv from wandb

In [None]:
# creating variables with artifact names to facilitate the usage of 
# the functions
input_artifact="ppgeec-ml-jj/red_wine_quality/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data After Preprocessing"

In [None]:
# initiate a run, syncing all steps taken on the notebook with wandb
run = wandb.init(project="red_wine_quality", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mjuliofreire[0m ([33mppgeec-ml-jj[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# download latest version of the artifact raw_data.csv
artifact_wandb = run.use_artifact(input_artifact)

# input the file raw_data.csv to a pandas dataframe
df = pd.read_csv(artifact_wandb.file())

In [None]:
# showing Dtype of all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [None]:
# the only thing we need to do at this stage is to remove duplicated rows.
df.duplicated().sum()

240

In [None]:
# removing duplicated rows
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

0

In [None]:
# putting _ instead of spaces
df.columns = ['fixed_acidity',
 'volatile_acidity',
 'citric_acid',
 'residual_sugar',
 'chlorides',
 'free_sulfur_dioxide',
 'total_sulfur_dioxide',
 'density',
 'ph',
 'sulphates',
 'alcohol',
 'quality']

In [None]:
# now we only have 1359 rows to experiment with
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1359 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         1359 non-null   float64
 1   volatile_acidity      1359 non-null   float64
 2   citric_acid           1359 non-null   float64
 3   residual_sugar        1359 non-null   float64
 4   chlorides             1359 non-null   float64
 5   free_sulfur_dioxide   1359 non-null   float64
 6   total_sulfur_dioxide  1359 non-null   float64
 7   density               1359 non-null   float64
 8   ph                    1359 non-null   float64
 9   sulphates             1359 non-null   float64
 10  alcohol               1359 non-null   float64
 11  quality               1359 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 138.0 KB


In [None]:
# we will use the median value to split the 'quality' in two: 'bad' and 'good'
df['quality'].median()

6.0

In [None]:
bins = (2, 6.5, 8)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins=bins, labels=group_names)

In [None]:
df['quality'].value_counts()

bad     1175
good     184
Name: quality, dtype: int64

In [None]:
df['quality'].count().sum()

1359

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1359 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   fixed_acidity         1359 non-null   float64 
 1   volatile_acidity      1359 non-null   float64 
 2   citric_acid           1359 non-null   float64 
 3   residual_sugar        1359 non-null   float64 
 4   chlorides             1359 non-null   float64 
 5   free_sulfur_dioxide   1359 non-null   float64 
 6   total_sulfur_dioxide  1359 non-null   float64 
 7   density               1359 non-null   float64 
 8   ph                    1359 non-null   float64 
 9   sulphates             1359 non-null   float64 
 10  alcohol               1359 non-null   float64 
 11  quality               1359 non-null   category
dtypes: category(1), float64(11)
memory usage: 128.9 KB


In [None]:
# creating the preprocessed dataset file
df.to_csv(artifact_name,index=False)

In [None]:
# uploading a new artifact to wandb using the variables created earlier
artifact = wandb.Artifact(name=artifact_name,
                          type=artifact_type,
                          description=artifact_description)
artifact.add_file(artifact_name)

<ManifestEntry digest: jOOwdxRXWoxdZBgJIVCrLw==>

In [None]:
# uploading artifact to wandb
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7efdaf05fe10>

In [None]:
run.finish()

VBox(children=(Label(value='0.536 MB of 0.536 MB uploaded (0.031 MB deduped)\r'), FloatProgress(value=1.0, max…