# Universidade Federal do Rio Grande do Norte


## Programa de Pós-Graduação em Engenharia Elétrica e de Computação
## EEC1509 - Aprendizagem de Máquina


# Group

## João Lucas Correia Barbosa de Farias

## Júlio Freire Peixoto Gomes


# Project 1 - Red Wine Quality Classification


## About the Project
This project is divided in 8 files including this one, where each one represents one step in the process of deploying a machine learning algorithm. In this case, we choose a Decision Tree algorithm as Classifier due to its simplicity and because it is the algorithm we saw in class. However, other classifiers may perform a better fit.

The dataset has some characteristics about red wines and their quality based on that information, so our mission is to predict the quality of any red wine using the same information we used to train our model.


### The details about the dataset are shown below.

For more information, read [Cortez et al., 2009].

### Input variables (based on physicochemical tests):


1. fixed acidity

2. volatile acidity

3. citric acid

4. residual sugar

5. chlorides

6. free sulfur dioxide

7. total sulfur dioxide

8. density

9. pH

10. sulphates

11. alcohol

Output variable (based on sensory data):

12. quality (score between 0 and 10)

## The dataset was taken from Kaggle:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

# 1.0 Install and Load Libraries


In [None]:
# install wandb
!pip install wandb

In [None]:
import logging
import tempfile
import pandas as pd
import os
import wandb
from sklearn.model_selection import train_test_split

# 2.0 Data Segretation
In this step we will segregate our data in train and test sets.

## 2.1 Login to Weights & Biases

In [None]:
# login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 2.2 Define the ratios with which to segregate the dataset

In [None]:
# ratio used to split train and test data
test_size = 0.30

# seed used to reproduce purposes
seed = 13

# reference (column) to stratify the data
stratify = "quality"

# name of the input artifact
artifact_input_name = "red_wine_quality/preprocessed_data.csv:latest"

# type of the artifact
artifact_type = "segregated_data"

In [None]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate wandb project
run = wandb.init(project="red_wine_quality", job_type="split_data")

[34m[1mwandb[0m: Currently logged in as: [33mjuliofreire[0m ([33mppgeec-ml-jj[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
logger.info("Downloading and reading artifact...")
artifact = run.use_artifact(artifact_input_name)
artifact_path = artifact.file()
df = pd.read_csv(artifact_path)

# we will first split the data into train and test sets
logger.info("Splitting data into train and test...")
splits = {}

splits["train"], splits["test"] = train_test_split(df,
                                                   test_size=test_size,
                                                   random_state=seed,
                                                   stratify=df[stratify])

# Save the artifacts. We use a temporary directory so we do not leave any trace behind
with tempfile.TemporaryDirectory() as tmp_dir:

    for split, df in splits.items():

        # Make the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)
        print(temp_path)

        logger.info(f"Uploading the {split} dataset to {artifact_name}...")

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(name=artifact_name,
                                  type=artifact_type,
                                  description=f"{split} split of dataset {artifact_input_name}",
        )
        artifact.add_file(temp_path)

        logger.info("Logging artifact...")
        run.log_artifact(artifact)

        # this waits for the artifact to be uploaded to wandb. if you
        # do not add this, the temp directory might be removed before
        # wandb has the chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

29-05-2022 04:00:53 Downloading and reading artifact...
29-05-2022 04:00:56 Splitting data into train and test...
29-05-2022 04:00:56 Uploading the train dataset to train.csv...
29-05-2022 04:00:56 Logging artifact...


/tmp/tmpfkwwirg7/train.csv


29-05-2022 04:00:59 Uploading the test dataset to test.csv...
29-05-2022 04:00:59 Logging artifact...


/tmp/tmpfkwwirg7/test.csv


In [None]:
# finishing the run
run.finish()

VBox(children=(Label(value='0.079 MB of 0.079 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…