# Universidade Federal do Rio Grande do Norte


## Programa de Pós-Graduação em Engenharia Elétrica e de Computação
## EEC1509 - Aprendizagem de Máquina


# Group

## João Lucas Correia Barbosa de Farias

## Júlio Freire Peixoto Gomes


# Project 1 - Red Wine Quality Classification


## About the Project
This project is divided in 8 files including this one, where each one represents one step in the process of deploying a machine learning algorithm. In this case, we choose a Decision Tree algorithm as Classifier due to its simplicity and because it is the algorithm we saw in class. However, other classifiers may perform a better fit.

The dataset has some characteristics about red wines and their quality based on that information, so our mission is to predict the quality of any red wine using the same information we used to train our model.


### The details about the dataset are shown below.

For more information, read [Cortez et al., 2009].

### Input variables (based on physicochemical tests):


1. fixed acidity

2. volatile acidity

3. citric acid

4. residual sugar

5. chlorides

6. free sulfur dioxide

7. total sulfur dioxide

8. density

9. pH

10. sulphates

11. alcohol

Output variable (based on sensory data):

12. quality (score between 0 and 10)

## The dataset was taken from Kaggle:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

# 1.0 Install and Load Libraries


In [None]:
# install wandb
!pip install wandb

In [None]:
# install pytest
!pip install pytest pytest-sugar

In [None]:
import wandb

# 2.0 Data Check

After the preprocessing stage, we need to check the data to see if it is in accordance with what we expect

## 2.1 Login to Weights & Biases

In [None]:
# login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 2.2 Write a .py file to run pytest on

In [None]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="red_wine_quality", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("red_wine_quality/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    Here we test if the dataset has at least 1000 rows
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    Here we test if the dataset has the number of columns (features) we expect.
    """
    assert data.shape[1] == 12


def test_column_presence_and_type(data):
    """
    Here we test if the columns have the appropriate dtypes
    """
    required_columns = {
        "fixed_acidity": pd.api.types.is_float_dtype,
        "volatile_acidity": pd.api.types.is_float_dtype,
        "citric_acid": pd.api.types.is_float_dtype,
        "residual_sugar": pd.api.types.is_float_dtype,
        "chlorides": pd.api.types.is_float_dtype,
        "free_sulfur_dioxide": pd.api.types.is_float_dtype,
        "total_sulfur_dioxide": pd.api.types.is_float_dtype,
        "density": pd.api.types.is_float_dtype,
        "ph": pd.api.types.is_float_dtype,
        "sulphates": pd.api.types.is_float_dtype,
        "alcohol": pd.api.types.is_float_dtype,
        "quality": pd.api.types.is_object_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_func in required_columns.items():

        assert format_verification_func(data[col_name]), f"Column {col_name} failed test {format_verification_func}"


def test_column_ranges(data):

    ranges = {
        "fixed_acidity": (0, 20),
        "volatile_acidity": (0, 3),
        "citric_acid": (0, 2),
        "residual_sugar": (0, 20),
        "chlorides": (0, 1),
        "free_sulfur_dioxide": (0, 100),
        "total_sulfur_dioxide": (0, 500),
        "density": (0, 2),
        "ph": (0, 5),
        "sulphates": (0, 3),
        "alcohol": (0, 20),
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

run.finish()

Overwriting test_data.py


In [None]:
# running pytest
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m25% [0m[40m[32m█[0m[40m[32m█▌       [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m50% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██     [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m75% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m[40m[32m█[0m[40m[32m█▌  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                              [32m100% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m

Results (11.44s):
[32m       4 passed[0m
