# Refactor: Wine Quality Analysis

* **Objective:** refactor the code to make it more clean and modular
* **Dataset:** [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) - Containes physicochemical properties collected in test and rating by wine expers.

<hr />


# Table of content
* [1) Import libraries](#import)
* [2) Read dataset](#dataset)
* [3) Renaming columns](#columns)
    * [3.1) Long solution](#cols1)
    * [3.2) Refactorated solution](#cols2)
* [4) Analyzing features](#analysis)
    * [4.1) Long solution](#analysis1)
    * [4.2) Refactorated solution](#analysis2)

## 1) Import libraries <a class="anchor" id="import"></a>

In [1]:
import pandas as pd

## 2) Read dataset <a class="anchor" id="dataset"></a>

In [2]:
df = pd.read_csv("./dataset/winequality-red.csv", sep=";")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## 3) Renaming columns <a class="anchor" id="columns"></a>

Removing the white space from the column name will help to call the dataframe column using the dot "." notation.

### 3.1) Long solution <a class="anchor" id="cols1"></a>

In [3]:
new_df = df.rename(columns={"fixed acidity": "fixed_acidity",
                            "volatile acidity": "volatile_acidity",
                            "citric acid": "citric_acid",
                            "residual sugar": "residual_sugar",
                            "free sulfur dioxide": "free_sulfur_dioxide",
                            "total sulfur dioxide": "total_sulfur_dioxide"
                           })
new_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### 3.2) Refactorated solution <a class="anchor" id="cols2"></a>

The solution presented in item 3.1 can suffer from typing erros. Check below a possible way to handle this issue:

In [4]:
labels = list(df.columns)
print("labels = ", labels)
print("len(labels) = ",len(labels))

labels =  ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
len(labels) =  12


In [5]:
for i in range(len(labels)):
    labels[i] = labels[i].replace(" ", "_")
print(labels)

['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


In [6]:
df.columns = labels #updating columns' name

## 4) Analyzing features <a class="anchor" id="analysis"></a>

 Check how different features relate to the wine quality rate.

### 4.1) Long solution <a class="anchor" id="analysis1"></a>

In [7]:
df1 = df.copy()

In [8]:
median_alcohol = df1.alcohol.median()
for i, alcohol in enumerate(df1.alcohol):
    if alcohol >= median_alcohol:
        df1.loc[i, 'alcohol'] = 'high'
    else:
        df1.loc[i, 'alcohol'] = 'low'
df1.groupby('alcohol').quality.mean()

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64

In [9]:
median_pH = df1.pH.median()
for i, pH in enumerate(df1.pH):
    if pH >= median_pH:
        df1.loc[i, 'pH'] = 'high'
    else:
        df1.loc[i, 'pH'] = 'low'
df1.groupby('pH').quality.mean()

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64

In [10]:
median_sugar = df1.residual_sugar.median()
for i, sugar in enumerate(df1.residual_sugar):
    if sugar >= median_sugar:
        df1.loc[i, 'residual_sugar'] = 'high'
    else:
        df1.loc[i, 'residual_sugar'] = 'low'
df1.groupby('residual_sugar').quality.mean()

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64

In [11]:
df1.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,low,0.076,11.0,34.0,0.9978,high,0.56,low,5
1,7.8,0.88,0.0,high,0.098,25.0,67.0,0.9968,low,0.68,low,5
2,7.8,0.76,0.04,high,0.092,15.0,54.0,0.997,low,0.65,low,5
3,11.2,0.28,0.56,low,0.075,17.0,60.0,0.998,low,0.58,low,6
4,7.4,0.7,0.0,low,0.076,11.0,34.0,0.9978,high,0.56,low,5


In [12]:
median_citric_acid = df1.citric_acid.median()
for i, citric_acid in enumerate(df1.citric_acid):
    if citric_acid >= median_citric_acid:
        df1.loc[i, 'citric_acid'] = 'high'
    else:
        df1.loc[i, 'citric_acid'] = 'low'
df1.groupby('citric_acid').quality.mean()

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64

### 4.2) Refactorated solution <a class="anchor" id="analysis2"></a>

One way to make the code more modular is to transform a repetitive task in a function.

In [13]:
def mean_wine (df, col_name):
    '''Substitute de column values by the categories "High" and "Low" if the value of the row
    compared is higher or lower than the mendian of the chosen column.
    
    Args.:
        df: dataframe
            dataset dataframe
        col_name: string
            column name that will be modified
    Return:
        Does not return a parameter but it modifies the dataframe content
    '''
    median_col_name = df[col_name].median()
    for i, col_i in enumerate(df[col_name]):
        if col_i >= median_col_name:
            df.loc[i, col_name] = "high"
        else:
            df.loc[i, col_name] = "low"

In [14]:
df2 = df.copy()

In [15]:
for col in list(df2.columns)[:-1]:
    mean_wine(df2, col)
    print(df2.groupby(col).quality.mean())
    print("________________")

fixed_acidity
high    5.726061
low     5.540052
Name: quality, dtype: float64
________________
volatile_acidity
high    5.392157
low     5.890166
Name: quality, dtype: float64
________________
citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64
________________
residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64
________________
chlorides
high    5.507194
low     5.776471
Name: quality, dtype: float64
________________
free_sulfur_dioxide
high    5.595268
low     5.677136
Name: quality, dtype: float64
________________
total_sulfur_dioxide
high    5.522981
low     5.750630
Name: quality, dtype: float64
________________
density
high    5.540574
low     5.731830
Name: quality, dtype: float64
________________
pH
high    5.598039
low     5.675607
Name: quality, dtype: float64
________________
sulphates
high    5.898917
low     5.351562
Name: quality, dtype: float64
________________
alcohol
high    5.958904
low     5.310302
Name: quality, dty