<a href="https://colab.research.google.com/github/larasacodes/ML_Project_LaraAmusan/blob/main/Wine_Quality_Predictor_LaraAmusan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Individual Project - Wine Quality Price Predictions - Lara Amusan**

## **1. Problem and Dataset**

##### **Project Objective**
The project objective is to predict the quality of wine based on its chemical properties.


##### **Methodology**
Two datasets containing various chemical properties of red and white variants of the Portuguese "Vinho Verde" wine to train machine learning models in order to predict wine quality score (between 0 and 10).


##### **Dataset**
The dataset is from UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/dataset/186/wine+quality

##### **Dataset Variables**
The (uncleaned) dataset variables include:
Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality (score between 0 and 10), color (red or white).

Thorough exploration of the property variables used for testing can be found below in the Data Processing section.

## **2.  Data Preprocessing & Cleaning**

Upon loading the relevant libraries and loading the dataset via url link to the repository, where the dataset is placed, the data available can then be assessed. And then the preprocessing procedures can be determined.

In [1]:
# Import all libraries needed for the project

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import Callback
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RepeatedKFold
import re

First the datasets were loaded from the GitHub repository. Note that 'sep=';'' was used, as the data is not comma delimited, but delimited by a semi colon, ;.

In [10]:
# Loading the data, via GitHub repository. Please update the url as the token expires after some time.

# Red Wine Dataset
urlr=("https://raw.githubusercontent.com/larasacodes/ML_Project_LaraAmusan/main/winequality-red.csv?token=GHSAT0AAAAAACUZGD36FCFWASOGP2Q7W4QEZVAFA7A")
raw_data_red = pd.read_csv(urlr, sep=';')

# White Wine Dataset
urlw=("https://raw.githubusercontent.com/larasacodes/ML_Project_LaraAmusan/main/winequality-white.csv?token=GHSAT0AAAAAACUZGD36QMBCUSQTIGRQHGGUZVAFBDA")
raw_data_white = pd.read_csv(urlw, sep=';')

In [11]:
# Displaying first 5 rows of the data for the red wine dataset

raw_data_red.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
# Displaying first 5 rows of the data for the white wine dataset

raw_data_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


Now that the data has been uploaded and it can be seen that both datasets have the same variables included, it was decided that merging the dataset would be the best approach. This is because having a larger dataset would be better for analysis and model training, as well as a more efficient process.

However, firstly to identify the red from white dataset, a new column including the wine colour will be added to each dataset.

In [13]:
# Adding a new column 'colour' and setting the value to 'red' for all rows in the red wine dataset

raw_data_red['colour'] = 'red'


# Displaying to check

raw_data_red.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [14]:
# Adding a new column 'colour' and setting the value to 'white' for all rows in the white wine dataset

raw_data_white['colour'] = 'white'


# Displaying to check

raw_data_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


In [15]:
# Combine the red and white wine datasets

raw_red_white_data = pd.concat([raw_data_red, raw_data_white], ignore_index=True)  # ignore index so the index doesn't start at 0 again for the added dataset


# Displaying to check

raw_red_white_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


The top rows show red wine data and the bottom rows show white wine data, so we can assume that the data has ben successfully merged, and is ready for remaining cleaning processes.

Remaining steps in the data cleaning process:
* Dataset is checked for any missing values
* Dataset is checked for any duplicate values (and any duplicate values are removed)
* Amount of red wine and white wine values are checked (to see if balanced)
* Sort dataset in ascending score order to check for any obvious outliers at each end, which will then be removed. This is because it will be difficult to validate if the ML models perform well at that extreme scores as the dataset does not have enough data around that score.
* Once done, rename the dataset to mark the completion of cleaning.

In [16]:
# Check for missing values

missing = raw_red_white_data.isnull().sum()
print(missing)

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
colour                  0
dtype: int64


In [17]:
# Check for duplicates

raw_red_white_data[raw_red_white_data.duplicated()]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
4,7.4,0.700,0.00,1.90,0.076,11.0,34.0,0.99780,3.51,0.56,9.400000,5,red
11,7.5,0.500,0.36,6.10,0.071,17.0,102.0,0.99780,3.35,0.80,10.500000,5,red
27,7.9,0.430,0.21,1.60,0.106,10.0,37.0,0.99660,3.17,0.91,9.500000,5,red
40,7.3,0.450,0.36,5.90,0.074,12.0,87.0,0.99780,3.33,0.83,10.500000,5,red
65,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.99620,3.41,0.39,10.900000,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6427,6.4,0.230,0.35,10.30,0.042,54.0,140.0,0.99670,3.23,0.47,9.200000,5,white
6449,7.0,0.360,0.35,2.50,0.048,67.0,161.0,0.99146,3.05,0.56,11.100000,6,white
6450,6.4,0.330,0.44,8.90,0.055,52.0,164.0,0.99488,3.10,0.48,9.600000,5,white
6455,7.1,0.230,0.39,13.70,0.058,26.0,172.0,0.99755,2.90,0.46,9.000000,6,white


In [18]:
# Removing duplicates

raw_red_white_data = raw_red_white_data.drop_duplicates()

In [19]:
# Check to see all duplicates have been dropped

raw_red_white_data[raw_red_white_data.duplicated()]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour


In [20]:
# Grouping wines by colour to determine the count

wine_colours = raw_red_white_data.groupby('colour').size().reset_index(name='Count')

print(wine_colours)

  colour  Count
0    red   1359
1  white   3961


There are clearly more white wines than red wines in the dataset. In order to prevent a model bias towards the white wine, I will aim to use sampling techniques in the model building process in order to have a good representation of both colour wines.

In [21]:
# Sorting by ascending order for price to see what the most expensive phone is.

raw_red_white_data = raw_red_white_data.sort_values(by='quality', ascending=True)

raw_red_white_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
3287,6.7,0.250,0.26,1.55,0.041,118.5,216.0,0.99490,3.55,0.63,9.4,3,white
1469,7.3,0.980,0.05,2.10,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3,red
4906,9.4,0.240,0.29,8.50,0.037,124.0,208.0,0.99395,2.90,0.38,11.0,3,white
4864,4.2,0.215,0.23,5.10,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3,white
6344,6.1,0.260,0.25,2.90,0.047,289.0,440.0,0.99314,3.44,0.64,10.5,3,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2475,6.9,0.360,0.34,4.20,0.018,57.0,119.0,0.98980,3.28,0.36,12.7,9,white
2419,6.6,0.360,0.29,1.60,0.021,24.0,85.0,0.98965,3.41,0.61,12.4,9,white
2373,9.1,0.270,0.45,10.60,0.035,28.0,124.0,0.99700,3.20,0.46,10.4,9,white
3204,7.1,0.260,0.49,2.20,0.032,31.0,113.0,0.99030,3.37,0.42,12.9,9,white


It does not seem that there are any quality score outliers, so no data will be removed at this stage.

In [22]:
# Resetting the index

raw_red_white_data = raw_red_white_data.reset_index(drop=True)

raw_red_white_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
0,6.7,0.250,0.26,1.55,0.041,118.5,216.0,0.99490,3.55,0.63,9.4,3,white
1,7.3,0.980,0.05,2.10,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3,red
2,9.4,0.240,0.29,8.50,0.037,124.0,208.0,0.99395,2.90,0.38,11.0,3,white
3,4.2,0.215,0.23,5.10,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3,white
4,6.1,0.260,0.25,2.90,0.047,289.0,440.0,0.99314,3.44,0.64,10.5,3,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5315,6.9,0.360,0.34,4.20,0.018,57.0,119.0,0.98980,3.28,0.36,12.7,9,white
5316,6.6,0.360,0.29,1.60,0.021,24.0,85.0,0.98965,3.41,0.61,12.4,9,white
5317,9.1,0.270,0.45,10.60,0.035,28.0,124.0,0.99700,3.20,0.46,10.4,9,white
5318,7.1,0.260,0.49,2.20,0.032,31.0,113.0,0.99030,3.37,0.42,12.9,9,white


As the cleaning process is now complete, we can finally rename our dataset to represent the fact that it is now cleaned.

In [23]:
# Renaming the dataset and displaying

cleaned_wine_data = raw_red_white_data
cleaned_wine_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,colour
0,6.7,0.250,0.26,1.55,0.041,118.5,216.0,0.99490,3.55,0.63,9.4,3,white
1,7.3,0.980,0.05,2.10,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3,red
2,9.4,0.240,0.29,8.50,0.037,124.0,208.0,0.99395,2.90,0.38,11.0,3,white
3,4.2,0.215,0.23,5.10,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3,white
4,6.1,0.260,0.25,2.90,0.047,289.0,440.0,0.99314,3.44,0.64,10.5,3,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5315,6.9,0.360,0.34,4.20,0.018,57.0,119.0,0.98980,3.28,0.36,12.7,9,white
5316,6.6,0.360,0.29,1.60,0.021,24.0,85.0,0.98965,3.41,0.61,12.4,9,white
5317,9.1,0.270,0.45,10.60,0.035,28.0,124.0,0.99700,3.20,0.46,10.4,9,white
5318,7.1,0.260,0.49,2.20,0.032,31.0,113.0,0.99030,3.37,0.42,12.9,9,white
