# **GROUP 3 PROJECT PROPOSAL**

# Introduction


Of the 5 senses we have available to us, taste is often thought of as the least understood sense because of how difficult it is to quantify. This can pose problems for industries where business is influenced by the taste of a product, which is especially true for the wine industry. Expert wine testers are often needed to determine and certify the quality of different wines through sensory tests. (Cortez, 2009) While this is the current industry standard, are there any other ways to certify the quality of a wine without depending on certifed experts? This leads to our question:

**Is it possible to predict the quality classification of a wine based on quantifiable physicochemical properties using data analysis?**


Our dataset contains details of red and white *winho verde* wine samples from the north of Portugal. There are 11 quantified physicochemical properties and the quality is classified on an integer scale of 1-10

<u> Dataset citation: </u> \
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. *Decision Support Systems*, 47(4), 547–553. doi:10.1016/j.dss.2009.05.016

# Preliminary Exploratory Data Analysis


In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import (
    train_test_split,
)
! pip install altair==5.0.0rc1  #I need this version of altair to concat my graphs below
import altair as alt



In [2]:
data_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";").assign(type = "red")
data_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";").assign(type = "white")

data_wine = pd.concat([data_red, data_white])

In [3]:
wine_train, wine_test = train_test_split(data_wine, test_size=0.25, random_state=123)

X_train = wine_train.drop('quality', axis=1)
y_train = wine_train['quality']
    
ct = make_column_transformer(
    (OneHotEncoder(), ["type"]))

occurances = pd.DataFrame(ct.fit_transform(wine_train)).astype("Int64").sum()

stat_table = pd.DataFrame(wine_train.drop("type", axis=1).mean()).swapaxes("index", "columns").assign(
    red_occurances = occurances[0],
    white_occurances = occurances[1])

stat_table.columns = ["mean fixed acidity", "mean volatile acidity", "mean citric acid", "mean residual sugar", "mean chlorides", "mean free sulfur dioxide",
                      "mean total sulfur dioxide", "mean density", "mean pH", "mean sulphates", "mean alcohol", "mean quality", "red occurances", "white occurances"]
                                                    
                                            

display(wine_train.reset_index(drop=True))
display(stat_table)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.8,0.390,0.26,9.90,0.059,33.0,181.0,0.99550,3.04,0.42,10.9,6,white
1,6.6,0.725,0.09,5.50,0.117,9.0,17.0,0.99655,3.35,0.49,10.8,6,red
2,5.8,0.360,0.50,1.00,0.127,63.0,178.0,0.99212,3.10,0.45,9.7,5,white
3,9.3,0.360,0.39,1.50,0.080,41.0,55.0,0.99652,3.47,0.73,10.9,6,red
4,7.5,0.180,0.31,11.70,0.051,24.0,94.0,0.99700,3.19,0.44,9.5,7,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4867,6.8,0.450,0.28,26.05,0.031,27.0,122.0,1.00295,3.06,0.42,10.6,6,white
4868,8.2,0.260,0.33,2.60,0.053,11.0,71.0,0.99402,2.89,0.49,9.5,5,white
4869,6.1,0.590,0.01,2.10,0.056,5.0,13.0,0.99472,3.52,0.56,11.4,5,red
4870,8.0,0.220,0.28,14.00,0.053,83.0,197.0,0.99810,3.14,0.45,9.8,6,white


Unnamed: 0,mean fixed acidity,mean volatile acidity,mean citric acid,mean residual sugar,mean chlorides,mean free sulfur dioxide,mean total sulfur dioxide,mean density,mean pH,mean sulphates,mean alcohol,mean quality,red occurances,white occurances
0,7.21015,0.337659,0.319353,5.475062,0.055704,30.792898,116.315271,0.994681,3.218393,0.528136,10.499358,5.817734,1178,3694


In [4]:
np.random.seed(1234)
sample_size = 150                          #change this to change sample size. It will also change the subtitle
plot_data = wine_train.sample(sample_size) #using a sample for graph readibility and bc altair can't handle over 5000 rows.

red_count = plot_data[plot_data["type"] == "red"].shape[0]     #for subtitle
white_count = plot_data[plot_data["type"] == "white"].shape[0] #for subtitle

wine_plot1 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["fixed acidity", "volatile acidity", "citric acid", "residual sugar"])
              .resolve_scale(y="independent"))

wine_plot2 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["chlorides", "free sulfur dioxide", "total sulfur dioxide", "density"])
              .resolve_scale(y="independent"))

wine_plot3 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["pH", "sulphates", "alcohol"])
              .resolve_scale(y="independent"))

wine_type_plot = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y("type", scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250))

wine_plot4 = alt.hconcat(wine_plot3, wine_type_plot)

wine_plot = alt.VConcatChart(vconcat=(wine_plot1, wine_plot2, wine_plot4), 
                             title=alt.TitleParams("Plot of All Wine Properties vs Quality", fontSize=30, 
                                                   subtitle="(sample size "+str(sample_size)+": "+str(red_count)+" red "+str(white_count)+" white)", anchor="middle", dy=-10))

wine_plot
#If you get a schema error at the end its becuase of a bug that's solved in altair 5.0.0rc1. 
#I install it above, but if it doesn't work then clear all your kernal outputs then run all cells

# Methods


We will be performing a classification using n nearest neighbours to predict the quality of observed *winho verde* wines. We will conduct our data analysis using all properties provided by the data frame. We will do this because by looking at the graphs made during our Preliminary Exploratory Data Analysis, we found that there weren't many strong relationships between wine properties and quality--or at least these relationships haven't made themselves immediately apparent. This is only to the human eye, however, so we hope that once all properties are analyzed together (something which can't be represented by a 2-D graph) that patterns will make themselves more apparent to our classification model. We also wish to include the wine type in the properties we use in our model. How we go about this though is something to be seen during the creation of our project, since there are two ways we can think of doing this: 

1) We turn the categorical data of wine type into numerical data using OneHotEncoder() and incorporate that into our n_nearest_neighbors model. We aren't sure if this will work or if it will confuse our model since both wine types can have very different physiochemical makeups. This is best demonstrated by our residual sugar graph. If this does seem to confuse our model and our model's accuracy is low, then we may test our second method.

2) We keep our data separated into two datasets, one for red and one for white, then train both separately and create a classification model for each type. This might help our model avoid any confusion between the differing physiochemical makeups of the two wine types.

We will visualize our result by creating a scatterplot of {property} vs quality for each property using our testing data (or a sample of our testing data) and then overlaying a line plot of our predicted quality. It will look similar to the graph layout in our Preliminary Exploratory Data Analysis, just including the overlay. We would also like to include a visualization involving our prediction accuracy, but we are unsure how to go about it.

# Expected Outcomes and Significance


We expect that our model(s) will accurately predict qualities between 4 and 7 since that's where most of the training data falls. It may have trouble predicting wines with quality far outside that range. 

The accurate predictions of wine quality based on physicochemical properties could have large implications for how the wine industry conducts its quality certifications. It is doubtful that our model will have a high enough accuracy to instantly revolutionize the wine certification process, but a small success with our method could encourage further research with more complicated methods. If those more complicated methods can prove themselves trustworthy in predicting wine quality, then that could change the industry standard for how wines are certified in the future (assuming people allow this change).

The success of this scenario could lead to future questions about the relationship between the physicochemical properties of various foods and what people think tastes "good".