# **GROUP 3 PROJECT PROPOSAL**

# Introduction


Of the 5 senses we have available to us, taste is often thought of as the least understood sense because of how difficult it is to quantify. This can pose problems for industries where business is influenced by the taste of a product, which is especially true for the wine industry. Expert wine testers are often needed to determine and certify the quality of different wines through sensory tests. (Cortez, 2009) This leads to our question:

**Is it possible to predict the quality classification of a wine based on quantifiable physicochemical properties?**


Our dataset contains details of red and white *winho verde* wine samples from the north of Portugal. There are 11 quantified physicochemical properties and the quality is classified on an integer scale of 1-10

Dataset citation: \
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. *Decision Support Systems*, 47(4), 547–553. doi:10.1016/j.dss.2009.05.016

;; just keeping these links here for now \
sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub \
https://archive.ics.uci.edu/ml/datasets/Wine+Quality

# Preliminary exploratory data analysis


In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.model_selection import (
    train_test_split,
)
np.random.seed(123)
! pip install altair==5.0.0rc1  #I need this version of altair to concat my graphs below



In [2]:
data_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";").assign(type = "red")
data_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";").assign(type = "white")

data_wine = pd.concat([data_red, data_white])

In [3]:
# red_train, red_test = train_test_split(data_red, test_size=0.25, random_state=123)
# white_train, white_test = train_test_split(data_white, test_size=0.25, random_state=123)

# X_red_train = red_train.drop('quality', axis=1)
# y_red_train = red_train['quality']
# X_white_train = white_train.drop('quality', axis=1)
# y_white_train = white_train['quality']

wine_train, wine_test = train_test_split(data_wine, test_size=0.25, random_state=123)

X_train = wine_train.drop('quality', axis=1)
y_train = wine_train['quality']
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
1773,7.8,0.390,0.26,9.90,0.059,33.0,181.0,0.99550,3.04,0.42,10.9,white
1094,6.6,0.725,0.09,5.50,0.117,9.0,17.0,0.99655,3.35,0.49,10.8,red
4813,5.8,0.360,0.50,1.00,0.127,63.0,178.0,0.99212,3.10,0.45,9.7,white
853,9.3,0.360,0.39,1.50,0.080,41.0,55.0,0.99652,3.47,0.73,10.9,red
2185,7.5,0.180,0.31,11.70,0.051,24.0,94.0,0.99700,3.19,0.44,9.5,white
...,...,...,...,...,...,...,...,...,...,...,...,...
3619,6.8,0.450,0.28,26.05,0.031,27.0,122.0,1.00295,3.06,0.42,10.6,white
2461,8.2,0.260,0.33,2.60,0.053,11.0,71.0,0.99402,2.89,0.49,9.5,white
1346,6.1,0.590,0.01,2.10,0.056,5.0,13.0,0.99472,3.52,0.56,11.4,red
1855,8.0,0.220,0.28,14.00,0.053,83.0,197.0,0.99810,3.14,0.45,9.8,white


In [4]:
plot_data = wine_train.sample(75)

wine_plot1 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["fixed acidity", "volatile acidity", "citric acid", "residual sugar"])
              .resolve_scale(y="independent"))

wine_plot2 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["chlorides", "free sulfur dioxide", "total sulfur dioxide", "density"])
              .resolve_scale(y="independent"))

wine_plot3 = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250)
             .repeat(
                 column=["pH", "sulphates", "alcohol"])
              .resolve_scale(y="independent"))

wine_type_plot = (alt.Chart(plot_data)
             .mark_circle(opacity=0.65)
             .encode(
                 x=alt.X("quality", scale=alt.Scale(domain=[1,10], zero=False)),
                 y=alt.Y("type", scale=alt.Scale(zero=False)),
                 color="type:N")
             .properties(
                 width=250,
                 height=250))

wine_plot4 = alt.hconcat(wine_plot3, wine_type_plot)

wine_plot = alt.VConcatChart(vconcat=(wine_plot1, wine_plot2, wine_plot4), 
                             title=alt.TitleParams("Plot of All Wine Properties vs Quality", fontSize=30, subtitle="(sample size 75)", anchor="middle", dy=-10))

wine_plot      


- loaded data into notebook and read
- tidy data: each variable forms a volumn & each observation forms a row & each cell is a single measurement
- separate data into X_training, y
- create a plot using the training data

# Methods


# Expected outcomes and significance
