# Is it possible to determine white wine quality based on its attributes?


## Introduction:

An alcoholic drink made from fermenting grapes or other fruits, wine has many attributes, including:

- fixed acidity (g(tartaric acid)/dm^3)
- volatile acidity (g(acetic acid)/dm^3)
- citric acid (g/dm^3)
- residual sugar (g/dm^3)
- chlorides (g(sodium chloride)/dm^3)
- free sulfur dioxide (mg/dm^3)
- total sulfur dioxide (mg/dm^3)
- density (g/cm^3)
- pH
- sulphates (g(potassium sulphate)/dm^3)
- alcohol (vol.%)

Quality, which is typically determined by professional wine tasters, could be affected by these attributes of wine. The goal of our project is to answer the following question: is it possible to determine white wine quality based on its attributes?

To answer this question, we will be using the Wine Quality dataset from the UC Irvine Machine Learning Repository which can be accessed [here](https://archive.ics.uci.edu/ml/datasets/Wine+Quality). There are two datasets to choose from: red wine and white wine. We chose the white wine dataset since it has more observations than the red wine dataset. The dataset includes the attributes listed above and a quality rating from 0 to 10 for each wine.

## Preliminary exploratory data analysis:
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [7]:
# Loading libraries required for notebook

library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 7)

After loading required libraries, we read the dataset from the internet using R, convert quality to a factor, and remove spaces from column names.

In [14]:
wine_data <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", delim = ";") |>
    mutate(quality = as.factor(quality))

colnames(wine_data) <- make.names(colnames(wine_data))

wine_data

[1mRows: [22m[34m4898[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[32mdbl[39m (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
7.0,0.27,0.36,20.7,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.6,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.28,0.40,6.9,0.050,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.40,9.9,6
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
6.5,0.24,0.19,1.2,0.041,30,111,0.99254,2.99,0.46,9.4,6
5.5,0.29,0.30,1.1,0.022,20,110,0.98869,3.34,0.38,12.8,7
6.0,0.21,0.38,0.8,0.020,22,98,0.98941,3.26,0.32,11.8,6


We set the seed so that ```initial_split``` produces the same result every time, and split the data such that 90% of the data is for training, while 10% is for testing.

In [15]:
set.seed(1234) # Setting seed for reproducibility

wine_split <- initial_split(wine_data, prop = 0.90, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split) 

The table shows the number of observations for each class using ```group_by``` and ```summarize``` in the training dataset.

In [23]:
observation_table <- wine_train |>
    group_by(quality) |>
    summarize(count = n())

observation_table

quality,count
<fct>,<int>
3,17
4,145
5,1313
6,1981
7,790
8,157
9,4


In [24]:
pred_means <- wine_train |>
    select(-quality) |>
    map_df(mean)

pred_means

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
6.853472,0.2795598,0.3344679,6.425051,0.04592308,35.27275,138.6197,0.9940456,3.18806,0.4891967,10.51013


In [25]:
sum(is.na(wine_train))

## Methods:

- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
- Describe at least one way that you will visualize the results

- We will conduct data analysis by....
- The columns we will use for this project are fixed acidity, volatile acidity, residual sugars and alcohol as our predictors and quality as our target variable. 
- We will visualize the results in scatter and line plots?

## Expected outcomes and significance:

- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?

We expect to find if white wine quality can be determined based on our chosen predictors. These findings could help white wine manufacturers create better quality white wine by knowing which components of it affect its quality. It could also help consumers know which components of their white wine affects its quality when choosing wines to buy. This could lead to questions about if other components of wine, that were not included in the raw data set could affect white wine quality, such as type of grapes used. It could also lead to questions about if external factors may affect white wine quality such as environment and harvesting, fermentation processes.