#### Title: Classifying Red Wine Quality - Group 21

#### Introduction

"Vinho Verde'' is a Portuguese wine which has three main variants, of which we will be considering red. In the kaggle dataset we are using from UCI Machine Learning, its quality is affected by 11 different physicochemical variables, such as acidity, chlorides, density, etc.. These 11 variables determine the quality of the wine, on a scale from 0-10. We will conduct our data analysis using three predictors from our dataset: fixed acidity, alcohol content, and pH.

Our ultimate goal is to create a k-nearest neighbors classifier (built off of our dataset) to classify the quality of various red wines at different price points that are sourced from outside of the dataset. This will ultimately help us determine whether or not the differences in the qualities of the wines generally correspond to their differences in price. We can rephrase this as a predictive research question: Given a k-means classifier, does a difference in the quality of a red wine reflect a difference in its price?

#### Methodology

We decided on the three predictors outlined above due to the online accessibility of this information; many red wine companies do not share the finer contents of their wines, such as sulfur dioxide, density, volatile and citric acids, etc. However alcohol content, fixed acidity, and pH are far more easily found and therefore the best predictors to answer our question, especially important considering that we are sourcing wines from the internet.

Since this proposal is a classification task, we will build a classifier to categorize wines as either good or poor in quality - for which we will be using the K-nearest neighbors classification algorithm. The classifier will be trained to recognize the quality of wine based on the three predictors outlined above. We will define good quality wine as having a quality rating over _6.5_ and anything below as poor quality - a threshold set by the creator of the dataset. The good quality wines will be assigned _1_ and the poor quality wines will be assigned _0_. This is done for the purpose of making a binary classifier, rather than a multiclass one, creating more concrete distinctions between the qualities of the wines and avoiding an unecessarily complex model that may have lower accuracy. Additionally, since each variable has a small range of values, it will already prove difficult to accurately classify each wine into such specific qualities. Thus, creating a binary classifier eliminates this problem, since there are only two, broad classes to consider.

In our preliminatory data visualization, we will use a ggpairs matrix to examine the general relationship between the wine class (good or poor) and our chosen predictors, giving us an idea of what parameters to expect in our classifier. The plot will help us see the mean values of each predictor that fall into either class. When we build the classifier, we will have to tune the number of neighbors and create a line plot to show at which *k* the classifier takes on the highest accuracy. These are the visuals to be expected.

#### Preliminary exploratory data analysis

In [None]:
install.packages("themis")
install.packages("GGally")

library(dplyr)
library(tidyverse)
library(tidymodels)

library(themis)
library(ggplot2)
library(GGally)

also installing the dependencies ‘dplyr’, ‘clock’, ‘recipes’




After loading in the necessary libraries, we first read in the red wine data, using the "mutate" and "ifelse" functions in conjunction to create a new column, distinguishing the wines as either good or poor. The original dataset sets the threshold for a good wine at >6.5 and we followed this in our code. We called this new column "new_quality" to make it distinct from the existing "quality" column, and it only contains 0s and 1s to indicate the binary wine quality. In order to use total acidity as a predictor, we use the mutate function to combine the values of the two existing acidity columns. Finally, we removed white spaces from between the column names to be able to later call on them, tidying the data.

In [None]:
options(repr.matrix.max.rows = 6)

red_wine <- read_csv("data/winequality-red.csv")|>
    mutate(new_quality = ifelse(quality < 6.5, 0, 1)) |>
    mutate(new_quality = as_factor(new_quality))

colnames(red_wine) <- make.names(colnames(red_wine))

red_wine <- red_wine |>
    mutate(total.acidity = fixed.acidity + volatile.acidity)
red_wine

Before we build our classifier, we can visualize and create tables for the original data to extract more information from it. We chose to create a simple ggpairs matrix to simultaneously observe the relationships between our chosen predictors and the binary wine quality. Since we are interested in how the quality varies with each predictor, we focussed on the rightmost four graphs in the matrix (3 boxplots and 1 bargraph).

Observing the graph of new quality v. fixed acidity, we noticed that wines falling in the "good" class or the 1 class had a higher mean total acidity. Observing the graphs for the other two predictors, good quality wines had a higher mean alcohol content and lower average pH. However, the differences between these averages are not significant, suggesting that the classifier may have difficulty distinguishing between qualities given the small range of alcohol, fixed acidity, and pH values.

The proportion of good wines to poor wines was also noted in the bottom right bar graph, showing how good wine is a minority class. This indicates that the data will have to be balanced or upscaled at some point in the analysis to avoid the classifier being biased towards the majority class - poor quality wine - when making predictions on the testing data.

In [None]:
options(repr.plot.height = 12, repr.plot.width = 12)

my_ggpairs <- ggpairs(red_wine, columns = c("total.acidity", "pH", "alcohol", "new_quality"))
my_ggpairs

Following the visualization, we created three more informative tibbles. The first of these shows the mean values of each predictor which, though not explicitly relevant, does show that the predictors have different scales and that they will need to be normalized in the analysis to create a reliable classifier. The second tibble shows the exact proportion of poor to good wines, with good wines making up only 13.5% of the dataset - further indication that we should balance it in the analysis. Finally, we determine that there are no missing values in the dataset. This preliminary exploration of the data informs our decisions in the analysis portion of the proposal.

In [None]:
red_wine_quality_count <- red_wine_train |>
    group_by(new_quality) |>
    summarize(count = n())

missing_data <- red_wine_train |>
    summarize(empty_rows = sum(is.na(" ")))

red_wine_mean <- red_wine_train |>
    select(-quality) |>
    summarize(across(fixed.acidity:alcohol, mean))

red_wine_mean
red_wine_quality_count
missing_data

To train the k-means classifier, we split the mutated red wine data into 75% training and 25% testing data. We specified that the class on which to base the splitting was new_quality, as this is the class we will be working with. In order to make our code reproducible, we set a seed at the beginning, ensuring that the same random sequences of numbers are considered for the outputs.

In [None]:
set.seed(2020)
red_wine_split <- initial_split(red_wine, prop = 0.75, strata = new_quality)

red_wine_train <- training(red_wine_split)

red_wine_test <- testing(red_wine_split)

Now having the testing and training data, we created a recipe for the training data with the three predictors outline in the introduction - alcohol, pH, and fixed acidity. We scaled and centered all the predictors in the recipe to ensure that every predictor has an equal influence over the model. We also chose to upsample the underepresented "good" quality wines as they only comprised 13.5% of the original data. This is done to prevent the classifier from being biased to the overepresented "poor" quality wines and making inaccurate predictions.

We then created a knn model, specifying the task to be classification, and set the neighbors to tune(), since we needed to optimize the k value we would use in the final model. The number of neighbors considered in the cross validation will be from 1-10, sequentially.

In [None]:
set.seed(2000)

wine_recipe <- recipe(new_quality~ alcohol + pH + fixed.acidity, data = red_wine_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors()) |> 
    step_upsample(new_quality, over_ratio = 1, skip = FALSE)

wine_upscaled <- wine_recipe |>
    prep() |>
    bake(red_wine_train)

wine_recipe_upscaled <- recipe(new_quality~ alcohol + pH + fixed.acidity, data = wine_upscaled)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

wine_vfold <- vfold_cv(red_wine_train, v = 10, strata = new_quality)

kvals <- tibble(neighbors = seq(1,10))

With the recipe and model created, we passed them into the workflow and generated results for various mean accuracies corresponding to various numbers of neighbors. To determine the optimal k, we created a plot of the mean vs. neighbors, identifying this value of k to be 7, as seen below.

In [None]:
set.seed(2000)
options(repr.max.plot.width = 6, repr.max.plot.height = 6)

wine_results <- workflow() |>
    add_recipe(wine_recipe_upscaled) |>
    add_model(knn_tune) |>
    tune_grid(resamples = wine_vfold, grid = kvals) |>
    collect_metrics() |>
    filter(.metric == "accuracy")|>
    arrange(desc(mean))

accuracy_versus_k <- ggplot(wine_results, aes(x = neighbors, y = mean))+
       geom_point() +
       geom_line() +
       labs(x = "Neighbors", y = "Accuracy Estimate")

accuracy_versus_k
wine_results

Having our optimal *k*, we recreated our classification model, fitting the recipe and model to our training data after passing it through the workflow. The testing data was then used to generate accuracy metrics for the classifier, including a confusion matrix. From this, we gathered that our final classifier has a reasonable accuracy of 90% using *k* = 7.

In [None]:
knn_2 <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) |>
    set_engine("kknn") |>
    set_mode("classification")

wine_fit <- workflow() |>
       add_recipe(wine_recipe) |>
       add_model(knn_2) |>
       fit(data = red_wine_train)

wine_test_predictions <- predict(wine_fit, red_wine_test) |>
       bind_cols(red_wine_test) |>
       metrics(truth = new_quality, estimate = .pred_class) 

wine_mat <- predict(wine_fit, red_wine_test) |>
       bind_cols(red_wine_test) |> 
       conf_mat(truth = new_quality, estimate = .pred_class)

wine_test_predictions
wine_mat

Using our knowledge of classification, we successfully built a reliable k-means model that can classify wines into two categories - good or poor - based on their pH, alcohol, and levels of fixed acidity. With this model, it is now possible to predict the qualities of red wines sourced outside of our dataset. To involve variety in our choices, we selected cheap, medium-priced, and expensive wines from a single website, taking their predictor details and creating new tibbles for each one. We then passed these new observations through the predict function with our fitted model.

In [None]:
cheap_wine <- tibble(pH = 3.58,
                   fixed.acidity = 6.3,
                    alcohol = 13.9)

cheap_wine_2 <- tibble(pH = 3.77,
                   fixed.acidity = 5.3,
                    alcohol = 13.5)

cheap_wine_3 <- tibble(pH = 3.76,
                   fixed.acidity = 5.3,
                    alcohol = 14.5)

medium_wine <- tibble(pH = 3.7,
                   fixed.acidity = 7.07,
                    alcohol = 14.5)

exp_wine <- tibble(pH = 3.62,
                   fixed.acidity = 6.9,
                    alcohol = 13.5)

cheap_wine_predict <- predict(wine_fit, cheap_wine)
cheap_wine_2_predict <- predict(wine_fit, cheap_wine_2)
medium_wine_predict <- predict(wine_fit, medium_wine)
exp_wine_predict <- predict(wine_fit, exp_wine)
cheap_wine_3_predict <- predict(wine_fit, cheap_wine_3)

cheap_wine_predict
medium_wine_predict
exp_wine_predict
cheap_wine_2_predict
cheap_wine_3_predict

#### Methods

We will conduct our data analysis using three predictors from our dataset: fixed acidity, alcohol content, and pH. We decided on these three predictors due to the online accessibility of this information; many red wine companies do not share the finer contents of their wines, such as sulfur dioxide, density, volatile and citric acids, etc. However alcohol content, fixed acidity, and pH are far more easily found and therefore the best predictors to answer our question.

To visualize our data, we have created three histograms, showing the distribution of wine quality according to each predictor. We have done this to focus on the effect of the predictors on the quality of wine. We will use this visualization to try and understand which specific amount of variable (alcohol content, pH, and fixed acidity) would correspond to a higher quality rating. These graphs hint at how the classifier we train will end up classifying wines as either higher or lower quality, and where the ranges of the predictors fall for higher quality wines.

- specify classification algorithm used
- More detailed explanation is required in how this visulization leads to better understand the relation between predictor vs target

#### Expected Outcomes & Significance

We expect to find the quality of wine based on the classifier we have created. Taking the Vinho Verde dataset, we will use our classifier to distinguish between the various wines in the sample based on three predictors, ranging from 0 to 10. 

If the classifier successfully identifies the quality of the red wine samples (related to their prices), separate from the data set, our classifier would demonstrate efficiency and accuracy that could be useful for data outside of ours. This can be a pioneering method in the wine industry that can be utilized commercially for pricing wine based on physical quality, or allow wine connoisseurs to assess wines digitally.

One question emerging from our results could be whether K-nearest neighbor classification could change the way the wine industry prices new wines. Should this classifier become a routine part of the industry, would it become a quicker way to distinguish between wine qualities and price them appropriately? Additionally, could current wine prices change?

- provide more detail on the expected outcomes and the potential significance of the findings
- acknowledge the limitations of the study and suggest potential avenues for future research