# Red Wine Quality Classifier
## Introduction:
Red wine is a popular type of wine made from dark-colored grapes, known for its rich flavor and complexity. Wine quality is typically evaluated based on factors such as aroma, taste, and acidity, etc. We use this dataset to address: “how can a business predict the quality of wine they produce based on its chemical composition by using this dataset?”


There are 12 input variables in this dataset based on physicochemical tests: 

  1 - **fixed acidity** ((grams of tartaric acid)/dm3)
  
  2 - **volatile acidity** ((grams of acetic acid)/dm3)    
  
  3 -  **citric acid** (g/dm3)
  
  4 - **residual sugar** (g/dm3)
  
  5 - **chlorides**(In grams of sodium chloride per cubic decimetre)    
  
  6 - **free sulfur dioxide** (mg/dm3)
  
  7 - **total sulfur dioxide** (mg/dm3)
  
  8 - **density** (g/cm3)
  
  9 - **pH**
  
  10 - **sulphates** ((grams of potassium sulphate)/dm3)
  
  11 - **alcohol** (%)
  
    Output variable (based on sensory rating): 
  12 - **quality** (between 0-10): Sensory rating by human experts.




## Preliminary exploratory data analysis: 

In [1]:
install.packages("themis")

also installing the dependencies ‘pillar’, ‘tibble’, ‘dplyr’, ‘clock’, ‘tidyselect’, ‘vctrs’, ‘cli’, ‘rlang’, ‘recipes’, ‘lifecycle’, ‘RANN’, ‘ROSE’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [None]:
install.packages("tidyverse")

also installing the dependencies ‘gargle’, ‘googledrive’, ‘timechange’, ‘systemfonts’, ‘textshaping’, ‘vroom’, ‘broom’, ‘conflicted’, ‘dbplyr’, ‘dtplyr’, ‘forcats’, ‘ggplot2’, ‘googlesheets4’, ‘haven’, ‘hms’, ‘httr’, ‘jsonlite’, ‘lubridate’, ‘modelr’, ‘purrr’, ‘ragg’, ‘readr’, ‘readxl’, ‘reprex’, ‘rstudioapi’, ‘rvest’, ‘stringr’, ‘tidyr’


“installation of package ‘systemfonts’ had non-zero exit status”
“installation of package ‘textshaping’ had non-zero exit status”
“installation of package ‘vroom’ had non-zero exit status”
“installation of package ‘httr’ had non-zero exit status”


In [2]:

library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(themis)
library(GGally)

ERROR: Error: package or namespace load failed for ‘tidyverse’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘rlang’ 1.0.4 is already loaded, but >= 1.1.0 is required


In [None]:
options(repr.matrix.max.rows = 7)

wine_quality <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", delim = ";")
colnames(wine_quality) <- make.names(colnames(wine_quality))
wine_quality

wine_quality <- mutate(wine_quality, across(-quality, as.numeric, na.rm = TRUE)) |>
                 mutate(new_quality = case_when(
                             quality <= 4 ~ "bad",
                             between(quality, 5, 6) ~ "moderate",
                             quality >= 7 ~ "good",
                             TRUE ~ "")) |>
                 arrange(quality)
wine_quality

### Choosing Predictor Variables

  In Figure 1, we created a series of correlation graphs with each variable in the original data set. We observed how each variable had an impact on the classification of the quality of the wine, where ranges shaded red were classified as “bad” wine, ranges shaded blue were classified as “excellent” wine, and regions shaded green were classified as “good” wine. From here we analyzed which variables provided distinct groupings based on quality, to see which columns would be good predictor variables and train the classifier to be more accurate. From the analysis described above, we chose to use the columns we will use when classifying will be Volatile acidity, Citric acid, Sulphates, and Alcohol as predictor variables and Quality as the response variable. Based on the distribution of 1-10 quality, we will categorize quality as a factor between "bad", "moderate", and "good".


In [None]:
# correlations between variables
options(repr.plot.height = 20, repr.plot.width = 21)
library(ggplot2)

wine_quality_factored <- mutate(wine_quality, new_quality = as.factor(new_quality)) |>
select(-quality)
wine_quality_factored

wine_pairs <- ggpairs(wine_quality_factored, columns = 1:12, aes(color = new_quality )) +
                ggtitle("Figure 1: Correlations Between Variables") +
                theme(text = element_text(size = 20))
wine_pairs

In [None]:
wine_quality_factored <- select(wine_quality_factored, citric.acid, volatile.acidity, sulphates, alcohol,new_quality)
wine_quality_factored

## Summary Tables
Table 1 shows the average values for each predictor variable (citric acid, volatile acidity, sulphates, and alcohol) by quality. 

## Summary Visualizations
Figure 2: Bar plot of number of observations for each quality.

Figure 3: Scatter plot of the volatile acidity of observations versus quality shows a negative correlation between volatile acidity and quality.

Figure 4: Scatter plot of citric acid content versus quality shows that there is a positive correlation between the citric acid content and the quality.

Figure 5: Scatter plot of the alcohol percentage of observations versus quality that shows a positive correlation between the two variables

Figure 6: Scatter plot of the sulphates versus quality, this figure shows a positive correlation between the two variables.


In [None]:
wine_summary <- group_by(wine_quality, new_quality) |>
                summarize(mean_citric = mean(citric.acid), 
                          mean_acidity = mean(volatile.acidity), 
                          mean_alcohol = mean(alcohol),
                         mean_sulphates = mean(sulphates))
wine_summary

wine_grouped <- group_by(wine_quality, quality) |>
                summarize(observations = n())
wine_grouped

options(repr.plot.width = 10, repr.plot.height = 8)
quality_num_plot <- ggplot(wine_grouped, aes(x = quality, y = observations)) +
                        geom_bar(stat = "identity") +
                        xlab("Quality Based on Sensory Rating (0-10)") +
                        ylab("Number of Observations") + 
                        ggtitle("Figure 2: Quality Based on Sensory Rating") +
                        theme(text = element_text(size=14))
quality_num_plot

acidity_quality_plot <- ggplot(wine_quality, aes(x = volatile.acidity, y = quality)) +
                        geom_point(alpha = 0.3) +
                        labs(x = "Volatile Acidity (g(acetic acid)/dm3)", 
                         y = "Quality Sensory Rating (1-10)", 
                         title = "Figure 3: Volatile Acidity by Quality") +
                        theme(text = element_text(size=14))
acidity_quality_plot

citric_quality_plot <- ggplot(wine_quality, aes(x = citric.acid, y = quality)) +
                        geom_point(alpha = 0.3) +
                        labs(x = "Citric Acid (g/dm3)", 
                         y = "Quality Sensory Rating (1-10)", 
                         title = "Figure 4: Citric Acid by Quality") +
                        theme(text = element_text(size=14))
citric_quality_plot


alc_quality_plot <- ggplot(wine_quality, aes(x = alcohol, y = quality)) +
                        geom_point(alpha = 0.3) +
                        labs(x = "Alcohol Content (%)", 
                         y = "Quality Sensory Rating (1-10)", 
                         title = "Figure 5: Alcohol Percentage by Quality") +
                        theme(text = element_text(size=14))
alc_quality_plot

sulph_quality_plot <- ggplot(wine_quality, aes(x = sulphates, y = quality)) +
                        geom_point(alpha = 0.3) +
                        labs(x = "Sulphate Level (g(potassium sulphate)/dm3)", 
                         y = "Quality Sensory Rating (1-10)", 
                         title = "Figure 6: Sulphate Level by Quality") +
                        theme(text = element_text(size=14))
sulph_quality_plot

## Creating the Model
After wrangling the dataset, we moved on to creating the KNN classifier model. Since the distribution of our observations by quality is on a bell curve with a high number of moderate observations relative to others as shown in Figure 2, we wanted to rebalance the data and added an upsampling step into our recipe.

We will split the dataset into 75% training and 25% test set use the training set to preprocess and create model specification. Next use cross-validation on the training set to determine the best K nearest-neighbor value to use when creating a classifier, which ensures the highest prediction accuracy. Lastly, use the tuned classifier with the K value that gives the highest prediction accuracy on the testing set to predict the quality of an unclassified wine. Using the testing set, we can check the metrics for the model to see how accurate its predictions are.


In [None]:
ups_recipe <- recipe(new_quality ~ . , data = wine_quality_factored) |> 
    step_upsample(new_quality , over_ratio = 1, skip = FALSE) |>
    prep() 
    ups_recipe

In [None]:
upsampled_data <- bake(ups_recipe,  wine_quality_factored)

upsampled_data |>
  group_by(new_quality) |>
  summarize(n = n())

In [None]:
data_split <- initial_split(upsampled_data, prop = 0.75, strata = new_quality) 
data_train <- training(data_split)
data_test <- testing(data_split)

In [None]:
my_recipe <- recipe(new_quality ~ . , data = data_train) |>
    step_scale(all_predictors()) |> 
    step_center(all_predictors()) 

# specification
knn_spec <- nearest_neighbor(weight_func =  "rectangular", neighbors = tune()) |> 
    set_engine("kknn") |>
    set_mode("classification")

# vfold
my_vfold <- vfold_cv(data_train, v = 5, strata = new_quality)

# k_values
k_vals <- tibble(neighbors = seq(from = 1, to = 50, by = 2))

# k validations
knn_fit <- workflow() |>
    add_recipe(my_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = my_vfold, grid = k_vals) |>
    collect_metrics()
knn_fit


In [None]:
df_knn_fit <- knn_fit |>
    arrange(desc(mean))
    
 df_knn_fit[ df_knn_fit$.metric == "accuracy", ]

##Accuracy Visualization:
Figure 7: A line plot which shows the correlation between accuracies and k-values. This plot helps us to determine the best k-value to use in the following steps.


In [None]:
knn_spec2 <- nearest_neighbor(weight_func =   "rectangular", neighbors = 1) |>
set_engine("kknn") |>
set_mode("classification")
#creating second specification with neighbors value decided from above

knn_fit2 <- workflow() |>
add_recipe(my_recipe) |>
add_model(knn_spec2) |>
fit(data = data_train)
knn_fit2

In [None]:
#testing classifier on test data set
#testing the accuracy of the model on the test data set
data_test_predictions <- predict(knn_fit2, data_test) |>
    bind_cols(data_test) 
data_test_predictions

In [None]:
# Compare the accuracy of predictions to the true values in the test set
accuracies <- data_test_predictions |> 
    metrics(truth = new_quality, estimate = .pred_class) |> 
    select(.metric, .estimate) |> 
    head(1)
accuracies

# Compare the predictions to the true values in a confusion matrix
wine_cm <- data_test_predictions |> 
    conf_mat(truth = new_quality, estimate = .pred_class)
wine_cm

In [None]:
#accuracy of the model
accuracies <- knn_fit |>
                 filter(.metric == 'accuracy')

#the visulization for k-values and accuracy -> to choose the best K
k_accuracy_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate') +
                  theme(text = element_text(size = 20)) 
k_accuracy_plot 
                   

In [None]:
# loading necessary libraries

library(yardstick)
library(ggplot2)

options(repr.plot.width = 10, repr.plot.height = 8)
cf_plot <- autoplot(wine_cm, type = "heatmap") +
    scale_fill_gradient(low = "light gray", high = "green") +
    theme(text = element_text(size = 20))
cf_plot

***Plug in the whole original data set which is not upsampled to see wheather the test outcome is accurate(because the test data set is also upsampled  *** 

In [None]:
data_test_predictions <- predict(knn_fit2, wine_quality_factored) |>
    bind_cols(wine_quality_factored) 
data_test_predictions

In [None]:
# Compare the accuracy of predictions to the true values in the test set
accuracies <- data_test_predictions |> 
    metrics(truth = new_quality, estimate = .pred_class) |> 
    select(.metric, .estimate) |> 
    head(1)
accuracies

# Compare the predictions to the true values in a confusion matrix
wine_cm <- data_test_predictions |> 
    conf_mat(truth = new_quality, estimate = .pred_class)
wine_cm

**It still makes good prediction to the original data set which is not upsampled, we think the test result is accurate**

## Discussion:

Overall, we found that our final classifier model gives an accuracy of around 95% when testing the model on the test data. 
In k-nearest neighbors algorithm, the number of neighbors is an important hyperparameter that determines the number of nearest neighboring samples used for prediction. When setting the optimal number of neighbors to 1, the model will only consider a single nearest neighbor sample for prediction. In such case, the model may become overly sensitive to the training data and overly complex, leading to overfitting.

On the other hand, if upsample technique is used in the training data, it means that samples from the minority class are duplicated or generated to increase their weight during the training process. This may cause the model to overfit the minority class in the training data, resulting in decreased generalization performance on the test data or unseen data.

Therefore, setting the optimal number of neighbors to 1 while using upsample technique may result in good performance on the training data but poor performance on the test data or unseen data, indicating overfitting. To mitigate this issue, one can try adjusting the number of neighbors or using other model evaluation techniques such as cross-validation to select appropriate hyperparameter settings and reduce the risk of overfitting.






**Impact of our findings:**

Based on our current findings, businesses can use our model to test the quality (i.e. good, moderate, or bad) of a new wine. In the future, people who want to test the wine quality can also use our model to test how accurate the estimation of the quality of their wine is. In this case, people can try to test their wine quality with a higher accuracy.

**Future Questions:**

From our analysis, some future application questions that we might consider would be:
“How can the results of this study affect wine manufacturing processes to ensure higher quality?”
“How can the results of this analysis affect pricing strategies for wine?”



##Citations:
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016 

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
