# Introduction
- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal 

- Clearly state the question you will try to answer with your project 

- Identify and describe the dataset that will be used to answer the question

# Methods

- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction? 

- Describe at least one way that you will visualize the results

# Expected outcomes and significance

- What do you expect to find? What impact could such findings have? What future questions could this lead to?

# Preliminary exploratory data analysis

- Demonstrate that the dataset can be read from the web into R 

- Clean and wrangle your data into a tidy format 

- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis).An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis. 


In [17]:
# Libaries
library(tidyverse)
library(dplyr)
librrary(tidymodels)

In [18]:
# Demonstrate that the dataset can be read from the web into R

red_wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
white_wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

red_data <- read_delim(red_wine_url, delim = ";")
white_data <- read_delim(white_wine_url, delim = ";")

red_data
white_data

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  alcohol = [32mcol_double()[39m,
  quality = [32mcol_double()[39m
)

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.880,0.00,2.6,0.098,25,67,0.9968,3.20,0.68,9.8,5
7.8,0.760,0.04,2.3,0.092,15,54,0.9970,3.26,0.65,9.8,5
11.2,0.280,0.56,1.9,0.075,17,60,0.9980,3.16,0.58,9.8,6
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.660,0.00,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.9,0.600,0.06,1.6,0.069,15,59,0.9964,3.30,0.46,9.4,5
7.3,0.650,0.00,1.2,0.065,15,21,0.9946,3.39,0.47,10.0,7
7.8,0.580,0.02,2.0,0.073,9,18,0.9968,3.36,0.57,9.5,7
7.5,0.500,0.36,6.1,0.071,17,102,0.9978,3.35,0.80,10.5,5


fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.50,0.058,47,186,0.9956,3.19,0.40,9.9,6
7.2,0.23,0.32,8.50,0.058,47,186,0.9956,3.19,0.40,9.9,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
6.2,0.32,0.16,7.00,0.045,30,136,0.9949,3.18,0.47,9.6,6
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.22,0.43,1.50,0.044,28,129,0.9938,3.22,0.45,11.0,6


In [48]:
# Reading the data using read_delim 
# (using read_csv2 made it so some of the data lost their decimal points, using read_delim worked)
# Changing column names so they have underscores using colnames

red_wine_data <- read_delim("Data/winequality-red.csv", delim = ";")

colnames(red_wine_data) <- c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", 
                             "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", 
                             "ph", "sulfates", "alcohol", "quality")

white_wine_data<-read_delim("Data/winequality-white.csv", delim = ";")

colnames(white_wine_data) <- c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", 
                             "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density", 
                             "ph", "sulfates", "alcohol", "quality")
red_wine_data
white_wine_data

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  alcohol = [32mcol_double()[39m,
  quality = [32mcol_double()[39m
)

Parsed with column specification:
cols(
  `fixed acidity` = [32mcol_double()[39m,
  `volatile acidity` = [32mcol_double()[39m,
  `citric acid` = [32mcol_double()[39m,
  `residual sugar` = [32mcol_double()[39m,
  chlorides = [32mcol_double()[39m,
  `free sulfur dioxide` = [32mcol_double()[39m,
  `total sulfur dioxide` = [32mcol_double()[39m,
  density = [32mcol_double()[39m,
  pH = [32mcol_double()[39m,
  sulphates = [32mcol_double()[39m,
  

fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulfates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.880,0.00,2.6,0.098,25,67,0.9968,3.20,0.68,9.8,5
7.8,0.760,0.04,2.3,0.092,15,54,0.9970,3.26,0.65,9.8,5
11.2,0.280,0.56,1.9,0.075,17,60,0.9980,3.16,0.58,9.8,6
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.660,0.00,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.9,0.600,0.06,1.6,0.069,15,59,0.9964,3.30,0.46,9.4,5
7.3,0.650,0.00,1.2,0.065,15,21,0.9946,3.39,0.47,10.0,7
7.8,0.580,0.02,2.0,0.073,9,18,0.9968,3.36,0.57,9.5,7
7.5,0.500,0.36,6.1,0.071,17,102,0.9978,3.35,0.80,10.5,5


fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulfates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.50,0.058,47,186,0.9956,3.19,0.40,9.9,6
7.2,0.23,0.32,8.50,0.058,47,186,0.9956,3.19,0.40,9.9,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
6.2,0.32,0.16,7.00,0.045,30,136,0.9949,3.18,0.47,9.6,6
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.22,0.43,1.50,0.044,28,129,0.9938,3.22,0.45,11.0,6


In [50]:
#Let's split our data into testing and training data, and look at the training data so we can apply exploraty analysis

red_split<-initial_split(red_wine_data, prop=0.75, strata = quality)
red_train<-training(red_split)
red_test<-testing(red_split)

white_split<-initial_split(white_wine_data, prop=0.75, strata = quality)
white_train<-training(white_split)
white_test<-testing(white_split)

red_train
white_train

fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulfates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.880,0.00,2.6,0.098,25,67,0.9968,3.20,0.68,9.8,5
7.8,0.760,0.04,2.3,0.092,15,54,0.9970,3.26,0.65,9.8,5
11.2,0.280,0.56,1.9,0.075,17,60,0.9980,3.16,0.58,9.8,6
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.660,0.00,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.3,0.650,0.00,1.2,0.065,15,21,0.9946,3.39,0.47,10.0,7
7.8,0.580,0.02,2.0,0.073,9,18,0.9968,3.36,0.57,9.5,7
7.5,0.500,0.36,6.1,0.071,17,102,0.9978,3.35,0.80,10.5,5
6.7,0.580,0.08,1.8,0.097,15,65,0.9959,3.28,0.54,9.2,5


fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulfates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.50,0.058,47,186,0.9956,3.19,0.40,9.9,6
8.1,0.28,0.40,6.90,0.050,30,97,0.9951,3.26,0.44,10.1,6
6.2,0.32,0.16,7.00,0.045,30,136,0.9949,3.18,0.47,9.6,6
7.0,0.27,0.36,20.70,0.045,45,170,1.0010,3.00,0.45,8.8,6
6.3,0.30,0.34,1.60,0.049,14,132,0.9940,3.30,0.49,9.5,6
8.1,0.22,0.43,1.50,0.044,28,129,0.9938,3.22,0.45,11.0,6
8.1,0.27,0.41,1.45,0.033,11,63,0.9908,2.99,0.56,12.0,5


In [27]:
#Let's summarize the data by figuring out how many wines are in each class (each quality ranking)
red_wine_quality_count<- red_wine_data %>% 
                       group_by(quality) %>%
                       summarize(count = n())
red_wine_quality_count

white_wine_quality_count<- red_wine_data %>% 
                       group_by(quality) %>%
                       summarize(count = n())
white_wine_quality_count

fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulfates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
7.9,0.350,0.46,3.6,0.078,15,37,0.99730,3.35,0.86,12.8,8
10.3,0.320,0.45,6.4,0.073,5,13,0.99760,3.23,0.82,12.6,8
5.6,0.850,0.05,1.4,0.045,12,88,0.99240,3.56,0.82,12.9,8
12.6,0.310,0.72,2.2,0.072,6,29,0.99870,2.88,0.82,9.8,8
11.3,0.620,0.67,5.2,0.086,6,19,0.99880,3.22,0.69,13.4,8
9.4,0.300,0.56,2.8,0.080,6,17,0.99640,3.15,0.92,11.7,8
10.7,0.350,0.53,2.6,0.070,5,16,0.99720,3.15,0.65,11.0,8
10.7,0.350,0.53,2.6,0.070,5,16,0.99720,3.15,0.65,11.0,8
5.0,0.420,0.24,2.0,0.060,19,50,0.99170,3.72,0.74,14.0,8
7.8,0.570,0.09,2.3,0.065,34,45,0.99417,3.46,0.74,12.7,8


`summarise()` ungrouping output (override with `.groups` argument)



quality,count
<dbl>,<int>
3,10
4,53
5,681
6,638
7,199
8,18
