Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
510 lines (373 sloc) 17.2 KB
Red Wine Analysis by Kyle Santana
========================================================
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2); theme_set(theme_bw())
library(tseries)
library(readr)
library(dplyr)
library(lmtest)
library(ggcorrplot)
library(ggthemes)
```
```{r echo=FALSE, Loading_Data}
# Loading Data
redwine <- read.csv('wineQualityReds.csv')
# Importing the red wine database
# Removing the X column
redwine$X <- NULL
```
```{r echo=FALSE, Loading_Titles}
fixed_acidity_title <- "Fixed Acidity (g/dm^3)"
volatile_acidity_title <- "Volatile Acidity (g/dm^3)"
citric_acid_title <- "Citric Acid (g/dm^3)"
sugar_title <- "Residual Sugar (g/dm^3)"
chlorides_title <- "Chlorides (g/dm^3)"
free_sulfur_title <- "Free Sulfur Dioxide (mg/dm^3)"
total_sulfur_title <- "Total Sulfur Dioxide (mg/dm^3)"
density_title <- "Density (g/dm^3)"
sulphates_title <- "Sulphates (g/dm^3)"
alcohol_title <- "Alcohol (%)"
```
> **Introduction**:
For this study I will analyze a Red Wine dataset created by
Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and
Jose Reis (CVRVV) @ 2009. This data set contains the following input variables:
fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free
sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and the
output variable quality.
A descrition of the variables are below:
1 - Fixed Acidity: Most acids involved with wine or fixed or nonvolatile
(do not evaporate readily) (tartaric acid - g/dm^3)
2 - Volatile Acidity: The amount of acetic acid in wine, which at too high of
levels can lead to an unpleasant, vinegar taste. (acetic acid - g/dm^3)
3 - Citric Acid: Found in small quantities, citric acid can add 'freshness'
and flavor to wines. (g/dm^3)
4 - Residual Sugar: The amount of sugar remaining after fermentation stops,
it's rare to find wines with less than 1 gram/liter and wines with greater
than 45 grams/liter are considered sweet. (g/dm^3)
5 - Chlorides: The amount of salt in the wine. (sodium chloride - g/dm^3)
6 - Free Sulfur Dioxide: The free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine. (mg/dm^3)
7 - Total Sulfur Dioxide: Amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste of
wine. (mg/dm^3)
8 - Density: The density of water is close to that of water depending on the
percent alcohol and sugar content. (g/cm^3)
9 - pH: Describes how acidic or basic a wine is on a scale from 0 (very
acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - Sulphates: A wine additive which can contribute to sulfur dioxide gas
(S02) levels, wich acts as an antimicrobial and antioxidant. (g/dm3)
11 - Alcohol: The percent alcohol content of the wine
Output variable (based on sensory data):
12 - Quality (Score between 0 and 10)
The goal of the project is to explore the data and see what inferences can be
drawn from how the variables interact with each other.
# Univariate Plots Section
```{r echo=FALSE, Univariate_Plot}
print("Looking at structure of the data")
str(redwine)
```
Looking at structure of the data
```{r echo=FALSE, Univariate_Plot2}
ggplot(aes(quality, fill = "red"), data = redwine) +
geom_bar() +
theme(plot.title = element_text(hjust = 0.5) ,
legend.position="none") +
labs(title = 'Histogram of Wine Quality',
x = 'Quality of Wine',
y = 'Number of Wines')
```
This barplot shows that most of the wines are rated either a 5 or 6. This plot
helped me to see how the wine ratings are grouped together. For example we can
see that there are really no wines that are rated 1,2,9, or 10.
```{r echo=FALSE}
rating_terms <- c("Good", "Normal", "Poor")
redwine$quality.rating <- ifelse(redwine$quality < 5, "Poor",
ifelse(redwine$quality < 7, "Normal", "Good"))
redwine$quality.rating <- factor(redwine$quality.rating,
levels=rating_terms, ordered=TRUE)
head(redwine$quality.rating, n=10)
#Resources: https://stackoverflow.com/questions/23396591/factors-ordered-vs-
#levels
```
Created a new quality rating column that categorizes a rating of 1-4 as Poor
5-6 as Normal and 7-10 as Good
```{r echo=FALSE}
# Summary of the information of each column in the Red Wine data set
summary(redwine)
```
Most of the variables have a similar median and mean which
would lead me to believe that their should be a symetrical distribution.
```{r echo=FALSE}
ggplot(redwine,aes(fixed.acidity, fill = "red")) +
geom_histogram(binwidth = .30) +
ggtitle('Fixed Acidity') +
xlab(fixed_acidity_title) +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(volatile.acidity, fill = "red")) +
geom_histogram(binwidth = .010) +
ggtitle('Volatile Acidity') +
xlab(volatile_acidity_title) +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(citric.acid, fill = "red")) +
geom_histogram(binwidth = .05) +
ggtitle('Citric Acid') +
xlab('Citric Acid') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(residual.sugar, fill = "red")) +
geom_histogram(binwidth = .2) +
ggtitle('Fixed Acidity') +
xlab('Fixed Acidity') +
ylab('Number of Wines') +
scale_x_log10() +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(chlorides, fill = "red")) +
geom_histogram(binwidth = NULL) +
ggtitle('Chlorides') +
xlab('Chlorides') +
ylab('Number of Wines') +
scale_x_log10() +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(free.sulfur.dioxide, fill = "red")) +
geom_histogram(binwidth = .30) +
ggtitle('Free Sulfur Dioxide') +
xlab('Free Sulfur Dioxide') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(total.sulfur.dioxide, fill = "red")) +
geom_histogram(binwidth = 30) +
ggtitle('Total Sulfur Dioxide') +
xlab('Total Sulfur Dioxide') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(density, fill = "red")) +
geom_histogram(binwidth = NULL) +
ggtitle('Density') +
xlab('Density') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(pH, fill = "red")) +
geom_histogram(binwidth = NULL) +
ggtitle('pH') +
xlab('pH') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(sulphates, fill = "red")) +
geom_histogram(binwidth = NULL) +
ggtitle('Sulphates') +
xlab('Sulphates') +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
ggplot(redwine,aes(alcohol, fill = "red")) +
geom_histogram(binwidth = NULL) +
ggtitle('Alcohol (%)') +
xlab(alcohol_title) +
ylab('Number of Wines') +
theme(plot.title = element_text(hjust = 0.5) , legend.position="none")
barplot(table(redwine$quality.rating),
space = 0, # set space between bars to zero
ylab = "Number of Wines", main = "Quality Rating",
border="black", col="red",las=1)
```
I wanted to view the histograms of all the variable to help mee determine which
one may be good to analyze. During the process I determined that some of the
variables needed to be transformed to be able to fit into a more normal
distribution.
# Univariate Analysis
### What is the structure of your dataset?
There are 1599 observation and 12 attributes in this data set, the variables
are numeric.
Other observations include:
Most of the wines have a quality rating of 5 or 6 on the scale of 0-10.
Most of the wines have pH ranging between 3.2 and 3.4
### What is/are the main feature(s) of interest in your dataset?
The main feature of interest for me is quality. I would like to know what
variables will likely lead to a better quality wine.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
Sugar and alcohol content are figures that I would guess would be important in
the investigation.
### Did you create any new variables from existing variables in the dataset?
I created a new quality rating column that categorizes a rating of 1-4 as Poor
5-6 as Normal and 7-10 as Good.
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the \
form of the data? If so, why did you do this?
I removed the first row as it was an index column and was not needed.
# Bivariate Plots Section
```{r echo=FALSE, Bivariate_Plot1}
ggplot(redwine, aes(x = alcohol, y = quality)) +
geom_jitter(alpha = 0.5) +
geom_smooth(method = "lm") +
ggtitle('Quality vs Alcohol (%)') +
ylab('Quality') +
xlab(alcohol_title) +
theme(plot.title = element_text(hjust = 0.5))
#Resources: http://ggplot2.tidyverse.org/reference/geom_smooth.html
```
This is a graph of quality compared to alcohol. We can see an upward trend as
alchohol increases quality increases. This falls in line with our assumption
as the percentage of alcohol increases the quality increases.
```{r echo=FALSE, Plot}
ggplot(data = redwine, aes(x = quality.rating, y = alcohol,
fill = quality.rating))+
scale_fill_discrete(name = "Wine Quality")+
geom_boxplot() +
geom_smooth(method = 'lm') +
theme(plot.title = element_text(hjust = 0.5)) +
xlab('Quality') +
ylab(alcohol_title) +
ggtitle('How Alcohol Level Affects Compared to Wine Quality')
#Resources:
#http://ggplot2.tidyverse.org/reference/geom_boxplot.html
```
This graph shows how on average as you increase alcohol the quality of the wine
increases. The normal boxplot should be higher but since I combined 5 and 6
there are now a lot of outliers which aren't showing in the box plot.
```{r echo=FALSE, Bivariate_Plot2}
corr_redwine <- redwine[ , !names(redwine) %in% c("quality.rating")]
corr <- round(cor(corr_redwine), 1)
print(corr)
ggcorrplot(corr, lab = TRUE) +
ggtitle('Wine Correlation Matrix') +
theme(plot.title = element_text(hjust = 0.5))
#Resources: http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-cor
#relation-matrix-using-ggplot2
```
Correlation Matrix that displays the relationship (correlation) between the
different variables. Since my variable of interest is quality I am looking for
the closest number to one in the quality column which would be alcohol.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features \
in the dataset?
Alcohol correlated strongly with quality, sugar suprisingly did not correlate
with quality.
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
Density correlates strongly with fixed acidity.
### What was the strongest relationship you found?
Besides the acids and dioxides that were strongly correlated to each other
I found Density positively correlates with fixed acidity.
# Multivariate Plots Section
```{r echo=FALSE, Multivariate_Plot1}
ggplot(data = redwine, aes(y = alcohol,
x = density, color = quality.rating))+
geom_point(position=position_jitter(), alpha=1, size = 2) +
scale_color_brewer(type = 'qual', palette = 1,
guide = guide_legend(title = 'Quality', reverse = TRUE))+
ggtitle('Alcohol Content Vs Density Colored by Quality') +
theme(plot.title = element_text(hjust = 0.5)) +
ylab(alcohol_title) + xlab(density_title)
#Resources: http://ggplot.yhathq.com/docs/scale_color_brewer.html
```
This graph compares alcohol content to density. We can see as alcohol increases
and density decreases quality tends to get slightly better. While this is an
interesting observation I will not be using it for my final analysis.
```{r echo=FALSE, Multivariate_Plot_2}
ggplot(data = redwine, aes(y = citric.acid,
x = fixed.acidity, color = quality.rating))+
geom_point(position=position_jitter(), alpha=1, size = 2) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_brewer(type = 'qual', palette = 1,
guide = guide_legend(title = 'Quality', reverse = TRUE)) +
ggtitle('Fixed Acidity Vs Citric Acid by Quality') +
ylab(citric_acid_title) + xlab(fixed_acidity_title)
```
This graph compares fixed acidity to citric acid. We can see as both the acids
increase the quality slightly increases. While this is an interesting
observation I will not be using it for my final analysis.
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
I decided to look at the density vs alcohol content and saw as alcohol increases
and density decreases quality tends to get slightly better.
I also decided to look at fixed acidity vs citric acid and saw as both the acids
increase the quality slightly increases.
### Were there any interesting or surprising interactions between features?
I found it interesting/ suprising that as acidity increase overall quality of
the wine increased.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
ggplot(aes(quality, fill = "red"), data = redwine) +
geom_bar() +
theme(plot.title = element_text(hjust = 0.5) ,
legend.position="none") +
labs(title = 'Histogram of Wine Quality',
x = 'Quality of Wine',
y = 'Number of Wines')
```
### Plot One Summary
This barplot shows that most of the wines are rated either a 5 or 6.
### Plot Two
```{r echo=FALSE, Plot_Two}
ggplot(data = redwine, aes(x = quality.rating, y = alcohol,
fill = quality.rating))+
scale_fill_discrete(name = "Wine Quality")+
geom_boxplot() +
geom_smooth(method = 'lm') +
theme(plot.title = element_text(hjust = 0.5)) +
xlab('Quality') +
ylab(alcohol_title) +
ggtitle('How Alcohol Level Affects Compared to Wine Quality')
```
### Description Two
This graph shows how on average as you increase alcohol the quality of the wine
increases. The normal boxplot should be higher but since I combined 5 and 6
there are now a lot of outliers which aren't showing in the box plot.
### Plot Three
```{r echo=FALSE, Plot_Three}
ggplot(redwine, aes(x = alcohol, y = quality)) +
geom_jitter(alpha = 0.5) +
geom_smooth(method = "lm") +
ggtitle('Quality vs Alcohol') +
ylab("Quality") + xlab(alcohol_title) +
theme(plot.title = element_text(hjust = 0.5))
#Resources: http://ggplot2.tidyverse.org/reference/geom_smooth.html
```
### Description Three
This graph shows a strong positive correlation between alcohol rate and quality.
------
# Reflection
This was an interesting dataset to work with as we were able to explore the
different things that make up the quality of wine.
I started out by investigating each of the variable and plotting them on a
histogram. I created a barplot of the quality levels and saw that most of the
wines consisted of wines rated between a 5 and 6. I created a new quality
rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10
as Good. I then created a histogram of the variables to view their distribution.
From there I created a correlation plot showed a strong positive correlation
between alcohol rate and quality. After using the correlation plot to come up
with the variables that would probably be the most successful to test I put
together a box plot outlining the the effects of alcohol on wine quality.
Finally I plotted a scatter plot that displayed an overall increase in quality
as acohol increases.
Some of the difficulties that I ran into during this project was my own
assumptions. Coming into this I has the assumption that sugar content would be
a strong contributing factor to the quality of the wine. Also this is my first
attempt at using R and I was suprised as to how similar it is with using the
different plotting libraries in Python.
One thing I would like to investigate further if I had the chance would be to
find out if there is a specific reason that the wine ratings data range from
only 3 to 8. Did they not include data outside that range, or did none of the
wines warrant a score below 3 or above 8?
You can’t perform that action at this time.