# Project Proposal
#### Authors: Omer Tahir, Sam Zheng, Paul Huang, Longfei Guan
#### Group: 9

## Introduction
Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.

Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.

**UPDATE (Mar 1, 2022):** If it doesn’t make sense to infer a scale parameter, you can choose another parameter, or choose a second variable altogether. Ultimately, we’re looking for a comprehensive inference analysis on one parameter spread across 2+ groups (with at least one hypothesis test), plus a bit more (such as an investigation on the variance, a quantile, or a different variable). In total, you should use both bootstrapping and asymptotics somewhere in your report at least once each. Also, your hypothesis test(s) need not be significant: it is perfectly fine to write a report claiming no significant findings (i.e. your p-value is large).

Identify and describe the dataset that will be used to answer the question. Remember, this dataset is allowed to contain more variables than you need – feel free to drop them!

Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.

**Substance testing takes on many practical applications in modern society whether it be forensics, sports integrity, medicinal research (cite). Acts such as distinguishing and  determining presence of substances is undoubtedly an essential technique in improving society that draws upon many scientific disciplines namely chemistry and statistics. 
Acknowledging the complexity of existent modern techniques which can vary on context and application, this report aims to use statistical inference to differentiate two groups of wine - red and white. Evidently, in a real life scenario this would be as simple as a colour comparison; however, the goal is to prove potential differences in qualities can be statistically detected between similar substances. Understanding statically proven differences –treating existing data The data sets which will be used for this report is the Wine Quality Data Set from UCI machine learning repository.** 

**Notable quantitative variables include fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, total sulfur dioxide, density, pH, and quality.  Likewise, the our research question is:
Does alcohol, sulfur dioxide, and volatile acidity differ between red and white wine?
(An explanation for these variables in the subsequent sections)**

**In the full report, after investigating the relevance of these variables in relation to the wine groups, we plan to validate our findings with existing research on general health differences surrounding the two wines. Connecting the results of our hypothesis test to the broader picture will allow for a greater contextual understanding.**



## Preliminary Results

In this section, you will:

* Demonstrate that the dataset can be read from the web into R.
* Clean and wrangle your data into a tidy format.
* Plot the relevant raw data, tailoring your plot in a way that addresses your question.
* Compute estimates of the parameter you identified across your groups. Present this in a table. If relevant, include these estimates in your plot.

Be sure to not print output that takes up a lot of screen space.

### Loading relevant libraries

In [None]:
library(tidyverse)
library(infer)

### Reading & Wrangling the datasets from the web into R

In [None]:
url_1 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
url_2 <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

redwine_data <- 
    read.csv(url_1, sep=';') |>
    mutate(type = "red")

whitewine_data <- 
    read.csv(url_2, sep=';') |>
    mutate(type = "white")

wine_data <-  
    rbind(redwine_data, whitewine_data) |> 
    select(type, fixed.acidity:quality) |>
    mutate(quality = as_factor(quality))

head(wine_data)

### Exploratory Data Analysis

In [None]:
options(repr.plot.width = 15, repr.plot.height = 10)

wine_data |> 
    pivot_longer(cols = fixed.acidity:alcohol,
                names_to = "predictors",
                values_to = "values") |>
    ggplot() +
    geom_boxplot(aes(x = type, y = values, fill = type)) +
    facet_wrap(vars(predictors),
               scales = "free") +
    theme(legend.position = "none",
          text = element_text(size = 20))

## Methods: Plan

The previous sections will carry over to your final report (you’ll be allowed to improve them based on feedback you get). Begin this Methods section with a brief description of “the good things” about this report – specifically, in what ways is this report trustworthy?

Continue by explaining why the plot(s) and estimates that you produced are not enough to give to a stakeholder, and what you should provide in addition to address this gap. Make sure your plans include at least one hypothesis test and one confidence interval. If possible, compare both the bootstrapping and asymptotics methods.

Finish this section by reflecting on how your final report might play out:

* What do you expect to find?
* What impact could such findings have?
* What future questions could this lead to?

**For a high-quality wine, it is important to control all factors which could cause negative effects during the production.We choose the following three random variables as our variables of interest:**
 
**1)Volatile acidity (VA), which is one of the chemical produced during the production which has the smell and taste of vinegar. Although there aren't any evidences showing that VA is harmful to human's body, but many countries have restricted the VA concentration's limits in order to assure the quality of the wine.**
 
**2)Alcohol, which is a key element that explains the rich flavors and tastes of wines.It makes the wine taste bitter, sour, sweet, and spicy all together. higher alcohol wines tend to taste bolder and more oily.**
 
**3)Sulfur Dioxide, which is mainly added to kill bacteria and prevent oxidation of the wine. It can be naturally occur during the wine production process without artificial add. Too much sulfur dioxide in the wine will kill good yeast and give out undesirable odor and bitter taste.**



**We expect that two wines will have different amount of VA, alcohol, and sulfur dioxide. And the report’s result will give an explanation about why red wines are more popular than white wines and how are they different in the amount of VA, alcohol, and sulfur dioxide. Some future questions are:**

**1)What are the potential reasons that make one wine healthier than the other wine?**

**2)What are the factors that determine the quality of the wine?**

**3)Do alcohol level affect the popularity of the wine?**


## References

**P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.**

**Gleason, J. G., & Barnum, D. B. (1991, January). RISK: Health, Safety & Environment (1990–2002). Https://Scholars.Unh.Edu/Cgi/Viewcontent.Cgi?Article=1038&context=risk&httpsredir=1&referer=#:~:text=That%20is%2C%20for%20the%20population,Positive%20for%20drugs%20is%200.013.**

**Harper, L. (2017, July 31). An overview of forensic drug testing methods and their suitability for harm reduction point-of-care services - Harm Reduction Journal. BioMed Central. https://harmreductionjournal.biomedcentral.com/articles/10.1186/s12954-017-0179-5**

