## Instructions {-}

1. Please answer the following questions as part of your project proposal.

2. Write your answers in the *Markdown* cells of the Jupyter notebook. You don't need to write any code, but if you want to, you may use the *Code* cells.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The project proposal is worth 8 points, and is due on **18th April 2023 at 11:59 pm**. 

5. You must make one submission as a group, and not individually.

6. Maintaining a GitHub repository is optional, though encouraged for the project.

7. Share the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0) (optional).

# 1) Team name
Mention your team name.

*(0 points)*

Saturn.

# 2) Member names
Mention the names of your team members.

*(0 points)*

Members: Kaitlyn Hung, Amy Wang, Anastasia Wei, Lila Wells.

# 3) Link to the GitHub repository (optional)
Share the link of the team's project repository on GitHub.

Also, put the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0).

We believe there is no harm in having other teams view your GitHub repository. However, if you don't want anyone to see your team's work, you may make the repository *Private* and add your instructor and graduate TA as *Colloborators* in it.

*(0 points)*

The link to our GitHub repository can be accessed [here](https://github.com/notlilawells/Saturn-303-3)

# 4) Topic
Mention the topic of your course project.

*(0 points)*

Our topic is as follows: "Predicting Wine Quality based on Physical and Chemical Alcohol Attributes"

# 5) Problem statement

*(4 points)*

## 5a) The problem

The Vinho Verde region of northwest Portugal brings affordable wines of diverse flavor profiles and complexities to everyday individuals [1]. As this region continues to gain influence and grow as a wine exporter, it is crucial that wineyard and consumers proportionally scale their ability to determine the quality of each wine variant (a key metric in identifying a wine's price and how it will sell). Quality is an important metric used in a wine’s certification process in the market. A wine's quality helps vintners determine how to properly price a wine, as well as how to market it [2]. 

However, wine quality is (at least at the moment) largely determined by sommeliers [2]. Though these individuals are experts in wine-tasting, sommeliers too are humans with subjective opinions and personal preferences in taste. Therefore, the industry has begun to support the *sommelier-determined measure* of quality through the measurement of objective physicochemical properties of wines, such as pH and alcohol values. 

In an effort to expand this practice, **we will build a model that predicts sommelier-determined wine quality based on physicochemical properties of Vinho Verde wines.** Given the great diversity in climate, grapes, and methods of winemaking in the Vinho Verde region and wine-production areas [1], this model would likely be generalizable when predicting the qualities of wines from other regions (especially those surrounding Vinho Verde).

## 5b) Type of response

Our dataset lends itself to a **regression problem**, where we will use the physicochemical properties of wine samples from the Vinho Verde region of Portugal to predict their quality. Our response variable, `quality`, is a continuous variable ranging from 1-10.


## 5c) Performance metric

To assess the accuracy of our regression model, we plan to optimize its **RMSE**. We want to understand and accurately predict the quality of *each individual wine sample* from our dataset. Thus, we would like the error of each individual prediction to be minimized, and for our performance metric to be particularly sensitive to larger errors. RMSE accomplishes this by penalizing larger errors more than smaller ones. MAE would be useful if we were not as interested in having a performance metric that was sensitive to larger errors, or or if we were indifferent as to how the error is distributed across wine samples. However, that is not our focus in this project - and thus we choose to optimize the model's RMSE. 

## 5d) Naive model accuracy
What is the accuracy of the naive model (Standard deviation of response in case of continuous response / proportion of the majority class in case of classification model)

The standard deviation of the continuous response variable `quality` (representing a wine's quality) is **0.87.**

# 6) Data

## 6a) Source
What data sources will you use, and how will the data help solve the problem? Explain.
If the data is open source, share the link of the data.

*(0.5 point)*

We are using a data set entitled "Wine Quality Data Set" from the UCI Machine Learning Repository. The dataset can be accessed [here](http://archive.ics.uci.edu/ml/datasets/Wine+Quality). 

This dataset helps us to address our project by comprehensively cataloging the different qualities of over 6,000 red and white variants of Portuguese "Vinho Verde" wine. A wine's quality is based on its physicochemical properties, and this data set includes 10 of such properties - thus reinforcing its use in our goal to predict wine quality. 

## 6b) Response & predictors
What is the response, and mention some of the predictors.

*(0.5 point)*

Our response variable is wine quality, represented by the variable `quality` in this dataset. This value ranges from 1 to 10. The predictors we will be using include physicochemical attributes of wine such as pH, density, fixed and volatile acidity, citric acid content, residual sugars, chlorides, free sulfur dioxides and total sulfur dioxides, sulphates, alcohol quality, and wine type (red or white).

## 6c) Size
What is the number of continuous predictors, categorical predictors, and observations in your dataset(s). If you are using multiple datasets, please provide the information for each dataset. When counting predictors, count only those that have sufficient non-missing values, and will be useful.

*(1 point)*

There are 11 total predictors in this dataset: **10 predictors are continuous** and **one predictor is categorical**. There are **6,497 total observations** in this dataset, and each observation erepresents a different wine sample. 

# 7) Exisiting solutions
Are there existing solutions of your problem? Almost all Kaggle datasets have exisiting solutions. If yes, then how do you plan to build up on those solutions? **What is the highest model accuracy / performance achieved in the existing solutions?**

*(1 point)*

There are several existing solutions for our problem, as we are using an established data set from the UCI Machine Learning Library (a well known website housing data resources). There are 1408 dataset notebooks on kaggle of various qualities and completeness that use this data set in some regard. The highest accuracy a published Kaggle notebook has achieved when using this dataset to predict wine quality is 91% (using Random Forest modeling methods). 

We plan to improve upon existing solutions by implementing ensemble modeling with 8 additional modeling methods (including MARS, decision tree with cost-complexity pruning, bagging, Random Forests, AdaBoosts, gradient boosting, XGBoost, and Lasso/Ridge/Stepwise selection methods) Existing solutions on Kaggle largely address this problem using a single modeling method. By implementing ensemble modeling and leveraging different modeling methods, we are confident that our model will outperform existing Kaggle solutions.

# 8) Stakeholders

Who are the stakeholders, and how will your project benefit them? Explain.

*(1 point)*

# 9) Work-split
*(This question is answered for you)*

How do you plan to split the project work amongst individual team members?

We will learn to develop and tune the following models in the STAT303 sequence:

1. MARS

2. Decision trees with cost-complexity pruning

3. Bagging (Bagging MARS / decision trees)

4. Random Forests

5. AdaBoost

6. Gradient boosting

7. XGBoost

8. Lasso / Ridge / Stepwise selection 

Each team member is required to develop and tune at least one of the above models. In the end, the team will combine all the developed models to create a model more accurate than each of the individual models.

*(0 points)*

We will all participate in exploratory data analysis to identify salient predictors for our models. We will divide the modeling processes as follows:

1. MARS - **Lila**

2. Decision trees with cost-complexity pruning - **Kaitlyn**

3. Bagging (Bagging MARS / decision trees) - **Lila**

4. Random Forests - **Amy**

5. AdaBoost - **Kaitlyn**

6. Gradient boosting - **Anastasia**

7. XGBoost - **Amy**

8. Lasso / Ridge / Stepwise selection - **Anastasia**

# 10) References 

[1] https://www.vinhoverde.pt/en/about-vinho-verde 

[2] https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub 