## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix and references will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Background / Motivation

The Vinho Verde region of northwest Portugal brings affordable wines of diverse flavor profiles and complexities to everyday individuals [1]. As this region continues to gain influence and grow as a wine exporter, it is crucial that wineyard and consumers proportionally scale their ability to determine the quality of each wine variant (a key metric in identifying a wine's price and how it will sell). Quality is an important metric used in a wine’s certification process in the market. A wine's quality helps vintners determine how to properly price a wine, as well as how to market it [2].

However, wine quality is (at least at the moment) largely determined by sommeliers [2]. Though these individuals are experts in wine-tasting, sommeliers too are humans with subjective opinions and personal preferences in taste. Therefore, the industry has begun to support the sommelier-determined measure of quality through the measurement of objective physicochemical properties of wines, such as pH and alcohol values.

In an effort to expand this practice, we built a series of models to predict sommelier-determined wine quality based on physicochemical properties of Vinho Verde wines. We were motivated to approach this topic in particular as all of our group members are bourgeoning young-adults of drinking age. Thus, we hoped to better learn about and understand wine practices through our project so that we may better approach our interactions with wine in the future. We are also all interested in the world of sommeliers and oenologists, and hope to  explore their practices and quality measurements through our research and modeling.

Given the great diversity in climate, grapes, and methods of winemaking in the Vinho Verde region and wine-production areas [1], this model would likely be generalizable when predicting the qualities of wines from other regions (especially those surrounding Vinho Verde). 

## Problem statement 

To approach our analysis, we first articulated our objective: to predict oenologist-determined wine qualities of Vinho Verde wines (on a 1 to 10 scale) using the physical and chemical attributes of those wines.

Although our objective involves predicting a response variable that has integer values from 1 to 10, we elected to approach this problem with regression (i.e., predicting the response on a continuous scale and rounding our predictions to integer values thereafter) instead of multi-classification.

We chose to approach this problem with regression because our response values (from 1 to 10) have an inherent order (as ordinal values), and the magnitude of the difference between consecutive values is meaningful. For example, the difference between "1" and "2" quality wines may be similar to the diffference between quality "9" and "10" wines. Regression helps to capture this relationship by considering the magnitude of the values. 


## Data sources

We elected to use data set entitled "Wine Quality Data Set" from the UCI Machine Learning Repository. The dataset can be accessed [here](https://archive.ics.uci.edu/ml/datasets/Wine+Quality). This dataset helped us to address our project by comprehensively cataloging the different qualities of over 6,000 red and white variants of Portuguese "Vinho Verde" wine from 2004 to 2007. A wine's quality is based on its physicochemical properties, and this data set included 10 of such properties - thus reinforcing its use in our goal to predict wine quality.

Our response variable here was wine quality, represented by the variable quality in this dataset. This value ranged from 1 to 10. The predictors we used included physicochemical attributes of wine such as pH, density, fixed and volatile acidity, citric acid content, residual sugars, chlorides, free sulfur dioxides and total sulfur dioxides, sulphates, alcohol quality, and wine type (red or white). There are 11 total predictors in the dataset: 10 continuous predictors one categorical predictor. There are 6,497 total observations in this dataset, and each observation erepresents a different wine sample.

## Stakeholders

Our project addresses the interests of three stakeholder groups: (1) wine producers, (2) restaurants and bars, and (3) oenologists and sommeliers.

**Wine producers** have a vested interest in predicting a wine's quality, as often quality equates to both a wine's performance in the market and its return on investment, or ROI. Higher quality wines are typically expensive. They thus yield a higher ROI than lower quality wines - and this direclty impacts wine producers. Wine producers invest copious amounts of time and money into cultivating different wine variants. Thus, our project benefits these stakeholders by offering them a measure to predict a wine's quality with accuracy and reliability - information that they may then use to either gauge a wine's performance in the market or to choose how to best spend their time and money (perhaps on wines that are will be more expensive and will thus financially benefit them more).

**Restaurant and bar owners** who buy wine for their businesses may also be interested in assessing a wine’s quality. These stakeholders often buy wine from vineyards to sell in their businesses. They buy different wine variants with the goal of later selling them to customers for an upcharge. These stakeholders need to know the quality of the wines they are buying, as this metric informs how much restaurants and bars can charge for a given glass. Thus, our model may allow these stakeholders to assess the quality of the wines that they are considering for their establishments. By developing a model to predict a wine's quality, our project allows these stakeholders to not only assess a wine's quality overall (and thus figure out how to best market and charge for that wine), but also explore the range of wines to include (as, typically, restaurants include a variety of wine types and qualities - a "good mix" is encouraged to cater to diverse customer wants and needs).

Our third and final stakeholders include **oenologists and sommeliers**. Oenologists and sommeliers are currently working to establish defined measures for determining a wine's quality by evaluating its physiochemical characteristics. By incorporating these characteristics as predictors in our modeling processes, our project may offer these stakeholders a method to more easily and accurately determine a wine's quality through its make up - without having to spend exhaustive hours comparing one wine's composition to another's. This project offers a tool that, if accurate and reliable, could be leveraged by oenologists and sommeliers to better gauge the performance and quality of their industry's product.

## Data quality check / cleaning / preparation 

### Data Quality Check and Cleaning




![response_dist.png](attachment:response_dist.png) <br>
The response variable, wine quality, is a continuous variable with a standard deviation of 0.87 and a mean of 5.82.



![predictor_dist.png](attachment:predictor_dist.png)


There are 11 continuous variables. There are no missing values for any predictors, and all values seem plausible (for example, the minimum and maximum pH values fall within the pH scale). The original data was split into two separate datasets--one for white wine and one for red wine. We decided to merge these datasets into one and create a new categorical variable, `type`, to describe whether the wine was red or white. Most wines in the new merged dataset are white, representing 75% of all samples.   

|             | `type`     |
| ----------- | ----------- |
| Levels      | 2 (White, Red) |
| Missing values   | 0        |
| Number of unique values   | 2        |
| Frequency at all levels   | {White : 4898, Change: 1599}   |


### Data Preparation

We choose to use MinMaxScaler to scale the predictors. StandardScaler assumes that the distribution of the predictors are normal. From the QQ plots of the first three predictors, this assumption does not hold all predictors (a normally distributed predictor would have its values align closely with the 45 degree line). 

"MinMaxScaler scales the data to a fixed range, typically between 0 and 1. On the other hand, StandardScaler rescales the data to have a mean of 0 and a standard deviation of 1. This results in a distribution with zero mean and unit variance." - [Source](https://vitalflux.com/minmaxscaler-standardscaler-python-examples/#:~:text=Differences%20between%20MinMaxScaler%20and%20StandardScaler,-Both%20MinMaxScaler%20and&text=MinMaxScaler%20scales%20the%20data%20to,zero%20mean%20and%20unit%20variance.)

<ins>Quartile-Quartile plots for the first four predictors: fixed acidity, volatile acidity, citric acid, and residual sugar. The other 10 predictors also do not clearly follow the 45 degree line.</ins>

<img src="../Visualizations/fixed_acidity_qq.png"/>

<img src="../Visualizations/volatile_acidity_qq.png"/>

<img src="../Visualizations/citric_acid_qq.png"/>
<img src="../Visualizations/residual_sugar_qq.png"/>



## Exploratory data analysis

Our insights from our exploratory data analysis process were as follows: 

- **Insight 1**: The predictor distributions were not normal--residual sugar had a skewed distribution and total sulfur dioxide had a bimodal distribution. Thus, we scaled the predictors using MinMaxScaler instead of StandardScaler because the latter assumes that the predictors have normal distributions, which is not true for our dataset. MinMaxScaler scales the data to a range of 0 to 1 whereas StandardScaler scales the data to have a mean of 0 and standard deviation of 1. (*See graph below for a distribution of our predictors*)

<div style="text-align: center;">
   <img width="60%" height="60%" src="imgs/pred_dist_w_title.png">
</div>

- **Insight 2**: Alcohol, volatile acidity, and density showed distinct trends with wine quality, suggesting that these might be useful predictors. These predictors also had the highest correlation with wine quality. 

<div style="text-align: center;">
   <img width="80%" src="imgs/pred_vs_quality_2.png">
</div>

## Approach

What kind of a models did you use? What performance metric(s) did you optimize and why?

>We chose to use all models we learned this quarter including several tree-based methods and boosting methods. We also used lasso as a base linear model to compare our other models. We optimized our models for RMSE, as the error of each prediction is important. For example, predicting quality 2 wine as a quality 8 wine is worse than predicting a quality 5 wine as a quality 6 wine. 

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction accuracy or your chosen metric?

>There are several existing solutions for our problem, as we are using an established data set from the UCI Machine Learning Library. There are 1408 dataset notebooks on Kaggle of various qualities and completeness that use this data set in some regard. We sought to improve upon existing solutions by implementing ensemble modeling with 8 different modeling methods (including MARS, decision tree with cost-complexity pruning, bagging, Random Forests, AdaBoosts, gradient boosting, XGBoost, and Lasso/Ridge/Stepwise selection methods). Existing solutions on Kaggle largely address this problem by using a single modeling method. By implementing ensemble modeling and leveraging different modeling methods, we hoped to outperform existing Kaggle solutions. The highest accuracy that a published Kaggle notebook has achieved when using this dataset to predict wine quality is 91% using Random Forest modeling methods. It is difficult for us to compare our model to existing solutions as we performed regression and optimized for RMSE, rather than accuracy. We chose to perform regression and round to integer values because we have not learned multi-class classification and because we were curious if this alternative approach might help us create a better model. 



**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

## Developing the model: Hyperparameter tuning

### Intercept Base Model
*By Anastasia Wei*

For the intercept base model, I took the mean of the train data response as the intercept and rounded the mean to be the predicted test response. This intercept model gave a rmse of 0.9327.

### Ridge and Lasso Regression
*By Anastasia Wei*

For Ridge Regression, I used the `Ridge` modulo from `sklearn` package. I chose a range from 15.8 to 5e-4 (`10**np.linspace(1.5,-3,200)*0.5`) as the tunning parameter alpha, and tuned the model using used `RidgeCV`. The optimal alpha = 0.01636 and results in a test rmse pf 0.8167.

For Lasso Regression, I used `Lasso` modulo from `sklearn` package. I chose a range from 0.5 yo 5e-6 (`10**np.linspace(2,-5,200)*0.5`) as the tunning parameter alpha, and tuned the model using a for loop looping over the parameters optimizing for model RMSE. The optimal alpha is found to be 0.0004301732208342255 results in a train rmse of 0.7806 and test rmse of 0.8138. 

See below for a plot of the train rmse vs alpha.

![image-3.png](attachment:image-3.png)

### MARS
*By Lila Wells*

I approached the MARS modeling process in three parts: (1) a coarse grid search to optimize `max_terms` and `max_degree`, (2) a fine grid search to optimize the same hyperparameters, and (3) building a residual model to predict the optimized model's residuals.

#### Coarse Grid Search 

I began the coarse grid search process with a nested for loop search over `max_terms` and `max_degree`. I iterated over the following hyperparameter values: 

> `max_terms`: range(400, 1201, 200) (i.e., from 400 to 1200 in steps of 200)

> `max_degree`: range(1, 11, 2) (i.e., from 1 to 10 in steps of 2)

After running the nested for loop search to optimize both hyperparameters simultaneously, I found the optimal `max_terms` value to be 400 and the optimal `max_degree` value to be 5. However, after computing this model's test RMSE, I was surprised to find that it was quite high: 1.566 quality units. Thus, I moved on to a finer grid search over a narrower set of hyperparameters.

#### Fine Grid Search

In the finer grid search, I iterated over the following range of hyperparameter values:

> `max_terms`: [4, 5, 6, 7]

> `max_degree`: range(300, 801, 100) (i.e., from 300 to 800 in steps of 100)

After running this nested for loop search to optimize both hyperparameters simultaneously, I found the optimal `max_terms` value to be 800 and the optimal `max_degree` value to be 7. This yielded a test RMSE of 0.790 -- far better than the RMSE of the original coarse grid search. 

Though both the optimal `max_terms` and `max_degree` in this fine grid search lied at the top of the range of hyperparameters I considered, they ultimaltey proved to be the best. I attempted several other fine grid searched over higher ranges to validate that 800 and 7 were the optimal values for `max_terms` and `max_degree` respectively, though to little avail (as the test RMSE increased as these hyperparameters were increased). Thus, this fine grid search proved to be the best in terms of determining the optimal hyperparameters for the MARS model.

Though, I still wanted to see if I could decrease the model's RMSE even further. Thus, I decided to build a model to predict the residuals of my MARS model -- so that I could essentially anticipate and correct for that model's error. 

#### Residual Model

Before building the residuals model, I began first by plotting the residuals of the optimized MARS model itself (see figure below).

<div style="text-align: center;">
   <img width="60%" height="60%" src="imgs/residual_plot_MARS.png">
</div>

The residuals seeemed to be somewhat evenly distributed on either side of the line `y = o`. 

I created the residuals model by fitting an `Earth()` instance with the predictions of my MARS model on train data as `X` (or the predictor) and the train data residuals of my MARS model (i.e., the difference between predicted values and actual values) as `y`, or the response. The residual model itself had an RMSE of 0.76 (slightly better than the performance of my MARS model). 

I then used the residual model to make predictions. I did so by feeding the model the test predictions of the MARS model (to output the predicted residuals). I then added those residuals to my test data predictions for a combined RMSE od 0.777 (a slight decrease in RMSE from the MARS model's of 0.790)

### Decision Tree
*By Kaitlyn Hung*

First, I trained an untuned tree model and found that the depth was 27 and there were 1,469 leaves. Next, I tuned hyperparameters max_depth, max_leaf_nodes, and min_samples_leaf using the untuned tree model to determine hyperparameter ranges to consider. I used GridSearchCV with 5 fold cross validation optimized for neg_mean_squared_error. I used the following hyperparameter values: 

In [None]:
coarse_grid = {    
    'max_depth': np.arange(2,27, 5),
    'max_leaf_nodes': np.arange(100, 1500, 250),
    'min_samples_leaf': np.arange(1,10,2)
}

![tree1.png](attachment:tree1.png)

From this coarse grid search, I found that the best parameters were a max_depth of 7, max_leaf_nodes of 100, and min_samples_leaf of 9. As some of these values were at the end of the range of values I considered, I did a fine grid search. I considered the following hyperparameter ranges: 

In [None]:
fine_grid = {    
    'max_depth': range(3,12),
    'max_leaf_nodes': np.arange(2, 127, 25),
    'min_samples_leaf': [8, 9, 10]
}

![tree2.png](attachment:tree2.png)

This GridSearchCV gave optimal values of max_depth = 6, max_leaf_nodes = 52, and min_samples_leaf = 8. I trained a decision tree regressor using these optimal hyperparameter values. Then, I used this model to predict the quality of test data, rounding values to the closest integer. This gave me an RMSE of 0.804.

### Bagging Decision Trees
*By Lila Wells*

I approached the bagging model in two phases (1) a coarse grid search to identify the optimal hyperparameter space, and (2) a finer grid search to optimize hyperparameters. Yet before both steps, I wanted to determine (roughly) the number of trees that would be needed to stabilize the model's R-squared and RMSE. 

In the set of graphs below, I plotted the number of trees versus the out of bag RMSE and test RMSE. I noted that, while the test and out-of-bag R-squard trendlines seem to intersect and start to stabilize after the number of trees exceeds ~400, the out-of-bag and test RMSE trendlines seem to stabilize a little later, after around 500 trees. Thus, I chose 500 as my lower bound for `n_estimators` for the hyperparameter space.

<div style="text-align: center;">
   <img width="80%" src="imgs/bagging-graph1.png">
</div>

#### Coarse Grid Search

I elected to optimize the following parameters: (1) n_estimators, (2) max_samples, (3) max_feautres, (4) bootstrap, and (5) bootstrap features. 

I considered the following ranges of hyperparamteters in my initial coarse grid search: 

In [None]:
# Parameters for the coarse grid search
'estimator': [DecisionTreeRegressor(random_state = 1), DecisionTreeRegressor(random_state = 1, max_depth = 6), DecisionTreeRegressor(random_state = 1, max_depth = 10)],
'n_estimators': range(500, 1001, 100),
'max_samples': [0.5, 0.75, 1.0],
'max_features': [0.5, 1.0],
'bootstrap': [True, False],
'bootstrap_features': [True, False]}

I considered three different types of decision trees. The first was a base decision tree with `random_state = 1` and no specified `max_depth` or other hyperparameters. The second tree had a `max_depth` of 6 (the optimal value found in Kaitlyn's decision tree optimization process. I included that tree to test whether specifying `max_depth` could have a beneficial impact on the model's RMSE. The final tree had a `max_depth` of 10, and was included as a way to test whether specifying an additional hyperparameter could yield postiive results in the search process. 

I then used the above parameters as a grid to iterate through using `GridSearchCV()` and 2 fold cross-validation (to minimize runtime and identify the range of hyperparameters to focus in on in my finer grid search). 

This coarse grid search identified the following optimal hyperparameters:

1. 'estimator': DecisionTreeRegressor(random_state = 1),

2. 'n_estimators': 1000,

3. 'max_samples': 0.75,

4. 'max_features': 1.0,

5. 'bootstrap': False

6. 'bootstrap_features': True

With these hyperparameters, the model yielded a 0.667 test RMSE. I graphed and inspected the results of the `GridSearchCV()` search with 2-fold cross validation to identify which distributions I should inspect for my finer grid search (in an effort to reduce the model's RMSE). (*See below for the graphed distributions of the coarse grid search over hyperparameters*). 

<div style="text-align: center;">
   <img width="80%" src="imgs/bagging-init-search.png">
</div>

#### Finer Grid Search

Based on the above graphs and the optimal hyperparameters identified in my coarse grid search, I altered my hyperparameter distribution for the finer grid search as follows:

In [None]:
'estimator': [DecisionTreeRegressor(random_state = 1)], # Narrowing this search space baesed on coarse grid results
'n_estimators': range(800, 1101, 60),, # Narrowing this search space based on coarse grid results
'max_samples': [0.6, 0.75, 0.9], # Narrowing this search space based on coarse grid results
'max_features': [0.5, 0.75, 0.85, 1.0] # Narrowing this search space based on coarse grid results
'bootstrap': [True, False],
'bootstrap_features': [True, False]}

I then used the above parameters as a grid to iterate through using `GridSearchCV()` and 2 fold cross-validation (to minimize runtime and identify the range of hyperparameters to focus in on in my finer grid search). 

This coarse grid search identified the following optimal hyperparameters:

1. 'estimator': DecisionTreeRegressor(random_state = 1),

2. 'n_estimators': 1040,

3. 'max_samples': 0.75,

4. 'max_features': 0.75,

5. 'bootstrap': False

6. 'bootstrap_features': False

With these hyperparameters, the model yielded a 0.665 test RMSE, a slight improvement from my original RMSE of 0.667. I graphed and inspected the results of the `GridSearchCV()` search with 2-fold cross validation to better inspect the distributions of my hyperparameter values.  (*See below for the graphed distributions of the fine grid search over hyperparameters*). 


<div style="text-align: center;">
   <img width="80%" src="imgs/bagging-fine-search.png">
</div>

I attempted several more grid searches with different values for `n_estimators` (as my optimal `n_estimators` value of 1040 was at the upper range of that hyperparameter space in the fine grid search). However, the test RMSE of my model did not improve. Thus, this fine grid search yielded the optimal hyperparameters for this model.

### Random Forest
*By Amy Wang*

### AdaBoost
*By Kaitlyn Hung*

I optimized the following hyperparameters: n_estimators, learning_rate, and max_depth of the base estimator using GridSearchCV to identify the optimal hyperparameter values. In my coarse grid search, I considered the following ranges: 

In [None]:
grid['n_estimators'] = [10, 50, 100,200]
grid['learning_rate'] = [0.0001, 0.001, 0.01,0.1, 1.0]
grid['base_estimator__max_depth'] = [3, 5, 10, 15]

![ada1.png](attachment:ada1.png)
This gave the optimal hyperparameters to be a max_depth of 15, n_estimators as 200, and learning_rate as 1. These were the max values I considered for each of the hyperparameters, so I did another grid search, increasing the values I considered.

In [None]:
grid['n_estimators'] = [200, 500, 1000]
grid['learning_rate'] = [.5, .75, 1.0]
grid['base_estimator__max_depth'] = [12,14,16]

 ![ada2.png](attachment:ada2.png)
This gave the optimal hyperparameter values to be a max_depth of 14, learning_rate as 1.0, and n_estimators as 1000. I continued to narrow in to the optimal values with a finer grid search. 

In [None]:
grid['n_estimators'] = [800, 1000, 1200, 1500]
grid['learning_rate'] = [.75, 1.0, 1.5]
grid['base_estimator__max_depth'] = [13, 14,15, 16]

![ada3.png](attachment:ada3.png)
This gave the optimal values to be a max_depth of 13, learning rate of 1.5, and n_estimators as 1500. As some of these values are at the end of the range, I increased the values I considered. I tuned max_depth and learning_rate separately from n_estimators as thus far, the highest value of n_estimators has been best. 

In [None]:
grid['learning_rate'] = [1.25, 1.5, 2]
grid['base_estimator__max_depth'] = [12, 13, 14]

![ada4.png](attachment:ada4.png)
I found that a max_depth of 13 and learning rate of 1.5 still gave the lowest RMSE. Using these optimal values, I next tuned n_estimators, considering the following values: 

In [None]:
grid['n_estimators'] = [1000, 1500, 2000, 3000, 4000]

![ada5.png](attachment:ada5.png)
n_estimators as 2000 appears to be best. However, when I calculated the RMSE using n_estimators as 1500 and 2000, I found that 1500 actually gave the lowest RMSE. I created an AdaBoostRegressor model using n_estimators as 1500, mad_depth as 13, and learning_rate as 1.5, rounded the predictions to the nearest integer, and found that this model had an RMSE of 0.658.

### Gradient Boosting
*By Anastasia Wei* <br>

I used the `GradientBoostingRegressor` from the `sklearn.ensemble` modulo with the huber loss function for the model. First I used 5 fold cross validation to get a sense of the number of estimators I need to reach a stable cross validation RMSE and found that around 1500 trees will be sufficient.
![image.png](attachment:image.png)
With this information, I started with a coarse grid search with 4 fold cross validation (to speed up the training process) using `RandomizedSearchCV` with 50 iterations. 

In [None]:
# Parameter grid for gradient boosting model
grid['n_estimators'] = [1200, 1400, 1600, 1800]
grid['learning_rate'] = [0.1, 0.2, 0.3]
grid['max_depth'] = [8, 10, 12, 14]
grid['subsample'] = [0.4, 0.6, 0.8, 1]

I found a best cross validation RMSE of 0.6237 using `'subsample': 0.8, 'n_estimators': 1400, 'max_depth': 8, 'learning_rate': 0.1` and test RMSE of 0.6777. I visualized the parameters as follows: 
![image.png](attachment:image.png)

Then I implemented a finer tunning zooming in on the parameter space with the lowest k-fold RMSE. The search grid is as follows: 

In [None]:
grid['n_estimators'] = [1300, 1350, 1400, 1450, 1500]
grid['learning_rate'] = [0.8, 0.1, 0.15]
grid['max_depth'] = [8, 9, 10]
grid['subsample'] = [0.7, 0.8, 0.9, 1]

And found a best cross validation RMSE of 0.6204 using `'subsample': 0.8, 'n_estimators': 1300, 'max_depth': 9, 'learning_rate': 0.1` and test RMSE of 0.6713. I visualized the parameters as follows: 
![image.png](attachment:image.png)

After this fine tuning, I did some manual tuning by making minor changes to the optimal parameters and combinations of parameters that are correlated (e.g. n_estimators and learning_rate) given by this tuning. However, the performance of the model did not change significantly. The only changed the improved the model significantly was increasing the subsample rate from 0.8 to 0.85. This resulted in a test RMSE of 0.6587.

### CatBoost, and LightGBM
*By Anastasia Wei*

I fit the data with `CatBoostRegressor` model from the `catboost` package using the default parameters and it gave a train RMSE of 0.4761 and a test RMSE of 0.7131. Then I also fit the data with a `LGBMRegressor` model from the `lightbm` package using the default parameters. It yielded a train RMSE of 0.5211 and a test RMSE of 0.7259. 

We decided to include CatBoost and LightGBM models to (1) increase the diversity of the ensemble models and (2) see whether they will perform better than our other base models. However, these two models are additional base models and were included more for experimentation than the purpose of rigorous optimization (thus, we did not continue to optimize the CatBoost and LightGBM hyperparameters beyond their defaults). 

## Model Ensemble 

### Voting ensemble
*By Anastasia Wei* <br>

I used the `VotingRegressor` from `sklearn.ensemble` to create the voting ensemble model. I chose to only use the XGBoost, AdaBoost, Random Rorest, Gradient Boost, and Bagging Models in this ensemble as the other base models (CatBoost, LightGBM, Lasso, Ridge, MARS) had higher test RMSEs (> 0.7). 

The voting ensemble model yielded a test RMSE of 0.651 (lower than each of the individual models). 


See below for a plot of the actual vs predicted response. 
![image.png](attachment:image.png)

### Stacking ensemble
*By Anastasia Wei* <br>

I first fit each of the models mentioned about on the train and test dataset to create a new test and train datasets to fit the metamodels on. See the following subsections for the individual metamodels.

#### Linear Regression Metamodel
Using `LinearRegression` from `sklearn.LinearModels` on the new train data set, I obtained a test RMSE of 0.6714.

#### Lasso Metamodel
Using regularization parameter alphas in the range of `10**np.linspace(0, -3, 300)*0.5`, I tuned the model using LassoCV. With the optimal alpha of 0.0005, I obtained a test RMSE of 0.6598.

#### MARS Metamodel
I fit a MARS model using `Earth` from `pyearth` package with degree 1 to avoid overfitting. This gave a test RMSE of 0.6593.

#### Random Forest Metamodel
I tuned a random forest metamodel using the following parameter space with 4 fold cross validation with `RandomizedSearchCV` using 100 iterations. 

In [None]:
param_grid = {'n_estimators': [100],
              'max_depth': [8, 10, 12, 14],
              'max_leaf_nodes':[100, 500, 1000],
              'max_features': [2, 4, 6, 8],
              'max_samples': [1000, 2000, 3000]}

This search yielded the following optimal parameters: `'n_estimators': 100, 'max_samples': 3000, 'max_leaf_nodes': 1000, 'max_features': 8, 'max_depth': 12`. The test RMSE is 0.6482 using these optimal parameters.

#### XGBoost Metamodel

I tuned a xgboost metamodel using the following parameter space with 4 fold cross validation with `RandomizedSearchCV` using 100 iterations. The parameter grid for this metamodel is detailed below:

In [None]:
# Parameter grid for the XGBoost metamodel
param_grid = {'max_depth': [3, 4, 5, 6],
              'learning_rate': [0.008, 0.01, 0.025, 0.05],
              'reg_lambda':[0, 1, 5],
              'n_estimators':[500, 600, 800, 1000],
              'gamma': [0, 3, 5, 10],
              'subsample': [0.5, 0.75, 1.0],
              'colsample_bytree': [0.5, 0.75, 1.0]}

This search yielded the following optimal parameters: `'subsample': 0.75, 'reg_lambda': 0, 'n_estimators': 1000, 'max_depth': 6, 'learning_rate': 0.008, 'gamma': 0, 'colsample_bytree': 1`. The metamodel's test RMSE is 0.6754 using these optimal parameters.

### Ensemble of ensembled models

Using the above 5 metamodels, I built a ensembled model of these metamodels using voting regression (i.ee., simply averaging the predictions). I then tuned the cutoff for rounding and found a optimal rounding threshold of 0.6869. Numbers with fractional amount greater than the threshold will be rounded up and vice versa. This resulted in a final rmse 0.6421, which is the lowest RMSE that we've achieved so far. 

See below for a plot of the actual vs predicted response for this final model.
![image.png](attachment:image.png)

## Limitations of the model with regard to prediction

Are you confident that you found the optimal hyperparameter values for each of your individual models, and that your individual models cannot be better tuned? Or, are there any models that could be better tuned if you had more time / resources, but you are limited by the amount of time you can spend on the course project *(equivalent to one assignment)*? If yes, then which models could be better tuned and how?

Will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

If we had more time and resources, the following can be implemented to make our model prediction better: <br>
The catboost and lgbm models can be tuned to perform better. <br>
The ensemble models can perform better if we predicted residuals for each of the model and subtracted from the test prediction, tune the rounding threshold for each of the model, and then use this new prediction with the subtracted residuals as the new train and test data. <br>
We could also potentially build more ensemble models that are more distinct from each other to create the ensemble of the meta models.

## Future Work

Future work in the wine quality prediction space may benefit from more recent data. Our dataset was amalgamated between 2004 and 2007, though new wines have been cultivated each year in the Vinho Verde region, with differing weather patterns and soil qualities that affect their chemical properties, taste, and quality. Modeling wine quality with more recent wines may better allow researchers to explore wines that are more likely being produced, bought, and consumed by the public today (though it is likely that many wines tested in this study are still being aged and consumed by the public, just to a smaller extent). Thus, one potential avenue of future inquiry within this research community could include studying wines from the last decade to study the qualities and chemical compositions of more recent wines -- especially amidst global warming, which affects weather patterns, soil qualities, and ultimately the chemical compositions of wines in regions like Vinho Verde [3].

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? You may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

Add details of each team member's contribution, other than the models contributed, in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 25%;">
       <col span="1" style="width: 40%;">
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Individual Model</th>
    <th>Work other than individual model</th>    
    <th>Details of work other than individual model</th>
  </tr>
  <tr>
    <td>Kaitlyn Hung</td>
    <td>Decision Tree & AdaBoost</td>
    <td>EDA</td>    
    <td>EDA + visualizations of response and predictor distributions and relationship. Data preparation and approach.</td>
  </tr>
  <tr>
    <td>Amy Wang</td>
    <td>Random Forest & XGBoost</td>
    <td>Data Preparation</td>    
    <td>XXX</td>
  </tr>
    <tr>
    <td>Anastasia Wei</td>
    <td>Ridge and Lasso Regression & Gradient Boosting</td>
    <td>Ensembling</td>    
    <td>Ensembling the metamodels, making the intercept model, and some EDA</td>
  </tr>
    <tr>
    <td>Lila Wells</td>
    <td>MARS and Bagging Decision Trees</td>
    <td>Presentation Assets & some EDA</td>    
    <td>Designed all presentation assets for the project presentation component of the study. Some relevant EDA.</td> 
  </tr>
</table>

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Vinho Verde, "About Vinho Verde." https://www.vinhoverde.pt/en/about-vinho-verde. Supplied as additional material. 

[2] Cortez et. al, "Modeling wine preferences by data mining from physicochemical properties." Decision Support Systems (2009). https://www.sciencedirect.com/science/article/abs/pii/S0167923609001377?via%3Dihub. Supplied as additional material. 

[3] Gambetta et. al, "Global warming and wine quality: are we close to the tipping point?" Vint and Wine Open Access Journal (2021). https://oeno-one.eu/article/view/4774. Supplied as additional material. 

## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.