# The Finest Wine

By now, you've become an accomplished chef, well-known worldwide. At this stage, you've honed your craft almost to perfection. The only remaining step is to source the finest ingredients for your restaurant. Your first focus is on testing wine quality. You've dispatched a team of researchers to gather data about various wines, while your task is to develop a model that predicts wine quality based on its distinctive characteristics.

* Task: Using the dataset below, find the model that best predicts `quality` in `y_test`.

Most of the code is given to you. Your task is to try out different algorithms using `x_train` and `y_train` and find the most accurate one. Then, use that model to generate predictions using `x_test`, save those predictions to a CSV, and email those predictions to emilio@northwestern.edu.

You don't have access to `y_test`. Emilio will calculate your predictive accuracy on `y_test`.

The number of points that you get for this puzzle depends on the accuracy of your model as following:
- $R^2$ lower than 0.2: 0 points
- $R^2$ between 0.2 and 0.4: 50 points
- $R^2$ between 0.41 and 0.5: 75 points
- $R^2$ above 0.5: 100 points

## Importing libraries and setting options

In [None]:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.neural_network import MLPRegressor
import pandas as pd

pd.set_option('display.max_columns', None)

## Reading data

In [None]:
x_train = pd.read_csv('https://raw.githubusercontent.com/emiliolehoucq/trainings/main/data/x_train.csv', index_col=0)
x_test = pd.read_csv('https://raw.githubusercontent.com/emiliolehoucq/trainings/main/data/x_test.csv', index_col=0)
y_train = pd.read_csv('https://raw.githubusercontent.com/emiliolehoucq/trainings/main/data/y_train.csv', index_col=0)

## Exploratory data analysis

In [None]:
y_train['quality'].plot(kind='hist')

In [None]:
y_train.head()

In [None]:
x_train.apply(lambda col: y_train.corrwith(col))

In [None]:
for col in x_train.columns:
    plt.scatter(x_train[col], y_train, alpha=0.1)
    plt.title(f'{col} vs. quality')
    plt.show()

In [None]:
x_train.corr()

## Modeling / trying out different algorithms

You may want to take a look at the algorithms available for [supervised learning in Scikit-Learn](https://scikit-learn.org/stable/supervised_learning.html). You could also find [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) useful.

This is your task!

## Getting and saving predictions from best model

Replace `<name>` with your group name. **Make sure to not change anything else about the format of the CSV** (column names, index), otherwise we'll have trouble evaluating your submission and it will be counted as invalid. Also, you may need to change `best_model` to your best model.

In [None]:
y_pred = best_model.predict(x_test)
pd.DataFrame({'predictions': y_pred}, index=x_test.index).to_csv('<name>.csv')