# Predict Red Wine Quality

![red_wine](images/red_wine.jpg)

## Introduction to the dataset
### Start with reading the file

In [1]:
# Import the library needed to read the file


In [2]:
# Read the file using the relative path


### Take a look at the datset and check its dimensions

In [3]:
# Look at the first five rows


In [4]:
# Look at the five last rows


In [5]:
# Check the dimension of the dataset


### Understand the data

* __fixed_acidity__: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
* __volatile_acidity__: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinerger taste
* __citric_acid__: found in small quantities, citric acid can add 'freshness' and flavor to wines
* __residual_sugar__: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter. Wines with greater than 45 grams/liter are considered sweet.
* __chlorides__: the amount of salt in the wine. (chlorides > 0.06 is considered salty)
* __free_sulfur_dioxide__: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
* __total_sulfur_dioxide__: amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentration over 50 ppm, SO2, becomes evident in the nose and taste of wine. 
* __density__: wine density (density > 1 means fermentation is not over)
* __pH__: describe how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. (pH scale is lograitmic, which means pH=6 is 10 times more acid than pH=7)
* __sulphates__: a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant
* __alcohol__: the percent alcohol content of the wine. (alcohol < 10 is considered weak)
* __quality__: output variable (based on sensory data, score between 0 and 10)

#### What are the features?

In [6]:
# print the name of the eleven features


#### What is the response?

In [7]:
# print the name of the response


### Let's dig a bit more into the numbers

In [8]:
# Use the Pandas built-in function describe() on the DataFrame.


#### What kind of additional information can be retrieved from the DataFrame?


### How is the quality distribution of the red wines?

In [9]:
# Import the library needed to plot


In [10]:
# Plot a histogram of the red wine quality (https://seaborn.pydata.org/generated/seaborn.distplot.html)


#### What conclusions can be drawn from the plot?



## Visualize the data
In the previous notebook, we used scatterplots to visualise the relationship between the features and the response. We could do it again for this dataset, but since we have many more features a correlation plot may be easier to work with.

__Correlation__ is a way to measure how strong the relationship between two variables are. The correlation value can range between -1 and 1:

* -1: Perfect negative correlation
* 1: Perfect positive correlation
* 0: No relationship between the variables

In [11]:
# Use the built-in function corr() provided by Pandas to compute the correlation between all the variables


In [12]:
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,12)) # This line is used to increase the size of the heatmap figure

# Use a heatmap to visualize the result in the correlation matrix (https://seaborn.pydata.org/generated/seaborn.heatmap.html)



(<Figure size 1200x1200 with 1 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x11f24b1d0>)

* __annot__: if True, write the data value in each cell
* __linewidth__: width of the lines that will divide each cell.

#### Do we have to include all features in the following steps? Which features are more correlated with the response?

## Exercise
Now when we have a better understanding of the dataset it is time to prepare the input, select a model, train it and then predict the quality of red wine on a scale of 0-10.

## Prepare feature matrix "X" and response vector "y"

In [13]:
# create a Python list of the features you want to include


In [14]:
# use the list to select a subset of the original DataFrame


In [15]:
# print the first 5 rows


In [16]:
# select the response vector from the DataFrame


In [17]:
# print the first 5 values


## Follow the following procedure

1. Split the dataset into two pieces: a __training set__ and a __testing set__.
2. Train the model on the __training set__.
3. Test the model on the __testing set__, and evaluate how well we did.

### Step 1: Split the dataset into two pieces: a __training_set__ and a __testing_set__

In [18]:
# print the shapes of the new X objects


In [19]:
# print the shapes of the new y objects


### Step 2: Train the model with linear regression on the training set

__Step 1:__ Import the class (model) you want to use

__Step 2:__ "Instantiate" the "estimator"

__Step 3:__ Fit the model with the training data (aka "model training")

### Intepreteing the model coefficients

In [20]:
# Print the intercept and coefficients


In [21]:
# Pair the feature names with the coefficients


__Step 4:__ Predict the response (sales) on the testing set

## Step 3: Test the model on the testing set, and evaluate how well we did

In [22]:
# Import the library needed to evaluate the testing set


In [23]:
# Compute the MAE


In [24]:
# Compute the MSE


In [25]:
# Compute the RMSE


#### What conclusion can be drawn from the evaluation metrics?



## Repeat the procedure, but this time with cross-validation

In [26]:
# Import the library needed for cross-validation


Compute the RMSE using cross-validation

In [27]:
# Compute the negative MSE


In [28]:
# fix the sign of MSE scores


In [29]:
# convert from MSE to RMSE


In [30]:
# calculate the average RMSE


## Want to explore more?

Here are some suggestions:

* Try with a different combination of features
* Try other datasets
* Try some feature engineering (create your own features)
* Try new models