# The Taste of Machine Learning

How do you know if a wine is good or bad? A sommelier might tell you that the chemical properties of wine affect the taste and the quality of the wine. What if we could test this using machine learning?

**Please note**: This exercise is based on a Jupyter notebook, an interactive environment for writing and running code, and is running in Python. To get familiar with working in Jupyter notebooks, see our 2-minute [JupyterLab Tutorial](https://www.university4industry.com/player/chapter/jupyterlab-tutorial).

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer in the Answers.txt file. Please answer the question <b>before</b> continuing through the notebook. You can <b>double click on Answers.txt</b> in the Left Sidebar now to open it in a new tab. As you go through the notebook, navigate between the tabs to answer questions.
</div>

## Table of contents

1. [Introduction](#1.-Introduction)

2. [Get familiar with the data](#2.-Get-familiar-with-the-data)

3. [Further explore the data](#3.-Further-explore-the-data)

4. [Prepare the data](#4.-Prepare-the-data)

   1. [Remove less relevant features](#4A.-Remove-less-relevant-features)
   2. [Convert text-based features](#4B.-Convert-text-based-features)
   3. [Fill in missing data](#4C.-Fill-in-missing-data)
   4. [Derive new features](#4D.-Derive-new-features)
   
   
5. [Train and evaluate a model](#5.-Train-and-evaluate-a-model)

6. [Sources](#Sources)

## 1. Introduction

[[ go back to the top ]](#Table-of-contents)

In this challenge, we will build a predictive model to answer the question **"What makes a wine good or bad?"**. We will use a real data set of 1599 red Vinho Verde wine samples from the northern Portugal (add link to data set). Our data includes a number of variables that come from chemical tests and a variable `Quality` which is a score from 1-10 given by experts. Our hypothesis is that chemical properties contribute to wine quality and therefore make a wine "good" or "bad", and we want to create model that can predict the quality of previously unseen wines. 

To do so we will first explore our data set and create some visualizations to gather some insights about which varaibles int he data set might contribite to wine quality. We will then, create a model using a part of the data set and evaluate it to determine how effective the model would be at predciting wine wulaity when given previously unseen wines. 

By the end of this tutorial, you will get a basic understanding of how to analyze a large data set and extract some insights from it. You will also learn the importance of the training data set and its quality for creating a preditive machine learning model. 

## 2. Get familiar with the data

[[ go back to the top ]](#Table-of-contents)

Before we start exploring, we need to import some libraries that will help us with our calculations and visualizations. 

*Remember to press ***Shift+Enter*** to run each code cell.*

In [None]:
# Import data analysis libraries
import pandas as pd
import numpy as np
import random as rnd

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

#Linear regression and confusion matrix
import sklearn as sk
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import confusion_matrix
from sklearn import metrics 

# Note: this cell and some of the cells below produce no visible output
# The sucessful execution of a code cell is indicated by the number in the brackets [ ] on the left

Now, let's import our data set `winequality-red.csv` and take a look at it:

In [None]:
# Load data from file into a new object called "wine_data"
wine_data = pd.read_csv('winequality-red.csv', sep=';')

# Show first 15 rows of data set
wine_data.head(15)

Our data set includes 11 variables (the predictors of wine quality):\
<b>Fixed acidity</b> (tartaric acid - g / dm^3)\
<b>Volatile acidity</b> (acetic acid - g / dm^3)\
<b>Citric acid</b> (g / dm^3)\
<b>Residual sugar</b> (g / dm^3)\
<b>Chlorides</b> (sodium chloride - g / dm^3\
<b>Free sulfur dioxide</b> (mg / dm^3)\
<b>Total sulfur dioxide</b> (mg / dm^3)\
<b>Density</b> (g / cm^3)\
<b>pH</b>\
<b>Sulfates</b> (potassium sulfate - g / dm^3)\
<b>Alcohol</b> (% by volume)

There is also one variable based on sensory data (the target variable):\
<b>Quality</b> (score from 1 to 10)


## 3. Further explore the data

[[ go back to the top ]](#Table-of-contents)

In addition to descriptive statistics, **data visualization** can be a powerful tool. Data visualizations allow us to see trends and patterns and relationships between variables because our brains are very good at spotting patterns on pictures.

As our target variable, or the variable we want to predict with our model, is `Quality`, let's start by visualizing the quality of wines in our data set.

In [None]:
# Create histogram of the target variable (Quality)
sns.countplot(x='quality', data=wine_data)
plt.show()

<div class="alert alert-block alert-info">Pause! Answer <b>Question 1</b> in the Answers.txt file. 
    
Which 2 wine qualities are most common in our data set? </div>

We can also create a heat map to visualize which variabels correlate with eachother.

In [None]:
plt.subplots(figsize=(15, 10))
sns.heatmap(wine_data.corr(), annot = True, cmap = "coolwarm")
plt.show()

The correlation coefficient ranges from –1 to 1. When it is closer to 1, there is a strong positive correlation (e.g., `quality` goes up when `alcohol` goes up. When the coefficient is closer to –1, there is a strong negative correlation (e.g, you can see a small negative correlation between the `volatile acidity` and `quality` . Finally, coefficients close to zero mean that there is no correlation.

There are a lot of numbers in this heat map so let's look a little closer at two specific variables: `pH` and `fixed acidity`.

In [None]:
#Visualize the correlation between pH and fixed ycidity

#Create a new dataframe containing only pH and fixed acidity columns 
fixedAcidity_pH = wine_data[['pH', 'fixed acidity']]

#Initialize a joint-grid with the dataframe, using seaborn library
gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)

#Draws a regression plot in the grid 
gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})

#Draws a distribution plot in the same grid
gridA = gridA.plot_marginals(sns.distplot)

This scatter plot shows that, as fixed acidity levels increase, pH levels drop. Makes sense doesn’t it? A lower pH level is an indicator of high acidity.

Let's try another one: alcohol and quality. Since there are several discrete categories of `quality`, we can use a bar graph to visualize this.

In [None]:
#Visualize the correlation between alcohol and quality

#Create a new dataframe containing only alcohol and quality columns 
alcohol_quality = wine_data[['alcohol', 'quality']]

fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='alcohol', data=alcohol_quality, ax=axs)
plt.title('quality VS alcohol')

plt.tight_layout()
plt.show()

<div class="alert alert-block alert-info">Pause! Answer <b>Question 2</b> in the Answers.txt file. 
    
Describe the relationship between the variables quality and alcohol in 1 sentence.</div>

## 4. Prepare the data

[[ go back to the top ]](#Table-of-contents)

Since we are trying to create a model that classifies wines, let's adapt our data set to reflect this and divide it into 3 classes:\
<b>Poor</b>: all wines rated 4 or lower\
<b>Average</b>: wines rated 5 and 6\
<b>Excellent</b>: wines rated 7 or higher

We can represent these 3 categories by numbers 1 (Poor), 2 (Average), and 3 (Excellent) in a 13th variable called `review`. 

In [None]:
# Replace `quality` by `review`
reviews = []
for i in wine_data['quality']:
    if i >= 1 and i <= 3:
        reviews.append('1')
    elif i >= 4 and i <= 7:
        reviews.append('2')
    elif i >= 8 and i <= 10:
        reviews.append('3')
wine_data['review'] = reviews

In [None]:
#Show first 5 rows of data set with 13th variable added
wine_data.head(5)

Now, let's look at the count of each type of review (Poor (1), Average (2), Excellent(3)) in our data set.

In [None]:
# Create histogram of the target variable (Review)
sns.countplot(x='review', data=wine_data)
plt.show()

<div class="alert alert-block alert-info">Pause! Answer <b>Question 3</b> in the Answers.txt file. 
    
What do you notice about how many wines there are of each review type? Why might this be a problem for our machine learning model? --is this question too advanced?

## 5. Train and evaluate the model

[[ go back to the top ]](#Table-of-contents)

Before we train and test our model we need to split our data set into two groups: a training set (80%), and a test set (20%). The training set will be used to build our machine learning model, and the test set will be used to see how well the model performs on new, unseen data.

In [None]:
#Define review variable as dependent variable (y) and all other variables as independent variables (x)
y = wine_data.review
x = wine_data.drop('review', axis=1)

#Split data into training and test set
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)

There are a number of models that we can use. Since we have want to solve a classification problem (is a wine poor, average, or excellent?), we can use the logistic regression algorithm. 

In [None]:
# Fit training data to model
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)

In [None]:
#Test model
y_pred = classifier.predict(x_test)

In [None]:
# Print accuracy score of model
lr_acc_score = accuracy_score(y_test, lr_predict)
print('The accuracy of the model for the test data is:')
print(lr_acc_score*100)

<div class="alert alert-block alert-info">Pause! Answer <b>Question 4</b> in the Answers.txt file. 
    
What does this score tell us about our model? How good is the model at prediciting wine quality? Note: Think about the data we used to test the model. --is this question too advanced?

Something explantory here about confustion matrix

The confusion matrix compares the predicted (wine review) values by the model with the actual (wine review) values. The number of correct and incorrect predictions are summarized with numbers, broken down by class. These numbers are organized into a table or matrix where each row represents the predicted values and each column represents the actual values.

In [None]:
sk.metrics.confusion_matrix(y_test, y_pred)
labels = ['1', '2', '3']
cm = confusion_matrix(y_test, y_pred, labels)

ax = plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted review');ax.set_ylabel('True review'); 
ax.xaxis.set_ticklabels(['Poor', 'Average', 'Excellent']); ax.yaxis.set_ticklabels(['Poor', 'Average', 'Excellent']);

Question here about how to make our model better, esp with regards to the data used to train the model (need more bad and excellent wines)?