In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


### Let's take a look at the data...

In [None]:
df = pd.read_csv("data/winequality-red.csv")

In [None]:
df.head()

Ok, we want to classify the quality, but it's a numeric value.
Let's take a look at the distribution.

In [None]:
df["quality"].hist(bins=df["quality"].nunique())

Ok, let's call wines with quality 6, 7 or 8 "good". The rest is "not good".

In [None]:
df["isGood"] = 0
df.loc[df["quality"] >= 6, "isGood"] = 1

In [None]:
df["isGood"].value_counts()

Good, the distribution of 0s and 1s is pretty balanced. This should make it easier to classify and evaluate.

In [None]:
df = df.drop("quality", axis=1)

Ok, we've now binarized the target variable.

In [None]:
df.head()

In [None]:
# Save it for later
df.to_csv("/tmp/wine-binary.csv", index=False)

### A first model

Let's go with Logistic Regression.

Take a look at the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and try to import it.


In [None]:
from sklearn # ... YOUR CODE HERE

In [None]:
model = LogisticRegression()

In [None]:
X = df.drop("isGood", axis=1)  # Make sure that the target is not also part of the features!
y = df["isGood"]

Try to train the classifier below (on the full data set, so no train/test split yet):

Hint: take a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit).

In [None]:
model. # ... YOUR CODE HERE

Now, try to get a score below.

Again, you might want to check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score).

In [None]:
model.  # ... YOUR CODE HERE

What do you think about this score? Good? Bad?

Before you move on, try and see what happens if we're *not* dropping "isGood" from the dataset before fitting. What is the effect on the score?

Restore the drop before moving on.

To evaluate whether the score is good, it's always wise to compare it to a simple baseline.

Let's take a look the number of 0's and the number of 1's again:

In [None]:
df["isGood"].value_counts()

What would the accuracy be if we always predict the majority class: 1?

In [None]:
# YOUR CODE HERE

So, how do you feel about the model's accuracy?

## Let's see some examples

In [None]:
df.loc[random_example.index, "isGood"].values[0]

In [None]:
random_example = X.sample(n=1)
print("True 'isGood' value: {}".format(df.loc[random_example.index, "isGood"].values[0]))
random_example

Now [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) whether it's any good!

In [None]:
model. # YOUR CODE HERE

You might wonder why it's only showing 0 and 1. Logistic Regression can do probabilities, right?

Well, there is another function that returns probabilities. See if you can find it in the documentation.

In [None]:
model. # YOUR CODE HERE

Try to run this again for some other random samples.

### Inspecting the model

We can take a look at the coefficients of the model. Try to assign them here:

In [None]:
coef = # YOUR CODE HERE

In [None]:
coef_map = dict(zip(df.columns.values, coef[0]))
coef_map

Some of these stand out. Let's inspect some of them:

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x="isGood", y="alcohol", data=df)
plt.grid()

Any interesting findings?

### Our methodology was a bit flawed here...

If we use the full dataset for training, and then report the score on the same data set as we used for training, the score is usually a bit *too* positive. 

Below is an import for a new function that will help us make a training and testing split. Could you see if you can get it to work? [Docs are here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# YOUR CODE HERE

You should now have 4 new variables: X_train, X_test, y_train, y_test

### Open-ended asignment

- Re-train the model, but now only on X_train and y_train. 
- Evaluate on both X_train & y_train and X_test & y_test. Do you notice a difference?
- Increase the complexity: add polynomial features (manually or using [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)). What does this do to the scores on train & test?
- You might be able to get the train and test scores closer together by tuning the "C" parameter. Give that a go.
- What is your best score?

In [None]:
# YOUR CODE HERE