# Predicting Red Wine Quality Using Machine Learning

This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether a wine can be classified as bad, good, or great based on physicochemical properties.

We will take the following approach
1. Problem Definition
2. Data 
3. Evaluation
4. Features
5. Modelling
6. Experimentation

## 1. Problem Definition

In a statement, 

> Given physicochemical inputs about a red wine, can we predict what kind of output the wine will produce to our senses when tasted.

## 2. Data

The original data is from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality 

It is also available on Kaggle, https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

## 3. Evaluation 

> If we can reach 88% accuracy on wine classifications then we should pursue building this model into an application

## 4. Features

This is where you will get information about each of the features of the data 

** Create data dictionary **

Input variables (based on physicochemical tests):
*  fixed acidity
*  volatile acidity
*  citric acid
*  residual sugar
*  chlorides
*  free sulfur dioxide
*  total sulfur dioxide
*  density
*  pH
*  sulphates
*  alcohol

Output variable (based on sensory data):

*  quality (score between 0 and 10)


## Prepare the Tools
we're going to use pandas, Matplotlib, and NumPy for the data analysis and manipulation

In [None]:
# import all the tools we need

# Regular EDA (exploratory data analysis) and planning libraries
import sys
sys.path.append('/Users/nick/.virtualenvs/red_wine_project/lib/python3.7/site-packages')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# %matplotlib inline # we want our plots to appear inside the notebook

# Models from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation 
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn import datasets, linear_model, metrics

## Load Data

In [None]:
df = pd.read_csv("data/winequality-red.csv")
df.shape # (rows, columns)

## Data Exploration (EDA)

The goal here is to learn more about the data and become a subject matter expert on the data set

1. What question(s) are you trying to solve?
2. What kind of data do we have and how do we treat different types?
3. What's missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# Let's find out how many of each class there are
df["quality"].value_counts()

In [None]:
df["quality"].value_counts().plot(kind="bar");

In [None]:
df.info()


In [None]:
# Are there any missing values? 
df.isna().sum()

In [None]:
df.describe()

### Wine Quality according to Volatile Acidity and pH

In [None]:
df["volatile acidity"].value_counts()


In [None]:
# pd.crosstab(df.quality, df["volatile acidity"])
df.pH.value_counts()

In [None]:
# Create a plot of crosstab, this doesn't work so well because of the large amount of different data points
# pd.crosstab(df.quality, df["volatile acidity"]).plot(kind="bar", figsize=(10,6))

In [None]:
plt.figure(figsize=(10, 6))

# Add scatter with quality == 3
plt.scatter(df.pH[df.quality == 3], df["volatile acidity"][df.quality == 3], c="black")

# Scatter with quaility == 5
plt.scatter(df.pH[df.quality == 5], df["volatile acidity"][df.quality == 5], c="red")

# Add scatter with quality == 7
plt.scatter(df.pH[df.quality == 7], df["volatile acidity"][df.quality == 7], c="gold")

# add some helpful info 
plt.title("Wine Quality in Function of pH and Volatile Acidity")
plt.xlabel("pH")
plt.ylabel("Volatile Acidity")
plt.legend(["Quality of 5", "Quality of 3", "Quality of 7"])

## Check the distribution of the pH column with a Histogram

In [None]:
df.pH.plot.hist()

In [None]:
# Make a correlation matrix
df.corr()

In [None]:
# Let's make the correlation matrix a better visual
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(corr_matrix, annot=True, linewidths=0.5, fmt=".2f", cmap="YlGnBu")

In [None]:
# Take the average and plot in a bar graph due to the amount of different data points
df.groupby('quality')['alcohol'].mean().plot.bar()
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.title('Quality Avg. vs Alcohol Avg.')
plt.show()

In [None]:
df.groupby('quality')['volatile acidity'].mean().plot.bar()
plt.xlabel('quality')
plt.ylabel('volatile acidity')
plt.title('Quality Avg. vs Volatile Acidity Avg.')
plt.show()

## 5. Modelling

In [None]:
df.head()

In [None]:
# split the data into X and y
X = df.drop("quality", axis=1)

y = df["quality"]


In [None]:
X

In [None]:
y

In [None]:
# Split data into train and test sets
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)



In [None]:
X_train

In [None]:
y_train

Now we've got our data split into training and testing data sets. We can now build our machine learing models.

We'll train it on the training set

Then we will test it on the testing set

We will try 3 different machine learning models:
1. Logistic Regression
2. K-Nearest Neighbours Classifier
3. Random Forest Classifier

In [None]:
# Put the models in a dictionary
models = {"Logistic Regression": LogisticRegression(), 
        "K-Nearest Neighbours Classifier": KNeighborsClassifier(),
        "Random Forest Classifier": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train,  y_test):
    """
    fits and evaluates given machine learning models. 
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training labels
    y_test : testing labels
    """
    # Set a random seed
    np.random.seed(42)
    # Make a dict to keep model scores 
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
    return model_scores


In [None]:
model_scores = fit_and_score(models=models, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)

model_scores

## Model Comparison

In [None]:
model_compare = pd.DataFrame(model_scores, index=["Accuracy"])

model_compare.T.plot.bar()

In [None]:
forest = RandomForestClassifier(n_estimators=40, random_state=0)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

print(confusion_matrix(y_test, y_pred))





In [None]:
print(metrics.accuracy_score(y_test,y_pred))

Need this to increase the accuracy of the random forest classifier

Let's look at the following:
* Hyperparameter tuning
* Feature Importance
* Confusion Matrix
* Cross-validation
* Precision
* Recall
* F1 Score
* Classification Report
* ROC Curve
* Area under the curve (AUC)

## Hyperparameter Tuning

In [None]:
# Let's tune KNN

train_scores = []
test_scores = []

# Create a list of different values for n_neighbours
neighbours = range(1, 21)

# Setup KNN instance
knn = KNeighborsClassifier()

# Loop through different n_neighbours
for i in neighbours:
    knn.set_params(n_neighbors=i)

    # fit the model
    knn.fit(X_train, y_train)

    # update the training scores list
    train_scores.append(knn.score(X_train, y_train))

    # update the testing scores list
    test_scores.append(knn.score(X_test, y_test))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbours, train_scores, label="Train Scores")
plt.plot(neighbours, test_scores, label="Test Scores")
plt.xlabel("Number of Neighbors")
plt.ylabel("Model Score")
plt.legend()


## Hyperparameter tuning with RandomizedSearchCV

we're going to tune:

* LogisticRegression()
* RandomForestClassifier()

... using RandomsizedSearchCV

In [None]:
# Create a hyperparameter grid for logistic regression
log_req_grid = {"C" : np.logspace(-4, 4, 20), "solver" : ["liblinear"]}

# create a hyperparameter grid for  RandomForestClassifier
rf_grid = {"n_estimators" : np.arange(10, 1000, 50), "max_depth" : [None, 3, 5, 10], "min_samples_split" : np.arange(2, 20, 2), "min_samples_leaf" : np.arange(1, 20, 2)}

In [None]:
# Tune logistic Regression
np.random.seed(42)

rs_log_reg = RandomizedSearchCV(LogisticRegression(), param_distributions=log_req_grid, cv=5, n_iter=20, verbose=True)

rs_log_reg.fit(X_train, y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test, y_test)

In [None]:
# Now tune the RandomForestClassifier

# set seed
np.random.seed(42)

# setup random hyperparameters search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(), 
                            param_distributions=rf_grid,
                            cv=5,
                            n_iter=20, 
                            verbose=True)

# fit random hyperparameter search model for RFC()
rs_rf.fit(X_train, y_train)


In [None]:
# Find the best hyperparameters
rs_rf.best_params_

In [None]:
# Evaluate the radomized search RFC model
rs_rf.score(X_test, y_test)

In [None]:
df.head()

## Feature Scaling

Since the feature values we have live it different ranges from each other for example Chlorides look like 0.092 and total sulfur dioxide look like 57.0 we need to scale these so are in a normalized range

In [None]:
scaler = StandardScaler()

In [None]:
# scale all features except last column
scaler.fit(df.drop('quality', axis=1))

In [None]:
scaled_features = scaler.transform(df.drop('quality', axis=1))

In [None]:
df_feat = pd.DataFrame(scaled_features, columns=df.columns[:-1])
df_feat.head()
df.tail()



## Feature Selection

In [None]:
X = df_feat
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)


In [None]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
score

In [None]:
print(classification_report(y_test,pred))

In [None]:
X

## Using Random Forest with Scalers 

In [None]:
forest= RandomForestClassifier(n_estimators=40, random_state=0)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:
print(metrics.accuracy_score(y_test,y_pred))

## Make the classification binary

we need to give the wine a binary classification as in 'bad' and 'good'

we may want to look into giving this a third classification as well if it does not reduce our accuracy below 80%

In [None]:
df = pd.read_csv("data/winequality-red.csv")
df.shape # (rows, columns)

bins = (2.0, 4.0, 8.0)
group_names = ['bad','good']
df['quality'] = pd.cut(df['quality'], bins=bins, labels=group_names)
df.head()
print(df['quality'].unique)

In [None]:
#Now lets assign a labels to our quality variable
label_quality = LabelEncoder()


In [None]:
#Bad becomes 0 and good becomes 1 
df['quality'] = label_quality.fit_transform(df['quality'])

In [None]:
df['quality'].value_counts()


In [None]:
sns.countplot(df['quality'])
plt.show()


In [None]:
df.head()

In [None]:
X = df_feat
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
score

In [None]:
forest= RandomForestClassifier(n_estimators=40, random_state=0)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
y_pred

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:
print(metrics.accuracy_score(y_test,y_pred))
X_test

In [None]:
for key, value in X_test.items():
    print(key)
    print(value)

In [None]:
# -0.528360	0.961877	-1.391472	-0.453218	-0.243707	-0.466193	-0.379133	0.558274	1.288643	-0.579207	-0.960246

test_input_data = {
"fixed acidity": -0.528360,
"volatile acidity": 0.961877,
"citric acid": -1.391472,
"residual sugar": -0.453218,
"chlorides": -0.243707,
"free sulfur dioxide": -0.466193,
"total sulfur dioxide": -0.379133,
"density": 0.558274,
"pH": 1.288643,
"sulphates": -0.579207,
"alcohol": -0.960246
}

In [None]:
# need to turn the dict into a data frame
test_df = pd.DataFrame([test_input_data])

test_pred = forest.predict(test_df)
test_pred

# Below will be the start - finish data processing/modelling for this project

In [None]:
df = pd.read_csv("data/winequality-red.csv")
df.shape # (rows, columns)

In [None]:
df.describe()

## Plots that may be useful

In [None]:
# below we can see comparison of different properties and how those props can relate to the quality output

plt.figure(figsize=(10, 6))

# Add scatter with quality == 3
plt.scatter(df.alcohol[df.quality < 6], df["density"][df.quality < 6], c="#8B0000")

# Scatter with quaility == 5
# plt.scatter(df.pH[df.quality == 5], df["volatile acidity"][df.quality == 5], c="red")

# Add scatter with quality == 7
plt.scatter(df.alcohol[df.quality >= 6], df["density"][df.quality >= 6], c="#006400")

# add some helpful info 
plt.title("Wine Quality as a Function of Alcohol and Sulphates")
plt.xlabel("Alcohol")
plt.ylabel("Sulphates")
plt.legend([ "Quality < 6", "Quality >= 6"])

In [None]:
# this is my function for getting the averages of each column with each quality.. 3,4,5,6,7,8

my_filter = df.quality==3

quality_is_three = df.quality==3
quality_is_three = df.loc[quality_is_three, 'fixed acidity']
quality_is_three

X=df.iloc[:,0:-1]
i=1
X
# qualities = [3,4,5,6,7,8]
# averages_for_each_feature = {}
# for col in X.columns:
#     averages_for_each_feature.setdefault(col, [])
#     for quality in qualities:
#         current_quality = df.quality==quality
#         current_quality = df.loc[current_quality, col]
#         current_quality_average = current_quality.std()
#         averages_for_each_feature[col].append(current_quality_average)
# averages_for_each_feature




In [None]:
X=df.iloc[:,0:-1]
i=1
plt.figure(figsize=(30,90))
for col in X.columns:
    plt.subplot(11,2,i)
    sns.histplot(X[col], palette="crest")
    plt.xticks(fontsize=25)
    plt.yticks(fontsize=25)
    plt.xlabel("Quality", fontsize=25)
    plt.ylabel(col, fontsize=25)
    print(X[col].count())

    i=i+1
plt.show()

In [None]:
data_for_x = df["sulphates"][df.quality > 6]
data_for_y = df.alcohol[df.quality > 6]
data_for_x = data_for_x.tolist()
data_for_y = data_for_y.tolist()

empty_list = []

for i in range(len(data_for_x)):
    temp_dict = {'x': data_for_x[i], 'y': data_for_y[i]}
    empty_list.append(temp_dict)

print (empty_list)


In [None]:
plt.figure(figsize=(10, 7))

fig, axs = plt.subplots(2, 3, sharex=True, sharey=True)

axs[0,0].scatter(df.pH[df.quality == 3], df["volatile acidity"][df.quality == 3], c="black")
axs[0,0].scatter(df.pH[df.quality == 5], df["volatile acidity"][df.quality == 5], c="red")
axs[0, 0].set_title("Wine Quality in Function of pH and Volatile Acidity")

plt.tight_layout()
plt.show()

In [None]:
# below we use an average to derive this plot

# Take the average and plot in a bar graph due to the amount of different data points
df.groupby('quality')['alcohol'].mean().plot.bar()
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.title('Quality Avg. vs Alcohol Avg.')
plt.show()

In [None]:
# this function could be good data to load to the backend so that we mean, std, min etc...

df.describe()

## starting the data processing

In [None]:
df.head()

In [240]:
# here we are splitting the quality values into good or bad based on a choosen threshold
# i've noticed so far that bins = (2.0, 4.0, 8.0) produces the highest accuracy for our models at around 95%
# bins = (2.0, 5.0, 8.0) only gets us around 78% with feature scaling
# bins = (2.0, 6.0, 8.0) gets us at 88% WITHOUT feature scaling!!!

bins = (2.0, 6.0, 8.0)
group_names = ['bad','good']
df['quality'] = pd.cut(df['quality'], bins=bins, labels=group_names)
df.head()
print(df['quality'].unique)

<bound method Series.unique of 0       bad
1       bad
2       bad
3       bad
4       bad
       ... 
1594    bad
1595    bad
1596    bad
1597    bad
1598    bad
Name: quality, Length: 1599, dtype: category
Categories (2, object): ['bad' < 'good']>


In [None]:
# notice now that we have quality divided into good and bad wines
df.head()

In [241]:
# let's find examples of bad wines, note we only have 63 rows of bad wine, we changed our tolerance now we have 744, which seems more equal
df.loc[df.quality == 'bad']

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,bad
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,bad
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,bad
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,bad
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,bad
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,bad
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,bad
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,bad
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,bad


In [242]:
# now let's get only good values
df.loc[df.quality == 'good']

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7,7.3,0.65,0.00,1.2,0.065,15.0,21.0,0.99460,3.39,0.47,10.00,good
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.99680,3.36,0.57,9.50,good
16,8.5,0.28,0.56,1.8,0.092,35.0,103.0,0.99690,3.30,0.75,10.50,good
37,8.1,0.38,0.28,2.1,0.066,13.0,30.0,0.99680,3.23,0.73,9.70,good
62,7.5,0.52,0.16,1.9,0.085,12.0,35.0,0.99680,3.38,0.62,9.50,good
...,...,...,...,...,...,...,...,...,...,...,...,...
1541,7.4,0.25,0.29,2.2,0.054,19.0,49.0,0.99666,3.40,0.76,10.90,good
1544,8.4,0.37,0.43,2.3,0.063,12.0,19.0,0.99550,3.17,0.81,11.20,good
1549,7.4,0.36,0.30,1.8,0.074,17.0,24.0,0.99419,3.24,0.70,11.40,good
1555,7.0,0.56,0.17,1.7,0.065,15.0,24.0,0.99514,3.44,0.68,10.55,good


In [243]:
# Now lets assign some labels to our quality variable, we need to switch it 0 and 1
label_quality = LabelEncoder()

In [244]:
# Bad becomes 0 and good becomes 1 
df['quality'] = label_quality.fit_transform(df['quality'])
df['quality'].value_counts()

0    1382
1     217
Name: quality, dtype: int64

In [None]:
# simple plot showing how many bad and good wines we have
sns.countplot(df['quality'])
plt.show()

In [245]:
# split the data into X and y
X = df.drop("quality", axis=1)

y = df["quality"]

In [246]:
# set seed
np.random.seed(42)

# Split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [247]:
# let's run KNN to see how well it performs
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
score

  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


0.8729166666666667

In [248]:
# now we try and run the RF 
forest= RandomForestClassifier(n_estimators=40, random_state=0)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,

In [249]:
# let's see how well our random forest did, 
rf_score = forest.score(X_test, y_test)
rf_score

0.8854166666666666

## Our results so far...

we got a 0.6645833333333333 for our knn model and 0.7854166666666667 for our random forest model

this is after we have put our quality into two categories, so why is our accuracy still bad.... we have to scale!

since the features have a large distribution we are most likely skewing our results so we will need to scale the features

**** EDIT ****

i have now determined that the we don't need to scale the features and we can still achieve an accuracy of about 88% which is 
what I wanted the min threshold for a good model to be. 

this was achieved by adjusting the sorting of what constitutes a good wine from a bad one. we broke them into the following bins
- *bins = (2.0, 6.0, 8.0)*

In [None]:
# let's setup our scaler function
scaler = StandardScaler()

# then let's scale everything EXCEPT our last column
scaler.fit(df.drop('quality', axis=1)) # compute mean and std for later scaling

scaled_features = scaler.transform(df.drop('quality', axis=1)) # actually scale

# what does our data look now????
df_features_scaled = pd.DataFrame(scaled_features, columns=df.columns[:-1]) # create a new df with scaled features
df_features_scaled.head()

In [None]:
df_features_scaled.describe()

In [None]:
# let's now train the models on our scaled features

X = df_features_scaled
y = df.quality

In [None]:
# set seed
np.random.seed(42)

# Split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
# let's run KNN to see how well it performs
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
score

# hmm 0.7125 .... Better!
# we get 0.8729166666666667 !!

In [239]:
# now we try and run the RF 
forest = RandomForestClassifier(n_estimators=40, random_state=0)
forest.fit(X_train, y_train)
rf_score = forest.score(X_test, y_test)
rf_score

0.8854166666666666

# Okay so let's test our models on non scaled data

we want to test on something that should be a bad wine then something that should be a good wine
let's see what we get ... 



In [None]:
# let's find a wine that has a quality of == 0
df.loc[df.quality == 0]

In [None]:
test_bad_wine = {
"fixed acidity": 7.4,
"volatile acidity": 0.700,
"citric acid": 0.00,
"residual sugar": 1.9,
"chlorides": 0.076,
"free sulfur dioxide": 11.0,
"total sulfur dioxide": 34.0,
"density": 0.99780,
"pH": 3.51,
"sulphates": 0.56,
"alcohol": 9.4
}

# need to turn the dict into a data frame
test_bad_wine_df = pd.DataFrame([test_bad_wine])

test_bad_wine_pred = forest.predict(test_bad_wine_df)
test_bad_wine_pred.score

## great! that wine was predicted to be bad like it should have

## now let's test on wine that should be good!

In [None]:
# let's find a wine that has a quality of == 1
df.loc[df.quality == 1]

In [None]:
test_good_wine = {
"fixed acidity": 6.7,
"volatile acidity": 0.32,
"citric acid": 0.44,
"residual sugar": 2.4,
"chlorides": 0.061,
"free sulfur dioxide": 24.0,
"total sulfur dioxide": 34.0,
"density": 0.99484,
"pH": 3.29,
"sulphates": 0.80,
"alcohol": 11.6
}

# need to turn the dict into a data frame
test_good_wine_df = pd.DataFrame([test_good_wine])

test_good_wine_pred = forest.predict(test_good_wine_df)
test_good_wine_pred

## okay so after playing with inputs we finally got a good wine!

this does make sense because it should be harder to make wine that is good so we should be happy that we are able to re-produce that result

# Let's now export our model using the Pickle library

In [None]:
import pickle

pick = {
    'rf': forest, 
    'knn': knn
}

pickle.dump(pick, open('models' + ".p", "wb"))