# Wine Classifier
## Data Modeling

### Table of Contents

1. [Data Preparation](#preparation)
    1. [Load the data](#load)
    2. [Prepare the data](#prepare)
    3. [Train and test split](#split)
2. [Data Modeling](#modeling)
3. [Model Evaluation](#evaluation)
4. [Export the results](#export)

In [11]:
# Imports
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from IPython.display import display, Markdown as md
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures


# Config
%matplotlib notebook
pd.options.display.max_columns = None

## Data Preparation <a id="preparation"></a>
### Load the data <a id="load"></a>
Loads *.csv* file into *pandas DataFrame*

In [12]:
features = ["label", "alcohol", "malic_acid", "ash", "ash_alcalinity", "magnesium", "phenols", "flavanoids", 
            "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", "od280_od315", "proline"]
df = pd.read_csv("data/raw/wine.csv", names=features)

### Prepare the data <a id="prepare"></a>

In [13]:
# Define 'target' and 'features'
target = "label"
features = df.columns.values
features = features[features != target]

X = df.loc[:, features]
y = df.loc[:, [target]]

### Train and test split <a id="split"></a>
Splits the dataset into a *train set* (80%) and *test set* (20%)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

display(md("*X_train* shape: {0} - *X_test* shape: {1}".format(X_train.shape, X_test.shape)))
display(md("*y_train* shape: {0} - *y_test* shape: {1}".format(y_train.shape, y_test.shape)))

*X_train* shape: (142, 13) - *X_test* shape: (36, 13)

*y_train* shape: (142, 1) - *y_test* shape: (36, 1)

## Data Modeling<a id="modeling"></a>

In [15]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

classifier_names = [
    "Logistic Regression",
    "Decision Tree Classifier",
    "Random Forest Classifier",
]

classifier_models = [
    LogisticRegression(),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
]



In [16]:
model_metrics = []
for clf_name, clf_model in zip(classifier_names,classifier_models):
    clf_model.fit(X_train, y_train)
    preds = clf_model.predict(X_test)
    model_metrics.append([
        clf_name,
        accuracy_score(y_test, preds),
        recall_score(y_test,preds, average="weighted"),
        precision_score(y_test, preds, average="weighted"),
        f1_score(y_test,preds ,average="weighted")
    ])
    
model_summary = pd.DataFrame(model_metrics, columns=['Model','Accuracy','Presicion','Recall','F1-score'])

display(model_summary)

  y = column_or_1d(y, warn=True)
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Model,Accuracy,Presicion,Recall,F1-score
0,Logistic Regression,0.972222,0.972222,0.974074,0.972187
1,Decision Tree Classifier,0.944444,0.944444,0.946296,0.943997
2,Random Forest Classifier,0.972222,0.972222,0.974074,0.972187


## Model Evaluation<a id="evaluation"></a>

In [17]:
#TODO: this!

## Export the results<a id="export"></a>

In [18]:
#TODO: this!