# Men Classification Model

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

plt.style.use('seaborn-poster')
%matplotlib inline

The data contains the count of items of a particular category bought per year by a costumer.

In [None]:
# import the data
data = pd.read_csv("men_basket.csv")

# only print the first 10 samples
print(data.head())
print(f'We have {data.shape[0]} data samples with {data.shape[1] - 1} features')

Report some descriptive statistics and print labels

In [None]:
print(data.describe())
data['label'].unique()

In [None]:
X = data[["beer", "pizza"]]
y = data["label"]

If possible, we would like to explore the data before modeling.  
The boundary between the classes seems to be almost linear. That should tell us the type of model we want to use to separete ('classify') this samples. 

In [None]:
# let's have a look of the data first
colors = ['g', 'r']
symbols = ['^', '*']
plt.figure(figsize = (10,8))

for l, c, s in (zip(y.unique(), colors, symbols)):
    plt.scatter(X["pizza"][y == l], X["beer"][y == l], \
                color = c, marker = s, s = 60, \
                label = l)

plt.legend(loc = 2, scatterpoints = 1)
plt.xlabel("n_pizza / year")
plt.ylabel("n_beer / year")
plt.show()

We will use now a model from scikit-learn called NaiveBayes. You can read more about it at:  
https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes


The API is similar for most algorithms in scikit-learn:  
**Step 1:** initialize the model  
**Step 2:** train the model using the *fit* function  
**Step 3:** predict using the *predict* function  

In [None]:
from sklearn.naive_bayes import GaussianNB

# Initialize the classifier model
clf = GaussianNB()

# Train the classifier with data
clf.fit(X,y)

# predict on the data
y_hat = clf.predict(X)
print(y_hat)

We plot the decision boundary for the model

In [None]:
# Plotting decision regions

x_min, x_max = X["pizza"].min() - 1, X["pizza"].max() + 1
y_min, y_max = X["beer"].min() - 1, X["beer"].max() + 1   
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Z = clf.predict_proba(np.c_[yy.ravel(), xx.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

plt.figure(figsize = (10, 8))
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=.3)

for l, c, s in (zip(y.unique(), colors, symbols)):
    plt.scatter(X["pizza"][y == l], X["beer"][y == l], \
                color = c, marker = s, s = 60, \
                label = l)

plt.title("Men Classification")

plt.xlabel('n_pizza')
plt.ylabel('n_beer')
plt.show()


The linear boundary found by the model is good, as it can separete most of the samples.  
How could you report 'how good' the model is?