# Kelleher 2015, Chapter 6, Exercise 5

In this exercise, we're going to predict the preferred communication channel of policy holders at an insurance company, based on information about them.

Data are available here: 

In [1]:
import numpy as np
import pandas as pd

# Read in the training data AND the new data
input_file = "ch6ex5.csv"
df = pd.read_csv(input_file)

# Extract and process the new data (Russia)
new_data = df.tail(1)
new_data = new_data.drop("Occupation", 1).drop("PrefChannel", 1)
df = df.drop(df.tail(1).index)

# Get the training data ready to go
target_colname = "PrefChannel"
X = df.drop(target_colname, axis=1)
y = df[target_colname]

## 5a) Equal-Frequency Binning for Age

We saw in the chapter that there are multiple options for Naive Bayes models to handle continuous variables. In Exercise 3, we explored one of those options: we assumed Normality for each conditional distribution, and estimated the mean and standard deviation from the (limited) data we had. 

In this exercise, we'll take a different approach and use **equal-frequency binning** to convert the quantitative variable Age into a categorical variable. With 9 observations and 3 requested levels (young, middle-aged, mature), the youngest three policy holders will be "young", the next 3 will be "middle-aged", and the oldest three will be "mature". (Note that each bin has the same number of observations in it - thus, "equal-frequency.")

This wouldn't be hard to program manually, but pandas has a function to do this for us:

In [2]:
X.Age = pd.qcut(X.Age, 3, labels=["young", "middle-aged", "mature"])

## 5b) Excluding Features

**The obvious feature to exclude is "Occupation."** Every person in our training data has a different occupation, so knowing a person's occupation tells us nothing about his/her preferred communication channel. (Plus, dropping Occupation will mean far fewer probabilities to estimate, not that we're going to run into computational issues with a dataset this small.)

Gender can stay - 75% of females prefer phone, whereas only 60% of men prefer phone, so this seems possibly informative.

Age actually doesn't seem very informative: in all three of our categorical buckets, there are 2 of one label and 1 of the other. But it can stay for now.

PolicyType has potential to be informative - 75% of TypeC's prefer phone, only 33% of TypeA's prefer phone, and 50% of TypeB's prefer phone.

In [3]:
X = X.drop("Occupation", 1)

## 5c) Calculating Probabilities for Naive Bayes

Excluding Occupation and using equal-frequency binning for Age, we have the following probabilities:

* P(email) = 4/9
* P(phone) = 5/9
* P(female | email) = 1/4   =>   P(male | email) = 3/4
* P(female | phone) = 3/5   =>   P(male | phone) = 2/5

And so on.

## 5d) Predicting

Unfortunately, sklearn doesn't directly handle non-binary categorical features for Naive Bayes. It *does*, however, support Bernoulli Naive Bayes.

I did a little reading, and it seems like the strategy is to encode the various factor levels as indicators, and then use those with the BernoulliNB.

https://stackoverflow.com/questions/38621053/how-can-i-use-sklearn-naive-bayes-with-multiple-categorical-features
https://datascience.stackexchange.com/questions/9854/sklearn-naive-bayes-vs-categorical-variables

My first choice for doing this was pandas.get_dummies. But unfortunately, this is just a function, and it lacks a corresponding object-oriented transformer a la sklearn. This would make converting the new_data to the new Bernoulli format tedious.

Instead, I'll do it with sklearn.

In [14]:
# X_dropfirst = pd.get_dummies(X, drop_first=True)
# X_nodropfirst = pd.get_dummies(X)

# https://stackoverflow.com/questions/15021521/how-to-encode-a-categorical-variable-in-sklearn
from sklearn.feature_extraction import DictVectorizer
def one_hot_dataframe(data, cols, replace=False):
    vec = DictVectorizer()
    mkdict = lambda row: dict((col, row[col]) for col in cols)
    vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
    vecData.columns = vec.get_feature_names()
    vecData.index = data.index
    if replace is True:
        data = data.drop(cols, axis=1)
        data = data.join(vecData)
    return (data, vecData, vec)

X_nodropfirst, _, _ = one_hot_dataframe(X, list(X), replace=True)
X_dropfirst = X_nodropfirst.ix[:,[0, 1, 3, 5, 6]]
X_dropfirst


Unnamed: 0,Age=mature,Age=middle-aged,Gender=female,PolicyType=A,PolicyType=B
0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0,0.0
7,1.0,0.0,0.0,0.0,1.0
8,0.0,0.0,1.0,0.0,0.0


In [8]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X_dropfirst, y)
clf.predict(new_data)

ValueError: could not convert string to float: 'A'