<h1>Naive Bayes Classifier</h1>
<h3>Goal:</h3>
<p>Here we will use the Gaussian naive Bayes classifier, imported from sklearn, over census data to predict if a person makes above $50k a year. The data is brought in from Keras.</p>

In [234]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

<p>We need to update the column headers of the dataframe</p>

In [235]:
df = pd.read_csv('adult.csv', header=None)
column_names = ['age',
                'workclass',
                'fnlwgt', 
                'education', 
                'education_num', 
                'marital_status', 
                'occupation',
                'relationship',
                'race',
                'sex',
                'capital_gain', 
                'capital_loss', 
                'hours_per_week', 
                'native_country', 
                'income']
df.columns = column_names

<p>We need to do a quick check over the columns for their data type. If we find categorical data, we will perform One-Hot-Encoding</p>
<p>We also need to check for null values and act accordingly</p>

In [236]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

In [237]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

<p>A quick look into the data shows there are some empty fields which contain input ' ?'</p>
<p>We replace with those question marks with np.NaN and do a dropna. I do not think imputing is a good idea here. Imputing could introduce bias which would be counterproductive.</p>

In [238]:
df['workclass'].replace(' ?', np.NaN, inplace=True)
df['occupation'].replace(' ?', np.NaN, inplace=True)
df['native_country'].replace(' ?', np.NaN, inplace=True)
df.dropna(inplace=True)

In [239]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

<p>Next we split the data and perform one-hot-encoding over the categorical data. We must also transform the labels into something feasible (i.e., 0 and 1)</p>

In [240]:
# We split the data into label and features.
x = df.drop(['income'], axis=1)
y = df['income']
y = np.array(preprocessing.LabelEncoder().fit_transform(y))

# We separate the integer type data and categorical data so we can one-hot-encode the categorical
int_type = list(x.select_dtypes(include=[np.int64]).columns)
categorical = list(x.select_dtypes(include=[object]).columns)
encoded_df = pd.DataFrame()
for c in categorical:
    u = x[c].unique().tolist()
    new_names = [f"{c}_{i}" for i in u]
    ehc = preprocessing.OneHotEncoder(sparse=False).fit(x[c].to_numpy().reshape(-1,1))
    transformed = ehc.transform(x[c].to_numpy().reshape(-1,1))
    encoded_df[new_names] = transformed

# Join the data to properly maintain row structure when dropping nulls
encoded_df[int_type] = x[int_type]
encoded_df['income'] = y
encoded_df.dropna(inplace=True)

# Reseparate the data and perform a split
x = encoded_df.drop(['income'], axis=1)
y = encoded_df['income']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

  self[k1] = value[k2]
  encoded_df['income'] = y


<h3>Model</h3>
<p>Next we build the model, fit the data, make predictions, and lastly check the accuracy of our prediction.</p>
<p>We could go further and check the precision, recall, and F1 score, although this is merely a simple toy project to implement naive Bayes.</p>

In [241]:
model = GaussianNB()
model.fit(x_train, y_train)
pred = model.predict(x_test)

<p>The accuracy is not amazing but also isn't horrible. It is possible the choice of Gaussian naive Bayes is fundamentally wrong and the data is distributed differently.</p>

In [244]:
acc = accuracy_score(y_test, pred)
acc

0.7428162632645762