# Naive Bayes Categorical 

The notebook uses the same dataset (automobile.csv) as in the previous threads.

The notebook implements a version of Naive Bayes from sklearn library:

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

This version works only with categorical features (as is the version we learned in class, which you should review).

The first step is to execute the notebook in an environment with access to the automobile.csv file. Read carefully all the comments and the text in the notebook, don't just zap through all the cells. If you do not understand something, you are encouraged to either read about it in some Python/Pandas/sklearn documentation, or to ask in the thread (Do not be shy! You are not graded for your previous knowledge).

The last part of the notebook computes training and validation accuracy of a trained Naive Bayes classifier. For some cases, the part that computes the validation accuracy will crash (!!). Please share the accuracies you get in the thread, and if your execution crashes, please share that in the thread as well.

If your execution crashed, try to guess why. If it didn't crash, but you want it to crash, try to run again and again (with different splits) until it does.

In the next forum we will fix the crash and look more closely at what Naive Bayes can give us.

**Advanced:** If you figured out why it crashes, see if you can figure out from the sklearn Naive Bayes documentation how to fix this.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("automobile.csv", na_values = '?')

In [None]:
df.head(10)

In [None]:
df.describe(include="all")

In [None]:
# Let's see the types of the columns
df.dtypes

Target variable is Risk, ranging from -2 to 3.  For this notebook, we will covert it to "0" for (-2,-1,0) and "1" for (1,2,3)

In [None]:
df["Risk"] = df["Risk"].apply(lambda x: 0 if x<=0 else 1)
df["Risk"].hist()

Let's keep only the categoricals.

In [None]:
# Take only categoricals
from pandas.api.types import is_numeric_dtype
categoricals = [c for c in df.columns if ((not is_numeric_dtype(df[c])) or c=='Risk')]
df_cat = df[categoricals].astype(str).astype('category')

In [None]:
df_cat.dtypes

In [None]:
## JK: alternative way to select 'obj'
df_cat1 = df.select_dtypes(['object'])

#changing dtype to 'category'
df_cat1 = df_cat1.astype('category')
df_cat1.dtypes

In [None]:
df_cat1.head(10)

In [None]:
## JK: mfi and spfi are unique values and will crash the model
df_cat1['Fuel-system'].value_counts(sort=True)

For Pandas categoricals, a missing value is not a category, it is just a missing value.

Let's replace the missing values with a new category "zzz".  This way, we are using the information that a value is missing as a category by itself.  As we discussed in class, this may be useful, especially if values are missing "not at random".

In [None]:
def add_dummy_category(series):
  series = series.cat.add_categories(['zzz'])
  series = series.fillna('zzz')
  return series

In [None]:
for c in df_cat.columns: 
    df_cat[c] = add_dummy_category(df_cat[c])

In [None]:
### 2 NaN values have been registered as categories so we are not categorizing them as 'zzz'
df_cat['Num-of-doors'].value_counts()

df_cat[df_cat['Num-of-doors'] == 'nan']

In [None]:
df_cat['Num-of-doors'].value_counts()

In [None]:
from sklearn.naive_bayes import CategoricalNB #Naive Bayes (categorical)
from sklearn.model_selection import train_test_split

In [None]:
df_cat.head()

In [None]:
train, val = train_test_split(df_cat, train_size=0.7)
X_train = train.drop('Risk', axis=1)
y_train = train['Risk']
X_val = val.drop('Risk', axis=1)
y_val = val['Risk']

In [None]:
clf = CategoricalNB()

In [None]:
##JK: fixing the error to have same number of categories between train/test

#clf = CategoricalNB(min_categories=3)

In [None]:
clf.fit(X_train, y_train)

Oops, sklearn's categorical naive Bayes' implementation doesn't like string values.  We have to convert all strings to numbers.

In [None]:
def categorical_to_int(series):
  categories = series.cat.categories
  categories = categories.sort_values()
  return series.replace(to_replace = categories, value = range(len(categories))).astype('string').astype('int32')

# see what happens if you try to convert to int32 without first converting to string...


In [None]:
for c in df_cat.columns: df_cat[c] = categorical_to_int(df_cat[c])


In [None]:
train, val = train_test_split(df_cat, train_size=0.7)
X_train = train.drop('Risk', axis=1)
y_train = train['Risk']
X_val = val.drop('Risk', axis=1)
y_val = val['Risk']

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train) # check accuracy

In [None]:
clf.score(X_val, y_val)

In some cases the last line of code crashes.  Can you see why?

In [None]:
print(X_train.nunique())
print('')
print(y_train.nunique())
print('')
print(X_val.nunique())
print('')
print(y_val.nunique())


In [None]:
df_cat_unique = df_cat.nunique()
x_train_unique = X_train.nunique()
x_val_unique =X_val.nunique()

comparing = pd.DataFrame({'df_cat':df_cat_unique, 'x_train': x_train_unique, 'x_val':x_val_unique})#.transpose()
comparing

In [None]:
print(X_train['Fuel-system'].value_counts())
print("")
print(X_val['Fuel-system'].value_counts())