## NAIVE BAYES CLASSIFICATION PROBLEM

### Algorithm Intuition :

Given a datapoint the algorithm using the Baye's theorem finds the probability of the datapoint belonging to a particular class. It then outputs the class with the highest probability as the predicted class.

The Naive Baye's algorithm works on an assumption of feature independence i.e no pair of features are dependent.
Secondly each feature is given the same importance /weight in predicting the outcome

***
***

Types of Naive Bayes:

- Gaussian Naive Bayes algorithm

    When we have continuous attribute values, we made an assumption that the values associated with each class are distributed according to Gaussian or Normal distribution. For example, suppose the training data contains a continuous attribute x. We first segment the data by the class, and then compute the mean and variance of x in each class. 
    
- Multinomial Naive Bayes algorithm

    With a Multinomial Naive Bayes model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1, . . . ,pn) where pi is the probability that event i occurs. Multinomial Naïve Bayes algorithm is preferred to use on data that is multinomially distributed. It is one of the standard algorithms which is used in text categorization classification.

- Bernoulli Naive Bayes algorithm

    In the multivariate Bernoulli event model, features are independent boolean variables (binary variables) describing inputs. Just like the multinomial model, this model is also popular for document classification tasks where binary term occurrence features are used rather than term frequencies.


### Applications of Naive Bayes :

It is an fast classification algorithm and works well with large amounts of data and is used in applications like :

    - Spam filtering
    - Text classification
    - Sentiment analysis
    - Recommender systems

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
 ## "skipinitialspace = True"  is used to trim the white spaces from all cells in the dataframe

column_names = ['age', 'workclass', 'fnlwgt' , 'education', 'education_num' , 'marital_status', 'occupation', 'relationship', 'race', 'sex','capital_gain' ,'capital_loss' , 'hours_per_week', 'native_country', 'income']
data_df = pd.read_csv('adult.csv', names=column_names, skipinitialspace = True)
data_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# data_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
# data_df.head()

#### Check count of unique values in each column.

In [4]:
data_df.nunique().sort_values()

sex                   2
income                2
race                  5
relationship          6
marital_status        7
workclass             9
occupation           15
education            16
education_num        16
native_country       42
age                  73
capital_loss         92
hours_per_week       94
capital_gain        119
fnlwgt            21648
dtype: int64

In [5]:
cat_columns = data_df.select_dtypes(include=['object']).columns
num_columns = data_df.select_dtypes(exclude=['object']).columns

print(f"Categorical columns : {cat_columns}")
print(f"Integer columns : {num_columns}")

Categorical columns : Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')
Integer columns : Index(['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week'],
      dtype='object')


In [6]:
for col in cat_columns:
    print(f"\nNo of unique values in column \'{col.upper()}\' : {data_df[col].nunique()} and their values :\n{'####'*30} \n {data_df[col].unique()}")


No of unique values in column 'WORKCLASS' : 9 and their values :
######################################################################################################################## 
 ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']

No of unique values in column 'EDUCATION' : 16 and their values :
######################################################################################################################## 
 ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']

No of unique values in column 'MARITAL_STATUS' : 7 and their values :
######################################################################################################################## 
 ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']

No of unique values 

## From the above output, columns 'WORKCLASS', 'OCCUPATION' and 'NATIVE_COUNTRY' Looking for Null values in the dataset

In [7]:
data_df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [8]:
for col in data_df.columns:
    if data_df[col].dtype=='O':
        print(f'{col} -- ', data_df[col].dtype, data_df[col].dtype=='O') 
    
print("\n\n", data_df.select_dtypes(include=['object']).columns)

workclass --  object True
education --  object True
marital_status --  object True
occupation --  object True
relationship --  object True
race --  object True
sex --  object True
native_country --  object True
income --  object True


 Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')


In [9]:
no_rows, no_cols = data_df.shape
no_rows

32561

### Finding count of unknown/null values in columns represented as '?'

In [10]:
cols_with_qm = ['workclass', 'occupation', 'native_country']

for col in cols_with_qm:
    print(f"Count of values with cell value as '?' for column {col.upper()} : {(data_df.loc[:, col]=='?').sum()} i.e {((data_df.loc[:, col]=='?').sum()/no_rows)*100} percentage of total values")


Count of values with cell value as '?' for column WORKCLASS : 1836 i.e 5.638647461687294 percentage of total values
Count of values with cell value as '?' for column OCCUPATION : 1843 i.e 5.660145572924664 percentage of total values
Count of values with cell value as '?' for column NATIVE_COUNTRY : 583 i.e 1.7904855501980899 percentage of total values


## Now replacing these  question marks with np.Nan values

In [11]:
for col in cols_with_qm:
    data_df[col].replace('?', np.NaN, inplace=True)

data_df[cols_with_qm].isnull().sum()

workclass         1836
occupation        1843
native_country     583
dtype: int64

## Filling missing values using Imputation techniques

In [12]:
data_df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
income               0
dtype: int64

## Encoding labels

In [13]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
data_df['income']= label_encoder.fit_transform(data_df['income'])
data_df['income'].unique()

array([0, 1])

In [14]:
data_df['income'].value_counts()

0    24720
1     7841
Name: income, dtype: int64

In [15]:
labels = data_df['income']
data_df.drop(['income'], inplace=True, axis=1)
data_df.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States


## Train Test Split data

In [16]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(data_df, labels, test_size=0.2, random_state=567, stratify=labels)

In [17]:
train_x.shape

(26048, 14)

In [18]:
cat_cols = train_x.select_dtypes(include=['object']).columns
int_cols = train_x.select_dtypes(exclude=['object']).columns

print(cat_cols)
print(int_cols)

Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country'],
      dtype='object')
Index(['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week'],
      dtype='object')


## Imputation techniques

### MICE (Multi-Variate Imputation by Chained Equations) method

In [19]:
train_x[int_cols].head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
23429,36,215392,11,0,0,40
26543,41,193882,13,0,0,40
14032,48,155372,9,0,0,36
13229,53,83434,13,0,0,21
4897,36,398931,13,0,1485,50


In [20]:
from impyute.imputation.cs import mice

In [21]:
train_x[int_cols] = train_x.loc[:, int_cols].astype('float64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [22]:
train_x[int_cols].isnull().sum()

age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
dtype: int64

In [23]:
## Since from the above output there are no null values in the numerical columns, we ignore this step

# train_x_int_mice_imputed = mice(train_x[int_cols].values)

### Using Simple Imputer for categorical columns with null values

In [24]:
from sklearn.impute import SimpleImputer

simp_imp = SimpleImputer(strategy='constant', fill_value='missing')
train_x = pd.DataFrame(data=simp_imp.fit_transform(train_x), columns=train_x.columns)


In [25]:
test_x = pd.DataFrame(data=simp_imp.fit_transform(test_x), columns=test_x.columns)

In [26]:
train_x.workclass.value_counts()

Private             18168
Self-emp-not-inc     2027
Local-gov            1686
missing              1450
State-gov            1033
Self-emp-inc          898
Federal-gov           768
Without-pay            12
Never-worked            6
Name: workclass, dtype: int64

### KNN Method

In [27]:
# from fancyimpute import KNN

# imputer = KNN()
# train_x_imputed = pd.DataFrame(np.round(imputer.fit_transform(train_x)),columns = train_x.columns)
# train_x_imputed.head()

### SKlearn KNN Impute method

In [28]:
# from sklearn.impute import KNNImputer
# from sklearn.preprocessing import LabelEncoder
                                     
# knn = KNNImputer(n_neighbors=10, add_indicator=True)
# # knn = KNNImputer(missing_values=np.nan, n_neighbors=5, weights='uniform', metric='nan_euclidean')

# knn.fit(train_x)
# train_x[cat_cols] = pd.DataFrame(data=knn.fit_transform(train_x[cat_cols]), columns=train_x.columns)

# train_x.head()

In [29]:
# from sklearn.impute import SimpleImputer
# strategies = ['mean', 'median', 'most_frequent', 'constant']

# sim_imp = SimpleImputer(strategy='most_frequent')

# for col in cols_with_qm:
#     data_df[col]=sim_imp.fit_transform(data_df[col].values.reshape(-1,1))[:, 0]

# data_df[cols_with_qm].head()

In [30]:
train_x.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64

## Encoding Categorical data

In [31]:
import category_encoders as ce

In [32]:
cat_enc = ce.OneHotEncoder(cols=cat_cols, return_df=True)

train_x = cat_enc.fit_transform(train_x)
test_x = cat_enc.transform(test_x)

In [33]:
train_x.head(5)

Unnamed: 0,age,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,workclass_9,...,native_country_33,native_country_34,native_country_35,native_country_36,native_country_37,native_country_38,native_country_39,native_country_40,native_country_41,native_country_42
0,36,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,41,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,48,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,53,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,36,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Scaling Features

In [34]:
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

columns = train_x.columns

# scaler = RobustScaler()
# train_x  = pd.DataFrame( scaler.fit_transform(train_x), columns=train_x.columns)
# test_x = pd.DataFrame( scaler.transform(test_x), columns=test_x.columns)

scaler = MinMaxScaler()
train_x  = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)

In [35]:
train_x.head(3)

AttributeError: 'numpy.ndarray' object has no attribute 'head'

## Training Machine Learning Model using Naive Bayes Algorithm

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# nbc = GaussianNB()
nbc = MultinomialNB()
# nbc = BernoulliNB()

hist = nbc.fit(train_x, train_y)

In [None]:
test_preds = hist.predict(test_x)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve, recall_score, precision_score, roc_auc_score, roc_curve, classification_report 

acc_sc = accuracy_score( test_y, test_preds)
print(f"Accuracy Score : {acc_sc}")

In [None]:
print(f"Training Accuracy : {accuracy_score(train_y, hist.predict(train_x))}")

In [None]:
cls_rep = classification_report(test_y, test_preds)
print(cls_rep)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cvs = cross_val_score(nbc, train_x, train_y, cv=10, scoring='roc_auc')

print(f"Cross Validation Scores : {cvs}")

## Confusion Matrix

In [None]:
conf_mat = confusion_matrix(test_y, test_preds)
conf_mat = pd.DataFrame(conf_mat, columns=["Actual Positive", "Actual Negative"], index=["Predicted Positive", "Predicted Negative"] )

sns.heatmap(conf_mat, annot=True, fmt='d', cmap='RdYlBu')

## ROC CURVE

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresh = roc_curve(test_y, test_preds, pos_label=1)
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC Curve for predicting salaries")
plt.show()