<a class="anchor" id="0"></a>
# **Naive Bayes Classifier in Python**


Hello friends,

In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. In this kernel, I implement Naive Bayes Classification algorithm with Python and Scikit-Learn. I build a Naive Bayes Classifier to predict whether a person makes over 50K a year. 

So, let's get started.

**As always, I hope you find this kernel useful and your <font color="red"><b>UPVOTES</b></font> would be highly appreciated**.


# **2. Naive Bayes algorithm intuition** <a class="anchor" id="2"></a>

[Table of Contents](#0.1)


Naïve Bayes Classifier uses the Bayes’ theorem to predict membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. This is also known as the **Maximum A Posteriori (MAP)**. 

The **MAP for a hypothesis with 2 events A and B is**

**MAP (A)**

= max (P (A | B))

= max (P (B | A) * P (A))/P (B)

= max (P (B | A) * P (A))


Here, P (B) is evidence probability. It is used to normalize the result. It remains the same, So, removing it would not affect the result.


Naïve Bayes Classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. 


In real world datasets, we test a hypothesis given multiple evidence on features. So, the calculations become quite complicated. To simplify the work, the feature independence approach is used to uncouple multiple evidence and treat each as an independent one.


# **3. Types of Naive Bayes algorithm** <a class="anchor" id="3"></a>

[Table of Contents](#0.1)


There are 3 types of Naïve Bayes algorithm. The 3 types are listed below:-

  1. Gaussian Naïve Bayes

  2. Multinomial Naïve Bayes

  3. Bernoulli Naïve Bayes

These 3 types of algorithm are explained below.


## **Gaussian Naïve Bayes algorithm**


When we have continuous attribute values, we made an assumption that the values associated with each class are distributed according to Gaussian or Normal distribution. For example, suppose the training data contains a continuous attribute x. We first segment the data by the class, and then compute the mean and variance of x in each class. Let µi be the mean of the values and let σi be the variance of the values associated with the ith class. Suppose we have some observation value xi . Then, the probability distribution of xi given a class can be computed by the following equation –


![Gaussian Naive Bayes algorithm](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQEWCcq1XtC1Yw20KWSHn2axYa7eY-a0T1TGtdVn5PvOpv9wW3FeA&s)

## **Multinomial Naïve Bayes algorithm**

With a Multinomial Naïve Bayes model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1, . . . ,pn) where pi is the probability that event i occurs. Multinomial Naïve Bayes algorithm is preferred to use on data that is multinomially distributed. It is one of the standard algorithms which is used in text categorization classification.

## **Bernoulli Naïve Bayes algorithm**

In the multivariate Bernoulli event model, features are independent boolean variables (binary variables) describing inputs. Just like the multinomial model, this model is also popular for document classification tasks where binary term occurrence features are used rather than term frequencies.

# **4. Applications of Naive Bayes algorithm** <a class="anchor" id="4"></a>

[Table of Contents](#0.1)



Naïve Bayes is one of the most straightforward and fast classification algorithm. It is very well suited for large volume of data. It is successfully used in various applications such as :

1. Spam filtering
2. Text classification
3. Sentiment analysis
4. Recommender systems

It uses Bayes theorem of probability for prediction of unknown class.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline



In [2]:
data=pd.read_csv(r"C:\Users\Murahari Chavali\Desktop\colab\Colab Notebooks\week7\1.Augest\30th, 31st\project\adult.csv")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [3]:
data.shape

(32561, 15)

In [4]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [5]:
data.columns=['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship','race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [7]:
cat=[var for var in data.columns if data[var].dtype=="O"]

print(len(cat))
print(cat)

9
['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']


In [8]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [9]:
for i in cat:
    print(data[i].value_counts())

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital_status, dtype: int64
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         377

In [10]:
for i in cat:
    print(data[i].value_counts()/np.float(len(data)))

Private             0.697030
Self-emp-not-inc    0.078038
Local-gov           0.064279
?                   0.056386
State-gov           0.039864
Self-emp-inc        0.034274
Federal-gov         0.029483
Without-pay         0.000430
Never-worked        0.000215
Name: workclass, dtype: float64
HS-grad         0.322502
Some-college    0.223918
Bachelors       0.164461
Masters         0.052916
Assoc-voc       0.042443
11th            0.036086
Assoc-acdm      0.032769
10th            0.028654
7th-8th         0.019840
Prof-school     0.017690
9th             0.015786
12th            0.013298
Doctorate       0.012684
5th-6th         0.010227
1st-4th         0.005160
Preschool       0.001566
Name: education, dtype: float64
Married-civ-spouse       0.459937
Never-married            0.328092
Divorced                 0.136452
Separated                0.031479
Widowed                  0.030497
Married-spouse-absent    0.012837
Married-AF-spouse        0.000706
Name: marital_status, dtype: float64


In [11]:
data.workclass.unique()

array(['?', 'Private', 'State-gov', 'Federal-gov', 'Self-emp-not-inc',
       'Self-emp-inc', 'Local-gov', 'Without-pay', 'Never-worked'],
      dtype=object)

In [12]:
data.workclass.value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [13]:
data.workclass.replace("?",np.NaN,inplace=True)
data.workclass.unique()

array([nan, 'Private', 'State-gov', 'Federal-gov', 'Self-emp-not-inc',
       'Self-emp-inc', 'Local-gov', 'Without-pay', 'Never-worked'],
      dtype=object)

In [14]:
#Next we see occupation
data.occupation.unique()

array(['?', 'Exec-managerial', 'Machine-op-inspct', 'Prof-specialty',
       'Other-service', 'Adm-clerical', 'Craft-repair',
       'Transport-moving', 'Handlers-cleaners', 'Sales',
       'Farming-fishing', 'Tech-support', 'Protective-serv',
       'Armed-Forces', 'Priv-house-serv'], dtype=object)

In [15]:
data.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [16]:
data.occupation.replace("?",np.NaN,inplace=True)
data.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [17]:
data.native_country.unique()

array(['United-States', '?', 'Mexico', 'Greece', 'Vietnam', 'China',
       'Taiwan', 'India', 'Philippines', 'Trinadad&Tobago', 'Canada',
       'South', 'Holand-Netherlands', 'Puerto-Rico', 'Poland', 'Iran',
       'England', 'Germany', 'Italy', 'Japan', 'Hong', 'Honduras', 'Cuba',
       'Ireland', 'Cambodia', 'Peru', 'Nicaragua', 'Dominican-Republic',
       'Haiti', 'El-Salvador', 'Hungary', 'Columbia', 'Guatemala',
       'Jamaica', 'Ecuador', 'France', 'Yugoslavia', 'Scotland',
       'Portugal', 'Laos', 'Thailand', 'Outlying-US(Guam-USVI-etc)'],
      dtype=object)

In [18]:
data.native_country.value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [19]:
data.native_country.replace("?",np.NaN,inplace=True)

In [20]:
data.native_country.unique()

array(['United-States', nan, 'Mexico', 'Greece', 'Vietnam', 'China',
       'Taiwan', 'India', 'Philippines', 'Trinadad&Tobago', 'Canada',
       'South', 'Holand-Netherlands', 'Puerto-Rico', 'Poland', 'Iran',
       'England', 'Germany', 'Italy', 'Japan', 'Hong', 'Honduras', 'Cuba',
       'Ireland', 'Cambodia', 'Peru', 'Nicaragua', 'Dominican-Republic',
       'Haiti', 'El-Salvador', 'Hungary', 'Columbia', 'Guatemala',
       'Jamaica', 'Ecuador', 'France', 'Yugoslavia', 'Scotland',
       'Portugal', 'Laos', 'Thailand', 'Outlying-US(Guam-USVI-etc)'],
      dtype=object)

# Data splitting

In [21]:
x=data.drop(["income"],axis=1)
y=data.income

# Data scalling and Spliting

In [22]:
x.shape,y.shape


((32561, 14), (32561,))

# feature enginering

In [23]:
# we will be givin g data types to the data
x.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
dtype: object

In [24]:
cat=[i for i in x.columns if x.dtypes[i] =="O" ]
num=[i for i in x.columns if x.dtypes[i] !="O" ]



In [25]:
x.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
dtype: int64

# Encoding categorical data

In [26]:
x[cat].head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country
0,,HS-grad,Widowed,,Not-in-family,White,Female,United-States
1,Private,HS-grad,Widowed,Exec-managerial,Not-in-family,White,Female,United-States
2,,Some-college,Widowed,,Unmarried,Black,Female,United-States
3,Private,7th-8th,Divorced,Machine-op-inspct,Unmarried,White,Female,United-States
4,Private,Some-college,Separated,Prof-specialty,Own-child,White,Female,United-States


In [27]:
import category_encoders as ce


In [28]:
encoder=ce.OneHotEncoder(cat)


x=encoder.fit_transform(x)


# Data spliting

In [29]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [30]:
X_train.isnull().sum()

age                  0
workclass_1          0
workclass_2          0
workclass_3          0
workclass_4          0
                    ..
native_country_38    0
native_country_39    0
native_country_40    0
native_country_41    0
native_country_42    0
Length: 108, dtype: int64

In [31]:
X_test.isnull().sum()

age                  0
workclass_1          0
workclass_2          0
workclass_3          0
workclass_4          0
                    ..
native_country_38    0
native_country_39    0
native_country_40    0
native_country_41    0
native_country_42    0
Length: 108, dtype: int64

In [32]:
cat=[i for i in x.columns if X_test.dtypes[i] =="O" ]
num=[i for i in x.columns if X_train.dtypes[i] !="O" ]

In [33]:
null_cat=[ i for i in cat if x[i].isnull().sum() != 0]
null_num=[ i for i in num if x[i].isnull().sum() != 0]

In [34]:
for i  in null_cat:
    X_train[i].fillna(data[i].mode()[0],inplace=True)
    X_test[i].fillna(data[i].mode()[0],inplace=True)
    
for i in null_num:
    X_train[i].fillna(data[i].mean(),inplace=True)
    X_test[i].fillna(data[i].mean(),inplace=True)

In [35]:
X_test.isna().sum()

X_train.isna().sum()

age                  0
workclass_1          0
workclass_2          0
workclass_3          0
workclass_4          0
                    ..
native_country_38    0
native_country_39    0
native_country_40    0
native_country_41    0
native_country_42    0
Length: 108, dtype: int64

In [36]:
X_train.head()

Unnamed: 0,age,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,workclass_9,...,native_country_33,native_country_34,native_country_35,native_country_36,native_country_37,native_country_38,native_country_39,native_country_40,native_country_41,native_country_42
15282,41,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24870,25,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
18822,25,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26404,53,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7842,24,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
X_test.head()

Unnamed: 0,age,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,workclass_9,...,native_country_33,native_country_34,native_country_35,native_country_36,native_country_37,native_country_38,native_country_39,native_country_40,native_country_41,native_country_42
22278,56,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8950,19,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7838,23,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16505,37,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19140,49,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Feature Scaling

In [38]:
cols=X_train.columns

In [39]:
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler

from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report



scalers = {"std": StandardScaler(), "nor": Normalizer(), "rob": RobustScaler()}

z = []
k = ["confusion_matrix", "accuracy_score", "classification_report", "bias", "variance"]

for scaler_name, scaler in scalers.items():
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model.fit(X_train_scaled, Y_train)
    Y_pred = model.predict(X_test_scaled)
    cm = confusion_matrix(Y_test, Y_pred)
    accu_score = accuracy_score(Y_test, Y_pred)
    cr = classification_report(Y_test, Y_pred)
    bias = model.score(X_train_scaled, Y_train)
    variance = model.score(X_test_scaled, Y_test)
    z.append([cm, accu_score, cr, bias, variance])

m = pd.DataFrame(z, columns=k)


In [40]:
m

Unnamed: 0,confusion_matrix,accuracy_score,classification_report,bias,variance
0,"[[1538, 3428], [45, 1502]]",0.466759,precision recall f1-score ...,0.471975,0.466759
1,"[[4632, 334], [868, 679]]",0.815446,precision recall f1-score ...,0.816608,0.815446
2,"[[4005, 961], [311, 1236]]",0.804698,precision recall f1-score ...,0.798219,0.804698


In [41]:
for i in m.classification_report:
    print(i)

              precision    recall  f1-score   support

       <=50K       0.97      0.31      0.47      4966
        >50K       0.30      0.97      0.46      1547

    accuracy                           0.47      6513
   macro avg       0.64      0.64      0.47      6513
weighted avg       0.81      0.47      0.47      6513

              precision    recall  f1-score   support

       <=50K       0.84      0.93      0.89      4966
        >50K       0.67      0.44      0.53      1547

    accuracy                           0.82      6513
   macro avg       0.76      0.69      0.71      6513
weighted avg       0.80      0.82      0.80      6513

              precision    recall  f1-score   support

       <=50K       0.93      0.81      0.86      4966
        >50K       0.56      0.80      0.66      1547

    accuracy                           0.80      6513
   macro avg       0.75      0.80      0.76      6513
weighted avg       0.84      0.80      0.81      6513



In [42]:
from sklearn.preprocessing import StandardScaler, Normalizer, RobustScaler

from sklearn.naive_bayes import BernoulliNB
model=GaussianNB()
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report,recall_score,precision_score



scalers = {"std": StandardScaler(), "nor": Normalizer(), "rob": RobustScaler()}

z = []
k = ["confusion_matrix", "accuracy_score", "classification_report", "bias", "variance"]

for scaler_name, scaler in scalers.items():
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model.fit(X_train_scaled, Y_train)
    Y_pred = model.predict(X_test_scaled)
    cm = confusion_matrix(Y_test, Y_pred)
    accu_score = accuracy_score(Y_test, Y_pred)
    cr = classification_report(Y_test, Y_pred)
    bias = model.score(X_train_scaled, Y_train)
    variance = model.score(X_test_scaled, Y_test)

    z.append([cm, accu_score, cr, bias, variance])
m=0
m = pd.DataFrame(z, columns=k)

In [43]:
for i in m.classification_report:
    print(i)

              precision    recall  f1-score   support

       <=50K       0.97      0.31      0.47      4966
        >50K       0.30      0.97      0.46      1547

    accuracy                           0.47      6513
   macro avg       0.64      0.64      0.47      6513
weighted avg       0.81      0.47      0.47      6513

              precision    recall  f1-score   support

       <=50K       0.84      0.93      0.89      4966
        >50K       0.67      0.44      0.53      1547

    accuracy                           0.82      6513
   macro avg       0.76      0.69      0.71      6513
weighted avg       0.80      0.82      0.80      6513

              precision    recall  f1-score   support

       <=50K       0.93      0.81      0.86      4966
        >50K       0.56      0.80      0.66      1547

    accuracy                           0.80      6513
   macro avg       0.75      0.80      0.76      6513
weighted avg       0.84      0.80      0.81      6513



In [44]:
print(m)

             confusion_matrix  accuracy_score  \
0  [[1538, 3428], [45, 1502]]        0.466759   
1   [[4632, 334], [868, 679]]        0.815446   
2  [[4005, 961], [311, 1236]]        0.804698   

                               classification_report      bias  variance  
0                precision    recall  f1-score   ...  0.471975  0.466759  
1                precision    recall  f1-score   ...  0.816608  0.815446  
2                precision    recall  f1-score   ...  0.798219  0.804698  


In [45]:
from sklearn.preprocessing import Normalizer, RobustScaler

from sklearn.naive_bayes import MultinomialNB
model=GaussianNB()
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report



scalers = {"std": StandardScaler(), "nor": Normalizer(), "rob": RobustScaler()}

z = []
k = ["confusion_matrix", "accuracy_score", "classification_report", "bias", "variance"]

for scaler_name, scaler in scalers.items():
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model.fit(X_train_scaled, Y_train)
    Y_pred = model.predict(X_test_scaled)
    cm = confusion_matrix(Y_test, Y_pred)
    accu_score = accuracy_score(Y_test, Y_pred)
    cr = classification_report(Y_test, Y_pred)
    bias = model.score(X_train_scaled, Y_train)
    variance = model.score(X_test_scaled, Y_test)
    z.append([cm, accu_score, cr, bias, variance])

m = pd.DataFrame(z, columns=k)

In [46]:
m

Unnamed: 0,confusion_matrix,accuracy_score,classification_report,bias,variance
0,"[[1538, 3428], [45, 1502]]",0.466759,precision recall f1-score ...,0.471975,0.466759
1,"[[4632, 334], [868, 679]]",0.815446,precision recall f1-score ...,0.816608,0.815446
2,"[[4005, 961], [311, 1236]]",0.804698,precision recall f1-score ...,0.798219,0.804698


In [47]:
for i in m.classification_report:
    print(i)

              precision    recall  f1-score   support

       <=50K       0.97      0.31      0.47      4966
        >50K       0.30      0.97      0.46      1547

    accuracy                           0.47      6513
   macro avg       0.64      0.64      0.47      6513
weighted avg       0.81      0.47      0.47      6513

              precision    recall  f1-score   support

       <=50K       0.84      0.93      0.89      4966
        >50K       0.67      0.44      0.53      1547

    accuracy                           0.82      6513
   macro avg       0.76      0.69      0.71      6513
weighted avg       0.80      0.82      0.80      6513

              precision    recall  f1-score   support

       <=50K       0.93      0.81      0.86      4966
        >50K       0.56      0.80      0.66      1547

    accuracy                           0.80      6513
   macro avg       0.75      0.80      0.76      6513
weighted avg       0.84      0.80      0.81      6513

