# Naive Bayes Classifier

__What is Naive Bayes algorithm?__

It is a supervised learning method based on __Bayes theorem__ which assumes that all features (inputs) that we use to predict the target value are __mutual independent__.

This is a strong assumption that is not always applicable in real life dataset ,but we assume it because __it make our calculations way much simpler yet efficient__.

Bayes' Theorem provides a way to calculate the probability of the data belonging to a class (target) giving a prior knowledge (features or input) which can be stated as:

__P(class|data) = (P(data|class) * P(class)) / P(data)__

where P(class|data) is the probability of a class giving a provided data.

Naive Bayes is a classification algorithm used for binary and multiclass classification problems

## Example (spam filtter):

magine that you are receiving a lot of emails on your Gmail and you need to filter unwanted messages (spam) that where __Naive Bayes__ will be a perfect solution for the kind of problems (classification problems).

or example, you have 20 emails and you want to filter spam depending on the data you have which we are going to call training data next:-
You have 12 emails in your inbox and there are 8 regular emails and 4 are spam and the data besides them refers to the number of occurrences of each word.

![emails](assets/emails.png)

- from the figure above, if we wanted to calculate the probability of seeing the word "Dear" in the regular emails we use :-
P("Dear"|regular) = total number of occurences of word "Dear" in regular emails / total number of words in regular emails = 8 / 14

- P("Friend"|regular) = 5 / 14

- P("Money"|regular) = 1 / 14

- P("Dear"|regular) = 8 / 14

- to calculate the probability of seeing a regular email regardless of what it says = P(regular) = number of regular emails / total number of emails = 8 / 12 

### and for spam


with the same steps

- P("Friend"|spam) = 1 / 7

- P("Money"|spam) = 4 / 7

- P("Dear"|regular) = 2 / 7

- P(spam) = 4 / 12 

__then__

we can tell if the next email gonna be spam or not based on our calculations (of course it wont be accurate becuase the dataset is so small) by the next steps.


if we have an email that have "Dear Friend" in its body to know its class we study its likelihood of each class we have:-

### for regular emails

P(regular|"Dear Friend") = P("Dear"|regular) * P("Friend"|regular) * P(regular) = 4/7 * 5/14 * 2/3 = .1 (apprx) => 1

__Note__:- if you looked closely you find the previous equation is similar to P(A|B) = P(B|A) * P(A) / P(B) But removed the Denominator since we are interested in likelihood so we want to maximize the value closest class


### for SPAM

P(spam|"Dear Friend") = P("Dear"|spam) * P("Friend"|spam) * P(spam) = 2/7 * 1/7 * 1/3 = .01 (apprx) => 2



obviously, the value in eqn (1) is >> than eqn (2) so, we can conclude that the email is not SPAM. and this is how simply the algorthim works next we use it to solve our classification problem on heart disease dataset but we wil use a slightly different approach calles Gaussian Naive Bays 

# Heart Disease Dataset

- our target here to predict the presence of heart disease in the patient which will help us to deal with the disease in early phase


- So this is a binary classification problem, and our target class value will expect two values:-
    - 1 for the presence of heart disease (disease)
    - 0 for the patient is on the safe area (no disease)
    
    
- in our dataset `heart.csv` we have 1025 row with no missing data includes:
    - 13 columns (input or feature)
        1. age
        2. sex
        3. chest pain type (4 values)
        4. resting blood pressure
        5. serum cholestoral in mg/dl
        6. asting blood sugar > 120 mg/dl
        7. resting electrocardiographic results (values 0,1,2)
        8. maximum heart rate achieved
        9. exercise induced angina
        10. oldpeak = ST depression induced by exercise relative to rest
        11. the slope of the peak exercise ST segment
        12. number of major vessels (0-3) colored by flourosopy
        13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
    
    - 1 output column (target)

# Our Approach To Solve The Problem

the solution for this problem is broken down to 4 steps::
1. __getting our data ready to be processed__
2. __get some statistics for each column separated by class (0/no disease and 1/disease)__
3. __calculate Probability for each input row__
4. __calculate the class Probability (the model prediction)__

## step 1: getting our data ready to be processed


before any thing we are going to import numpy and pandas to help us in our calculations


In [2]:
import numpy as np
import pandas as pd

# read heart.csv file and put it into pandas dataframe
heart_disease = pd.read_csv('heart.csv')
heart_disease

x = heart_disease
y = heart_disease["target"]

x

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [3]:
#separate the dataset to training group and test group
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .2)

In [4]:
# separate our data by target class value
diseased = x_train[x_train['target'] == 1].drop('target', axis=1)
not_diseased = x_train[x_train['target'] == 0].drop('target', axis=1)

all not diseased entries

In [5]:
diseased

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
831,58,1,1,125,220,0,1,144,0,0.4,1,4,3
982,67,0,0,106,223,0,1,142,0,0.3,2,2,2
663,58,0,0,100,248,0,0,122,0,1.0,1,0,2
639,58,0,0,130,197,0,1,131,0,0.6,1,0,2
249,42,1,2,130,180,0,1,150,0,0.0,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
782,64,0,0,130,303,0,1,122,0,2.0,1,2,2
932,51,0,2,140,308,0,0,142,0,1.5,2,1,2
745,51,1,2,100,222,0,1,143,1,1.2,1,0,2
648,71,0,0,112,149,0,1,125,0,1.6,1,0,2


all diseased entries

In [6]:
not_diseased

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
454,65,0,0,150,225,0,0,114,0,1.0,1,3,3
900,61,1,3,134,234,0,1,145,0,2.6,1,2,2
844,60,1,0,140,293,0,0,170,0,1.2,1,2,3
629,65,1,3,138,282,1,0,174,0,1.4,1,1,2
32,57,1,0,130,131,0,1,115,1,1.2,1,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
706,57,1,2,128,229,0,0,150,0,0.4,1,1,3
634,52,1,0,125,212,0,1,168,0,1.0,2,2,3
59,57,1,1,154,232,0,0,164,0,0.0,2,1,2
787,51,1,0,140,298,0,1,122,1,4.2,1,3,3


## step 2: get some statistics for each column separated by class

in this step we are going to calculate the mean and the standard diviation for each column for the two dataframes.


these statistics will help us in calculating the wanted probabilities.

In [7]:
# calculate the mean and the standard diviation for each columns in diseased datafram and put the result in list
diseased_statistics = np.array([(diseased[column].mean(), diseased[column].std()) for column in diseased.columns])

In [8]:
diseased_statistics

array([[5.23701149e+01, 9.60551535e+00],
       [5.67816092e-01, 4.95950015e-01],
       [1.34022989e+00, 9.32819393e-01],
       [1.29119540e+02, 1.60590632e+01],
       [2.40678161e+02, 5.35576898e+01],
       [1.33333333e-01, 3.40326039e-01],
       [6.04597701e-01, 4.98825359e-01],
       [1.58340230e+02, 1.91002477e+01],
       [1.37931034e-01, 3.45224624e-01],
       [5.78620690e-01, 7.79328513e-01],
       [1.60000000e+00, 5.84752498e-01],
       [3.74712644e-01, 8.85002850e-01],
       [2.10114943e+00, 4.69212069e-01]])

In [9]:
# calculate the mean and the standard diviation for each columns in not_diseased datafram and put the result in list
not_diseased_statistics = np.array([(not_diseased[column].mean(), not_diseased[column].std()) for column in not_diseased.columns])

In [10]:
not_diseased_statistics

array([[5.66337662e+01, 8.02394870e+00],
       [8.33766234e-01, 3.72774783e-01],
       [4.90909091e-01, 9.10325866e-01],
       [1.33615584e+02, 1.85225866e+01],
       [2.48963636e+02, 4.96351661e+01],
       [1.63636364e-01, 3.70426658e-01],
       [4.31168831e-01, 5.31374922e-01],
       [1.38979221e+02, 2.21582750e+01],
       [5.45454545e-01, 4.98577522e-01],
       [1.58077922e+00, 1.26348741e+00],
       [1.17142857e+00, 5.74378546e-01],
       [1.19740260e+00, 1.04706609e+00],
       [2.50129870e+00, 6.96436896e-01]])

## step 3: calculate Probability for using Gaussian Probability Density Function


Calculating the probability of observing a given real-value is difficult.

We can assume that our values are drawn from a distribution to ease the probablity calculation, and we will use the Gaussian distribution becuase it can be calculate using the mean and the standard deviation that we collected from our data before

### Gaussian Probability Distribution Function:-

__f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 * sigma^2)))__


In [11]:
# a function to calculate the probability of a given value according to a pre calculated mean and standard deviationdef calc_probability(x, mean, stdev):
def calc_probability(x, mean, stdev):
    exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent

In [12]:
calc_probability(0, 1, 1)

0.24197072451914337

## step 4: calculate the class Probability (the model prediction)


in this step, we are going to calculate the probability of the input belonging to one of our classes by using our precalculated data (training data/mean and standard deviation for each column)



we will apply the previous step for every single row (input) so we will have the probability for each class giving the input data and since we are interested in class classification, not the actual probability we will consider the largest value



the following equation calculates the probability that a piece of data belongs to a class:-


__P(class|input) = P(input|class) * P(class)__  -> (1)



this equation is similar to Bayes Theorem but we removed the division as we are not interested in the probability and for calculation simplicity



the inputs features (columns of the row) are treated separately that's why it is called '__Naive__ Bayes' as we assumed variables are independent so equation (1) can be rewritten in the below form: -



__P(class|row) = P(input1|class) * P(input2|class) * P(input3|class) * ... * P(class)__


where 
- P(class|row) is the probability that a piece of data belongs to a class
- P(inputn|class) is the probability that a value for column giving that wanted class class
- P(class) is the probability of the class (will be calculated)


In [13]:
def predict_class(row):
    # calculating every class probability giving the inpur row     
    total_rows = x_train.shape[0] 
    not_diseased_prob = not_diseased.shape[0] / total_rows
    diseased_prob = diseased.shape[0] / total_rows  
    classes_probs = { 'not_diseased': not_diseased_prob, 'diseased': diseased_prob }
#     for i in range(not_diseased_statistics.shape[0]):
    for i in range(not_diseased_statistics.shape[0]):
        [ mean, stdev ] = not_diseased_statistics[i]
        classes_probs['not_diseased'] *= calc_probability(row[i], mean, stdev)
    for i in range(diseased_statistics.shape[0]):
        [ mean, stdev ] = diseased_statistics[i]
        classes_probs['diseased'] *= calc_probability(row[i], mean, stdev)
    
    # predicting the class (the higher value is the target value class)
    return 1 if classes_probs['diseased'] > classes_probs['not_diseased'] else 0      

In [14]:
# example 1 for diseased and 0 for not
predict_class([69,0,3,140,239,0,1,151,0,1.8,2,2,2])

1

In [15]:
# x = heart_disease
# y = heart_disease["target"]

# #separate the dataset to training group and test group
# from sklearn.model_selection import train_test_split
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .2)

In [16]:
not_diseased_statistics

array([[5.66337662e+01, 8.02394870e+00],
       [8.33766234e-01, 3.72774783e-01],
       [4.90909091e-01, 9.10325866e-01],
       [1.33615584e+02, 1.85225866e+01],
       [2.48963636e+02, 4.96351661e+01],
       [1.63636364e-01, 3.70426658e-01],
       [4.31168831e-01, 5.31374922e-01],
       [1.38979221e+02, 2.21582750e+01],
       [5.45454545e-01, 4.98577522e-01],
       [1.58077922e+00, 1.26348741e+00],
       [1.17142857e+00, 5.74378546e-01],
       [1.19740260e+00, 1.04706609e+00],
       [2.50129870e+00, 6.96436896e-01]])

In [17]:

# params: train => .8 of the data (target included)
#         test => .2 of the data (target not included)
def naive_bayes(test):
    # summarize = summarize_by_class(train)
    
    # 1- separating train data by class
#     diseased = train[train['target'] == 1].drop('target', axis=1)
#     not_diseased = train[train['target'] == 0].drop('target', axis=1)
    
    # 2- calculate statistics for each class
#     diseased_statistics = np.array([(diseased[column].mean(), diseased[column].std()) for column in diseased.columns])
#     not_diseased_statistics = np.array([(not_diseased[column].mean(), not_diseased[column].std()) for column in not_diseased.columns])

    # calculating not_diseased_prob and diseased_prob to pass it preict_input_class function
#     diseased_prob = diseased.shape[0] / train.shape[0]
#     not_diseased_prob = not_diseased.shape[0] / train.shape[0]
    
    # 3&4- calculate the probabilty for each row to predict the target 
    predictions = np.array([])
    for row in test:    
        output = predict_class(row)  
        predictions = np.append(predictions, [output])
    return predictions


In [18]:
# naive_bayes(x_train, x_test.drop('target', axis=1).values).shape
preds = naive_bayes(x_test.drop('target', axis=1).values)
preds

array([1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 1.,
       0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1.,
       1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1.,
       1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
       0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1.,
       0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0.,
       1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.,
       0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1.,
       1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0.,
       1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1.,
       1.])

In [19]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test.values, naive_bayes(x_test.drop('target', axis=1).values))


0.8390243902439024

# Summerzing all in OOP approach

In [20]:
def predict_class(row):
    # calculating every class probability giving the inpur row     
    total_rows = x_train.shape[0] 
    not_diseased_prob = not_diseased.shape[0] / total_rows
    diseased_prob = diseased.shape[0] / total_rows  
    classes_probs = { 'not_diseased': not_diseased_prob, 'diseased': diseased_prob }
#     for i in range(not_diseased_statistics.shape[0]):
    for i in range(not_diseased_statistics.shape[0]):
        [ mean, stdev ] = not_diseased_statistics[i]
        classes_probs['not_diseased'] *= calc_probability(row[i], mean, stdev)
    for i in range(diseased_statistics.shape[0]):
        [ mean, stdev ] = diseased_statistics[i]
        classes_probs['diseased'] *= calc_probability(row[i], mean, stdev)
    
    # predicting the class (the higher value is the target value class)
    return 1 if classes_probs['diseased'] > classes_probs['not_diseased'] else 0     

In [21]:
class NaiveBayes:
    def __init__(self):
        self.diseased_data = np.array([])
        self.not_diseased_data = np.array([])
        self.diseased_statistics = np.array([])
        self.not_diseased_statistics = np.array([])
        self.train_rows_number = 0
        self.train_diseased_prob = 0 
        self.train_not_diseased_prob = 0 
    
    # takes the dataset as pandas datafram
    def fit(self, data):
        self.train_rows_number = data.shape[0]
        
        # 1- separating train data by class
        diseased = data[data['target'] == 1].drop('target', axis=1)
        not_diseased = data[data['target'] == 0].drop('target', axis=1) 
        
        self.train_diseased_prob = diseased.shape[0] / self.train_rows_number
        self.train_not_diseased_prob = not_diseased.shape[0] / self.train_rows_number
        
        # 2- calculate statistics for each class
        self.diseased_statistics = np.array([(diseased[column].mean(), diseased[column].std()) for column in diseased.columns])
        self.not_diseased_statistics = np.array([(not_diseased[column].mean(), not_diseased[column].std()) for column in not_diseased.columns])
    
        return [diseased, not_diseased, self.train_rows_number]
        
    # params: input: np.array for the input wanted to predict its class
    # return: 1 for diseased,
    #         0 for not diseased
#     def predict_input_class(self, input, not_diseases_stats, diseases_stats):
    def predict_input_class(self, input):
        # initialiing class_probs with diseased and not diseased probability  
        classes_probs = { 'not_diseased': self.train_not_diseased_prob, 'diseased': self.train_diseased_prob }
        for i in range(not_diseased_statistics.shape[0]):
            [ mean, stdev ] = self.not_diseased_statistics[i]
            classes_probs['not_diseased'] *= self.calculate_prob(input[i], mean, stdev)
        for i in range(diseased_statistics.shape[0]):
            [ mean, stdev ] = self.diseased_statistics[i]
            classes_probs['diseased'] *= self.calculate_prob(input[i], mean, stdev)

        # predicting the class (the higher value is the target value class)
        return 1 if classes_probs['diseased'] > classes_probs['not_diseased'] else 0              
    
    
 
    def calculate_prob(self, x, mean, stdev):
        exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))
        return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent
    
    # params:  test : np.array of data to predict it target value
    # return: np.array of predicted values for test data (0 or 1) for each entry
    def predict(self, test):
        # prepare test data to be processed
        test_data = test.drop('target', axis=1).values
        print(test_data)
        
        # summarize = summarize_by_class(train)

#         # 1- separating train data by class
#         self.diseased_data = train[train['target'] == 1].drop('target', axis=1)
#         self.not_diseased_data = train[train['target'] == 0].drop('target', axis=1)

#         # 2- calculate statistics for each class
#         diseased_statistics = np.array([(diseased[column].mean(), diseased[column].std()) for column in diseased.columns])
#         not_diseased_statistics = np.array([(not_diseased[column].mean(), not_diseased[column].std()) for column in not_diseased.columns])
#         print(self.diseased_stats)
        # 3&4- calculate the probabilty for each row to predict the target 
        predictions = np.array([])
        for row in test_data:    
            output = self.predict_input_class(row)  
            predictions = np.append(predictions, [output])
        return predictions
    
    
    # params: true: np.array for true values (0 or 1)
    #         pred: np.array for predicted valuse by the model (0 or 1)
    # return: prcentage of the accuracy  of the model
    def score(self, true, pred):
        return accuracy_score(true, pred)

In [22]:
a = heart_disease
b = heart_disease["target"]

#separate the dataset to training group and test group
from sklearn.model_selection import train_test_split
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size = .2)

In [23]:
clf = NaiveBayes()


In [24]:
clf.fit(a_train)[2]

820

In [25]:
clf.diseased_data

array([], dtype=float64)

In [26]:
preds = clf.predict(a_test)

[[66.  0.  0. ...  1.  2.  3.]
 [66.  1.  0. ...  2.  1.  2.]
 [61.  1.  0. ...  1.  1.  2.]
 ...
 [43.  1.  0. ...  1.  0.  2.]
 [54.  1.  2. ...  1.  0.  3.]
 [57.  1.  2. ...  1.  1.  3.]]


In [27]:
b_test.shape

(205,)

In [28]:
a_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
737,67,1,0,120,229,0,0,129,1,2.6,1,2,3,0
399,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
354,57,1,1,124,261,0,1,141,0,0.3,2,0,3,0
189,64,1,2,125,309,0,1,131,1,1.8,1,0,3,0
375,66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708,60,0,2,120,178,1,1,96,0,0.0,2,0,2,1
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1
297,58,1,0,150,270,0,0,111,1,0.8,2,0,3,0
1024,54,1,0,120,188,0,1,113,0,1.4,1,1,3,0


In [29]:
b_test.values

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0], dtype=int64)

In [30]:
preds

array([0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 0., 1.,
       1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 0., 1., 1., 0., 1.,
       1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
       1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 1., 0., 0.,
       1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 0., 1.,
       0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
       1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1.,
       1., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0.,
       0., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1.])

In [31]:
clf.score(b_test.values, predss)

0.8341463414634146