<h1>Naive Bayes Classifier</h1>
<h2>Machine Learning in Python</h2>
<hr>
<h3>Objective</h3>
<p style="margin-top:20px">Our goal is to get the probability of <b>semen diagnosis</b> of the patients based on the 9 given attributes:<p>
<ol>
    <li>Season of Analysis</li>
    <li>Age of Analysis</li>
    <li>Childish Disease (Chicken pox, measles, mumps, polio)</li>
    <li>Accident or serious trauma</li>
    <li>Surgical Intervention</li>
    <li>High fevers in last year</li>
    <li>Frequency of alcohol consumption</li>
    <li>Smoking Habit</li>
    <li>Number of hours spent sitting per day</li>
</ol>


<h3>First, we will load the data into dataframe using pandas</h3>

In [4]:
import pandas as pd
df = pd.read_csv('fertility_Diagnosis_Data_Group5_8.txt', delimiter = ",", header=None, names=["Season of Analysis", "Age of Analysis", "Childish Disease", "Accident of Serious Trauma", "Surgical Intervention", "High fevers in last year", "Frequency of alcohol consumption", "Smoking Habit", "Number of hours spent sitting per day", "Semen Diagnosis"])
df

Unnamed: 0,Season of Analysis,Age of Analysis,Childish Disease,Accident of Serious Trauma,Surgical Intervention,High fevers in last year,Frequency of alcohol consumption,Smoking Habit,Number of hours spent sitting per day,Semen Diagnosis
0,-1.0,0.53,1,1,1,0,0.8,1,0.50,0
1,-1.0,0.56,1,1,0,0,0.8,1,0.50,0
2,-1.0,0.58,1,0,1,-1,0.8,1,0.50,0
3,-1.0,0.56,1,0,0,0,1.0,-1,0.44,0
4,-1.0,0.53,1,1,0,1,1.0,0,0.31,0
...,...,...,...,...,...,...,...,...,...,...
95,-1.0,0.78,1,1,0,1,0.6,-1,0.38,0
96,-1.0,0.78,1,0,1,0,1.0,-1,0.25,0
97,-1.0,0.56,1,0,1,0,1.0,-1,0.63,0
98,-1.0,0.67,0,0,1,0,0.6,0,0.50,1


<h3>Next, we will move the dependent variable Semen Diagnosis into its own series</h3>

In [15]:
target = df['Semen Diagnosis']
inputs = df.drop('Semen Diagnosis',axis='columns') # This will be our 9 features (inputs)

<h3>Check if columns contain empty/null results</h3>

In [14]:
inputs.columns[inputs.isna().any()] # Returns nothing, so we're cool

Index([], dtype='object')

<h3>We will split the training and test data into 80:20</h3>

In [75]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.2)

In [21]:
len(X_train), len(X_test)

(80, 20)

<h3>There are many Naive Bayes models, but we are using the classic Gaussian for this</h3>

In [76]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [77]:
model.score(X_test, y_test)

0.9

In [78]:
X_test[:20]

Unnamed: 0,Season of Analysis,Age of Analysis,Childish Disease,Accident of Serious Trauma,Surgical Intervention,High fevers in last year,Frequency of alcohol consumption,Smoking Habit,Number of hours spent sitting per day
95,-1.0,0.78,1,1,0,1,0.6,-1,0.38
14,-0.33,0.61,1,0,1,0,1.0,-1,0.63
1,-1.0,0.56,1,1,0,0,0.8,1,0.5
60,-0.33,0.69,0,1,1,0,0.8,0,0.88
94,1.0,0.56,1,1,1,0,1.0,-1,0.63
13,-0.33,0.58,1,1,1,-1,0.8,0,0.19
90,1.0,0.61,1,0,1,0,1.0,-1,0.63
69,1.0,0.61,1,0,0,0,1.0,-1,0.25
80,1.0,0.67,0,0,1,0,0.8,-1,0.25
78,1.0,0.75,1,1,1,0,1.0,1,0.25


In [84]:
y_test[:20]

95    0
14    0
1     0
60    0
94    0
13    0
90    0
69    0
80    0
78    0
36    0
99    0
49    0
70    0
38    0
20    0
31    0
5     0
43    0
85    0
Name: Semen Diagnosis, dtype: int64

In [81]:
model.predict(X_test[:20])

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
      dtype=int64)

In [82]:
model.predict_proba(X_test[:20])

array([[0.98059694, 0.01940306],
       [0.91904867, 0.08095133],
       [0.98120262, 0.01879738],
       [0.58767781, 0.41232219],
       [0.94074356, 0.05925644],
       [0.907075  , 0.092925  ],
       [0.77274924, 0.22725076],
       [0.8680547 , 0.1319453 ],
       [0.2706084 , 0.7293916 ],
       [0.86584114, 0.13415886],
       [0.89182338, 0.10817662],
       [0.96813129, 0.03186871],
       [0.95525215, 0.04474785],
       [0.70190782, 0.29809218],
       [0.99461205, 0.00538795],
       [0.34484921, 0.65515079],
       [0.93045731, 0.06954269],
       [0.99187064, 0.00812936],
       [0.98443923, 0.01556077],
       [0.82690852, 0.17309148]])