#### Problem Statement  
Implementation of Naive Bayes using a simple and small dataset based on Tennis Game.

In [38]:
#Importing the required modules
import numpy as np
import pandas as pd

In [39]:
#Reading data from csv to a dataframe
data = pd.read_csv("play_tennis.csv")
data.head(5)

Unnamed: 0,day,outlook,temp,humidity,wind,play
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes


In [40]:
#Checking dataset shape
data.shape

(14, 6)

There are only 14 rows and 6 columns in the dataset.

In [41]:
#Dropping the day column as it is not of any significance
data.drop(columns="day", inplace=True)

In [42]:
data.head(5)

Unnamed: 0,outlook,temp,humidity,wind,play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   outlook   14 non-null     object
 1   temp      14 non-null     object
 2   humidity  14 non-null     object
 3   wind      14 non-null     object
 4   play      14 non-null     object
dtypes: object(5)
memory usage: 692.0+ bytes


There are no null values in the dataset and all the columns are of string data type.

##### <u>Problem</u> -
##### Given Input: outlook = Sunny, temp = Hot, humidity = High, wind = Weak. For the given input, will play be 'Yes' or 'No'?

##### Solution:
We will solve this by implementing **Naive Bayes**. As there are two class labels for target column **play** i.e. **Yes** and **No**, so the Naive bayes will find the probability for each class label __given__ the input.   

1st Probability -  
**P(Yes|Sunny,Hot,High,Weak) = P(Sunny,Hot,High,Weak|Yes) * P(Yes)**  
    **==>P(Yes|Sunny,Hot,High,Weak) = P(Sunny|Yes) * P(Hot|Yes) * P(High|Yes) * P(Weak|Yes) * P(Yes)**   

2nd Probability -  
**P(No|Sunny,Hot,High,Weak) = P(Sunny,Hot,High,Weak|No) * P(No)**  
    **==>P(No|Sunny,Hot,High,Weak) = P(Sunny|No) * P(Hot|No) * P(High|No) * P(Weak|No) * P(No)**

Naive Bayes will compare both the probabilities and will decide the outcome using the Maximum a posteriori rule (MAP).

In [44]:
data

Unnamed: 0,outlook,temp,humidity,wind,play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


##### Manually creating LookupTable storing the probabilities for the features given class labels. Behind the scene, this lookup table will be created by Naive Bayes during Training phase and will be used during Testing Phase.

##### -Calculating P(Yes) and P(No)

In [45]:
data['play'].value_counts()

play
Yes    9
No     5
Name: count, dtype: int64

In [46]:
#Calculationg probability for each class labels of the target column
P_Yes = 9/(9+5)
P_No = 5/(9+5)

In [47]:
print(P_Yes)
print(P_No)

0.6428571428571429
0.35714285714285715


##### -Calculating possible probabilities for each outlook column value given each class label

In [48]:
pd.crosstab(data['outlook'], data['play'])

play,No,Yes
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
Overcast,0,4
Rain,2,3
Sunny,3,2


In [49]:
P_Overcast_No = 0/(0+2+3)
P_Rain_No = 2/(0+2+3)
P_Sunny_No = 3/(0+2+3)

P_Overcast_Yes = 4/(4+3+2)
P_Rain_Yes = 3/(4+3+2)
P_Sunny_Yes = 2/(4+3+2)

In [50]:
print(P_Overcast_Yes,P_Overcast_No,P_Rain_Yes,P_Rain_No,P_Sunny_Yes,P_Sunny_No)

0.4444444444444444 0.0 0.3333333333333333 0.4 0.2222222222222222 0.6


##### -Calculating possible probabilities for each temp column value given each class label

In [51]:
pd.crosstab(data['temp'],data['play'])

play,No,Yes
temp,Unnamed: 1_level_1,Unnamed: 2_level_1
Cool,1,3
Hot,2,2
Mild,2,4


In [52]:
P_Cool_No = 1/(1+2+2)
P_Hot_No = 2/(1+2+2)
P_Mild_No = 2/(1+2+2)

P_Cool_Yes = 3/(3+2+4)
P_Hot_Yes = 2/(3+2+4)
P_Mild_Yes = 4/(3+2+4)

In [53]:
print(P_Cool_Yes,P_Cool_No,P_Hot_Yes,P_Hot_No,P_Mild_Yes,P_Mild_No)

0.3333333333333333 0.2 0.2222222222222222 0.4 0.4444444444444444 0.4


##### -Calculating possible probabilities for each humidity column value given each class label

In [54]:
pd.crosstab(data['humidity'], data['play'])

play,No,Yes
humidity,Unnamed: 1_level_1,Unnamed: 2_level_1
High,4,3
Normal,1,6


In [55]:
P_High_No = 4/(4+1)
P_Normal_No = 1/(4+1)

P_High_Yes = 3/(3+6)
P_Normal_Yes = 6/(3+6)

In [56]:
print(P_High_Yes,P_High_No,P_Normal_Yes,P_Normal_No)

0.3333333333333333 0.8 0.6666666666666666 0.2


##### -Calculating possible probabilities for each wind column value given each class label

In [57]:
pd.crosstab(data['wind'], data['play'])

play,No,Yes
wind,Unnamed: 1_level_1,Unnamed: 2_level_1
Strong,3,3
Weak,2,6


In [58]:
P_Strong_No = 3/(3+2)
P_Weak_No = 2/(3+2)

P_Strong_Yes = 3/(3+6)
P_Weak_Yes = 6/(3+6)

In [59]:
print(P_Strong_Yes,P_Strong_No,P_Weak_Yes,P_Weak_No)

0.3333333333333333 0.6 0.6666666666666666 0.4


#### Solving Problem  
Given Input: outlook = Sunny, temp = Hot, humidity = High, wind = Weak.  
For the given input, will play be 'Yes' or 'No'?

In [60]:
# Finding P(Yes|Sunny,Hot,High,Weak) using the formula mentioned above
P_Yes_SHHW = P_Sunny_Yes * P_Hot_Yes * P_High_Yes * P_Weak_Yes * P_Yes

# Finding P(No|Sunny,Hot,High,Weak) using the formula mentioned above
P_No_SHHW = P_Sunny_No * P_Hot_No * P_High_No * P_Weak_No * P_No

In [61]:
print("P(Yes|Sunny,Hot,High,Weak) is ", P_Yes_SHHW)
print("P(No|Sunny,Hot,High,Weak) is ", P_No_SHHW)

P(Yes|Sunny,Hot,High,Weak) is  0.007054673721340387
P(No|Sunny,Hot,High,Weak) is  0.02742857142857143


**Result -**    
As **P(No|Sunny,Hot,High,Weak) > P(Yes|Sunny,Hot,High,Weak)**, so the output prediction will be **No**.

This means the Naive Bayes Model will predict that for the given input i.e. **outlook = Sunny, temp = Hot, humidity = High, wind = Weak,** **No** tennis game will be played.