<center><h2>Artificial and Computational Intelligence (Assignment - 2)</h2></center>

## Problem Statement

As part of the 2nd Assignment, we'll implement Bayesian Networks and also learn to use the pomegranate library.

You are required to create a bayesian network model which would help you predict the probability. The detailed problem description is attached as a PDF as a part of this assignment along with the marking scheme.  

### What is a Bayesian Network ?

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). 

Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 

### Dataset

The dataset can be downloaded from https://drive.google.com/drive/folders/1oMtKmmvPkN4O8DmrHMJe6M8CbB93Z5kw .You can access it only using your BITS IDs. Also, the same dataset is attached along with the assignment. 

#### Dataset Description
##### Sample Tuple

Y	won	5wickets	lost	2nd	vWest_Indies	Home	6-Nov-11

##### Explanation
- The first column represents if Ashwin was in the playing 11 or not. 
- The second column represents the Result of the match . win indicates India won the match.
- The third column represents the Margin of victory / losss.
- The fourth column represents the results of the toss. won indicates India won the toss. 
- The fifth column represents the batting order. If India batted 1st or 2nd. 
- The sixth column represents the opponent.
- The seventh column represents the location of the match. If the match was held in Home(India) or away. 
- The last column represents the start date of the match.


### Evaluation
We wish to evaluate based on 
- coding practices being followed
- commenting to explain the code and logic behind doing something
- your understanding and explanation of data
- how good the model would perform

### BITS RollNumbers , Names. 

#### 2018AH04612 , Prabhu Koduri
#### 2018AH04599 , K L N Rao
#### 2018AH04543 , Naveen
#### 2018AH04573 , Sagnik

In [1]:
#Import libraries
from pomegranate import *
import pandas as pd

#### Bayesian Net Graph:
*** Please include the file Test.png in the same directory

<img src="Test.png" alt="Drawing" style="width: 300px;"/>

In [2]:
#Read data and Create a dataframe out of the imported data

df= pd.read_excel('India_Test_Stats.xlsx')

#### Data Preprocesing :

##### Understanding of the data:
<br> a. From the Bayesian Net graph we can figure out the non-relevant columns and drop them
<br> b. Choosing to Bat depends only on the Toss on that day
<br> c. Ashwin playing depends on the location of the match
<br> d. Match result depends on Batting and Ashwing Playing

This understanding from data can help us,
Indentify the conditional dependence between attributes and
Construct the Bayesian Network accordingly

Please note, that as per the Bayesian Net definition the probability of an attribute can be calculated
given its parent


In [3]:
# Removing redundant columns

df=df.drop(["Margin",'Opposition','Start Date'],axis=1)

In [4]:
# Streamlining data
# Correcting data to follow the same format across nodes

df['Toss']   = df['Toss'].replace('won','Win')
df['Toss']   = df['Toss'].replace('lost','Loss')

df['Result'] = df['Result'].replace('won','Win')
df['Result'] = df['Result'].replace('lost','Loss')
df['Result'] = df['Result'].replace('draw','Draw')

df['Ashwin'] = df['Ashwin'].replace('Y','Yes')
df['Ashwin'] = df['Ashwin'].replace('N','No')

1) Create a function to calculate prior probability of any given variable. The function should read in an array and output a dictionary of prior probability of each possible outcome.

e.g. {'A': 1/4, 'B': 1/2, 'C': 1/4}

In [5]:
# Create domain dictionary to hold all possible domain values for each node.

domainDict             = {}
domainDict["Toss"]     = ['Win','Loss']
domainDict["Location"] = ['Home','Away']
domainDict["Ashwin"]   = ['Yes','No']
domainDict["Result"]   = ['Win','Loss','Draw']
domainDict["Bat"]      = ['1st','2nd']

In [6]:
# This method generates prior probability for the passed in variable

def CalculatePrior(df,var):
    tempDict = {}
    for i in domainDict[var]:
        #print((df[var] == i).sum())
        tempDict[i] = (df[var] == i).sum()/df[var].count()
    return tempDict

In [7]:
# Invoke CalculatePrior to generate prior probabilities for all nodes

locationPrior = CalculatePrior(df,'Location')
tossPrior = CalculatePrior(df,'Toss')
battingPrior = CalculatePrior(df,'Bat')
ashwinPrior = CalculatePrior(df,'Ashwin')
resultPrior = CalculatePrior(df,'Result')
#locationPrior,tossPrior,battingPrior,ashwinPrior,resultPrior

2) Create a function to calculate conditional probability. The function should read in multiple arrays and calculate the posterior probability of the last array wrt to previous arrays. For example, if you pass arrays “Location” and “Ashwin Playing” the output should be
Eg:


[[ 'home', 'Y', 0.xx ],
 [ 'home', 'N', 0.xx ],
 [ 'away', 'Y', 0.xx ],
 [ 'away', 'N', 0.xx ]]

In [8]:
# Method to calculate conditional probability of one node
# based on either one or two nodes
# This method returns the probabilities as a 2-dimensional array enumerated over all possible
# values for each of the nodes passed as arguments

def CalculateCond(df,var1,var2,target):
    tempDict = {}
    probTable = []
    varList = []
    if var1 != 'ZZ':
        if target is None or var1 is None or  var2 is None:
            return None
        else:
            varList.append(var1)
            varList.append(var2)

        for srcValue1 in domainDict[var1]:
            for srcValue2 in domainDict[var2]:
                for targetValue in domainDict[target]:
                    df1 = df[df[var1] == srcValue1]
                    df2 = df1[df1[var2] == srcValue2]
                    val = ((df2[target] == targetValue).sum())/((df2.shape[0]))
                    probTable.append([srcValue1,srcValue2,targetValue,val])
        return probTable
    elif var2 != 'ZZ':
        if target is None or var1 is None:
            return None
        else:
            varList.append(var2)

        for var in varList:
            for srcValue in domainDict[var]:
                for targetValue in domainDict[target]:
                    tempList = []
                    val = ((df[df[target] == targetValue][var] == srcValue).sum())/((df[var] == srcValue).sum())
                    probTable.append([srcValue,targetValue,val])
            #print((df[var] == i).sum())
            #tempDict[i] = (df[var] == i).sum()/df[var].count()
        return probTable        

#CalculateCond2(df,'Bat','Ashwin','Result')

In [9]:
CalculateCond(df,'Bat','Ashwin','Result')

[['1st', 'Yes', 'Win', 0.7027027027027027],
 ['1st', 'Yes', 'Loss', 0.1891891891891892],
 ['1st', 'Yes', 'Draw', 0.10810810810810811],
 ['1st', 'No', 'Win', 0.5555555555555556],
 ['1st', 'No', 'Loss', 0.2222222222222222],
 ['1st', 'No', 'Draw', 0.2222222222222222],
 ['2nd', 'Yes', 'Win', 0.48484848484848486],
 ['2nd', 'Yes', 'Loss', 0.24242424242424243],
 ['2nd', 'Yes', 'Draw', 0.2727272727272727],
 ['2nd', 'No', 'Win', 0.0],
 ['2nd', 'No', 'Loss', 0.8333333333333334],
 ['2nd', 'No', 'Draw', 0.16666666666666666]]

#### Construct Bayesian Network with Pomegranate Model

In [10]:
# Creating DiscreteDistribution using prior probabilities 

Location = DiscreteDistribution(CalculatePrior(df,'Location'))
Toss     = DiscreteDistribution(CalculatePrior(df,'Toss'))

Ashwin   = DiscreteDistribution(CalculatePrior(df,'Ashwin'))
Result   = DiscreteDistribution(CalculatePrior(df,'Result'))
Bat      = DiscreteDistribution(CalculatePrior(df,'Bat'))


In [11]:
# Creating conditional probability table (with reference to the Bayesian Net Graph)for,
# i.  level 2 nodes using level 1 node prior probabilities 
# ii. level 3 node using level 2 node conditional probability

AshwinCond = ConditionalProbabilityTable(CalculateCond(df,'ZZ','Location','Ashwin'),[Location])

BatCond    = ConditionalProbabilityTable(CalculateCond(df,'ZZ','Toss','Bat'),[Toss])

ResultCond = ConditionalProbabilityTable(CalculateCond(df,'Bat','Ashwin','Result'),[ BatCond, AshwinCond])

In [12]:
# Create the required states for pomegranate library
# Each node in the graph is considered as a state.

s1  = State(Toss,       name="Toss")
s2  = State(Location,   name="Location")
s3  = State(BatCond,    name="BatToss")
s4  = State(AshwinCond, name="AshwinLocation")
s5  = State(ResultCond, name="Result")

In [13]:
# Create the Bayesian network object with a useful name
model = BayesianNetwork("Spinning the Bayes Net")

# Add the states to the network 
model.add_states(s1, s2, s3, s4, s5)

# Add edges which represent conditional dependencies, where the second node is 
# conditionally dependent on the first node
# This node creation should exactly match the Bayesian Net provide above.

model.add_edge(s1, s3)
model.add_edge(s2, s4)
model.add_edge(s3, s5)
model.add_edge(s4, s5)

model.bake()

### Solution:

4) Use the Bayesian Network model created to calculate the probability of:
  <br> a. India winning, batting 2nd, Ashwin playing
  <br> b. India winning, batting 2nd, Ashwin not playing
  <br> c. India losing, batting 2nd, Ashwin playing
  <br> d. India losing, batting 2nd, Ashwin not playing

In [14]:
# Invoke predict_proba to generate the probabilities of each node

model.predict_proba({})

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.47058823529411764,
            "Loss" :0.5294117647058824
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.5058823529411764,
            "Away" :0.49411764705882366
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "1st" :0.5411764705882353,
            "2nd" :0.4588235294117648
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Yes" :0.8235294117647056,
            "No" :0.1764705882352944
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribut

In [15]:
# a. India winning, batting 2nd, Ashwin playing
# Predicting probabilities where three conditions are specified

model.predict_proba([None, None, '2nd', 'Yes', 'Win'])

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.10256410256410274,
            "Loss" :0.8974358974358972
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.6142857142857142,
            "Away" :0.3857142857142858
        }
    ],
    "frozen" :false
},
       '2nd', 'Yes', 'Win'], dtype=object)

In [16]:
# a. India winning, batting 2nd, Ashwin playing
model.predict_proba({'BatToss': '2nd', 'AshwinLocation':'Yes','Result' :'Win'})

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.10256410256410274,
            "Loss" :0.8974358974358972
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.6142857142857142,
            "Away" :0.3857142857142858
        }
    ],
    "frozen" :false
},
       '2nd', 'Yes', 'Win'], dtype=object)

In [17]:
#b. India winning, batting 2nd, Ashwin not playing
model.predict_proba({'BatToss': '2nd', 'AshwinLocation':'No','Result' :'Win'})

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.47058823529411764,
            "Loss" :0.5294117647058824
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.5058823529411764,
            "Away" :0.49411764705882366
        }
    ],
    "frozen" :false
},
       '2nd', 'No', 'Win'], dtype=object)

In [18]:
# c. India losing, batting 2nd, Ashwin playing

model.predict_proba({'BatToss': '2nd', 'AshwinLocation':'Yes','Result' :'Loss'})

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.10256410256410274,
            "Loss" :0.8974358974358972
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.6142857142857142,
            "Away" :0.3857142857142858
        }
    ],
    "frozen" :false
},
       '2nd', 'Yes', 'Loss'], dtype=object)

In [19]:
#d. India losing, batting 2nd, Ashwin not playing
model.predict_proba({'BatToss': '2nd', 'AshwinLocation':'No','Result' :'Loss'})

array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Win" :0.10256410256410274,
            "Loss" :0.8974358974358972
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "Home" :0.0,
            "Away" :1.0
        }
    ],
    "frozen" :false
},
       '2nd', 'No', 'Loss'], dtype=object)

<h3><center> Happy Coding!</center></h3>