## Approach

Our approach to solving this problem involves creating 3 functions, one of which will compute the fraction part of the entropy formula, another will take that value and put it in the entropy formula and the last one will compute information gain from the entropies that gets calculated.

In [1]:
# Imporing the required library
import pandas as pd
import math

### Function to get the fraction part of the entropy formula

The *prob* function makes use of use of \*\*kwargs, which is a way to pass variable length list of keyworded arguments. There is a certain pattern the arguments are to be passed in the function, which is the name of the data followed by the name of the dependent variable which is then followed by the names of the independent variables. The first for loop takes the dependent variable and loops through it's unique value getting the proportion of occurance of each of these values. The second for loop is a nested for loop that loops through the unique values of the dependent variable and inside that loop there is another for loop that loops through the independent variables passed in as the arguments and the innermost for loop loops through the different groups each of those independent variables has. It then updates a dictionary where the keys take the form of "prob_(Independent variable)_(Group of the independent variable)_(Group of the dependent variable)" for eg. "prob_Pulse_Strong_Pos". The value is the proportion of times we see a strong pulse for which the Oracle is Positive out of all strong pulse cases. 

In [2]:
def prob(**kwargs):
    kwargs = list(kwargs.values())
    df = pd.read_csv(kwargs[0], sep=" ")
    d = {}
    
    for i in df[kwargs[1]].unique():
        d['Prob_'+kwargs[1]+'_'+i] = df[df[kwargs[1]] == i].shape[0]/df.shape[0]


    for i in df[kwargs[1]].unique():
        for a in range(2,len(kwargs)):
            for j in df[kwargs[a]].unique():
                d["prob_"+kwargs[a]+'_'+j+"_"+i] = df[df[kwargs[1]] == i][df[df[kwargs[1]] == i][kwargs[a]] == j].shape[0] / df[df[kwargs[a]] == j].shape[0]
    
    return d

In [3]:
p = prob(data= "sample(1).data", dep= "Oracle", i1= "Pulse", i2= "BP", i3= "Age")
p

{'Prob_Oracle_Pos': 0.45,
 'Prob_Oracle_Neg': 0.55,
 'prob_Pulse_Strong_Pos': 0.4166666666666667,
 'prob_Pulse_Weak_Pos': 0.5,
 'prob_BP_Normal_Pos': 0.7777777777777778,
 'prob_BP_Abnormal_Pos': 0.18181818181818182,
 'prob_Age_Teen_Pos': 0.7142857142857143,
 'prob_Age_Adult_Pos': 0.5,
 'prob_Age_Senior_Pos': 0.14285714285714285,
 'prob_Pulse_Strong_Neg': 0.5833333333333334,
 'prob_Pulse_Weak_Neg': 0.5,
 'prob_BP_Normal_Neg': 0.2222222222222222,
 'prob_BP_Abnormal_Neg': 0.8181818181818182,
 'prob_Age_Teen_Neg': 0.2857142857142857,
 'prob_Age_Adult_Neg': 0.5,
 'prob_Age_Senior_Neg': 0.8571428571428571}

### Function to get the entropy

The *entropy* function also makes use of use of \*\*kwargs. Like with *prob* There is a certain pattern the arguments are to be passed in the function, which is the name of the data followed by the dictonary we get from *prob* followed by the name of the dependent variable which is then followed by the names of the independent variables. The first for loop takes the dependent variable and loops through it's unique value calculating the entropy for the response variable. The second for loop is a nested for loop that with the help of other for loops go through each independent variable and through each of their groups calculating the entropy and appending those entropy values to a dictionary and returning it.

In [4]:
def entropy(**kwargs):
    kwargs = list(kwargs.values())
    df = pd.read_csv(kwargs[0], sep=" ")
    entr = {}
    cls_ent = 0
    for i in df[kwargs[2]].unique():
        cls_ent += (-(kwargs[1]['Prob_'+kwargs[2]+"_"+i]))*math.log2(kwargs[1]['Prob_'+kwargs[2]+"_"+i])
        
    entr["ent_"+kwargs[2]] = cls_ent
    
    key_list = list(kwargs[1].keys())
    for b in range(3,len(kwargs)):
        for j in df[kwargs[b]].unique():
            ent = 0
            for k in key_list:
                if k.startswith('prob_'+kwargs[b]+'_'+j):
                    try:
                        ent += (-kwargs[1][k]) * math.log2(kwargs[1][k])
                    except ValueError:
                        ent += 0
            entr["ent_"+kwargs[b]+"_"+j] = ent
    return entr

In [5]:
e = entropy(data= "sample(1).data", dic = p, dep= "Oracle", i1= "Pulse", i2= "BP", i3= "Age")
e

{'ent_Oracle': 0.9927744539878084,
 'ent_Pulse_Strong': 0.9798687566511528,
 'ent_Pulse_Weak': 1.0,
 'ent_BP_Normal': 0.7642045065086203,
 'ent_BP_Abnormal': 0.6840384356390417,
 'ent_Age_Teen': 0.863120568566631,
 'ent_Age_Adult': 1.0,
 'ent_Age_Senior': 0.5916727785823275}

### Function to get the information gain

The *infogain* function also makes use of use of \*\*kwargs. Like with *prob* There is a certain pattern the arguments are to be passed in the function, which is the name of the data followed by the dictonary we get from *entropy* followed by the name of the dependent variable which is then followed by the names of the independent variables. The function is pretty straight forward. There is one nested loop which loops through the independent variables that we passed. The gain variable holds the entropy of the dependent variable which is used to subtract the entropy of all other variables. We go through each of the keys in the dictionary and match the key's name with the name of the independent variable passed as arguments in the function. Each group value for one variable gets subtracted from the gain and the result is stored in the infogain dictionary which is then returned.

In [6]:
def infogain(**kwargs):
    kwargs = list(kwargs.values())
    df = pd.read_csv(kwargs[0], sep=" ")
    infogain = {}
    key_list = list(kwargs[1].keys())
    for b in range(3,len(kwargs)):
        gain = kwargs[1]['ent_'+kwargs[2]]
        
        for k in key_list:
            if k.startswith("ent_"+kwargs[b]):
                gain = gain - (kwargs[1][k] * (df[df[k.split("_")[1]] == k.split("_")[2]].shape[0]/df.shape[0]))
        
        infogain['infogain_'+ kwargs[b]] = gain
        
    return infogain

In [7]:
infogain(data= "sample(1).data", dic = e, dep= "Oracle", i1= "Pulse", i2= "BP", i3= "Age")

{'infogain_Pulse': 0.004853199997116753,
 'infogain_BP': 0.27266128645745624,
 'infogain_Age': 0.18359678248567296}

## Conclusion

It was a pretty straightforward project that just involved some using some formulae repeatedly for a number of different independent variables. We could have created one function that did all this and returned what we wanted it to return (proportion or entropy or information gain) but we chose to create three individual functions so that it would be easy to read and process. 

In [12]:
import math
math.exp(-1.28)

0.27803730045319414

In [13]:
1+0.27803730045319414

1.2780373004531942

In [14]:
1/1.2780373004531942

0.7824497764231124

In [11]:
0.82*1.5-0.95+1

1.28

In [15]:
0.78*(1-0.78)*(1-0.78)

0.03775199999999999

In [16]:
0.82*(1-0.82)*(1.5*0.038)

0.008413200000000001

In [17]:
0.95*(1-0.95)*(-1*0.038)

-0.0018050000000000015

In [18]:
1.5+0.5*0.038*0.82

1.51558