# DS 2500 HW 7

Due: Mon Apr 03 @ 11:59PM

### Submission Instructions
Please submit both of the following to the corresponding [gradescope](https://www.gradescope.com/courses/478298) assignment:
- this `.ipynb` file
     -  <span style="color:red">give a fresh Kernel > Restart & Run All just before uploading</span>
         - there is no autograder for hw7, so this step is extra important here!

- a `.py` file consistent with your `.ipynb`
    - `File > Download as ...`

### Tips for success
- Start early
- Make use of [Piazza](https://course.ccs.neu.edu/ds2500/admin_piazza.html)
- Make use of [Office Hours](https://course.ccs.neu.edu/ds2500/office_hours.html)
- Remember that [Documentation / style counts for credit](https://course.ccs.neu.edu/ds2500/python_style.html)
- [No student may view or share their ungraded homework with another](https://course.ccs.neu.edu/ds2500/syllabus.html#academic-integrity-and-conduct)

| part                                        |       | ex cred   |   part total |
|:--------------------------------------------|:------|:----------|-------------:|
| Part 1: `BayesNetwork.add_prior_node`       | 20.0  |           |           20 |
| Part 2: `BayesNetwork.add_conditional_node` | 25.0  |           |           25 |
| Part 3: `BayesNet.get_prob`                 | 20.0  |           |           20 |
| Part 4: `BayesNet.get_conditional_prob`     | 15.0  |           |           15 |
| Part 5: Gardening                           | 15.0  |           |           15 |
| Part 6: Memory Analysis                     | 5.0   |           |            5 |
| Part 7: Build-your-own                      |       | 4.0       |            4 |
| total                                       | 100.0 | 4.0       |          104 |

# Suggestions:

- only modify the code in the cell immediately below
    - modifying the tests can alter the intended behavior of the test
- test your code by giving a fresh restart & run for each run
    - the tests are built to be run in the given sequence, running a code cell twice 
    
# Hints:

- this `BayesNetwork` class operates just like the "manual spreadsheet" computation shown in class.  Before diving into the syntax and programming challenge of building it be sure you're comfortable with the mathematics and "manual" computation method shown in class first.
- [hw7_hint](hw7_hint.ipynb) has a few constructions which could be useful   


In [1]:
from copy import copy
from collections import defaultdict
import pandas as pd


class BayesNetwork:
    """ Bayes Net, computes full joint table

    Attributes:
        df_joint (pd.DataFrame): a column per random variable plus another col
            for probability.  each row contains the outcomes of the
            corresponding random variable or the joint prob of entire row
    """

    def __init__(self):
        # note: we specify type of prob as float with 1.0 below
        self.df_joint = pd.DataFrame({'prob': [1.0]})

    def add_prior_node(self, rv_name, prob_dist):
        """ adds a nodes to joint distribution table

        Args:
            rv_name (str): name of random variable (must be unique in df_joint)
            prob_dist (dict): keys are outcomes of random variable, values are
                probability of each
        """
        assert rv_name not in self.df_joint.columns, \
            f'non-unique node: {rv_name}'
        
        #if there is only 1 row, overwrite and add new column and row
        if len(self.df_joint) == 1:
            row_idx = 0
            for key, value in prob_dist.items():
                self.df_joint.loc[row_idx, rv_name] = key
                self.df_joint.iloc[row_idx,0] = value
                row_idx += 1
        
        else:
            #create a copy of the current df_joint
            df_joint_copy = copy(self.df_joint)
        
            #initialize empty row list
            row_list = list()
            for idx, row in df_joint_copy.iterrows():
                #intialize previous row defualt dict
                prev_row = defaultdict(lambda: 0)
                
                #iterate through each column and then create dictionary using keys as the column name and values as the
                #value of the (row, col)
                for col in df_joint_copy.columns:
                    prev_row[col] = df_joint_copy.loc[idx, col]
            
                #get the probability of the row
                prev_prob = df_joint_copy.loc[idx, 'prob']
            
                #intialize default dict
                dict_row = defaultdict(lambda: 0)
                
                #loop through the conditional probaility dictionary
                for key, value in prob_dist.items():
                    
                    #get the probability of the outcome
                    prob = prev_prob * value
                    
                    #add to row dictionary
                    dict_row['prob'] = prob
                    
                    #add the outcome to the row
                    dict_row[rv_name] = key
                
                    #copy previous row then update that copy and add to the row list
                    prev_row_copy = copy(prev_row)
                    prev_row_copy.update(dict_row)
                    row_list.append(prev_row_copy)
                    
            #the new df_joint will be a dataframe created from the row list
            self.df_joint = pd.DataFrame(row_list)
            
    def add_conditional_node(self, cond_dist):
        """ adds a nodes to joint distribution table

        Args:
            cond_dist (ConditionalProb): a conditional probability of some new
                random variable.  (conditioned on random variables already in
                df_joint)
        """
        # check that all conditioned variables are in joint already
        assert set(cond_dist.condition_list).issubset(self.df_joint.columns), \
            f'condition rvs not in joint table: {cond_dist.condition_list}'
        
        # check that target variable is not in joint already
        assert cond_dist.target not in self.df_joint.columns, \
            f'random variable already in network: {cond_dist.target}'
        
        #create a copy of the current df_joint
        df_joint_copy = copy(self.df_joint)
        
        #initialize empty row list
        row_list = list()

        #iterate through the rows in df_joint
        for idx, row in df_joint_copy.iterrows():
            #intialize previous row defualt dict
            prev_row = defaultdict(lambda: 0)
            
            #iterate through each column and then create dictionary using keys as the column name and values as the
            #value of the (row, col)
            for col in df_joint_copy.columns:
                prev_row[col] = df_joint_copy.loc[idx, col]
        
            #get the probability of the row
            prev_prob = df_joint_copy.loc[idx, 'prob']
            
            #get the outcome of the condition_list variables for the row
            cond_outcome = tuple(row.loc[cond_dist.condition_list].values)
        
            #intialize default dict
            dict_row = defaultdict(lambda: 0)
    
            outcome_dict = cond_dist.cond_prob_dict[cond_outcome]
            #loop through the conditional probaility dictionary
            for key, value in outcome_dict.items():
                #get the probability of the outcome
                prob = prev_prob * value
                
                #add to row dictionary
                dict_row['prob'] = prob
                
                #add all the outcomes from the condition list to the row dictionary
                for idx in range(len(cond_dist.condition_list)):
                    dict_row[cond_dist.condition_list[idx]] = cond_outcome[idx]
                    
                #add the outcome of the target to row dictionary
                dict_row[cond_dist.target] = key
                
                #copy previous row then update that copy and add to the row list
                prev_row_copy = copy(prev_row)
                prev_row_copy.update(dict_row)
                row_list.append(prev_row_copy)
        
        #the new df_joint will be a dataframe created from the row list
        self.df_joint = pd.DataFrame(row_list)
        
    def get_prob(self, state):
        """ sums all rows which satisfy state (marginalization)

        Args:
            state (dict): keys are random variable, values are corresponding
                outcomes
                
        Returns:
            prob (float): probability of the given state
        """
        
        prob = 0
        #iterate through each row in the datafrmae
        for row in self.df_joint.iterrows():
            #create a dict out of the dataframe row
            row_dict = dict(row[1])
    
            #check if the outcomes in state are both in row, if so, add to probability
            if set(state.items()).issubset(row_dict.items()) == True:
                prob = prob + row_dict['prob']
                
            
        return prob

    def get_conditional_prob(self, state, condition):
        """ computes conditional probability of state given condition:

        P(ABC|XYZ) = P(ABCXYZ) / P(XYZ)

        above ABC are state variables while XYZ are conditional variables

        Args:
            state (dict): keys are random variable, values are corresponding
                outcomes
            condition (dict): keys are random variable, values are
                corresponding outcomes
                
        Returns:
            prob (float): probability of the given state given condition
        """
        # check that no variable is in state & conditional
        rv_double = set(state.keys()).intersection(condition.keys())
        assert not rv_double, \
            f'same random variable before & after conditional: {rv_double}'
        prob = 0
        prob_num = 0
        prob_dem = 0
        
        state_copy = copy(state)
        
        #combined the state and condition to find both state and conditon
        combined_dict = state_copy | condition 
        
        #use get_prob to get the numerator and denominator of the forumula
        prob_num = self.get_prob(combined_dict)
        prob_dem = self.get_prob(condition)
        
        
        prob = prob_num / prob_dem
        return prob

# Part 1: `BayesNetwork.add_prior_node` (20 points)

We validate whether the nodes have been added properly by constructing a known example: 

<img src="https://miro.medium.com/max/640/1*9OsQV0PqM2juaOtGqoRISw.jpeg" width=500>

and comparing output `bayes_net.df_joint` to expected dataframes, which are stored in the [expected_csv](expected_csv) folder.

In [2]:
# for example, after adding the cloudy node to the network, bayes_net.df_joint should look as below:
df_expected = pd.read_csv('expected_csv/prob_cloudy.csv', index_col=False)
df_expected

Unnamed: 0,prob,Cloudy
0,0.5,c0
1,0.5,c1


In [3]:
# build bayes net with cloudy node
bayes_net = BayesNetwork()
bayes_net.add_prior_node('Cloudy', prob_dist={'c0': .5, 'c1': .5})

# manually check output dataframe (just this first time, to see how to debug below)
bayes_net.df_joint

Unnamed: 0,prob,Cloudy
0,0.5,c0
1,0.5,c1


In [4]:
from df_compare import assert_df_equal_no_idx

# automatically compare expected to actual dataframe
# (it ends up being somewhat challenging to do given that we 
# can shuffle order of cols or rows while the two are still
# equivilent, for our purposes ... see df_compare.py for details,
# but it isn't necessary to complete the assignment)
assert_df_equal_no_idx(bayes_net.df_joint, df_expected)

# Part 2: `BayesNetwork.add_conditional_node` (25 points)

Hint:
- Inspect and study the given output DataFrames via their [expected_csv](expected_csv) before implementing!

In [5]:
from conditional import ConditionalProb

# add rain conditional prob
cond_prob_rain = \
    ConditionalProb(target='Rain',
                    condition_list=['Cloudy'],
                    cond_prob_dict={('c1',): {'r1': .8, 'r0': .2},
                                    ('c0',): {'r1': .2, 'r0': .8}})
bayes_net.add_conditional_node(cond_prob_rain)

# check that rain conditional prob was added properly
df_joint_expected = pd.read_csv('expected_csv/prob_cloudy_rain.csv', index_col=False)
assert_df_equal_no_idx(df_joint_expected, bayes_net.df_joint)

In [6]:
# add sprinkler conditional prob
cond_prob_sprinkler = \
    ConditionalProb(target='Sprinkler',
                    condition_list=['Cloudy'],
                    cond_prob_dict={('c1',): {'s1': .1, 's0': .9},
                                    ('c0',): {'s1': .5, 's0': .5}})
bayes_net.add_conditional_node(cond_prob_sprinkler)

# check that sprinkler conditional prob was added properly
df_joint_expected = pd.read_csv('expected_csv/prob_cloudy_rain_sprinkler.csv', index_col=False)
assert_df_equal_no_idx(df_joint_expected, bayes_net.df_joint)

In [7]:
df_joint_expected

Unnamed: 0,prob,Cloudy,Rain,Sprinkler
0,0.05,c0,r1,s1
1,0.05,c0,r1,s0
2,0.2,c0,r0,s1
3,0.2,c0,r0,s0
4,0.04,c1,r1,s1
5,0.36,c1,r1,s0
6,0.01,c1,r0,s1
7,0.09,c1,r0,s0


In [8]:
# add wet grass conditional prob
cond_prob_grass_wet = \
    ConditionalProb(target='WetGrass',
                    condition_list=['Rain', 'Sprinkler'],
                    cond_prob_dict={('r1', 's1'): {'w1': .99, 'w0': .01},
                                    ('r0', 's1'): {'w1': 0.9, 'w0': .1},
                                    ('r1', 's0'): {'w1': 0.9, 'w0': .1},
                                    ('r0', 's0'): {'w1': 0.0, 'w0': 1}})
bayes_net.add_conditional_node(cond_prob_grass_wet)

# check that wet grass conditional prob was added properly
df_joint_expected = pd.read_csv('expected_csv/prob_cloudy_rain_sprinkler_grass.csv', index_col=False)
assert_df_equal_no_idx(df_joint_expected, bayes_net.df_joint)

# Part 3: `BayesNet.get_prob` (20 points)

In [9]:
from math import isclose

assert isclose(bayes_net.get_prob({'Cloudy': 'c1'}), .5)

assert isclose(bayes_net.get_prob({'Sprinkler': 's1', 'Cloudy': 'c1'}), .05)
assert isclose(bayes_net.get_prob({'Sprinkler': 's1', 'Cloudy': 'c0'}), .25)
assert isclose(bayes_net.get_prob({'Sprinkler': 's1'}), .3)

assert isclose(bayes_net.get_prob({'Rain': 'r1', 'Cloudy': 'c1'}), .4)
assert isclose(bayes_net.get_prob({'Rain': 'r1', 'Cloudy': 'c0'}), .1)
assert isclose(bayes_net.get_prob({'Rain': 'r1'}), .5)

#### extra math note (not needed for HW completion, helpful for probability fluency though)

The chunks of three assert statements immediately above demonstrate marginalization: 

- there's only two ways sprinkler is on: 
    - when its cloudy or clear outside (.3 = .05 + .25)
- there's only two ways its raining:     
    - when its cloudy or clear outside (.5 = .1 + .4)

# Part 4: `BayesNet.get_conditional_prob` (15 points)

To validate `.get_conditional_prob()` we reproduce known conditional probs from the bayes net definition:

In [10]:
# whats the prob the sprinkler is on given its cloudy?
assert isclose(bayes_net.get_conditional_prob(state={'Sprinkler': 's1'}, condition={'Cloudy': 'c1'}), .1)

In [11]:
# whats the prob its not raining given its not cloudy?
assert isclose(bayes_net.get_conditional_prob(state={'Rain': 'r0'}, condition={'Cloudy': 'c0'}), .8)

In [12]:
# whats the prob lawn is wet given sprinkler is on and its raining?
assert isclose(bayes_net.get_conditional_prob(state={'WetGrass': 'w1'}, condition={'Sprinkler': 's1',
                                                                                    'Rain': 'r1'}), .99)

# Part 5: Gardening (15 points)

A gardener wants their newly planted lawn to have (at least) a 70% chance of being wet while using their sprinkler as little as possible, to conserve water.  Each morning they step outside their house and observe only if it is cloudy or not.  With only this evidence, they want to know whether they must turn their sprinkler on.

- on clear days, should the gardener turn on their sprinkler?
- on cloudy days, should the gardener turn on their sprinkler?
- is it possible for the gardener to always ensure at least 70% chance of having a wet lawn?

Call a few methods of the bayes net above to investigate the questions immediately above.  Write a summary of results in 2-3 sentences which is easily understood by a garener who knows little of probability or Bayes Nets.

In [13]:
prob_lawn_wet_c0 = bayes_net.get_conditional_prob(state={'WetGrass': 'w1'}, condition={'Cloudy': 'c0', 'Sprinkler': 's0'})
print(f'The probability that the lawn is wet if it is a clear day is {prob_lawn_wet_c0:.3f}')

prob_lawn_wet_c0_sprinkler = bayes_net.get_conditional_prob(state={'WetGrass': 'w1'}, condition={'Cloudy': 'c0', 
                                                                                              'Sprinkler': 's1'})
print(f'The probability that the lawn is wet if it is a clear day and the sprinkler is on is {prob_lawn_wet_c0_sprinkler:.3f}')

The probability that the lawn is wet if it is a clear day is 0.180
The probability that the lawn is wet if it is a clear day and the sprinkler is on is 0.918


On a clear day, the gardener should turn on the sprinkler because without the sprinkler the probability of having a wet lawn is less than 70%, but with the sprinkler the probability is over 90%.

In [14]:
prob_lawn_wet_c1 = bayes_net.get_conditional_prob(state={'WetGrass': 'w1'}, condition={'Cloudy': 'c1', 'Sprinkler': 's0'})
print(f'The probability that the lawn is wet if it is a cloudy day is {prob_lawn_wet_c1}')

The probability that the lawn is wet if it is a cloudy day is 0.72


On a cloudy day, the probability that the lawn is wet without the sprinkler is over 70%, so the gardener should not turn on the sprinkler.

In [15]:
rob_lawn_wet_c0_sprinkler = bayes_net.get_conditional_prob(state={'WetGrass': 'w1'}, condition={'Cloudy': 'c0', 
                                                                                              'Sprinkler': 's1'})
print(f'The probability that the lawn is wet if it is a clear day and the sprinkler is on is {prob_lawn_wet_c0_sprinkler:.3f}')

The probability that the lawn is wet if it is a clear day and the sprinkler is on is 0.918


With the sprinkler on, even on a cloudy day, the probability of a wet lawn is over 70%.

# Part 6: Memory Analysis (5 points)

Let's consider the liver disease bayes net example shown in class.  Assuming it has 40 total nodes, and each is binary, how much memory would it cost to store the probability column of `df_joint` as shown above?  Assume that every combination of variables must be stored as a float which uses `np.ones(1).nbytes / 1e6` megabytes of space.

Summarize your computation in 2 sentences so a non-technical reader can understand the drawback.  (Note: this memory problem lies with our implementation, there are methods to avoid it)

Hint:
- its a big number, don't try this line of code as you'll run out of memory before you get an answer:
    - `np.ones(2 ** 40).nbytes / 1e6` megabytes of space

When you add a node to the joint probability table, the number of rows is doubled, which means adding 40 nodes is doubling the number of rows 40 times (the same as 2^40). If each row takes up `np.ones(1).nbytes / 1e6` megabytes of space, 40 rows will take up that number multiplied by 40, which is `np.ones(2**40).nbytes / 1e6`. This requires around 8 terrabytes of memory, which is more than almost any computer can handle.

# Part 7: Build-your-own (4 ex cred pts)

Build your own Bayes Net problem!

1. Provide a graphical representation which contains a graph and all necessary distributions
    - see the thief, alarm, dog, doorbell, earthquake example in class
    - include it as an embedded image directly below
1. Implement it as a `BayesNet`
1. Write a few questions which tell a "data story".  Answer them by querying your network and interpretting results.
    - again, see the thief example in class for a "data story"
    
Grab your project team's data if you'd like :)

I'd love a few more beautiful examples for use in future coursework.  Make a super-clean figure and a compelling datastory to earn the full four points of credit.  

If you're willing to share this in future coursework (for any course or instructor) please shoot me a copy via email saying "You, or other instructors, are welcome to use this in any future course".  Also, let us know if you'd like us to cite you or whether you'd like us to give credit to an anonymous DS2500 student.  Your consent to use / share won't impact whether you score extra credit points.

![graph](https://imagizer.imageshack.com/img924/244/jW0rai.png)

In [16]:
#initialize bayes net
#(variable name then 1 is true, variable name then 0 is false)
bayes_net_class = BayesNetwork() 

#add study node
bayes_net_class.add_prior_node('Study', prob_dist={'st0': .5, 'st1': .5})
bayes_net_class.df_joint

Unnamed: 0,prob,Study
0,0.5,st0
1,0.5,st1


In [17]:
#add sleep node
cond_prob_sleep = \
    ConditionalProb(target='Sleep',
                    condition_list=['Study'],
                    cond_prob_dict={('st1',): {'sl1': .2, 'sl0': .8},
                                    ('st0',): {'sl1': .9, 'sl0': .1}})
bayes_net_class.add_conditional_node(cond_prob_sleep)

In [18]:
# add test node
cond_prob_test = \
    ConditionalProb(target='Test',
                    condition_list=['Study', 'Sleep'],
                    cond_prob_dict={('st1', 'sl1'): {'t1': .9, 't0': .1},
                                    ('st0', 'sl1'): {'t1': .45, 't0': .55},
                                    ('st1', 'sl0'): {'t1': 0.7, 't0': .3},
                                    ('st0', 'sl0'): {'t1': 0.01, 't0': .99}})
bayes_net_class.add_conditional_node(cond_prob_test)

#add quiz node
cond_prob_quiz = \
    ConditionalProb(target='Quiz',
                    condition_list=['Study', 'Sleep'],
                    cond_prob_dict={('st1', 'sl1'): {'q1': .95, 'q0': .05},
                                    ('st0', 'sl1'): {'q1': 0.7, 'q0': .3},
                                    ('st1', 'sl0'): {'q1': 0.8, 'q0': .2},
                                    ('st0', 'sl0'): {'q1': 0.4, 'q0': .6}})
bayes_net_class.add_conditional_node(cond_prob_quiz)

In [19]:
#add class node
cond_prob_class = \
    ConditionalProb(target='Class',
                    condition_list=['Test', 'Quiz'],
                    cond_prob_dict={('t1', 'q1'): {'c1': .99, 'c0': .01},
                                    ('t0', 'q1'): {'c1': 0.55, 'c0': .45},
                                    ('t1', 'q0'): {'c1': 0.85, 'c0': .15},
                                    ('t0', 'q0'): {'c1': 0.01, 'c0': .99}})
bayes_net_class.add_conditional_node(cond_prob_class)

bayes_net_class.df_joint

Unnamed: 0,prob,Study,Sleep,Test,Quiz,Class
0,0.140332,st0,sl1,t1,q1,c1
1,0.001417,st0,sl1,t1,q1,c0
2,0.051637,st0,sl1,t1,q0,c1
3,0.009112,st0,sl1,t1,q0,c0
4,0.095288,st0,sl1,t0,q1,c1
5,0.077963,st0,sl1,t0,q1,c0
6,0.000743,st0,sl1,t0,q0,c1
7,0.073508,st0,sl1,t0,q0,c0
8,0.000198,st0,sl0,t1,q1,c1
9,2e-06,st0,sl0,t1,q1,c0


In [20]:
prob_pass_class = bayes_net_class.get_prob({'Class': 'c1'})
print(f'The probability of passing the class is {prob_pass_class:.3f}')

The probability of passing the class is 0.716


The student has a .716 overall chance of passing the class.

In [21]:
condition = {'Sleep': 'sl1'}
prob_pass_class = bayes_net_class.get_prob(condition)
prob_pass_class

0.55

In [22]:
prob_pass_class = bayes_net_class.get_conditional_prob(state = {'Class': 'c1'}, condition = {'Sleep': 'sl1', 'Study': 'st0'})
print(f'The probability of passing the class given that the student sleeps and does not study is {prob_pass_class:.3f}')

The probability of passing the class given that the student sleeps and does not study is 0.640


In [23]:
prob_pass_class = bayes_net_class.get_conditional_prob(state = {'Class': 'c1'}, condition = {'Study': 'st1', 'Sleep': 'sl0'})
print(f'The probability of passing the class given that the student studies and does not sleep is {prob_pass_class:.3f}')

The probability of passing the class given that the student studies and does not sleep is 0.806


The student has a better chance of passing the class if they focus on studying and not sleeping.

In [24]:
prob_pass_quiz = bayes_net_class.get_conditional_prob(state={'Quiz': 'q1'}, condition={'Study': 'st1', 'Sleep': 'sl0'})
prob_pass_test = bayes_net_class.get_conditional_prob(state={'Test': 't1'}, condition={'Study': 'st1', 'Sleep': 'sl1'})

print(f'the probability of passing the quiz while studying and not sleeping is {prob_pass_quiz:.3f}')
print(f'the probability of passing the test with studying and sleep is {prob_pass_test:.3f}')

the probability of passing the quiz while studying and not sleeping is 0.800
the probability of passing the test with studying and sleep is 0.900
