# Conditional Probability (Discrete)

**Context:** You've already spent some time conducting a preliminary exploratory data analysis (EDA) of IHH's ER data. You noticed that considering variables separately can result in misleading information. As such, today you will continue your EDA, this time also considering the *relationship between variables*. For example, you may want to know:

* Are there certain conditions that are more likely to occur on certain days?
* What makes a patient likely to need hospitalization?

**Challenge:** So far, however, we've only seen ways of characterizing the variability/stochasticity of a univariate random phenomenon independently of other variables. So how can we consider the relationship between variables? Answer: conditional probability. 

**Outline:** 
1. Introduce and practice the concepts, terminology, and notation behind discrete conditional probability distributions (leaving continuous distributions to a later time).
2. Answer the above questions using this new toolset.

Before getting started, let's load in our IHH ER data:

In [1]:
# Import a bunch of libraries we'll be using below
import pandas as pd
import matplotlib.pylab as plt
import numpyro
import numpyro.distributions as D
import jax
import jax.numpy as jnp

# Load in the data into a pandas dataframe
csv_fname = 'IHH-ER.csv'
data = pd.read_csv(csv_fname, index_col='Patient ID')

# Print a random sample of 5 patients, just to see what's in the data
data.sample(15, random_state=0)

Unnamed: 0_level_0,Day-of-Week,Condition,Hospitalized,Antibiotics,Attempts-to-Disentangle
Patient ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9394,Friday,Allergic Reaction,No,No,
898,Sunday,Allergic Reaction,Yes,Yes,
2398,Saturday,Entangled Antennas,No,No,4.0
5906,Saturday,Allergic Reaction,No,No,
2343,Monday,High Fever,Yes,No,
8225,Thursday,High Fever,Yes,No,
5506,Tuesday,High Fever,No,No,
6451,Thursday,Allergic Reaction,No,No,
2670,Sunday,Intoxication,No,No,
3497,Tuesday,Allergic Reaction,No,No,


## Terminology and Notation for Discrete Conditional Probability

As with (non-conditional) discrete probability, the statistical language---terminology and notation---we introduce here will allow us to precisely specify to a computer how to model our data. In the future, we will translate statements in this language directly into code that a computer can run.

**Example:** Suppose you're working at the IHH ER, and you want to determine what is the probability that the next patient comes in with `Condition == "Intoxication"`. Given previously collected data, you can estimate this probability by counting the number of patients for which `Condition == "Intoxication"` and dividing by the total number of patients. 

In [19]:
num_intoxicated = len(data[data['Condition'] == 'Intoxication'])
num_total = len(data)

print('Portion with Intoxication =', round(num_intoxicated / float(num_total), 3))

Portion with Intoxication = 0.171


However, you also know that even in far reaches of the outer universe, beings work Mondays through Fridays, taking Saturdays and Sundays off. Therefore, you suspect they might drink more on the weekend. You decide to check whether your intuition is true here. If it is true, will you improve in your ability to predict how likely the next patient is to come with intoxication?

In [18]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Iterate over the days of the week
for day in days_of_week:
    # Select all patients that came in on the specific day of the week
    patients_on_day = data[(data['Day-of-Week'] == day)]

    # Of the selected patients, further select patients with intoxication
    patient_intoxicated_on_day = patients_on_day[patients_on_day['Condition'] == 'Intoxication']

    # Compute the portion of patients with intoxication on this day
    portion_intoxicated_on_day = float(len(patient_intoxicated_on_day)) / float(len(patients_on_day))

    # Print the day and the percentage
    print(day, round(portion_intoxicated_on_day, 3))

Monday 0.095
Tuesday 0.093
Wednesday 0.094
Thursday 0.108
Friday 0.105
Saturday 0.408
Sunday 0.415


As you can see, if you knew the day of the week, the probability of a patient arriving with intoxication could change *significantly*. Thus, it is usually better to condition on additional information if you have it!

**Notation:** A conditional probability is a probability distribution that changes as a function of another random variable. You can therefore think of a conditional probability as the "if/else-expression of probability." Continuing with the above example, 
* Let $D$ denote the day of the week.
* Let $I$ denote whether the patient arrives with intoxication.

Here, $p_I(\cdot)$ describes the (non-conditional) probability that a patient arrives with intoxication. It represents our initial *naive* prediction. In contrast, $p_{I | D}(\cdot | d)$ describes the *conditional* probability of intoxication given the day. In this notation, what comes on the right side of the vertical line is the "condition" (here, $D = d$). For different values of $d$, the conditional distribution $p_{I | D}(\cdot | d)$ also changes.