# Module 2 Lab

## Basics Continued (+) Conditional Probabilities with `Sepsis` Dataset

In [1]:
# Import necessary libraries - common libraries include pandas, numpy, matplotlib, and sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## If Statements and Boolean Operators

As shown in the example below, it is apparent that Jenny is sick because the Boolean value of `True` has been explicitly assigned to her. The following if statement is an example of how we can use Booleans to print information.

In [10]:
name = 'Jenny'

# Boolean => True or False
sick = True

# List => in this case a list of ages
age = [21, 20, 18, 19, 33]

# Integer => number without decimals.
age = 21

# Float => number with decimals.
weight = 60.25

# Dictionary
variables = {'Age': 21, 'Height': 60, 'Weight': 60.25}

Here, we see that Jenny is a 21 year-old young adult with a weight of 60.25 pounds who is sick. It is best practice to assign variable names explicitly such that they describe the nature of the data.

In [11]:
sick = True

if sick:
    print(f'{name} is sick. Please take her to the doctor.')
else:
    print(f'{name} is not sick. She is fine.')

Jenny is sick. Please take her to the doctor.


In [12]:
sick = False

if sick:
    print(f'{name} is sick. Please take her to the doctor.')
else:
    print(f'{name} is not sick. She is fine.')

Jenny is not sick. She is fine.


## For Loops with Lists

Earlier, we created a list of ages.

In [13]:
age = [21, 20, 18, 19, 33]

## Z-Normalization

$$\frac{x-\mu}{\sigma} $$

In [14]:
# initiate with empty list, appending to it later
age_znorm = []

# For each age in the age list
print('Original Age \t Z-Normalized Age')
for item in age:

    # Z-Score calculation with NumPy library
    z_score = (item - np.mean(age))/np.std(age)

    # Print age and Z-score, rounding to 2 decimal places
    print('%3i\t\t\t%.2f' % (item, round(z_score, 2)))

    # Append the Z-score to a list
    age_znorm.append(round(z_score, 2))

# Print the new list of Z-scores
print('\n List of Z-Normalized Ages:')
print(age_znorm)

Original Age 	 Z-Normalized Age
 21			-0.22
 20			-0.40
 18			-0.76
 19			-0.58
 33			1.97

 List of Z-Normalized Ages:
[-0.22, -0.4, -0.76, -0.58, 1.97]


## Function Definitions

Functions make things easier because they create a level of modularity where you do not have to re-write the same lines of code to get different results just because you have different variables. To this end, we can create a function for z-scores of not just ages, but also for weights, for example, as follows.

In [15]:
age = [21, 20, 18, 19, 33]
weight = [165, 78, 60.3, 115, 120]

In [16]:
def zscore_calc(values_list):

    # initiate with empty list, appending to it later
    values_list_znorm = []

    # For each values in the list
    print('Original Value \t Z-Normalized Value')
    for item in values_list:

        # Z-Score calculation with NumPy library
        z_score = (item - np.mean(values_list))/np.std(values_list)

        # Print value and Z-score, rounding to 2 decimal places
        print('%3i\t\t%.2f'%(item, round(z_score, 2)))

        # Append the Z-score to a list
        values_list_znorm.append(round(z_score, 2))

    # Print the final new list of z scores
    print('\n List of Z-Normalized Values:')
    print(values_list_znorm)

In [None]:
zscore_calc(weight)

Original Value 	 Z-Normalized Value
165		1.58
 78		-0.82
 60		-1.30
115		0.20
120		0.34

 List of Z-Normalized Values:
[1.58, -0.82, -1.3, 0.2, 0.34]


In [None]:
zscore_calc(age)

Original Value 	 Z-Normalized Value
 21		-0.22
 20		-0.40
 18		-0.76
 19		-0.58
 33		1.97

 List of Z-Normalized Values:
[-0.22, -0.4, -0.76, -0.58, 1.97]


## Bayesian Probability

### Using the Sepsis Survival minimal clinical records dataset from the UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/00628/

Source:
Davide Chicco, Giuseppe Jurman, â€œSurvival prediction of patients with sepsis from age, sex, and septic episode number aloneâ€. Scientific Reports 10, 17156 (2020). [Web Link](https://doi.org/10.1038/s41598-020-73558-3)

Four clinical features:
- `age_years`: integer
- `sex_0male_1female`:  binary
- `episode_number`: integer
- `hospital_outcome_1alive_0dead`: binary

In [18]:
# Read in the sepsis data file
# Import necessary modules
from google.colab import drive
import pandas as pd # Import pandas

# Mount Google Drive
drive.mount('/content/drive')

# Access file using the full path
sepsis = pd.read_csv('/content/drive/My Drive/san-diego-assignment/assignment-2/sepsis_survival.csv')

# View the first few rows of the dataset
sepsis.head()

Mounted at /content/drive


Unnamed: 0,age_years,sex_0male_1female,episode_number,hospital_outcome_1alive_0dead
0,21,1,1,1
1,20,1,1,1
2,21,1,1,1
3,77,0,1,1
4,72,0,1,1


## Hypothesis-Testing Framework

The following information is available in the last (target) column of the dataframe without any additional or conditional information thus far.

$H_0: \text{the patient is dead == 0.}$  
$H_1: \text{the patient is alive == 1.}$

Establishing the probability of living or dying in and of itself is called a `prior.`

Now, let's assign $X_1$ to age and $X_2$ to sex.  Suppose we want to establish what the probability is that any given patient will die given other information (e.g., their sex or age).

Updating this prior probability with, for example, that a patient will die given that they are younger than 50 years old is called the `posterior`, and can be written-out using the following notation:  

$$P(H_0|X_1<50)$$

We can also determine the probability among the patients that did not survive given that their age was younger than 50 as follows:


$$P((X_1<50)|H_0) $$

Following Bayes' Theorem, we have:

$$\qquad\qquad P(H_0| X_1<50) = P(X_1<50|H_0)\, \frac{P(H_0)}{P(X_1<50)}. $$

## Example 1.

Compute the probability that a sepsis patient will die.

In [19]:
## Print out the number of patients that died and survived:
print('Number of patients that survived and died:')
print(sepsis['hospital_outcome_1alive_0dead'].value_counts())
print(' ')

## for percentages, `normalize=True` is parsed in for parsimony
percent_mortality = sepsis['hospital_outcome_1alive_0dead'] \
                    .value_counts(normalize=True).iloc[1]

print('Mortality rate from sepsis = %.3f'%percent_mortality)

Number of patients that survived and died:
hospital_outcome_1alive_0dead
1    102099
0      8105
Name: count, dtype: int64
 
Mortality rate from sepsis = 0.074


## Example 2.
Calculate the probability of sepsis patients under 50 years old.

In [25]:
sepsis_50 = round(len(sepsis[sepsis['age_years'] < 50])/ \
            len(sepsis['age_years']),2)
print('The probability that sepsis patients are under 50 is %.2f.' \
       %sepsis_50)

The probability that sepsis patients are under 50 is 0.24.


## Example 3.

Calculate the probability of being under 50 years old given death.

In [26]:
sepsis_death = sepsis[sepsis['hospital_outcome_1alive_0dead'] == 0]
sepsis_50_death = len(sepsis_death[sepsis_death['age_years'] < 50]) \
                 /len(sepsis_death['age_years'])
sepsis_50_death

0.03689080814312153

## Example 4.

Calculate the probability of dying given that the sepsis patient is under 50 years old.

$$ P(H_0|X_1<50) = P(X_1<50|H_0)\, \frac{P(H_0)}{P(X_1<50)} $$

$$ P(H_0|X_1) = $$

In [27]:
posterior_death = sepsis_50_death * (percent_mortality/sepsis_50)
print('The probability of dying given that the patient is under 50 is %.2f.' \
      %posterior_death)

The probability of dying given that the patient is under 50 is 0.01.


## Example 5.

**(*Hint*)**: Similar to HW Problem # 2.4.

A photographer is asked to deliver a storyboard concept whereby she is asked to match 7 images of couples from the weddings that they were taken, representing 7 different concepts.

(a) Set up a sample space for the 7 guesses.  
(b) With random guessing, find the probability of getting all 7 correct.

In [None]:
# read in the permutations library
from itertools import permutations

# Set up sample space
wedding_guesses = permutations([1,2,3,4,5,6,7])
print(f"lll {wedding_guesses}")
wedding_guesses_list = list(wedding_guesses)
print(wedding_guesses_list[:5])

# Find all permutations of the sample space
prob = 0
for i in wedding_guesses:
    prob+=1
print('There are {} possible permutations'.format(prob))
print('The probability of getting all 7 concepts correct is {}'.format(round(1/prob, 4)))

lll <itertools.permutations object at 0x7f11c15cf790>
[(1, 2, 3, 4, 5, 6, 7), (1, 2, 3, 4, 5, 7, 6), (1, 2, 3, 4, 6, 5, 7), (1, 2, 3, 4, 6, 7, 5), (1, 2, 3, 4, 7, 5, 6)]
There are 0 possible permutations


ZeroDivisionError: division by zero