<a href="https://colab.research.google.com/github/fundamentals-of-data-science/course-materials/blob/master/labs/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 Lab - Probability

## General Instructions

In this course, Labs are the chance to applying concepts and methods discussed in the module.
They are a low stakes (pass/fail) opportunity for you to try your hand at *doing*.
Please make sure you follow the general Lab instructions, described in the Syllabus.
The summary is:

* Discussions should start as students work through the material, first Wednesday at the start of the new Module week. 
* Labs are due by **Sunday**.
* Lab solutions are released Monday.  
* Post Self Evaluation and Lab to Lab Group on Blackboard and Lab to Module on Blackboard on **Monday**.

The last part is important because the Problem Sets will require you to perform the same or similar tasks without guidance.
Problem Sets are your opportunity to demonstrate that you understand how to apply the concepts and methods discussed in the relevant Modules and Labs.

## Specific Instructions

1.  For Blackboard submissions, if there are no accompanying files, you should submit *only* your notebook and it should be named using *only* your JHED id: fsmith79.ipynb for example if your JHED id were "fsmith79". If the assignment requires additional files, you should name the *folder/directory* your JHED id and put all items in that folder/directory, ZIP it up (only ZIP...no other compression), and submit it to Blackboard.

    * do **not** use absolute paths in your notebooks. All resources should located in the same directory as the rest of your assignments.
    * the directory **must** be named your JHED id and **only** your JHED id.
    * do **not** return files provided by us (data files, .py files)


2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.

3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.

4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook.

In [None]:
from pprint import pprint

## Manipulating and Interpreting Probability

Given the following *joint probability distribution*, $P(A,B)$, for $A$ and $B$,

```
|    | a1   | a2   |
|----|------|------|
| b1 | 0.37 | 0.16 |
| b2 | 0.23 | 0.24 |
```

Answer the following questions.

**1\. What is $P(A=a2, B=b2)$?**

**$P(A=a2, B=b2)$ = 0.24**

**2\. If I observe events from this probability distribution, what is the probability of seeing (a1, b1) then (a2, b2)?**

*P(A = a1, B = b1) * P(A = a2, B = b2) = 0.37 * 0.24 = 0.0888*

**3\. Calculate the marginal probability distribution, $P(A)$.**

```
| a1   | a2   |
|------|------|
| 0.60 | 0.40 |
```

**4\. Calculate the marginal probability distribution, $P(B)$.**

```
|----|------|
| b1 | 0.53 |
| b2 | 0.47 |
```

**5\. Calculate the conditional probability distribution, $P(A|B)$.**

*P(A = a1 | B = b1) = 0.37/0.53 = 0.6981*<br>
*P(A = a2 | B = b1) = 0.16/0.53 = 0.3019*<br>
*P(A = a1 | B = b2) = 0.23/0.47 = 0.4894*<br>
*P(A = a2 | B = b2) = 0.24/0.47 = 0.5106*

**6\. Calculate the conditional probability distribution, $P(B|A)$.**

*P(B = b1 | A = a1) = 0.37/0.60 = 0.6167* <br>
*P(B = b2 | A = a1) = 0.37/0.60 = 0.3833* <br>
*P(B = b1 | A = a2) = 0.16/0.40 = 0.4000* <br>
*P(B = b2 | A = a2) = 0.24/0.40 = 0.6000*

**7\. Does $P(A|B) = P(B|A)$? What do we call the belief that these are always equal?**

*From this joint probability, P(A|B) != P(B|A)* <br>
We call the belief that these are always equal **Inverse Probability Fallacy**

**8\. Does $P(A) = P(A|B)$? What does that mean about the independence of $A$ and $B$?**

*From this joint probability, P(A) != P(A|B)* <br>
This means that A and B are not independent

**9\. Using $P(A)$, $P(B|A)$, $P(B)$ from above, calculate,**

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

Does it match your previous calculation for $P(A|B)$?

Let's consider A = a2, B = b2 as follows <br>
$P(A = a2|B = b2) = \frac{P(B = b2|A = a2)P(A = a2)}{P(B = b2)}$ <br>

$P(A = a2|B = b2) = \frac{0.6000 * 0.4000}{0.4700} = 0.5106$ <br>

**Yes, it matches**


**10\. If we let A = H (some condition, characteristic, hypothesis) and B = D (some data, evidence, a test result), then how do we interpret each of the following: $P(H)$, $P(D)$, $P(H|D)$, $P(D|H)$, $P(H, D)$?**

**P(H)** = prior probability - probability that the given condition will happen for the whole sample space. <br>
**P(D)** = normalizer - probability that a data is a test data. <br>
**P(H|D)** = posterior probability - probability the given condition will happen for the test data. <br>
**P(D|H)** = likelihood - probability that when the given condition happens, it is in the test data. <br>
**P(H, D)** = probability that the given condition happens and it is the test data.

## Bayes Rule

Bayes Rule will be an important part of our toolset in the weeks to come, especially in terms of Bayesian Inference. Work through the following problems in Bayes Rule.

### Problem 1 (Regular)

You might be interested in the relationship between alcoholism and liver disease, in which case “being an alcoholic” (or not) is a test (evidence for) for liver disease (or not).

Let `D` be the presence or absence of liver disease (`d` they have it; `~d`, "not d", they don't). Past data tells you that 10% of patients entering your clinic have liver disease. Let `A` be alcoholic (`a`) or not alcoholic (`~a`). 5% of the clinic’s patients are alcoholics.

You know that among those patients diagnosed with liver disease, 7% are alcoholics and among those without liver disease, 95.2% are non-alcoholics.

c2. From the above word problem, what values of Bayes Rule do you have? Which ones are missing?
3. Calculate the missing values.
4. Calculate the posterior probability *distributions* using Bayes Rule.
5. Describe what each individual posterior probability means *in words*.

1. Bayes rule for this problem: <br>
$P(D|A) = \frac{P(A|D)P(D)}{P(A)}$ <br>

2. What we have: <br>
$P(A = a|D = d) = 0.070$ <br>
$P(A = \~a|D = \~d) = 0.952$ <br>
$P(D = d) = 0.100$ <br>
$P(A = a) = 0.050$ <br>

3. Calculate missing values: <br>
$P(A = \~a|D = d) = 1.000 - 0.070 = 0.930$ <br>
$P(A = a|D = \~d) = 1.000 - 0.952 = 0.048$ <br>
$P(D = \~d) = 1.000 - 0.100 = 0.900$ <br>
$P(A = \~a) = 1.000 - 0.050 = 0.950$ <br>

4. Calculate the posterior probability *distributions* using Bayes Rule: <br>
$P(d | a) = \frac{P(a | d)P(d)}{P(a)} = \frac{0.070 * 0.100}{0.050} = 0.140$<br>
$P(\~d | a) = \frac{P(a | \~d)P(\~d)}{P(a)} = \frac{0.048 * 0.900}{0.050} = 0.864$<br>
$P(d | \~a) = \frac{P(\~a | d)P(d)}{P(\~a)} = \frac{0.930 * 0.100}{0.950} = 0.098$<br>
$P(\~d | \~a) = \frac{P(\~a | \~d)P(\~d)}{P(\~a)} = \frac{0.952 * 0.900}{0.950} = 0.902$<br>

5. In words: <br>
Given that a patient is alcoholic, then the patient has 14% chance of having liver disease. <br>
Given that a patient is alcoholic, then the patient has 86% chance of not having liver disease. <br>
Given that a patient is non-alcoholic, then the patient has 9.8% chance of having liver disease. <br>
Given that a patient is non-alcoholic, then the patient has 90.2% chance of not having liver disease. <br>

### Problem 2 (Harder)

In a particular pain clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of the clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all the people prescribed pain pills, 8% are addicts. What is the relationship between addiction and pain pill prescriptions?

1. What is Bayes Rule for this problem? (write it out symbolically)
2. From the above word problem, what values of Bayes Rule do you have? Which ones are missing?
3. Calculate the missing values.
4. Calculate the posterior probability *distributions* using Bayes Rule.
5. Describe what each individual posterior probability means *in words*.

(Note: this problem is structured slightly differently than usual. You will need to use Total Probability and the Axioms of Probability as well as solving simultaneous equations to get the answer).

Let $A$ be the addictive status, with $a$ = addictive, $\~a$ = not-addictive. <br>
Let $K$ be the prescribed narcotic pain killers, with $k$ = prescribed, $\~k$ = not-prescribed. <br>

1. Bayes rule for this problem: <br>
$P(K|A) = \frac{P(A|K)P(K)}{P(A)}$ <br>

2. What we have: <br>
$P(k) = 0.1$ <br>
$P(a | k) = 0.08$ <br>
$P(a) = 0.05$ <br>

3. Calculate missing values: <br>
$P(\~k) = 1 - 0.1 = 0.9$ <br>
$P(\~a) = 1 - 0.05 = 0.95$ <br>
$P(\~a | k) = 1 - 0.08 = 0.92$ <br>
$P(a | \~k) = (0.05 * 0.9) / 0.9 = 0.05$ <br>
$P(\~a | \~k) = 1 - 0.05 = 0.95$ <br>

4. Calculate the posterior probability *distributions* using Bayes Rule: <br>
$P(k | a) = \frac{P(a | k)P(k)}{P(a)} = \frac{0.08 * 0.1}{0.05} = 0.16$<br>
$P(\~k | a) = \frac{P(a | \~k)P(\~k)}{P(a)} = \frac{0.05 * 0.9}{0.05} = 0.9$<br>
$P(k | \~a) = \frac{P(\~a | k)P(k)}{P(\~a)} = \frac{0.92 * 0.1}{0.95} = 0.0968$<br>
$P(\~k | \~a) = \frac{P(\~a | \~k)P(\~k)}{P(\~a)} = \frac{0.95 * 0.9}{0.95} = 0.9$<br>

5. In words: <br>
Given that a patient is addictive, then the patient has 16% chance of being prescribed with pain killer. <br>
Given that a patient is addictive, then the patient has 90% chance of not being prescribed with pain killer. <br>
Given that a patient is non-addictive, then the patient has 9.7% chance of not being prescribed with pain killer. <br>
Given that a patient is non-addictive, then the patient has 90.2% chance of not being prescribed with pain killer. <br>

## Titanic

In [1]:
import pandas as pd
from pandasql import sqldf
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
pysqldf = lambda q: sqldf(q, globals())

Make sure you worked through the Titanic case study. This is a continuation of that analysis. *Feel free to copy code blocks from the case study as you see fit*

We start by loading the data:

In [3]:
titanic = pd.read_csv("https://raw.githubusercontent.com/fundamentals-of-data-science/datasets/master/titanic.csv")

## Conditional Probabilities

1. Calculate $P(survived|parch)$

(Remember...every "calculation" includes discuss/code/discuss. In this case, describe what the conditional probability is, what you expect to see, calculate it, and then discuss the results relative to your hypothesized values).

In [4]:
from pandas.core.algorithms import value_counts
from pandas.core.reshape.concat import concat

def summarize_category(series):
    res_regu = value_counts(series)
    res_norm = value_counts(series, normalize=True)
    result = concat([res_regu, res_norm], axis=1, keys=['Count', 'Frequency'])
    result = result.sort_index()
    return result

# print(titanic["parch"])
# print(titanic)
# titanic.info()
# passengers = len(titanic)

pd.DataFrame(titanic["parch"].describe())
parch_counts = summarize_category(titanic["parch"])
parch_counts




def conditional_probability(df, target, givens, cell="index"):
    """
    calculates a simple conditional probability (only one target variable) based off of:
    https://stackoverflow.com/questions/54040923/change-order-of-pandas-multiindex
    P(target|givens...)
    df: the DataFrame to use for the calculation
    target: the string name of the target variable
    givens: a string or List of strings that represent the "givens"
    cell: a column that is neither target nor givens to "count". Should be a column withouThe default assumes you have added a column: df["index"] = df.index to your DataFrame."""
    if isinstance(givens, str):
        givens = [givens]
    print(f"P({target}|{', '.join(givens)})")
    columns = [target] + givens
    # handling multiple targets would require a more sophisticated join.
    result = (df.groupby(columns).count() / df.groupby(givens).count())[cell]
    # this makes sure the target is always the column
    result = result.reorder_levels(givens + [target]).sort_index()
    # this flattens the hiearchical index and should fill in missing values.
    result = result.unstack(fill_value=0.0)
    return pd.DataFrame(result)

print(conditional_probability(titanic, "survived", ["parch"], "name"))




P(survived|parch)
survived       0.0       1.0
parch                       
0.0       0.664671  0.335329
1.0       0.411765  0.588235
2.0       0.495575  0.504425
3.0       0.375000  0.625000
4.0       0.833333  0.166667
5.0       0.833333  0.166667
6.0       1.000000  0.000000
9.0       1.000000  0.000000


2. Calculate $P(survived|fare)$

In [5]:
pd.DataFrame(titanic["ticket"].describe())
print(conditional_probability(titanic, "survived", ["ticket"], "name"))


P(survived|ticket)
survived          0.0       1.0
ticket                         
110152       0.000000  1.000000
110413       0.333333  0.666667
110465       1.000000  0.000000
110469       1.000000  0.000000
110489       1.000000  0.000000
...               ...       ...
W./C. 6608   1.000000  0.000000
W./C. 6609   1.000000  0.000000
W.E.P. 5734  0.500000  0.500000
W/C 14208    1.000000  0.000000
WE/P 5735    0.500000  0.500000

[929 rows x 2 columns]


## Naive Bayes Classifier

In [6]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder

1. Calculate the Naive Bayes Classifier for $P(survived|pclass, sex, decade, parch, sibsp)$ and make 5 predictions.

(Remember...discuss/code/discuss. This is especially true for the predictions...when you make up each passenger, do you expect them to survive or perish?)

In [7]:

titanic["decade"] = (titanic["age"] // 10) * 10

clf = CategoricalNB()
encoder = OrdinalEncoder()
with_age = titanic[titanic["age"].notnull()]


encoder.fit(with_age[["pclass", "sex", "decade", "parch", "sibsp"]])
clf.fit(encoder.transform(with_age[["pclass", "sex", "decade", "parch", "sibsp"]]), with_age["survived"])
encoder.categories_


# Prediction 1: 
# classify whether a first class, female, 20 years old, 0 parch, 0 sliblings will survive
clf.predict(encoder.transform([(1, 'female', 20, 0, 0)]))
clf.predict_proba(encoder.transform([(1, 'female', 20, 0, 0)]))
# conclusion, she survives with a raw probability of survival = 0.81251143

# Prediction 2: 
# classify whether a first class, male, 40 years old, 0 parch, 0 sliblings will survive
clf.predict(encoder.transform([(1, 'male', 40, 0, 0)]))
clf.predict_proba(encoder.transform([(1, 'male', 40, 0, 0)]))
# conclusion, he perishes with raw probability of survival = 0.28578456

# Prediction 3: 
# classify whether a first class, male, 40 years old, 0 parch, 0 sliblings will survive
clf.predict(encoder.transform([(2, 'male', 40, 0, 0)]))
clf.predict_proba(encoder.transform([(2, 'male', 40, 0, 0)]))
# conclusion, he perishes with raw probability of survival = 0.15285209

# Prediction 4: 
# classify whether a first class, male, 10 years old, 1 parch, 0 sliblings will survive
clf.predict(encoder.transform([(1, 'male', 10, 1, 0)]))
clf.predict_proba(encoder.transform([(1, 'male', 10, 1, 0)]))
# conclusion, he survives with raw probability of survival = 0.52385029

# Prediction 5: 
# classify whether a first class, male, 10 years old, 1 parch, 0 sliblings will survive
clf.predict(encoder.transform([(1, 'female', 10, 1, 0)]))
clf.predict_proba(encoder.transform([(1, 'female', 10, 1, 0)]))
# conclusion, she survives with raw probability of survival = 0.9275582


# survived_pclass_and_sex = conditional_probability(titanic, "survived", ["pclass", "sex", "decade", "parch", "sibsp"], "name")
# survived_pclass_and_sex



array([[0.0724418, 0.9275582]])