<a href="https://colab.research.google.com/github/fundamentals-of-data-science/course-materials/blob/master/labs/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 Lab - Probability

## General Instructions

In this course, Labs are the chance to applying concepts and methods discussed in the module.
They are a low stakes (pass/fail) opportunity for you to try your hand at *doing*.
Please make sure you follow the general Lab instructions, described in the Syllabus.
The summary is:

* Discussions should start as students work through the material, first Wednesday at the start of the new Module week. 
* Labs are due by **Sunday**.
* Lab solutions are released Monday.  
* Post Self Evaluation and Lab to Lab Group on Blackboard and Lab to Module on Blackboard on **Monday**.

The last part is important because the Problem Sets will require you to perform the same or similar tasks without guidance.
Problem Sets are your opportunity to demonstrate that you understand how to apply the concepts and methods discussed in the relevant Modules and Labs.

## Specific Instructions

1.  For Blackboard submissions, if there are no accompanying files, you should submit *only* your notebook and it should be named using *only* your JHED id: fsmith79.ipynb for example if your JHED id were "fsmith79". If the assignment requires additional files, you should name the *folder/directory* your JHED id and put all items in that folder/directory, ZIP it up (only ZIP...no other compression), and submit it to Blackboard.

    * do **not** use absolute paths in your notebooks. All resources should located in the same directory as the rest of your assignments.
    * the directory **must** be named your JHED id and **only** your JHED id.
    * do **not** return files provided by us (data files, .py files)


2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.

3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.

4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook.

In [1]:
from pprint import pprint

## Manipulating and Interpreting Probability

Given the following *joint probability distribution*, $P(A,B)$, for $A$ and $B$,

```
|    | a1   | a2   |
|----|------|------|
| b1 | 0.37 | 0.16 |
| b2 | 0.23 | ?    |
```

Answer the following questions.

**1\. What is $P(A=a2, B=b2)$?**

*answer here. You can type each answer as a combination of Markdown and or code cells as you see fit. Add more cells if you like.*

**2\. If I observe events from this probability distribution, what is the probability of seeing (a1, b1) then (a2, b2)?**

*answer here*

**3\. Calculate the marginal probability distribution, $P(A)$.**

*answer here*

**4\. Calculate the marginal probability distribution, $P(B)$.**

*answer here*

**5\. Calculate the conditional probability distribution, $P(A|B)$.**

*answer here*

**6\. Calculate the conditional probability distribution, $P(B|A)$.**

*answer here*

**7\. Does $P(A|B) = P(B|A)$? What do we call the belief that these are always equal?**

*answer here*

**8\. Does $P(A) = P(A|B)$? What does that mean about the independence of $A$ and $B$?**

*answer here*

**9\. Using $P(A)$, $P(B|A)$, $P(B)$ from above, calculate,**

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

Does it match your previous calculation for $P(A|B)$?

*answer here*

**10\. If we let A = H (some condition, characteristic, hypothesis) and B = D (some data, evidence, a test result), then how do we interpret each of the following: $P(H)$, $P(D)$, $P(H|D)$, $P(D|H)$, $P(H, D)$?**

*answer here*

## Bayes Rule

Bayes Rule will be an important part of our toolset in the weeks to come, especially in terms of Bayesian Inference. Work through the following problems in Bayes Rule.

### Problem 1 (Regular)

You might be interested in the relationship between alcoholism and liver disease, in which case “being an alcoholic” (or not) is a test (evidence for) for liver disease (or not).

Let `D` be the presence or absence of liver disease (`d` they have it; `~d`, "not d", they don't). Past data tells you that 10% of patients entering your clinic have liver disease. Let `A` be alcoholic (`a`) or not alcoholic (`~a`). 5% of the clinic’s patients are alcoholics.

You know that among those patients diagnosed with liver disease, 7% are alcoholics and among those without liver disease, 95.2% are non-alcoholics.

1. What is Bayes Rule for this problem? (write it out symbolically)
2. From the above word problem, what values of Bayes Rule do you have? Which ones are missing?
3. Calculate the missing values.
4. Calculate the posterior probability *distributions* using Bayes Rule.
5. Describe what each individual posterior probability means *in words*.

*answer here* (again, either use Markdown here or change to a code cell if you want to program the answer. Use Markdown cells for comments. I suggest calculating the missing information then using the code from Fundamentals to calculate the answers.)

### Problem 2 (Harder)

In a particular pain clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of the clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all the people prescribed pain pills, 8% are addicts. What is the relationship between addiction and pain pill prescriptions?

1. What is Bayes Rule for this problem? (write it out symbolically)
2. From the above word problem, what values of Bayes Rule do you have? Which ones are missing?
3. Calculate the missing values.
4. Calculate the posterior probability *distributions* using Bayes Rule.
5. Describe what each individual posterior probability means *in words*.

(Note: this problem is structured slightly differently than usual. You will need to use Total Probability and the Axioms of Probability as well as solving simultaneous equations to get the answer).

*answer here*

## Titanic

In [None]:
import pandas as pd
from pandasql import sqldf
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pysqldf = lambda q: sqldf(q, globals())

Make sure you worked through the Titanic case study. This is a continuation of that analysis. *Feel free to copy code blocks from the case study as you see fit*

We start by loading the data:

In [None]:
titanic = pd.read_csv("https://raw.githubusercontent.com/fundamentals-of-data-science/datasets/master/titanic.csv")

## Conditional Probabilities

1. Calculate $P(survived|parch)$

(Remember...every "calculation" includes discuss/code/discuss. In this case, describe what the conditional probability is, what you expect to see, calculate it, and then discuss the results relative to your hypothesized values).

2. Calculate $P(survived|fare)$

## Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder

1. Calculate the Naive Bayes Classifier for $P(survived|pclass, sex, decade, parch, sibsp)$ and make 5 predictions.

(Remember...discuss/code/discuss. This is especially true for the predictions...when you make up each passenger, do you expect them to survive or perish?)