<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Custom Discrete Distribution in Python</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/custom-discrete-distributions-in-python/">https://discovery.cs.illinois.edu/microproject/custom-discrete-distributions-in-python/</a></div>
</h1>

<hr style="color: #DD3403;">

## Random Variables and Distributions

In statistics and data science, random variables are used to model events that have uncertain outcomes.  For example, in DISCOVERY, we explore the **binomial distribution** to model flipping a coin, drawing from a deck of cards, guessing on a multiple choice exam, and many other events with a single, fixed probability of success.  However, what if there are multiple different outcomes?  This MicroProject will explore creating custom discrete distributions in Python to model complex events!

In this MicroProject, you will explore a dataset of the final scores of students in DISCOVERY!  Before we get to that, let's nerd out with the basics of a distribution! :)

<hr style="color: #DD3403;">

## Random Variable #1: Modeling Flipping a Coin Twice

In DISCOVERY, we introduce flipping a coin twice as an example binomial distribution.  Create a variable that contains a binomial distribution called `COIN` that models the distribution of the number of heads we see when we flip a coin two times:

(Not sure?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [None]:
COIN = ...

There are three different outcomes of flipping two coins and counting the number of heads:

| Number of Heads | Probability |
| --------------: | ----------: |
| 0 heads | 25% |
| 1 head | 50% |
| 2 heads | 25% |

The **expected value** of the distribution is the weighted sum of the possible results.  This means we need to add together all possible outcomes:

- The number of times we get zero heads, multiplied by the probability of getting zero heads, 
- The number of times we get one head, multiplied by the probability of getting one head, and
- The number of times we get two heads, multiplied by the probability of getting two heads.

Mathematically, it's the following equation:

$$EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$$


Solving the equation:

- $EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$
- $EV_{COIN} = (0\text{ heads}) + (0.5\text{ heads}) + (0.5\text{ heads})$
- $EV_{COIN} = 1\text{ heads}$


### Verifying our `COIN` Distribution in Python

Use `COIN.mean()` to verify the expected value in Python:

In [None]:
COIN.mean()

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"
import math
assert("COIN" in vars())
assert(COIN.mean() == 1)
assert(math.isclose(COIN.std(), 2**(1/2)/2))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #2: The Value of a Dice Roll

A common distribution in statistics is to model the outcome of rolling a dice.  Unfortunately, binominal distributions only have the output of a zero (not successful) or a one (successful).  However, a single die has six equally likely outcomes: 1, 2, 3, 4, 5, or 6.

To model this more complex event, we will use a **custom discrete distribution**.

### Requirements of a Custom Discrete Distributions

Similar to the binomial distribution, any custom discrete distribution we create must have three properties:

1. The event we are modeling must have a **fixed outcome that it independent** (it does not matter what happened in the past),

2. The event we are modeling must have a **probability does not change** (no external factor changes the probability), **AND**

3. The event we are modeling must have a **finite number of outcomes** (as a counter-example, the normal distribution can have any possible Z-score, like 0.000332, 0.094322, or any number you can imagine; it is note finite.)

### Dice Distribution

The following table describes the distribution of a six-sided die:

| Outcome | Probability |
| ------: | ----------: |
| 1 | 1/6 |
| 2 | 1/6 |
| 3 | 1/6 |
| 4 | 1/6 |
| 5 | 1/6 |
| 6 | 1/6 |

### Creating a Custom Discrete Distribution

In Python, we must provide two parallel lists of outcomes and the probabilities, similar to the table above.  One list will contain all the outcomes and one list will contain all the probabilities.

For example:

```
outcomes = [ 1, 2, 3, 4, 5, 6 ]
probability = [ 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 ]
```

Once we have our two lists, the scipy.stats `rv_discrete` function can be used to make our distribution using the following code:

```
from scipy.stats import rv_discrete
DICE = rv_discrete( values=(outcomes, probability) )
```

Create the `DICE` distribution below:


In [None]:
...

Let's check the expected value:

In [None]:
DICE.mean()

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

import math
assert("DICE" in vars())
assert(DICE.mean() == 3.5)
assert(math.isclose(DICE.std(), 1.707825127659933))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #3: Customer in a Tea Shop

Let's create a distribution to model a tea shop!  When a customer arrives, we have historical data to suggest the following pattern:

| Description | Outcome | Probability |
| ----------- | ------: | ----------: |
| Customer buys a simple tea | $ 4.49 | 20% |
| Customer buys a bubble tea | $ 5.69 | 40% |
| Customer buys a simple tea and treat | $ 7.69 | 15% |
| Customer buys a bubble tea and treat | $ 8.89 | 15% |
| Customer buys nothing | $ 0.00 | 10% |

Create a custom discrete distribution for this tea shop called `TEA`.  (If you're not sure, re-read the previous section on how to create a custom distribution.)


In [None]:
...

Let's check the expected value (in this case, the amount of money we expect an "average" person to spend):

In [None]:
round(TEA.mean(), 2)

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

import math
assert("TEA" in vars())
assert(math.isclose(TEA.mean(), 5.661))
assert(math.isclose(TEA.std(), 2.379237062589603))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #4: Final Scores in DISCOVERY

The provided code below sets up the `DISCOVERY` distribution, which includes the **final course score for students at The University of Illinois in Fall 2022**.  The grading of DISCOVERY is out of 1,000 points and any student who earns at least 930 points earns an `A` in the course:

In [None]:
outcomes = [ 1045, 1038, 1036, 1035, 1034, 1033, 1032, 1031, 1030, 1029, 1028, 1027, 1026, 1025, 1023, 1022, 1021, 1020, 1019, 1018, 1017, 1016, 1015, 1014, 1013, 1012, 1011, 1010, 1009, 1008, 1007, 1006, 1005, 1004, 1003, 1002, 1001, 1000, 999, 998, 997, 996, 995, 994, 993, 992, 991, 990, 989, 988, 987, 986, 985, 984, 983, 982, 981, 980, 979, 978, 977, 976, 975, 974, 973, 972, 971, 970, 969, 968, 967, 966, 965, 964, 963, 962, 961, 960, 959, 958, 957, 956, 955, 954, 953, 952, 951, 950, 949, 948, 947, 946, 945, 944, 943, 942, 941, 940, 939, 938, 936, 935, 934, 933, 932, 931, 930, 929, 928, 926, 925, 924, 923, 922, 921, 920, 919, 918, 917, 916, 915, 914, 913, 912, 910, 909, 907, 906, 905, 903, 902, 901, 900, 899, 898, 897, 895, 894, 893, 889, 888, 886, 884, 883, 882, 881, 879, 878, 877, 875, 873, 872, 870, 869, 868, 867, 866, 864, 861, 858, 856, 854, 852, 847, 837, 835, 830, 828, 826, 824, 821, 820, 819, 818, 817, 816, 814, 813, 812, 804, 802, 793, 792, 791, 790, 786, 784, 783, 780, 779, 777, 773, 770, 765, 762, 755, 753, 748, 746, 716, 714, 707, 704, 693, 691, 687, 654, 618, 600, 595, 566, 556, 464, 401, 366 ]
probability = [ 1/574, 3/574, 1/574, 2/574, 2/574, 1/574, 3/574, 1/574, 1/574, 1/574, 3/574, 3/574, 1/574, 1/574, 1/574, 2/574, 4/574, 4/574, 2/574, 6/574, 4/574, 2/574, 2/574, 5/574, 6/574, 5/574, 5/574, 4/574, 3/574, 3/574, 2/574, 4/574, 3/574, 5/574, 7/574, 7/574, 3/574, 3/574, 3/574, 2/574, 4/574, 4/574, 8/574, 5/574, 4/574, 4/574, 5/574, 3/574, 7/574, 4/574, 6/574, 9/574, 5/574, 4/574, 11/574, 3/574, 6/574, 2/574, 1/574, 3/574, 4/574, 5/574, 10/574, 4/574, 4/574, 3/574, 4/574, 2/574, 7/574, 9/574, 6/574, 7/574, 5/574, 2/574, 9/574, 4/574, 4/574, 6/574, 2/574, 7/574, 3/574, 5/574, 5/574, 3/574, 5/574, 4/574, 5/574, 1/574, 4/574, 4/574, 3/574, 1/574, 2/574, 1/574, 5/574, 3/574, 4/574, 3/574, 3/574, 2/574, 1/574, 4/574, 2/574, 2/574, 3/574, 4/574, 2/574, 1/574, 1/574, 7/574, 2/574, 1/574, 3/574, 3/574, 2/574, 2/574, 1/574, 3/574, 1/574, 4/574, 1/574, 1/574, 2/574, 1/574, 1/574, 1/574, 1/574, 4/574, 1/574, 1/574, 2/574, 1/574, 1/574, 3/574, 3/574, 1/574, 1/574, 4/574, 4/574, 1/574, 3/574, 3/574, 1/574, 2/574, 2/574, 3/574, 3/574, 1/574, 3/574, 1/574, 1/574, 1/574, 2/574, 1/574, 1/574, 1/574, 1/574, 3/574, 1/574, 1/574, 2/574, 2/574, 1/574, 1/574, 1/574, 1/574, 2/574, 1/574, 1/574, 1/574, 1/574, 2/574, 1/574, 2/574, 3/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 2/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 3/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 2/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574, 1/574 ]

DISCOVERY = rv_discrete( values=(outcomes, probability) )

### Puzzle #1: Average Score

Using the distributution distribution, what is the average score (or "expected value"), in points, in DISCOVERY?  Store your result in `avg_score`:

In [None]:
avg_score = ...
avg_score

### Puzzle #2: Median Score

Using the distributution distribution, what is the median score (50%-tile), in points, in DISCOVERY?  Store your result in `median_score`:

In [None]:
median_score = ...
median_score

### Puzzle #3: Earning an "A" in DISCOVERY

What percentage of students earned an "A" in DISCOVERY?  Remember, an "A" requires a student to earn 930 points.  Store the percentage of people in `pct_A`:

(Not sure how to find this?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [None]:
pct_A = ...
pct_A

### Puzzle #4: Being a Part of the Top 10%

How many points do you need to earn so that you would be among the top 10% of the course?  Store the number of points in `top10pct`:

In [None]:
top10pct = ...
top10pct

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert("DISCOVERY" in vars())
assert("avg_score" in vars())
assert(math.isclose(avg_score, 941.1271777003483))
assert(math.isclose(median_score, 965))
assert(pct_A > 0.50), "There were more than 50% As... are you sure the function is doing what you expect it to be doing?"
assert(math.isclose(pct_A, 1 - 0.2944250871080144))
assert(top10pct > 1000), "To be in the top 10%, you need more than 1000 points!  We love extra credit! :)"
assert(top10pct - 13 == 1000)
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject!