In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("06-exercise-pids2024.ipynb")

# Exercise sheet 6
**Hello everyone!**

# Points: 15

Topics of this exercise sheet are:
* Working with probability distributions


Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

## Introduction:
This exercise is designed to help you become familiar with the fundamental concepts of probability and statistics. To ensure that you have a strong grasp of these concepts, we recommend reading this article:
https://towardsdatascience.com/an-in-depth-crash-course-on-random-variables-a3905d03e322

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

## Question 1 (6 points)

### Binomial distribution:
The binomial distribution with parameters $n$ and $p$ is a class of the discrete probability distributions of <b> the number of successes in a sequence of $n$ independent experiments, </b> each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability $p$) or failure (with probability $1 − p$). Let $X$ denote a random variable with binomial distribution. The probability of having $x$ times a success in $n$ trials, is:
$$
    p(x) = \binom{n}{x}p^x (1-p)^{n-x} 
$$

In [None]:
n = 20 # The total number of trials
p = 0.5 # The probability of success in each trial
X = stats.binom(n, p)
# Please check this page for more information: 
# https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.binom.html#scipy.stats.binom

### 1a) (1 point)
Make a visual representation of the binomial distribution.
To do that, generate 10,000 independent realizations of the random variable $X$. These 10,000 samples will be distributed as a binomial distribution with parameters $p=0.5$ and $n=20$. 
**Save the samples in a variable named 'x'**.

In other words, this is equivalent to reproduce 10,000 independent experiments, where each experiment consists of flipping 20 coins and counting the number of times it hit 'tails' or 'heads'.

Plot a histogram of the generated samples.

Hint: You can use "X.rvs(...)" to generate independent samples from the random variable $X$.

In [None]:
class Question1a:
    n_samples=10000
    np.random.seed(0)
    
    x = ...
    ...

In [None]:
grader.check("Question 1a")

### 1b (1 point)
Calculate the probabilities $\text{P}(X = 8)$, $\text{P}(X = 10)$ and $\text{P}(X = 12)$ and store them in the variables 'p8', 'p10', and 'p12', respectively.

Hint: Use X.pmf()

In [None]:
class Question1b:
    p8 = ...
    p10 = ...
    p12 = ...
    print('P(X=8)={:.4f} \nP(X=10)={:.4f} \nP(X=12)={:.4f}'.format(p8, p10, p12))

In [None]:
grader.check("Question 1b")

Which point has the highest probability? Why?

Your answer:


### 1c (1 point)
Calculate the probabilities $\text{P}(X \leq 8)$, $\text{P}(X \leq 10)$, $\text{P}(X \leq 12)$ and $\text{P}(X \leq 20)$ and store them in the variables 'p_le8', 'p_le10', 'p_le12' and 'p_le20' respectively.

Hint: Use X.cdf()

In [None]:
class Question1c:
    p_le8 = ...
    p_le10 = ...
    p_le12 = ...
    p_le20 = ...
    
    print('P(X<=8)={:.4f}\nP(X<=10)={:.4f}\nP(X<=12)={:.4f}\nP(X<=20)={:.4f}'.format(p_le8, p_le10, p_le12,p_le20))

In [None]:
grader.check("Question 1c")

Is there an increasing trend apparent in calculated probabilities? Also, can you explain why $\text{P}(X \leq 20)$ has the highest value of 1? 

Your answer:

### 1d) (2 points)
Calculate the mean, standard deviation, and median of the samples generated in the variable 'x' (question 1a) and store them in the variables 'mean', 'std' and 'median' respectively.

Hint: use numpy package!

In [None]:
class Question1d:
    mean_x = ...
    std_x = ...
    median_x = ...
    
    print('Mean={:.4f} \nStd={:.4f} \nMedian={:.4f}'.format(mean_x, std_x, median_x))

In [None]:
grader.check("Question 1d")

Does the mean value appear to be close to $n \times p$? Can you explain why we would expect this?

Your answer:

## Question 2 (6 points)
### Multimodal distributions:
We will now consider samples from two different binomial distributions with distinct values of $p$.

Let $X_1$ denote a random variable with binomial distribution with parameters $p1=0.2$ and $n=20$.
Let $X_2$ denote a random variable with binomial distribution with parameters $p2=0.82$ and $n=20$.

In [None]:
n = 20
p1 = 0.2
p2 = 0.8
X1 = stats.binom(n, p1)
X2 = stats.binom(n, p2)

### 2a) (2 points)

Generate 5000 random samples from each of the two binomial distributions and store them in the variables 'x1' and 'x2', respectively. Concatenate the samples to create a unified vector 'x12', and plot a histogram of the concatenated samples.

In [None]:
class Question2a:
    n_samples=5000
    np.random.seed(0)
    
    
    x1 = ...
    x2 = ...
    x12 = ...
    
    ...

In [None]:
grader.check("Question 2a")

### 2b) (4 points)

Calculate the mean, median, and the first two modes of the concatenated samples (from Question 2a), and store them in the variables 'mean', 'median', 'mode1', and 'mode2', respectively.

Hint: You can use np.unique and np.argsort to find the first and the second modes

In [None]:
class Question2b:
    mean_x12 = ...
    median_x12 = ...
    
    ...
    
    mode1_x12 = ...
    mode2_x12 = ...
    
    
    print('Mean: {:.2f} \nMedian: {:.2f} \nFirst mode: {:.2f}\nSecond mode: {:.2f}'.format(mean_x12, median_x12, mode1_x12, mode2_x12))

In [None]:
grader.check("Question 2b")

Which of these estimates (mean, median and modes) is more suitable to better represent this particular type of data? Why?

Your answer:

## Question 3 (3 points)

### Normal distribution:
A normal (Gaussian) distribution is a type of continuous probability distribution characterized by a bell-shaped curve and defined by its mean $\mu$ and standard deviation $\sigma$:
$$
    p(y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y - \mu)^2}{2\sigma^2}}
$$
We let $Y$ denote a random variable following this distribution.

In [None]:
mu = 3.0
sigma = 2.0
Y = stats.norm(loc = mu , scale = sigma)

# Please check this page for more information: 
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

### 3a) (1 point)
Generate 10,000 independent samples in a variable named 'y'.

Visualize the distribution of the data plot using an histogram. Add a smooth curve obtained using a kernel density estimate (KDE) to the histogram.

Hint: you can use sns.histplot.

In [None]:
class Question3a:
    n_samples=10000
    np.random.seed(0)
    
    y = ...
    ...

In [None]:
grader.check("Question 3a")

### 3b) (1 point, bonus)
Calculate the probabilities $\text{P}(Y = 2)$, $\text{P}(Y = 3)$ and $\text{P}(Y = 4)$ and store them in the variables 'p2', 'p3', and 'p4', respectively.


In [None]:
class Question3b:
    p2 = ...
    p3 = ...
    p4 = ...
    
    print('P(Y=2)={:.4f} \nP(Y=3)={:.4f} \nP(Y=4)={:.4f}'.format(p2, p3, p4))

In [None]:
grader.check("Question 3b")

### 3c) (1 point)
Calculate the probabilities $\text{P}(Y \leq 2)$, $\text{P}(2 < Y \leq 3)$ and $\text{P}(Y \geq 4)$ and store them in the variables 'p_le2', 'p_gr2_le3', and 'p_gr4', respectively.


In [None]:
class Question3c:
    p_le2 = ...
    p_gr2_le3 = ...
    p_gr4 = ...
    
    print('P(Y<=2)={:.4f}\nP(2<Y<=3)={:.4f}\nP(Y>=4)={:.4f}'.format(p_le2, p_gr2_le3, p_gr4))

In [None]:
grader.check("Question 3c")

Do the probabilities $\text{P}(Y \leq 2)$ and $\text{P}(Y \geq 4)$ have the same value? What is the reason for your answer?

Your answer:

### 3d) (1 point)

Calculate the mean, standard deviation, and median of the samples generated in the variable 'y' (Question 3a) and store them in the variables 'mean', 'std' and 'median' respectively.

In [None]:
class Question3d:
    mean_y = ...
    std_y = ...
    median_y = ...
    
    print('Mean={:.4f} \nStd={:.4f} \nMedian={:.4f}'.format(mean_y, std_y, median_y))

In [None]:
grader.check("Question 3d")

Are the mean and the standard deviation close to the $\mu$ and $\sigma$? How about the median? Why?

Your answer:

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()