# Exercise 1: Basic concepts on random variables and probability distributions

## Introduction
This notebook includes a series of cells with code (Python) and descriptive text (markdown) and provides a number of exercises.
Work through the exercises and add, as appropriate, new cells with your code or descriptive text to answer the questions.
All required data are provided in your exercise folder.

To submit your work, follow the steps '**Before you submit**' and '**How to submit**' in the *intro2notebook.ipynb* file contained in the *ex0_introduction_to_python.zip*.
Please note that this is the same for all exercises.

### Part I: Basic statistics and probability calculation

In [2]:
#First import some of the basic python packages we want to use:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#### Question I.1

From monthly data (file *P_uppmonth1.dat*): compute for each month (column-wise) mean, median, mode, standard deviation, coefficient of variation, minimum and maximum.
Plot these values (except the mode) to show the annual variation.
Note that coefficient of variation has a different unit than the other statistics, and thus needs to be plotted in a separate figure.

In [16]:
# load data
precip_mon = pd.read_table('P_uppmonth1.dat', index_col=0)

In [5]:
precip_mon

Unnamed: 0_level_0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1981,32,26,62,22,23,98,80,163,13,122,144,109
1982,42,30,54,50,192,37,52,77,48,46,74,52
1983,78,10,107,46,36,105,80,25,188,58,36,78
1984,72,29,34,11,29,97,67,51,145,114,51,51
1985,83,37,42,69,23,25,104,50,76,53,74,63
1986,50,12,69,51,77,39,121,234,56,45,54,89
1987,23,28,22,4,55,77,92,122,65,34,54,24
1988,72,64,40,37,41,51,106,173,36,69,38,66
1989,9,28,51,55,38,42,11,68,23,74,55,59
1990,74,79,56,46,30,25,128,42,153,79,60,42


#### Question I.2

For daily data (file *P_Uppsala.dat*, values in [mm]): Estimate the probabilities that daily precipitation

1. equals zero
2. is more than zero
3. is more than 10 mm
4. is more than 10 mm on a day with precipitation

#### Question I.3

From the daily data, calculate the maximum 1-day, 3-day and 5-day average rainfall amount, and state the date/period.

*Hint: to compute average over x days, you can check out the pandas DataFrame method [rolling](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html).*

#### Question I.4

From the daily rainfall data, compute the maximum dry-spell duration (largest number of consecutive days without rain) and the maximum wet-spell duration (largest number of consecutive days with rain). When do they occur?

#### Question I.5

The depth of clarity $\left(D\right)$ of Lake Tahoe was measured (in inches) at several locations. Measurements are available in the file *Tahoe.dat*.

1. Plot the histogram (relative frequency) with class intervals of length 5.
2. Plot the cumulative relative frequency and estimate $\Pr\left(D \leqslant 40\right)$ and $\Pr\left(15 \leqslant D \leqslant 30\right)$.

####  Question I.6

Given the function $f$

$$
f\left(x\right) = 
\begin{cases}
  c x^2 & \text{if } 0 < x \leqslant 1 \\ \\
  0 & \text{otherwise}
\end{cases}
$$

solve the following problems analytically:

1. Find the value for $c$ so that $f\left(x\right)$ becomes a the probability density function of a continuous random variable $x$.
   Use that $\int_{-\infty }^{\infty }f\left(x\right)dx$ = 1
2. Find the distribution function, $F\left(x\right)$
3. Calculate $\Pr\left(x < 0\right)$, $\Pr\left(x = 0.5\right)$, $\Pr\left(x > 1\right)$,
   $\Pr\left(0 \leqslant x \leqslant 0.5\right)$ and $\Pr\left(0 < x < 0.5\right)$
4. Find the median
5. Find the mode
6. Calculate the expected value $\textrm{E}\left(x\right)$
7. Calculate the variance $\textrm{Var}\left(x\right)$

### Part II: Probability distributions

#### Question II.1
The number of rainy days in July and August at a meteorological station is given in the table below.


Year|1|2|3|4|5|6|7|8|9|10|
-|-|-|-|-|-|-|-|-|-|-|
July|10|15|17|8|9|19|17|14|20|4
August|4|9|8|3|0|10|12|2|8|6

##### a) Use the Hypergeometric, Binomial and Poisson distributions to calculate 
1) What is the probability of 10 rainy days in each of the months of July and August? 

2) What is the probability of 20 rainy days in the 2 month period? 

*Note: if you wish to use factorial function then you have to import the module 'math' by the command 'import math' then write: math.factorial(x)*


* **Hint: Hypergeometric distribution:**
    ```python
    from scipy.stats import hypergeom   
    hypergeom.pmf(x, N, n, k)
    ```
    where
        - k is the number of "successes" in the population
        - x is the number of "successes" in the sample
        - N is the size of the population
        - n is the number sampled
        

* **Hint: Binomial distribution**
    ```python
    from scipy.stats import binom
    binom.pmf(x,n,p)
    ```
    where
        - p is the probability of success
        - n and x same as for the hypergeometric distribution


* **Hint: Poisson distribution**
    ```python
    from scipy.stats import poisson
    poisson.pmf(x, λ)
    ```
    where $\lambda=p \cdot n$ (p and n are the same as for the binomial distribution)

##### b) Which assumptions in each method are likely violated by this problem?

##### c) What is the probability that the sixth rainy day of August occurs on 30 August, assuming that rain occurrences are independent events?
**Hint: use the negative binomial distribution**

#### Question II.2
A real estate developer tells a group of concerned citizens
that a new housing project will be well prepared to tackle
the problems associated with a 10-year flood
and that there is nothing to worry about for the next 9 years.
On the 10th year, when the flood will occur as he says, the local
flood protection authority will have prepared everything.
Compute the probability that the real estate developer is actually
right and that the 10-year flood occurs on the 10th year for the first time.

#### Question II.3
Assume that the annual maximum discharge at a river station is normally distributed with a mean
of 75 m$^3$/s and a standard deviation of 10 m$^3$/s.
What is the probability for any given year to have
a maximum flow that is

a) less than 70 m $^3$/s?

b) larger than 95 m $^3$/s?

c) between 60 and 80 m $^3$/s?

d) What is the flow that is not exceeded with 90% probability?

e) What is the flow with 80% probability to be exceeded?

f) In which interval (centred on the mean) would 50% of the flows fall?

 * **Hint**: you may use the following functions to find probability/quantile values of a normal distribution:
     * *scipy.stats.norm(mean, sd).cdf(discharge)*
     * *scipy.stats.norm(mean, sd).ppf(cumulated_probability) where probability is as expressed as a percentage*


#### Question II.4

a) Plot the probability mass function of the Poisson distribution for $\lambda = 3$.

b) Approximate the Poisson distribution by a normal distribution and plot the normal approximations on the same graph.

c) (Optional) Do the same as in a) and b) but with $\lambda = 8$.

The poisson distribution can be approximated using a normal distribution. We approximate the Z-score with: $$ Z = \frac{X - \mu}{\sigma}$$
with $\mu = \lambda$ and $\sigma = \sqrt{\lambda}$