# CE 93: Engineering Data Analysis
# LAB 02 Elements of Probability Theory

**Full Name:** *replace text here*

## Instructions 

Welcome to Lab 02! 

Please save your work after every question! At the end, you will have to submit your Jupyter Notebook as a PDF file in the bCourses quiz. The notebook should be consistent with your quiz answers. Not submitting a PDF file will result in a grade of 0 on the lab assignment. You will also receive a 0 if your answers to the quiz are inconsistent with your PDF.

If you see cells with "..." make sure to replace the "..." with your code even if they are not listed with a "Question". 
Please remember to label all axes with the quantity and units being plotted. 

Any part listed as a "<font color='red'>**Question**</font>" should be answered in the bCourses quiz to receive credit.

We will use the following Python packages:

* NumPy
* pandas
* MatPlotLib

## Load the required libraries 

The following code loads the required libraries. Run this cell first.

In [None]:
# import python library / packages 
import numpy as np                           # ndarrays for gridded data
import pandas as pd                          # DataFrames for tabular data
import matplotlib.pyplot as plt              # plotting

## About Lab 02


In this lab, we will be using the frequency notion of probability. According to this notion, the probability of an event, $E$, is estimated as the proportion of times $E$ would occur in the long run, if the experiment were to be repeated over and over again.

More specifically, let $n$ denote the number of observations of the phenomenon of interest and $n_E$ the number of observations during which the event $E$ occurred. Then the probability $P(E)$ is formally defined as: 

$$
P(E) = \lim_{n\to\infty} \frac{n_E}{n} 
$$

In practice, the number of observations usually is finite. In that case, only an approximate estimate of the probability is obtained. Naturally, the accuracy of the probability estimation increases as the sample size $n$ increases.

It should be noted that other notions of probability exist. We will discuss them in the class and future assignments.       



In response to the drought, the San Francisco Public Utilities Commission is considering a program to incentivize rainwater harvesting for new residential and commercial buildings in the city. The program would create a $ 1000 rebate for high-volume rainwater cisterns. Before making the expensive investment, the Commission wants to better understand the rainfall statistics for the area. They are asking you, a Civil and Environmental Engineering consultant, to analyze historical probability data for rainfall in the city.

<img src="rain.png" width='450'/>

Fig 1. Rainfall at Fort Mason, San Francisco http://www.sfgate.com/bayarea/article/First-Bay-Area-rain-this-fall-could-uproot-trees-9966631.php

### Load the data

In Lab 2 we will be working with rainfall data set in San Francisco from 1849-50 to 2021-2022. The file is named `SFrainfall_2021.csv`. 

Source: https://ggweather.com/sf/season.html

Let's load the provided data set `SFrainfall_2021.csv`. These are all the features:

|Feature|Units|Description|
|:-|:-|:-|
|year|yr|The year over which data was recorded|
|days|days|Number of rainy days in the year|
|rain|inch|Cumulative rainfall in the year|

* load using the Pandas `read_csv()` function

In [None]:
# read a .csv file in as a DataFrame
df = pd.read_csv('SFrainfall_2021.csv')

# returns the first 5 rows of the data set by default
df.head()

### Create Variables from the DataFrame

We want to generate data vectors, one for each column in the dataset (one for years, one for days, and one for rain). 

<font color='red'>**Question 1.**</font> What command(s) can we use to get the data vector for years? Select your answer(s) form the options in bCourses. You can refer to Lab 01 to check your answer.

Go back to bCourses and start the quiz to answer the first question.

Using the correct command, create different variables for each column in the Dataframe.
- Create a variable `year`  for the year
- Create a variable `days`  for number of rainy days in a year
- Create a variable `rain` for cumulative rainfall (inches)

In [None]:
# create variables for year, days, and rain
# replace ... with your code

year = ...
days = ...
rain = ...

## Graphical Summaries

Let's visualize the data in different ways.

<font color='red'>**Question 2.**</font> Match each plot number from the plots below with its correct graph type. Refer to the bCourses quiz to answer this question.

#### You DO NOT have to make any plots yourself. Simply go to bCourses and match the plot number with the graph type. You will see the options there.

<font color='red'>**Question 3.**</font> Match each graph type with what it is used for. Refer to the bCourses quiz to answer this question.

<img src="Graphical_Summaries.png"/>

## Symbolic Expressions of Events

Let's define the following events: 	

- $E_1$ = the number of rainy days in SF in a given year is > 80 days
- $E_2$ = amount of cumulative annual rainfall in SF in a given year is > 30 inches

The plots below show cumulative rainfall versus number of rainy days for every year in our data set. So, each dot represents data for a single year. The blue dots represent all of the outcomes/years.

<font color='red'>**Question 4.**</font> Let the orange dots represent different events. Match each plot number with the correct symbolic expression that the orange dots represent based on the definition of $E_1$ and $E_2$ above. Refer to bCourses to answer this question.

#### You DO NOT have to make any plots yourself. Simply go to bCourses and match the plot number with the event. You will see the options there.

<img src="Symbolic_Expressions.png"/>

## Plot and Interpret CDF 

In Lab 01, we saw how to plot histograms using `plt.hist()`. We also saw in the lecture how to plot a cumulative diagram from a histogram plot. So, let's generative cumulative distribution function (CDF) plots for our data.

We can do this using `plt.hist()` by specifying the following parameters:
* `cumulative`
* `histtype`

In Lab 01, and by default, `cumulative=False` and `histtype=bar`.

To show a cumulative diagram, you have to add between parentheses `cumulative=True, histtype=step`.

To show a cumulative diagram based on **proportions** and not frequency, add between parentheses `cumulative=True, histtype=step, density=True`.

Make two cumulative proportion diagrams with `bins=20`. 

1. The first for `days`. (I already wrote the code for you for days- copy and edit it to make a similar plot for `rain`)
2. The second for `rain`.

<font color='red'>**Question 5.**</font> Based on your CDF for `days`, what you can you tell from this figure? Select your answer(s) form the options in bCourses.


<font color='red'>**Question 6.**</font> Based on your CDFs for `rain` and `days`, what you can you tell from this figure? Select your answer(s) form the options in bCourses.

In [None]:
# Edit the code below
# I already provided the code to plot CDF for days. Modify it to plot in subplot 2 CDF for rain.

# specify number of bins
N=20

# initialize figure with (15,5) width by height
fig=plt.figure(figsize=(15,5))

# create empty axs
axs=[]

# add/append first subplot in a 1x2 grid
axs.append(fig.add_subplot(121))

# Create your first subplot below, with title and axes labels
axs[0].hist(days,bins=N,cumulative=True, density=True, histtype='step')
axs[0].set_title('cumulative proportion diagram of number of rainy days')
axs[0].set_ylabel('cumulative proportion')
axs[0].set_xlabel('number of rainy days')
axs[0].grid()

####################################################################################################

# add/append second subplot in a 1x2 grid
...

# Create your second subplot below, with title and axes labels
...

# display all figures
plt.tight_layout()
plt.show()

## Estimating Probabilities Using Frequencies

We previously defined: 	

- $E_1$ = the number of rainy days in SF in a given year is > 80 days
- $E_2$ = amount of cumulative annual rainfall in SF in a given year is > 30 inches

Let's try to calculate these probabilities based on the data we have.

We can simply use logical operators (<, >, <=, >=, ==, &, |, etc.) to calculate probabilities in this case. Let's say $E_3$ was the probability that the amount of cumulative rainfall in a given year is less than 10 inches. Then we can first count the number of times (i.e., the frequency) that this events has occurred in the past, and then divide by the total number of observations to get the probability of the event.

$$ P(E_3) = \frac{n_{E_{3}}}{n} $$

We can calculate this probability in Python using logical operators as follows:


1. Calculate the total number of observations, n, using the `len()` function (length of an object)

    `n = len(rain) # total number of observations`
    
    
2. Calculate the total number of times event $E_3$ occurred. If we use `rain<10`, this will return a Boolean data type (True or False) that indicates whether the condition is satisfied for every element in the array (in this case, whether rain<10 for every year in our data). To simply count the number of times this has occurred, we can use the `sum()` function. True is counted as 1 and False is counted as 0. So, the `sum(rain<10)` is the frequency or the total number of years where rain<10.

    `n_E3 = sum(rain<10) # frequency that rain is less than 10`
    

3. Finally, we calculate the probability by dividing the numbers above.

    `P_E3 = n_E3/n`
    
We can combine multiple events using &, |. For example, if we want the number of times rain was greater than 10 but less than 20, we can use:

`n_E4 = sum((rain>10)&(rain<20)) # frequency that rain is greater than 10 AND less than 20`


Write a Python code to compute the following probabilities by using the frequency notion, similar to what we did for event $E_3$ above:
- $ P(E_1) $ 
- $ P(E_2) $
- $P(E_1\cap E_2)$
- $P(E_1\cup E_2)$

Enter your code in the cell below to compute these probabilities, then answer these questions in bCourses.

<font color='red'>**Question 7.**</font> What is $P(E_1)$? Add your answer in the bCourses quiz.

<font color='red'>**Question 8.**</font> What is $P(E_2)$? Add your answer in the bCourses quiz.

<font color='red'>**Question 9.**</font> What is $P(E_1\cap E_2)$? Add your answer in the bCourses quiz.

<font color='red'>**Question 10.**</font> What is $P(E_1\cup E_2)$? Add your answer in the bCourses quiz.

In [None]:
# insert code below

...

<font color='red'>**Question 11.**</font> Based on the above probabilities, what can we say about $E_1$ and $E_2$? Select your answer(s) from the options on bCourses.

In [None]:
# if you need to make any calculations to answer the question above, add your code below

...

## What If We Had Less Data??

This data set has observations from 173 years! It is not always possible to have this many observations. The fewer the observations, the less accurate our probability estimates will be. Remember, the definition of probability is based on the limit as $n$ goes to infinity. 

So, let's examine if the number of observations influences the estimated probabilities.

First, let's review the basics of indexing for a 1D DataFrame.
* `rain[0]` will return the rain value for the **first** year in the data set
* `rain[:10]` will return the rain values for the **first 10** years in the data set
* `rain[-10:]` will return the rain values for the **last 10** years in the data set
* `rain[10:20]` will return the rain values between the first **11 and 20** years in the data set


**Next, suppose the data was available only for the last 25 years.**

Recalculate the following probabilities:
- $ P(E_1) $ 
- $ P(E_2) $
- $P(E_1\cap E_2)$
- $P(E_1\cup E_2)$

I am providing you with the values that you should get below. Use these to check your values and make sure you are doing things correctly. You do not have to answer any questions for this part. Just write the code and verify your answers. You will then copy your code and edit it in the next part to answer questions in the quiz.

- $P(E_1)$ = 0.2
- $P(E_2)$ = 0.16
- $P(E_1\cap E_2)$ = 0.16
- $P(E_1\cup E_2)$ = 0.2

*Hint: First, define a new variable, say `days_last_25`, and set it equal to the last 25 values in `days`. Second, define a new variable, say `rain_last_25`, and set it equal to the last 25 values in `rain`. Then, apply the same code you used above to calculate probabilities but now using `days_last_25` and `rain_last_25`. Note that both the frequency of the events and the total number of observations will change when we use the last 25 years of the data.*

In [None]:
# recalculate the probabilities using the last 25 years of data
# insert your code below

...

### Next, suppose the data was available only for the first 25 years.

Adapt your Python routine from above to recalculate the same probabilities again using the first 25 years of data:
- $ P(E_1) $ 
- $ P(E_2) $
- $P(E_1\cap E_2)$
- $P(E_1\cup E_2)$

Enter your code in the cell below to compute these probabilities, then answer these questions in bCourses.

<font color='red'>**Question 12.**</font> What is $P(E_1)$ when using the first 25 years of data? Add your answer in the bCourses quiz.

<font color='red'>**Question 13.**</font> What is $P(E_2)$ when using the first 25 years of data? Add your answer in the bCourses quiz.

<font color='red'>**Question 14.**</font> What is $P(E_1\cap E_2)$ when using the first 25 years of data? Add your answer in the bCourses quiz.

<font color='red'>**Question 15.**</font> What is $P(E_1\cup E_2)$ when using the first 25 years of data? Add your answer in the bCourses quiz.

In [None]:
# recalculate the probabilities using the first 25 years of data
# insert your code below

...

<font color='red'>**Question 16.**</font> Compare the probabilities of the different events when using the full dataset, first 25 years, and last 25 years. What do you observe? Select your answer in the bCourses quiz. 

<font color='red'>**Question 17.**</font> What can you tell based on these results? Select your answer(s) in the bCourses quiz. 

## Conditional Probability


The Climate Prediction Center (CPC) issues maps showing the probabilities of temperature and precipitation deviation from normal. The precipitation outlook for January, February, and March 2023 is shown below. (https://www.cpc.ncep.noaa.gov/products/predictions/long_range/seasonal.php?lead=1)

  <img src="precp_outlook.gif" width='600'/>
  
Based on this outlook, San Francisco has equal chances of getting precipitation above and below normal for the winter.

Let's define 'equal chances of getting precipitation above and below normal for the winter' as getting cumulative precipitation within the inter quartile range (IQR).

<font color='red'>**Question 18.**</font> Calculate the first and third quartiles of the `rain` data (using the full data set). You can refer to Lab 01 to see how to calculate quartiles. What are the first and third quartiles for the rain data set (using the full data set)? Select your answer in the bCourses quiz.

In [None]:
# add your code below to calculate the first and third quartiles

...

Assume that so far this year, the cumulative rainfall in SF is 30 inch. Thus, we know that by the end of the year, rain > 30 inch (we can't get negative rain!). Thus, we know that event $E_2$ occurred. Remember, we previously defined: 	

- $E_1$ = the number of rainy days in SF in a given year is > 80 days
- $E_2$ = amount of cumulative annual rainfall in SF in a given year is > 30 inches

<font color='red'>**Question 19.1.**</font> Knowing that the cumulative precipitation will be > 30 inch, and based on the historical data that we have from 1849-50 to 2021-2022 (complete dataset), what is the probability that there will be more than 80 days of rain in this year? Select your answer from the options in the bCourses quiz.

<font color='red'>**Question 19.2.**</font> What can you tell about the events $E_1$ and $E_2$? Select your answer from the options in the bCourses quiz.

In [None]:
# insert any code below

...

## Decision Making

Recent research shows that the drought and a changing climate have resulted in changes to historical precipitation trends in more recent years. In addition, UCB researchers have shown that prolonged droughts which are affecting California will decrease the number of total rainy days in the future. 

<font color='red'>**Question 20.**</font> Based on this and your statistical analysis, what would be your recommendation to the San Francisco Public Utilities Commission? Select your answer from the options in the bCourses quiz.

## Submit your work!

<font color='red'>**Question 21.** </font> Submit your PDF file.

I recommend that you save your .ipynb file and keep a copy of it so that you can refer to it in the future (e.g., when working on the project). 

Once done with answering ALL questions and you are ready to submit the quiz, follow these steps:

1. Run all cells in the notebook. You can do this by going to Cell > Run All. This makes sure that all your visuals and answers show up in the file you submit.

2. Then, go to "File > Download as > PDF via LaTex(.pdf)" to generate a PDF file or PDF via HTML(.html). Name the PDF file with your last name "Lastname.pdf". Even if you click on PDF via HTML(.html), make sure that the downloaded file is '.pdf'.

3. If you have trouble generating the PDF file from Jupyter notebook, use [datahub.berkeley.edu](http://datahub.berkeley.edu). Log in with your CalNet credentials. Upload the ipynb file with your outputs and results to Juptrer. Then follow step 2.

4. Upload the PDF file to the bCourses quiz (more instructions there).


**Not submitting a PDF file will result in a grade of 0 on this lab assignment.**
**You will also receive a 0 if your answers to this quiz are inconsistent with your PDF.**