# CE 93: Engineering Data Analysis
# LAB 01 Numerical and Graphical Summaries of Data


**Full Name:** *replace text here*

## Instructions 

Welcome to Lab 01! 

Please save your work after every question! At the end, you will have to submit your Jupyter Notebook as a PDF file in the bCourses quiz. The notebook should be consistent with your quiz answers. Not submitting a PDF file will result in a grade of 0 on the lab assignment. You will also receive a 0 if your answers to the quiz are inconsistent with your PDF.

If you see cells with "..." make sure to replace the "..." with your code even if they are not listed with a "Question". 
Please remember to label all axes with the quantity and units being plotted. 

Any part listed as a "<font color='red'>**Question**</font>" should be answered in the bCourses quiz to receive credit.

We will use the following Python packages:

* NumPy
* pandas
* MatPlotLib
* statistics
* scipy

## Load the required libraries 

The following code loads the required libraries. Run this cell first.

In [None]:
# import python library / packages 
import numpy as np                           # ndarrays for gridded data
import pandas as pd                          # DataFrames for tabular data
import matplotlib.pyplot as plt              # plotting
import statistics as stats                   # statistics like mode
import scipy                                 # statistics

## About Lab 01
In Lab 1 we will be working with a data set of Biochemical oxygen demand (BOD), Nitrates, and Ammonia measurements along the Blackwater River. Biochemical oxygen demand, or BOD, is a measurement of the amount of oxygen consumed by the microorganisms in decomposing organic matter in some volume of water. Additionally, BOD can also be used to measure the extraction of oxygen from water via oxidation of inorganic matter, a chemical reaction as opposed to a biological one. 

 <img src="figure1.png">
 
 From EPA: Water Monitoring and Assessment: 
“BOD directly affects the amount of dissolved oxygen in rivers and streams. The greater the BOD, the more rapidly oxygen is depleted in the stream. This means less oxygen is available to higher forms of aquatic life. The consequences of high BOD are the same as those for low dissolved oxygen: aquatic organisms become stressed, suffocate, and die.”


Nitrates are a very common contaminant of drinking water, especially in rural areas near agriculture. The primary sources of nitrate contamination are fertilizer, sewage, and manure. Federal regulations for nitrates were put in place because high nitrate levels can result in methemoglobinemia, or "blue baby" disease. They are also an indicator for other major agricultural contaminants, such as pesticides and bacteria. 


Ammonia in waterways is the result of industrial processes, wastes, and fertilizers, as well as a byproduct of drinking water disinfection with chloramines. Ammonia is toxic for aquatic life, especially in oxygen-poor environments.


For Civil and Environmental Engineers, BOD, Nitrates, and Ammonia measurements are critical to protect the health of waterways that are impacted by various organic and inorganic pollution. The consequences of high BOD could be species depletion, damaged ecosystems, polluted water, regulatory challenges, and criminal investigations.  

### Load the data

Let's load the provided data set `nutrients.csv`. These are all the features:

|Feature|Units|Description|
|:-|:-|:-|
|BOD|mg/L|Biochemical oxygen demand measurements|
|Nitrates|mg/L|Nitrates measurements|
|Ammonia|mg/L|Ammonia measurements|

* load using the Pandas `read_csv()` function

In [None]:
# read a .csv file in as a DataFrame
df = pd.read_csv('nutrients.csv')

# returns the first 5 rows of the data set by default
df.head()

### Dataframe Shape

Let's check the shape (number of rows and columns) of our data set

To get the shape of a DataFrame, use `DataFrame.shape`, where `DaraFrame` should be replaced with the name of your data (see the cell above). The `.shape` attribute returns (rows, columns).

<font color='red'>**Question 1.1.**</font> What is the shape of this data set?

Replace the "..." with your code.

In [None]:
# return the dataframe shape
...

### Column Labels

Each column in a DataFrame has a label/name. Getting column labels is useful when you want to access a specific column by name.
Let's check the column labels of the Dataframe

* using the `.columns` attribute

The column labels will appear between square brackets after running the cell below.

In [None]:
# return the column labels of the Dataframe
df.columns

So we have three columns, with labels: 
1. 'BOD(mg/L)'
2. 'Nitrates(mg/L)'
3. 'Ammonia(mg/L)'

### Getting Specific Values

If you want to get a specific value from the DataFrame (formally known as indexing), let's say the value for row 10 and column 10, you can NOT simply use `DataFrame[9, 9]`, like you would for a numpy array.

So let's see how to do it. There are multiple ways to select and index rows and columns from DataFrames. You could use the column labels from above instead of the column number, or simply use the column and row number.

* Method 1: Indexing a DataFrame using `.iloc[ ]`

This function allows us to retrieve rows and columns by position. It is primarily integer position based (from 0 to length-1 of the rows or columns). The format is `DataFrame.iloc[row, column]`. Remember that Python is 0-indexed. So the first row is 0, second is 1, so on. Same applies for column positions Remember this!

* Method 2:

You can also index a DataFrame using `DataFrame['column label'][row]`. This will return the value of the column 'column label' at row = row + 1 (because of 0-indexing). You need to use the exact column label. You can refer to the output above for the three column labels of our DataFrame.

There are more ways to index a DataFrame, but these should be sufficient.

<font color='red'>**Question 1.2.**</font> What is the BOD in mg/L for the 30$^{th}$ row? You can refer to the column labels in the previous code cell. You can see that the first label is for BOD, so BOD is in the first column.

Replace the "..." with your code. You can try both methods to confirm your answer, but you don't have to. Also, remember that to return multiple outputs in the same code cell, you need to use the `print()` function.

You now have the answers to Question 1 of the quiz. Go back to bCourses and start the quiz to answer the first question.

In [None]:
# Using iloc[]. Get cell value by row and column position
...

# Altenatively, get cell value using column label and woe position
...

### Create Variables from the DataFrame

We will be working with different columns of the DataFrame (biochemical oxygen demand, nitrates, and ammonia). So, let's create different variables for each column in the Dataframe.

* Create a variable `bod` for biochemical oxygen demand
* Create a variable `nit` for nitrates
* Create a variable `amm` for ammonia

We can do this in two different ways.
* Using `DataFrame['column label']
* Or if you don't want to type out the full column labels, you can use `DataFrame[DataFrame.columns[column]]`, where column would be the integer index of the column you want. Remember, indexing starts at 0!

Replace the "..." with your code.

In [None]:
# create variables for biochemical oxygen demand, nitrates, and ammonia
# bod is already created for you
# replace ... with your code to create nit and amm

bod = df['BOD(mg/L)'] # or df[df.columns[0]]
nit = ...
amm = ...

### Check your answer to Question 1.2.

You can check your answer to Question 1.2. using `bod[row]`, where row would be the **index** of the row you are interested in.

In [None]:
# return BOD value for the 30th row
...

## Get Numerical Summaries Using stats Module

In this part, we will calculate different measures of central tendency and variability for the `nit` data using the `stats` module.

* Use the `stats.mean()` function to get mean value
* Use the `stats.median()` function to get median
* Use the `stats.variance()` function to get sample variance
* Use the `stats.stdev()` function to get sample standard deviation
* (Not required in the questions) You can use the `stats.mode()` function to get mode

Go back to bCourses to answer the questions below after you update the code cell.

<font color='red'>**Question 2.**</font> What is the mean of nitrates? Add your answer in the bCourses quiz.

<font color='red'>**Question 3.**</font> Calculate the median of nitrates, then use logic operators in Python (>, <, ==) to compare the mean and median values. Select your answer(s) form the options in bCourses.

<font color='red'>**Question 4.**</font> What is the variance of nitrates? Select your answer in the bCourses quiz.

<font color='red'>**Question 5.**</font> What is the standard deviation of nitrates? Add your answer in the bCourses quiz.

<font color='red'>**Question 6.**</font> What is the coefficient of variation of nitrates? Add your answer in the bCourses quiz.

<font color='red'>**Question 7.**</font> What is the unit of the coefficient of variation of nitrates? Select your answer(s) form the options in bCourses.

In [None]:
# calculate the mean of nitrates
...

# calculate the median of nitrates
...

# compare the mean and median of nitrates
...

# calculate the variance of nitrates
...

# calculate the stdev of nitrates
...

# calculate the coefficient of variation of nitrates
...

## Get Numerical Summaries Using numpy Package

Numerical summaries can also be calculated using the `numpy` package. However, it is important to understand any possible differences between packages. Let's look at an example also for the `nit` data.

* Use the `np.std()` function with the default parameters

<font color='red'>**Question 8.**</font> What is the standard deviation of nitrates when you use `np.std()`? Add your answer in the bCourses quiz.

In [None]:
# calculate the stdev of nitrates using np.std()
...

### Why Are My Values Different?

If you got slightly different values from `stats.stdev()` and `np.std()`, then do not freak out! If you got the same values, you did something wrong, so go back.

Let's try to understand this difference.

To answer the next question, you need to understand how `np.std()` works. You can read more about it [here](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

<font color='red'>**Question 9.**</font> Why are your values different from `stats.stdev()` and `np.std()` different? Select your answer(s) form the options in bCourses.

### Other Numerical Summaries in numpy:

The `numpy` package has other functions for numerical summaries.

<font color='red'>**Question 10.**</font> Which of the following `numpy` functions exist to calculate numerical summaries similar to the `stats` module? Select your answer(s) form the options in bCourses.
* np.mean()
* np.median()
* np.mode()

If a function does not exist and you try to run it, you will simply get an error. If that happens, add a '#' before the code to comment this line and not run it. You should submit your PDF file at the end with no errors, so make sure to comment any functions that do not exist in `numpy`.

In [None]:
# Try to calculate mean using numpy
...

# Try to calculate median using numpy
...

# Try to calculate mode using numpy
...

## Other Methods to Get Numerical Summaries

There are other packages besides `stats` and `numpy` to get numerical summaries. Always make sure you understand the default parameters and check your values using different packages when in doubt.

Finally, let's use the `describe()` function of Dataframes to get several descriptive statistics for the `nit` data.

* Use the `DataFrame.describe()` function with the default parameters

You can read more about it here: [`DataFrame.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

<font color='red'>**Question 11.**</font> Which of the following is(are) output(s), by default, of the `describe()` function. Select your answer(s) form the options in bCourses.
* Mean
* Median
* Mode
* 25th percentile

You can simply run it on the nit data and check the output to answer this question.

In [None]:
# get descriptive statistics for nitrates using DataFrame.describe(), and replace DataFrame with the variable you want
...

## Histogram Plots Using matplotlib.pyplot

In this part, we will plot histograms for the `bod` data using `matplotlib.pyplot`.

* Read more about `plt.hist()` [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).
* You can also run `?plt.hist` in a code cell to read more about it
* You will need to understand the following parameters to solve the next questions:
    * bins
    * weights

Note: If you scroll to the top of the code, where we loaded the required libraries, you can see that we used: `import matplotlib.pyplot as plt`

Thus, in your code, you can simply replace matplotlib.pyplot by plt.

So instead of `matplotlib.pyplot.hist()`, you **should** use `plt.hist()`.

### Histogram Plot with `bins=5`

Plot a histogram for the `bod` data with `bins=5`. To clearly see the bars, you can specify the edge color of the bins as follows, where `k` stands for black:
* `plt.hist(bod, bins=5, ec='k')`

<font color='red'>**Question 12.**</font> What is the maximum frequency that you observe and for how many bins? Select your answer(s) form the options in bCourses.

In [None]:
# learn more about matplotlib.pyplot.hist
?plt.hist

In [None]:
# plot a histogram of BOD using bins = 5 (simply copy the line of code from the text above)
...

# display all figures
plt.show()

### Label your axes!!

Whenever you are presenting a plot, whether for a lab, homework, or project (and in your professional career), ALWAYS label your axes and add units where relevant. You can also add a title to your plot.

Let's re-plot the histogram for the `bod`, now adding axes labels and controlling the appearance of the figure. Run the code below to check it out! In all of the remaining figures, follow a similar format.

In [None]:
# initialize figure
# (5,5) is width by height
fig = plt.figure(figsize=(5,5))

# In the next question, you will create a figure with several subplots
# One way to do so is using fig.add_subplot(nrows, ncols, index)
# This will add a subplot that will take the index position on a grid with with nrows rows and ncols columns. 
# index in this case starts at 1 in the upper left corner and increases to the right. 

# While it is useless to have subplots for 1 figure in this case, let's do it just for illustration

# create an empty array axs to append to it each new subplot
axs = []

# add/append first subplot in a 1x1 grid
axs.append(fig.add_subplot(1,1,1))

# specify number of bins
N = 5

# plot histogram wiht bins = N
# to make sure we are plotting on the first subplot, we use axs[0]
axs[0].hist(bod, bins=N, ec='black')

# add title
axs[0].set_title('5 Bins bod histogram')

# add label for x-axis
axs[0].set_xlabel('BOD(mg/L)')

# add label for y-axis
axs[0].set_ylabel('Frequency')

# display all figures
plt.show()

### Try Different Bin Numbers

<font color='red'>**Question 13.**</font> Make three histograms for the `bod` using (a) `bins = 5`, (b) `bins = 15`, and (c) `bins = 50`. Which number of bins is most appropriate for this dataset? Select your answer(s) form the options in bCourses.

In [None]:
# Edit the code below to make three histogram subplots in 1 figure
# I am providing you with many of the commands here for you to practice. In the next question, you will write the entire code.
# You will simply have to copy your code from here and edit it for the other questions

# initialize figure with (12,4) is width by height
fig = plt.figure(figsize=[12,4])

# create empty axs
axs = []

# add/append first subplot in a 1x3 grid
axs.append(fig.add_subplot(1,3,1))

# specify number of bins for first subplot
N = 5

# Create your first subplot below, with title and axes labels
# here you will use axs[0].hist to plot, axs[0].set_title to set the title etc. (see code above)
axs[0].hist(bod, bins=N, ec='black')
axs[0].set_title('5 Bins bod histogram')
axs[0].set_xlabel('BOD(mg/L)')
axs[0].set_ylabel('Frequency')

####################################################################################################

# add/append second subplot in a 1x3 grid
# Note that now we are using index 2: (1,3,2)
axs.append(fig.add_subplot(1,3,2))

# specify number of bins for second subplot
...

# Create your second subplot below, with title and axes labels
# here you will use axs[1].hist to plot, axs[1].set_title to set the title etc. where axs[1] calls the second subplot
...

####################################################################################################

# add/append third subplot in a 1x3 grid
# Note that now we should use index 3: (1,3,3)
...

# specify number of bins for second subplot
...

# Create your second subplot below, with title and axes labels
# here you will use axs[2].hist to plot, axs[2].set_title to set the title etc. where axs[2] calls the third subplot
...

# display all figures
plt.show()

### Histogram Plot with Frequency, Relative Frequency, and Density:

So far, we have been specifying the total number of bins, which results in equal bin widths. You can alternatively specify the values for the bin edges, and thus, assign unequal bin widths. So in this part, we will `bins=[2.2, 2.5, 3.2, 3.3, 3.8, 4.5]`, which again specifies the edges of the bins.

Also, by default, `plt.hist()` plots the frequency (i.e., count of the number of sample data that fall into each bin). We can alternatively plot proportions (i.e., relative frequency: frequency/sample size) or densities (proportions/bin width).

Modify your plotting code from the previous question and plot, in the same figure, three histograms for `bod` all using `bins=[2.2, 2.5, 3.2, 3.3, 3.8, 4.5]`, such that:

1. The first plot should have frequency on the y-axis (similar to before): `plt.hist(bod, ec='black', bins=[2.2, 2.5, 3.2, 3.3, 3.8, 4.5])`
2. The second (middle) plot should have relative frequency on the y-axis. For this, you need to add between parentheses the parameter `weights=np.ones_like(bod)/len(bod)`, in addition to specifying the bins as defined above. This simply multiplies the data by 1/sample size, which would give us proportions.
3. The third plot should have density on the y-axis. For this, you need to to add between parentheses `density=True` (do not specify weights), in addition to specifying the bins as defined above.

Make sure you label the y-axes correctly for each plot. Also, add a title for each plot.

After generating the plot, right click on it and click 'Save image as' to download your figure as an image.

<font color='red'>**Question 14.**</font> Upload your figure to bCourses using the instructions there.

<font color='red'>**Question 15.**</font> Based on your plot of frequency, relative frequency, and density using the given bin edges, what you can you tell about these plots? Select your answer(s) form the options in bCourses.

In [None]:
# Add your code below
# You can simply copy it from above and make the necessary changes for each subplot
# Label your axes and add a descriptive title for each subplot

...

## Percentiles and Quartiles Using numpy:

We can calculate percentiles and quartiles directly in Python. 

* For example, [`np.percentile(data, q)`](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html) returns the qth percentile for the specified data. 
* `np.percentile(bod, q=[20, 40])` returns the 20th and 40th percentile (in the order you specify the q values)
* If you use `per20, per40 = np.percentile(bod, q=[20, 40])`, variable `per20` will be the 20th percentile and variable `per40` will be the 40th percentile, and you can perform operations like `per40 - per20`

Remember that the interquartile range is the third quartile (75th percentile) - first quartile (25th percentile).

* Use `np.percentile(data, q)` to calculate percentiles from which you can calculate the interquartile range for `bod`.

<font color='red'>**Question 16.**</font> What is the interquartile range for `bod`? Add your answer in the bCourses quiz.

In [None]:
# calculate interquartile range for bod
...

## Boxplots Using matplotlib.pyplot

Boxplots can be plotted using the [`plt.boxplot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html) function.

Let's look at an example. We will plot two boxplots for random data a and b.

In [None]:
# randomly generate an array with 100 numbers
a = np.random.randn(100)

# randomly generate another array with 100 numbers
b = 2*(np.random.randn(100)+1)

#initialize figure
fig = plt.figure() 
axs = []

#boxplot of dataset 'a'
axs.append(fig.add_subplot(121)) #first subplot
axs[0].boxplot(a)
axs[0].set_title('boxplot of a')
axs[0].set_ylabel('a')

#boxplot of dataset 'y'
axs.append(fig.add_subplot(122)) #second subplot
axs[1].boxplot(b)
axs[1].set_title('boxplot of b')
axs[1].set_ylabel('b')

plt.tight_layout()
plt.show()

If we wish to compare the boxplots of a and b, we should plot them on the same figure. Notice that the values for the y-axes above are different for a and b, and thus, the above plots are not an effective way to compare a and b.

Let's plot both boxplots together. This will provide a better graphical representation because we can effectively compare both data sets a and b.

In [None]:
# useful in comparing multiple data sets
fig = plt.figure() 
axs = []

#boxplot of a and b on one plot
axs.append(fig.add_subplot(111))
axs[0].boxplot([a,b])
axs[0].set_title('boxplot of a & b')
axs[0].set_xticklabels(["a","b"])

plt.tight_layout()
plt.show()

### Boxplots for `bod`, `nit`, `amm` 

<font color='red'>**Question 17.**</font> Create boxplots for `bod`, `nit`, `amm` **on the same plot**. Based on your boxplots for `bod`, `nit`, and `amm`, what you can you tell? Select your answer(s) form the options in bCourses.

In [None]:
# Add your code below
# You can simply copy the code above and edit it to plot boxplots fro bod, nit, and amm
# Make sure to label your axes

...

## Scatter Plots Using matplotlib.pyplot

Scatter plots of bivariate data can be plotted using the [`plt.scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) function.

You can control the marker shape, size, fill color, edge color and many more. Click on the function above to read more about some of its parameters.

Let's look at an example for data a and b.

In [None]:
#initialize figure
fig = plt.figure()
axs = []

# make the scatterplot
axs.append(fig.add_subplot(111))
axs[0].scatter(a,b, color='orange', edgecolors='k') #(x,y)
axs[0].set_xlabel('a')
axs[0].set_ylabel('b')
axs[0].set_title('Scatterplot of a versus b')

plt.show()

### Scatterplots for `bod`, `nit`, `amm` 

Make three scatter plots as follows: 

1. scatter plot of ‘bod’ (y-axis) versus ‘nit’ (x-axis)
2. scatter plot of ‘amm’ (y-axis) versus ‘bod’ (x-axis)
3. scatter plot of ‘amm’ (y-axis) versus ‘nit’ (x-axis) 
    
Make sure to correctly label the axes to easily answer the next question.

<font color='red'>**Question 18.**</font> Based on your scatter plots, what you can you tell about the associations between the three variables (positive, negative, none)? Select your answer(s) form the options in bCourses.

In [None]:
# Add your code below for the three scatter plots
# Try to create one figure with three suplots, similar to what you did when plotting frequency, relative frequency, and density

...

## Submit your work!

<font color='red'>**Question 19.** </font> Submit your PDF file.

I recommend that you save your .ipynb file and keep a copy of it so that you can refer to it in the future (e.g., when working on the project). 

Once done with answering ALL questions and you are ready to submit the quiz, follow these steps:

1. Run all cells in the notebook. You can do this by going to Cell > Run All. This makes sure that all your visuals and answers show up in the file you submit.

2. Then, go to "File > Download as > PDF via LaTex(.pdf)" to generate a PDF file or PDF via HTML(.html). Name the PDF file with your last name "Lastname.pdf". Even if you click on PDF via HTML(.html), make sure that the downloaded file is '.pdf'.

3. If you have trouble generating the PDF file from Jupyter notebook, use [datahub.berkeley.edu](http://datahub.berkeley.edu). Log in with your CalNet credentials. Upload the ipynb file with your outputs and results to Juptrer. Then follow step 2.

4. Upload the PDF file to the bCourses quiz (more instructions there).


**Not submitting a PDF file will result in a grade of 0 on this lab assignment.**
**You will also receive a 0 if your answers to this quiz are inconsistent with your PDF.**