# Mini-Project: Pennypack Creek Water Quality
<font color='red'>**Read all of the instructions and background material carefully!** </font>

## Mini-Project Goals
At the end of the semester, you'll engage in a group project where you'll choose a data set, analyze it, formulate and test a hypothesis, evaluate your results, and suggest future work. Think of this mini-project as a dry-run where you will be coached through this process. There will be none of the usual checks, but lots of hints and some (not all) of the code is provided. 

**Remember:** Proficiency in Python coding is a secondary goal of this course; the primary goal is to hone you skills in data analysis. That includes data interpretation and communication of your results. **How well you do on this assignment depends as much on what you write in the markdown cells as it does on your code.** Scientists must write!

Read everything carefully, and <span style="color: blue;">pay particular attention to instructions in blue. </span>

## Data Analysis Steps
While your approach may vary depending on the dataset, here are the typical fundamental steps:
* Understanding the Data Context
* Cleaning the data
* Univariate Exploration
* Correlation Analysis
* Hypothesis Testing or Modeling
* Interpreting the Results
* Understanding the Limitations
* Thinking About Future Work

Together, we will work though all of this steps. Let's begin!

## Part1: Understanding the Data Context

### Motivation
Sir Isaac Newton, the famous scientist and mathematician, once said, “If I have seen further, it is by standing on the shoulders of giants.” Every scientific study builds on the work of others. You don't want to reinvent the wheel. You also need domain knowledge -- you wouldn't expect a biologist to analyze data from the Hubble telescope, or an astronomer to report on DNA sequencing. At least not without reading up on the topic. So before you begin exploring your data, you need to *read.* Scientific journal articles always begin with a literature review that discussed previous research and points the reader to related studies. The literature review puts the current paper in context.

We'll begin the miniproject with some background information. You are encouraged to dig deeper on your own to explore any related resources. The less you know about a topic, the more background reading you need to do.

### Background: Pennypack Creek Water Quality
The data set you will be analyzing was provided by Dr. Laura Toran of Temple's Department of Earth and Environmental Science. Much of Dr. Toran's research has focused on how the urban environment affects the water quality of streams in Philadelphia and the surrounding suburbs. She has conducted this research with the help of numerous Temple graduate students, and undergraduate researchers like the ones in the photos below, all photographed while sampling Pennypack Creek.

![Undergrad researcher sampling the Pennypack](./data/sampling_side_by_side.png)

Pennypack Creek is a small stream that begins in the suburbs north of Philadelphia, flows through the northeast portion of the city and discharges into the Delaware River. The data set you will be working with is a "synoptic sample." In environmental science, synoptic sampling refers to the process of collecting samples from different locations within a study area during a single, short time period—essentially taking a “snapshot” of conditions across space, rather than over time. So you will be looking at data from water samples all collected on the same day but at points spanning the length of the stream. Your goal will be to examine how the geochemistry of the stream varies from the headwaters to where it discharges to the Delaware.

Many natural tributaries contribute to the flow of Pennypack Creek, but you will focus on a manmade source, a wastewater treatment plant run by the Upper Moreland Hatboro Joint Sewer Authority. You can read more about the plant [here.](https://www.umhjsa.org/about.html#:~:text=Recent%20plant%20upgrades,environmental%20compliance). The aerial image below shows the location of the treatment plant relative to the creek, and the photo shows the plant discharging into the Pennypack.

![Treatment Plant](./data/wwtp_combined_side_by_side.png)

### Watch this video on wastewater treatment plants

[Link to Wastewater Treatment video on YouTube](https://www.youtube.com/watch?v=FvPakzqM3h8)

<span style="color: blue;">Write a paragraph or two about what you found most interesting about this video. Also discuss what impact you think discharging treated wastewater might have on the Pennypack Creek data.</span>

Answer here ...


### A First Look at the Data

In [2]:
# Import the Python modules needed for the analysis
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')

In [3]:
# Load the dataset
import pandas as pd
pp = Table.read_table('data/Pennypack_anions_2022-03-22.csv')
pp.show(9)

SiteCode,X_dec_deg,Y_dec_deg,km downstream,Cl (mg/L),NO3-N (mg/L),SO4 (mg/L)
PP1-1,-75.1555,40.1778,0.93,52.77,1.18,19.53
PP1-2,-75.142,40.1791,2.18,80.05,1.3,23.34
PP1-3,-75.1339,40.1799,2.96,87.15,1.19,23.09
PP1-4,-75.1278,40.1785,3.56,116.08,1.3,23.14
PP1-5,-75.1219,40.1774,4.12,114.89,1.17,21.91
PP1-6,-75.1164,40.177,4.75,120.32,1.17,21.96
PP1-7,-75.1108,40.1727,5.62,157.79,1.22,22.15
PP1-8,-75.1065,40.1664,6.5,159.72,1.21,22.36
PP1-9,-75.1061,40.1596,7.37,280.6,1.06,22.87


### Metadata
Metadata is data about the data. Here is a description of each of the columns:
These samples were collect on March 22, 2022 by four teams of students along with Dr. Toran.
* SiteCode: Simply an identifier. PP is short for Pennypack. PP1 is team 1, PP2 is team 2, etc., for the four sampling teams, followed by the sample number. So PP2-3 would be the third sample collect by Team 2.
* X_dec_deg: The x-location (longitude) of the sample collection point in decimal degrees.
* Y_dec_deg: The y-location (latitude) of the sample collection point in decimal degrees.
* km downstream: The distance along the stream of each collection point from the start of the stream.
* Cl (mg/L): The measured chloride ion concentration in milligrams/liter.
* NO3-N (mg/L): “NO3-N” specifically quantifies only the nitrogen portion of the nitrate molecule, not the entire mass of the nitrate ion.
* SO4 (mg/L): Measured sulfate ion concentration.

Note: Reporting as Nitrate-N means that nitrogen (atomic weight = 14) is used for nitrate instead of NO3 (atomic weight = 62). Nitrogen can exist in several forms (nitrate, nitrite, ammonia, organic nitrogen), but nitrate is the main form discharged by the wastewater treatment plant.

## Part1: Mapping the Data
To further your understanding of the data, it helps to map the stream the sample collection points. To do this, we will load the data and create an interactive map -- something that can be done quite easily with our datascience module.

You are encouraged to review this section of the textbook: [8.5.2. Drawing Maps](https://inferentialthinking.com/chapters/08/5/Bike_Sharing_in_the_Bay_Area.html#852-drawing-maps)

In [4]:
# Prepare to map the sample points
# Note that the Circle function expects the datatable column with the labels to be named 'labels'
# hence we create a new table with three columns.
# Also note that Circle expects latitude (y) before longitude (x)
map_data = pp.select('Y_dec_deg', 'X_dec_deg', 'SiteCode').relabel('SiteCode', 'labels')
map_data.show(3)

Y_dec_deg,X_dec_deg,labels
40.1778,-75.1555,PP1-1
40.1791,-75.142,PP1-2
40.1799,-75.1339,PP1-3


#### Stupid Historical Tradition: Why give latitude (y) before longitude (x)?
Early mapmakers and navigators established the habit of listing latitude first. Latitude was easier to determine accurately (using the angle of the sun or stars above the horizon), so it became the primary coordinate for navigation. Longitude, which required precise timekeeping, was much harder to measure until the 18th century.

International standards, such as those from the International Organization for Standardization (ISO), specify that coordinates should be written as (latitude, longitude). This is also the default in many global positioning systems (GPS) and mapping APIs.

### Explore the map (you can easily zoom and pan, or click on a circle for the ID)

In [11]:
# Map the data
Circle.map_table(map_data, color='green', radius=3)

The map above uses OpenStreetMap as a background, but  you can also open a browser tab with Google Maps and turn on the satellite imagery to see what the stream looks like. For example, below is a screen capture of the headwaters of Pennypack Creek captured from the map above, and roughly the same area from Google Maps. The first sample was collected just downstream of the pond labeled the source of the Pennypack in the aerial image below.

![map](data/side_by_side_headwaters.png)


#### Some questions to answer based on  your map exploration include:
<span style="color: blue;">How many small tributaries do you see in the headwaters of the Pennypack Creek? Use the interactive map to zoom in on the headwaters.</span>

**Answer in this cell:**<br>
...


A riparian zone is the area of land directly adjacent to a body of flowing water, such as a river, stream, or creek. Streams are typically healthier if they have a large buffer of vegetation rather than flowing directly through developed areas. <br>

<span style="color: blue;"> Give a sample point ID for for a section of the Pennypack with a well-vegetated riparian zone and contrast this with one where the stream runs through a developed area.</span>

**Answer in this cell:**<br>
...

More data is often desirable.

<span style="color: blue;">If you could add one additional sample point, where would you add it and why? (There is no single right answer.)</span>

**Answer in this cell:**<br>
...

## Part 2: Cleaning the Data
***Write code to answer these questions:***

<span style="color: blue;">
How many sample points are there? (hint: num_rows)<br>
</span>

**What are the data types?**<br>
This is tricky, so the code is provided.

In [None]:
for col in pp.labels:
    print(f"The data type of {col} is {type(pp.column(col)[0])}")

<span style="color: blue;">How does the code above work?</span><br> Answer in this cell:<br>
...

<span style="color: blue;">Are there any missing values?</span>

It is common for datasets to have missing values, so ALWAYS check. Unfortunately, datatables do not have any built-in methods for doing this. Pandas "dataframes (something you will learn to use if you continue in Python datascience are much more powerful, but they are also complex. Data tables provide an easy way to convert a datatable to a dataframe and back again. ***Watch how it is done and remember this trick for later.***



In [None]:
# Convert a datatable to a dataframe
df = pp.to_df()
df

In [None]:
print(df.isnull().sum())

<span style="color: blue;">Are there any missing values?</span>

Answer here  ....

## Part 3: Univariate Exploration
"Univariate" means one variable. We have three anions of interest: Cl, Nitrate-N, and SO4. In this section we will explore them one at a time. Later we will look at them in combination ("multivariate").

But first, lets provide some more background.

**Cl is chloride:** typically a measure of the salinity (NaCl) of the water. In urban streams, elevated salinity is mainly caused by road salt runoff and wastewater. The concentration is reported in mg/L (milligrams per liter). A liter of water has a mass of 1,000 grams, and a milligram is one thousandth of a gram, so 1 mg/L is equivalent 1 part in a million or 1 ppm. While not usually a health risk at the current concentrations, water that is too salty it is bad for aquatic life and tastes lousy. **The EPA drinking water standard for chloride is 250 mg/L.**

**Nitrate-N:** When you see “Nitrate-N” (or “NO₃-N”) in water quality data, it means the concentration is reported as the mass of nitrogen (N) contained within the nitrate ion, rather than the total mass of the nitrate molecule. Reporting as Nitrate-N allows for direct comparison between different forms of nitrogen (like nitrate, nitrite, or ammonia), since all are expressed as the mass of nitrogen. This is important for regulatory standards and health guidelines, which are typically based on the nitrogen content, not the full molecule. **[Drinking water standards](https://www.epa.gov/sdwa/drinking-water-regulations-and-contaminants) (such as the EPA’s 10 mg/L limit for nitrate-N for adults and 5 mg/L for children) are based on the nitrogen portion, making it easier to compare risks from different nitrogen compounds.⁠ High nitrate level are a particular problem for infants, causing "Blue Baby Syndrome" as it can interfere with the blood's ability to carry oxygen.**

**SO4:** is sulfate. It can occur naturally from rock weathering and decay of organic matter, or through human actions such as fertilizer runoff, acid rain, weathering of concrete, and municipal wastewater from sewage treatment plants. Apart from creating an unpleasant taste, really high levels of sulfate (>500 mg/L) can cause diarrhea and other digestive issues. **The EPA drinking water standard for sulfate is 250 mg/L.**

### The five number summary
We begin our univariate data exloration by creating a function that will give us a five number summary for any numeric column in a table. 

In [None]:
def five_num_table(table, column):
    array = table.column(column)
    summary = Table().with_columns(
        "Statistic", ["Min", "Max", "Mean", "Median", "Std"],
        "Value", [
            np.min(array),
            np.max(array),
            np.mean(array),
            np.median(array),
            round(np.std(array), 2)
        ]
    )
    return summary

Let's apply the function to chloride.

### Chloride

In [None]:
five_num_table(pp, 'Cl (mg/L)')

The mean and median give us an idea of how the data are centered. The max and min give us the range of values. The standard deviation tells us how tightly the values are clustered around the mean.

To visualize the distribution, we make a histogram. Be sure to compare the numbers in the five_num_table with the histogram below.

In [None]:
pp.hist('Cl (mg/L)')

Do a little research.

<span style="color: blue;"> 
    
* What are typical chloride levels in pristine streams? 
* How does the Pennypack Creek compare?
* Do any of the chloride values exceed the drinking water standard?
  
</span>

**Answer in this cell:**<br>
...

<span style="color: blue;">Any outliers?</span><br>
While you are conducting your univariate analysis you should check for any outliers. An outlier could be real, or it could be the result of an error in measurement or in recording the data. Either way, outliers deserve attention. What is an outlier is somewhat subjective, but points that are  more than two standard deviations away are worth a second look, and points that are three or more standard deviations from the mean are certainly outliers.

In the case of chloride, how many standard deviations away from the mean are the maximum and minimum concentrations? <span style="color: blue;">Create code to calculate this using the values from the five-number summary.</span>

<span style="color: blue;">Are there any chloride values you would label as outliers?</span> <br>
....

<span style="color: blue;">Now create both a five-number summary and a histogram for nitrogen and sulfate and comment on your results in each case. Add code and markdown cells as needed.

### Nitrogen

### Sulfate

## Part 4: Multivariate Analysis
We've looked at one variable at a time. Now it is time to look at combinations. 

### Changes from headwaters to outflow?
The first, and most important, variable to add is distance along the stream. After all, the whole point of synoptic sampling is to snapshot the spatial variation of stream geochemistry at an instant in time. Up to this point, you've been looking at the distribution of the data independent of stream location, but one would expect changing values as the water flows from its headwaters to where it finally discharges into the Delaware river.

In [None]:
# Scatterplot of Cl vs. distance downstream
plt.figure(figsize=(10,5))
pp.scatter('km downstream', 'Cl (mg/L)', width=12, height=5)
plt.title('Chloride Concentration With Distance from Headwaters')

**Interesting!** Clearly, the chloride concentration changes, with the lowest value in the headwaters, a gradual increase with a surge at PP1-9, at around 7.4 km, and then it falls at around 13 km and stabilizes over the last 10 km.

Let's repeat the plot but add a vertical line where the wastewater treatment plant discharges into the creek, just downstream of point PP1-9 at about 7.5 km from the headwaters. 

In [None]:
# Scatterplot of Cl vs. distance downstream
plt.figure(figsize=(10,5))
pp.scatter('km downstream', 'Cl (mg/L)', width=12, height=5)
plt.axvline(7.5)
plt.title('Chloride Concentration With Distance from Headwaters')

Clearly the chloride concentration increases steadily from the source, seems to spike just upstream of the treatment plant. While this treatment plant does contribute chloride, the main source to streams in northern US cities is roadsalt runoff.

<span style="color: blue;">Look at the interactive map again to see if you can identify the probable chloride source.</span>

Answer here ...

<span style="color: blue;">Make a similar plot for both nitrogen and sulfate.</span>

Compare and contrast the three plots of anion concentrations along the stream.

<span style="color: blue;">
    
* Which anions do you think are most affected by the treatment plant?
* Does this make sense given the nature of treated wastewater (do a some research).
* What additional data might you collect to test your conclusions?

</span>

Essay answer
...

## Correlation Analysis
Let's examine how the different anions compare with each other. When the concentration of one increases, does the concentration of the other also increase (positive correlation)? Or does one decrease when the other increases (negative correlation)? Or maybe there is no relationship (no correlation).

In [None]:
# Scatterplot of Cl vs. N
plt.figure(figsize=(10,5))
pp.scatter('Cl (mg/L)', 'NO3-N (mg/L)', width=5, height=5)

What do we see? While there is clearly a positive correlation, it is by no means perfect. Rather there appears to be three clusters. **Could these clusters be related to the sample locations along the stream?**

How can we figure this out? Since the points are in order, we can split the data at any point and plot the upstream points in one color and the downstream points in another.

In [None]:
# Divide data into upstream and downstream of the wastewater treatment plant (WWTP).
upstream_wwtp = pp.where("km downstream", are.below(8))
downstream_wwtp = pp.where("km downstream", are.above_or_equal_to(8))

# We have to use matplotlib directly to put both on the same plot
# To do this we must extract the x and y arrays from the tables
xcol = 'Cl (mg/L)'
ycol = 'NO3-N (mg/L)'
plt.figure(figsize=(6, 6))
plt.scatter(upstream_wwtp.column(xcol), upstream_wwtp.column(ycol), label = 'upstream')
plt.scatter(downstream_wwtp.column(xcol), downstream_wwtp.column(ycol), label = 'downstream')
plt.xlabel(xcol)
plt.ylabel(ycol)
plt.title(f"{xcol} vs {ycol}")
plt.legend();

<span style="color: blue;">Make similar plots for the other two combinations:</span>
* CL vs SO4
* N vs SO4

<span style="color: blue;">
    
* What inferences do you draw from these three correlation plots? 
* Do you think you could use the concentration of one anion to predict another? 
* How about if you fit the data separately upstream and downstream of the wastewater treatment plant?</span><br>
Answer here...

## Part 5: Hypothesis Testing: Nitrate and the WWTP

Going into this study, Dr. Toran hypothesized that the wastewater treatment plant would be a major source of nitrogen. Although as you saw in the video, treatment of seweage includes the breakdown of compounds by bacteria, 10 mg/L total nitrogen is a common standard for wastewater treatment plant discharge in the U.S., which is greater that the concentrations measured Pennypack Creek, but we would expect the treatment plant discharge to be diluted by the creek waters.

**The statistical question is whether or not there is a real difference in the mean concentration of Nitrate-N above and below the treatment plant.**

<span style="color: blue;">State the null hypothesis and the alternative hypothesis.</span>

**Null Hypothesis**

...


**Alternative Hypothesis**

...



### Test Statistic -- The difference in mean concentrations
We start by calculating the actual difference in mean Nitrate-N concentrations above and below the wastewater treatment plant

In [None]:
mean_N_upstream = ...
mean_N_downstream = ...
test_statistic = np.abs(mean_N_above - mean_N_below)
test_statistic

### One simulation
If the null hypothesis is true, it doesn't matter whether a sample is labeled upstream or downstream. So we extract the full array of N-nitrate values, shuffle the order, label the first nine values "upstream" and the rest "downstream" and compute the difference in means

In [None]:
# Extract all of the N values
N = pp.column('NO3-N (mg/L)')
N

In [None]:
# Shuffle the array
np.random.shuffle(N)
N

In [None]:
# Calculate the difference in the means
sim_statistic = np.abs(np.mean(N[:9]) - np.mean(N[9:]))
sim_statistic

### Full simulation
For the full simulation, we repeatly shuffle, draw values, and compute "upstream" and "downstream" means to build up a distribution of the test statistic under the null hypothesis.

In [None]:
sim_statistics = []
num_simulations = 10_000
for i in np.arange(num_simulations):
    np.random.shuffle(N)
    sim_statistic = ...
    sim_statistics.append(sim_statistic)

In [None]:
plt.hist(sim_statistics)
plt.xlim((0,5))
plt.axvline(test_statistic_N, color='red', linestyle='--', linewidth=2, label='Test Statistic')
plt.title("Simulated Mean Differences Under the Null Hypothesis")
plt.show()

<span style="color: blue;">Calculate the p-value as the fraction of the simulations where the simulated test statistic is greater than or equal to the actual test statistic. </span>

In [None]:
p = ...

<span style="color: blue;">What is your conclusion?</span>

Answer here ...

<span style="color: blue;"> Perform a similar hypothesis test for mean sulfate concentrations above and below the treatment plant.</span>

In [None]:
test_statistic_S = ...

## Changes Over Time
Everything you've looked at so far involved synoptic sampling -- samples collected all on the same day. Streams are dynamic, and while it isn't practical to sample dozens of locations hourly for days on end, it is possible to put out equipment that logs readings at regular intervals a few locations. Dr. Toran deployed a nitrate logger just downstream of the water treatment plant to look at how concentrations varied over time.

NOTE: Datascience datatables are not great for plotting time series data, so again we'll dip into pandas dataframes to read and plot these data. You can just observe; you are not expected to learn Pandas in this class.

In [None]:
import pandas as pd

df = pd.read_csv('./data/WWTP_N.csv', usecols=[0, 1], skiprows=1, parse_dates=True)
df = df.set_index("Date")
df.index = pd.to_datetime(df.index, format='%m/%d/%y %H:%M')
df.head()

Notice that the data is hourly.

In [None]:
df.plot(figsize=(15, 5));

There is clearly a daily cycle.

Let's zoom in on a day.

In [None]:
# Plot one daily cycle from 3 AM to 3 AM on Wednesday 2015-04-29
start = pd.to_datetime('2015-04-29 03:00:00')
end = pd.to_datetime('2015-04-30 03:00:00')
df.loc[start:end].plot()
plt.show()

<span style="color: blue;">How would you interpret this daily trend?</span>

Answer ...

The plot above was for a weekday (Wednesday). Let's compare with a Saturday.

In [None]:
# Plot one daily cycle from 3 AM to 3 AM on Saturday 2015-05-02
start = pd.to_datetime('2015-05-02 03:00:00')
end = pd.to_datetime('2015-05-03 03:00:00')
df.loc[start:end].plot()
plt.show()

<span style="color: blue;">Do you see any difference in the timing of the increase in nitrate? If so, why?</span>

Answer here ...

## Reflection
We've covered a lot of ground in this mini project take a moment to reflect on what you have learned.

<span style="color: blue;">Write a one or two paragraph summary. Include an idea for future work.</span>

Summary
...

### <font color=blue> **Feedback** </font>

Please include a reflection. 
* How did this mini-project go?
* What was the most interesting or surpising thing you learned?
* Was it difficult to write code without a template?
* Did you seek help from any of the instructors or class assistants?
* Were there questions using techniques you found especially challenging you would like your instructor to review in class? 
* How long did the project take you to complete?

  
Share your feedback so we can continue to improve this class!