# mini-project II: Mount Saint Helens' biodiversity after the ashes
Elements of Data Science

In [None]:
# Enter your name as a string
name = ...

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')

### Mount Saint Helens Eruption  8:32 A.M. on May 18, 1980
We will explore data on ecosystem recovery following the volcanic eruption at Mount Saint Helens in Washington State.<br>
<img src='data/Eruption.jpg'><br><img src='data/800px-1980_St._Helens_ashmap.png'>

### <font color='green'>Data Sets
<font color='red'>**Mount Saint Helens erupted at 8:32 A.M. on May 18, 1980.** </font>
<br>
Professor Roger del Morales at University of Washington https://faculty.washington.edu/moral/ and his team set up circular, 9 meter radius, land plots near the volcano once it was safe to initiate the study in 1984. These plots are located in several distinct regions near the volcano cone to study the return of vegetation and biodiversity to these plots located in different positions relative to the volcanic cone (see mapping below). We will use this data to assess the rate of plant succession. Measures in the data include yearly species richness, *RICHNESS*, defined as the number of species in a given region or in this case the 9 meter radius (250 m^2) land plot. We will use our data science tools to decide if the changes over time in *RICHNESS* are a pattern (Alternate hypothesis) or if they are just due to random fluctuations (NULL hypothesis). We will also study *COVER_%*, a measure that reflects plant coverage. In this case we will test whether growth has occured following the eruption. Plots are studied in 13 unique locations with different characteristics including the blast type which they experienced. In designing their study, the researchers also collected data in up to 10 adjacent plots with the same name in order to assess statistical variation across an area. 

**Our Overall Study Plan**
1. Select two distinct plot names with contrasting characteristics
2. Look at plant succcesion 10 years after the blast relative to first year of data available for a given plot. Here we use a paired t-test.
3. Examine time trend for plant succession using changes function developed in Lab 07.

**Data collected:**
1. del Moral, Roger (2016): Thirty years of permanent vegetation plots, Mount St. Helens, Washington, USA. Wiley. Collection. https://doi.org/10.6084/m9.figshare.c.3303093.v1 
Source: https://figshare.com/collections/Thirty_years_of_permanent_vegetation_plots_Mount_St_Helens_Washington_USA/3303093

**Papers using this data:**
1. Del Moral, R.; Magnússon, B., "Surtsey and Mount St. Helens: a comparison of early succession rates". Biogeosciences 2014, 11 (7), 2099-2111.
https://faculty.washington.edu/moral/publications/2014%20delMoral%20Magnusson.pdf

2.  Cook, James E.; Halpern, Charles B., "Vegetation changes in blown-down and scorched forests 10–26 years after the eruption of Mount St. Helens, Washington, USA". Plant Ecology 2018, 219 (8), 957-972.
https://link.springer.com/content/pdf/10.1007/s11258-018-0849-8.pdf



### <font color='green'>Video background

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("UK--hvgP2uY")
# https://youtu.be/UK--hvgP2uY

[Google Earth 3D view](https://earth.google.com/web/search/Mount+Saint+Helens,+Washington/@46.19819667,-122.19529432,2095.30665214a,10392.19844707d,35y,224.30506563h,60t,0r/data=CigiJgokCcIkzMXzkzhAEcIkzMXzkzjAGcJo4PGMbkNAIYI6cQ6KIlHA)

### <font color='green'> ***Question 1***</font>
We will first explore the different study plots where data was collected. They are located in different environments, elevations, and locations relative to the volcano cone. <br>Create initial data tables and plots to explore the nature of the different plots included in "data/MSH_PLOT_DESCRIPTORS2.csv". Variables to consider include elevation, slope, aspect (direction), impact type. You could use group or pivot methods here.

In [None]:
# Plot description dataset
datafile = "data/MSH_PLOT_DESCRIPTORS2.csv"
MSH_PLOT = Table.read_table(datafile)
MSH_PLOT

#### Plot Impact Type

In [None]:
np.unique(MSH_PLOT['IMPACT_TYPE'])

#### Mapping Biodiversity Data Collected following Mount Saint Helens Eruption
Data is collected anually on developing biodiversity on defined plots of land with given latitude and longitude. These locations can be mapped using the *map_table* method of a table object.  Plots are studied in 13 unique locations relative to the volcanic cone with different characteristics including the blast type which they experienced. Zoom in on map, click on each circle to view the label for the data series. There are replicate plots within each plot name in order assess statistical variation. The `.map_table()` method uses [Folium](https://realpython.com/python-folium-web-maps-from-data/), a powerful mapping and GeoJSON utility.

In [None]:
MSH_PLOT = MSH_PLOT.with_columns('NLONG',-1*(MSH_PLOT.column('LONG')))
MSH_map = MSH_PLOT.select('LAT', 'NLONG', 'PLOT_CODE').relabel('PLOT_CODE', 'labels')
MSH_coordinates = (46.191387, -122.195618)
OurMap = Map(location=MSH_coordinates, zoom_start=10, width=500, height=500,)
#Circle.map_table(MSH_map, color='blue',radius_scale=9,radius_in_meters=True)
OurMap.overlay(MSH_map, color='blue', opacity=0.01)

### <font color='green'> ***Question 2***</font>
Identify two plots with unique `PLOT_NAME`'s to study based on mapped location and characteristics given in "data/MSH_PLOT_DESCRIPTORS2.csv" file. Use a detailed markdown cell to provide reasons for your two choices of plots. Include differences and similarities.

In [None]:
# Plot vegatation trend yearly dataset
datafile = "data/MSH_STRUCTURE_PLOT_YEAR.csv"
MSH_YEAR = Table.read_table(datafile)
MSH_YEAR

**Unique plot names**

In [None]:
np.unique(MSH_YEAR.column('PLOT_NAME')) #Return unique plot names

In [None]:
np.unique(MSH_YEAR.column('YEAR')) #Return unique plot years

**Select Plots**
Example exploratory data analysis below. Select from unique plot names above with complete justification for choice.

In [None]:
myplot1 = '...'

In [None]:
# Select a particular plot name based on examination of mapped data and descriptions in the plot description dataset.
PLT = myplot1 # Put the name for study hear, i.e ='STRD'
data = MSH_YEAR.where('PLOT_NAME',are.contained_in(PLT)).sort('YEAR',descending=False)
data

In [None]:
myplot2 = '...'

In [None]:
# Select a particular plot name based on examination of mapped data and descriptions in the plot description dataset.
PLT = myplot2 # Put the name for study hear, i.e ='STRD'
data = MSH_YEAR.where('PLOT_NAME',are.contained_in(PLT)).sort('YEAR',descending=False)
data

In [None]:
data.scatter('YEAR','RICHNESS')

Group and average to get better view of time trend.

In [None]:
data.group('YEAR', np.mean).plot('YEAR','RICHNESS mean')

In [None]:
data.stats()

In [None]:
def five_num_sum(table,column):
    nums=[]
    array = table.column(column)
    nums.append(np.min(array))
    nums.append(np.max(array))
    nums.append(np.mean(array))
    nums.append(np.median(array))
    nums.append(np.std(array))
    print(f'min: {nums[0]} \nmax: {nums[1]} \nmean: {nums[2]:.3f} \nmedian: {nums[3]:.3f} \nstd: {nums[4]:.3f}')
    return nums

In [None]:
five_num_sum(data,'RICHNESS')

### <font color='green'> ***Question 3***</font>
We want to understand how particular plot types evolve following the eruption.  We can look also look at the degree to which certain plot types recover differently based on their location and the type of transformation that occured following the eruption. In question 4 we look within each plot for these changes.
- Formulate a hypothesis regarding plant vegetation (*COVER_%*) and biodiversity (*RICHNESS*) following the eruption. You can refer to the above links and papers for ideas. Create a detailed markdown cell to detail this hypothesis.
- State the NULL hypothesis for each measure.
<br>Use below markdown cells

<font color='blue'>***COVER_% Hypothesis***

<font color='blue'>***RICHNESS Hypothesis***

<font color='blue'>***COVER_% NULL Hypothesis***

<font color='blue'>***RICHNESS NULL Hypothesis***

### <font color='green'> ***Question 4***</font>
Consider the change in COVER_% between the first year of your data and 15 years after the 1980 volcanic eruption. We will use the multiple data points at each year for each plot to perform a difference of means. Use the paired t-test as in lab 07 to test your hypothesis regarding COVER_%

<font color='blue'>Reminder: </font>The `data` Table variable is defined above for myplot1 and then myplot2, keep track of which plot you are examining and clarify with markdown and code for each case. <font color='red'>Each plot covers a different time range so you may need to adjust time range in hypothesis for specific plots.

In [None]:
data['YEAR'].min() # Using example plot

In [None]:
data['YEAR']

In [None]:
YEAR1 = 1984
YEAR2 = 1995

In [None]:
np.mean(data.where('YEAR',YEAR2)['COVER_%'])

In [None]:
np.mean(data.where('YEAR',YEAR1)['COVER_%'])

In [None]:
diff_means = np.mean(data.where('YEAR',YEAR2)['COVER_%'])-np.mean(data.where('YEAR',YEAR1)['COVER_%'])
diff_means

In [None]:
s1 = np.std(data.where('YEAR',YEAR1)['COVER_%'])
s2 = np.std(data.where('YEAR',YEAR2)['COVER_%'])
dof = 2 * data.num_rows - 2

mean_diff = np.mean(data.where('YEAR',YEAR2)['COVER_%'])-np.mean(data.where('YEAR',YEAR1)['COVER_%'])

n = data.num_rows
se1 = s1/np.sqrt(n)
se2 = s2/np.sqrt(n)
std_error =  np.sqrt((se1**2+se2**2)/2)

print(f'The mean COVER_% change is: {mean_diff:.2f}')
print(f'The standard deviation of the COVER_% differences is: {s:.3f}')
print(f'The standard error is: {std_error:.4f}')
print(f'The degrees of freedom is: {dof}')

In [None]:
t = ...
print("The t value is:", t)

<font color='blue'>***Find the p-value, accept or reject null hypothesis?***</font>
You can use either the t-test approach or the simulation approach similar to what you did in Lab 07. Be sure to document your choice and all the steps in new cells below. The p-value is the probability that the observed COVER_% increases are random. So do we accept or reject the null hypothesis? Explain in the cell below the check of the p-value.

<font color='blue'>**Now repeat the process to consider the change in RICHNESS**</font> between the first year of your data and 15 years after the 1980 volcanic eruption. We will use the multiple data points at each year for each plot to perform a difference of means. Use the paired t-test as in lab 07 to test your hypothesis regarding RICHNESS

### <font color='green'> ***Question 5***</font>
Now we will look at the degree to which certain plot types recover differently based on their location and the type of transformation that occured following the eruption. We will look at COVER_% for our two selected plots. In this case we  will use the unpaired t-test to test a hypothesis regarding COVER_%. <font color='red'>Each plot covers a different time range so you may need to adjust time range in hypothesis for specific plots.

<font color='blue'>***COVER_% Hypothesis across my two selected plots***

<font color='blue'>***COVER_% Null Hypothesis across my two selected plots***

### unpaired t-test
assumptions, differs from paired test used above and in Lab 07. The main difference is that we are comparing two different groups of plots as compared to the question 4 test which was applied to the same plots undergoing a 'treatment' of time passage following the eruption (paired). 

* **IF** two samples are independent
* and **IF** the samples are random
* and **IF** both samples come from populations with a normal distribution
* and **IF** both populations have approximately the same standard deviation
* and **IF** dependent variable is measured on an incremental level, such as ratios or intervals. 
* **THEN** we can calculate the following t-statistic

 $$ 
 t = \frac{\bar{x_1} - \bar{x_2}}{SE} \tag{1} 
 $$
  
with is the difference between the means of the two samples divided by average standard error of the mean, or standard error, for short.

$$ 
SE^2 = \frac{(SE)_1^2 + (SE)_2^2}{2} \tag{2}
$$

What is the "standard error?" It is sample standard deviation divided by the square root of the number of observations:

$$ SE = \frac{s}{\sqrt{n}}  \tag{3} $$

Here we need to compute the standard error (eq. 3) for each sample, SE, with a given number of observations, n. Then the overall standard error is computed by combining the two standard errors in equation 2.

The degrees of freedome will be $\nu = n_1 + n_2 - 4$


<br>**<center>Critical Values of <i>t**
    <center>See: [NIST](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm)

|$\nu$<br>degrees of freedom|95%<br>p = 0.05|99%<br>p = 0.01|
|:-:|:--|:--|
|2|4.303|9.92|
|3|3.18|5.84|
|4|2.78|4.60|
|5|2.57|4.03|
|6|2.45|3.71|
|7|2.36|3.50|
|8|2.31|3.36|
|9|2.26|3.25|
|10|2.23|3.17|
|15|2.13|2.95|
|20|2.09|2.85|
|30|2.04|2.75|
|$\infty$|1.96|2.58|




First we need a measure to compare between plots to meet the above criteria of the unpaired t-test, <i>'dependent variable is measured on an incremental level, such as ratios or intervals.' 

In [None]:
YEAR1 = 1984
YEAR2 = 1995

In [None]:
myplot1 = '...'

In [None]:
# Select a particular plot name based on examination of mapped data and descriptions in the plot description dataset.
PLT = myplot1 # Put the name for study hear, i.e ='STRD'
data1 = MSH_YEAR.where('PLOT_NAME',are.contained_in(PLT)).sort('YEAR',descending=False)
data1

In [None]:
growth1 = data1.where('YEAR',YEAR1)['COVER_%']/data1.where('YEAR',YEAR2)['COVER_%']
growth1

In [None]:
s1 = np.std(growth1)

In [None]:
myplot2 = '...'

In [None]:
# Select a particular plot name based on examination of mapped data and descriptions in the plot description dataset.
PLT = myplot2 # Put the name for study hear, i.e ='STRD'
data2 = MSH_YEAR.where('PLOT_NAME',are.contained_in(PLT)).sort('YEAR',descending=False)
data2.sort('YEAR')

In [None]:
growth2 = ...
growth2

In [None]:
s2 = np.std(growth2)

In [None]:
diffmean = np.mean(...) - np.mean(...)

In [None]:
s =
n =
SE = 

In [None]:
t = ...
print("The t value is:", t)

In [None]:
p = ...

<font color='blue'>***Outcome of Hypothesis Test and conclusion about selected plots...***

## Part 2: Testing a trend

### <font color='green'> ***Question 6***</font>
Now we will look at the time trend of COVER_% and RICHNESS using the `changes` function you developed and used  in Part 2 of Lab 07. With `changes` we are looking at the number of increases minus decreases over the time period.

`changes` function:

In [None]:
def diff_n(values, n):
    '''
    Parameters:
    values is an array of numbers
    n is the offset (how far apart the numbers are in the array)
    '''
    return np.array(values)[n:] - np.array(values)[:-n]

In [None]:
def changes(array, years = 1):
    "Return the number of increases minus the number of decreases"
    ...

In [None]:
test_stat = ...
print('Total increases minus total decreases, across all years:', test_stat)

### <font color='green'> ***Question 7***</font>
Carry out 1000 simulations. Statistically test whether data supports the alternate hypothesis
    1. Compute a P-value. (Hint: you can use np.count_nonzero())
    2. Using a 5% P-value cutoff, draw a conclusion about the null and alternative hypotheses.
    3. Describe your findings using simple, non-technical language.

### <font color='green'> ***Question 8: Conclusions***</font>
Summarize your conclusions from your study of two plots. Contrast the features of the two plots and how they might lead to different or similar conclusions in the magnitude and significance of the studied quantities, COVER_% or RICHNESS. Use a markdown Table to summarize part of your conclusions. A markdown table uses `|` to divide headings and `|---|` to draw lines between rows. Below you will find an example of a started table, replace these entries with your results. There should be one row for each of the two tests.

|Plot Name|Hypothesis|p-value|conclusion|
|---|---|---|---|
|myplot1| COVER_% increases after eruption| p-value = 0.001, simulation, p-value = 0.000|reject Null hypothesis|

### <font color='green'> ***Question 9***</font>
- What techniques did you use from Lab 07?
- What part was the most challenging?
- How long did you spend on the lab?
- What did you learn from this 2nd mini-project?

In [None]:
# Last cell to execute
import datetime
now = datetime.datetime.now()
now = now.strftime('%H:%M:%S on %A, %B the %dth, %Y')
print(" Submitted by ", name, user, " at ", now )