# DSTEP20 // Assignment #1

assigned : **Jan 7, 2020**

DUE : **Jan 14, 2020 11:59pm**

## When will New Castle Battery Park be underwater?

![alt text](https://lh5.googleusercontent.com/p/AF1QipNu7_-CRmvQTqI7m_V693kOu_IzeeDVYw2Do2UT=w408-h306-k-no)

---

**READING**

You won't be quizzed or tested on readings for this class, but they can be invaluable for learning through example how to think, write, and reason like a Data Scientist!

1. [CMU Metro 21 Project with Pittsburgh Bureau of Fire Report](http://michaelmadaio.com/Metro21_FireRisk_FinalReport.pdf) - especially the Executive Summary.

2. The [first](https://jakevdp.github.io/blog/2014/06/10/is-seattle-really-seeing-an-uptick-in-cycling/) and [second](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/) blog posts about bike share usage in Seattle from Jake van der Plas's *Pythonic Perambulations* blog. <br> **NOTE: the goal here is not to understand all of the python code, but rather to get a sense of what data storytelling is and the kinds of language and inferential thinking used in data science.**


---

### OVERVIEW

Sea level has been on the rise for at least 100 years, and as the climate changes and the Earth warms, the rate of that rise has been an active area of study given the potential consequences if sea level is strongly affected.

Measurements of sea level mostly come in two flavors, satellites and tide gauges.  The satellite measurements are primarily accomplished by firing radio waves towards the ocean surface and waiting for the time it takes for those radio waves to bounce back and return to the satellite.  Since we know how fast light (radio waves are a form of light) travels, the distance from the satellite to the surface is just the bounce back time divided by the speed of light.  [TOPEX/Poseidon](https://en.wikipedia.org/wiki/TOPEX/Poseidon) has been one of the most successful satellite missions for these altimetry measurements of the ocean surface.  [Tide gauges](https://en.wikipedia.org/wiki/Tide_gauge), on the other hand, are ground-based measurements that directly measure the height of water relative to a stationary device.  They are less accurate, provide significantly reduced spatial and temporal coverage, but prior to satellite altimetry, were the only real method for measuring sea level. 

There are three main goals of this assignment:

1. accessing and working with a century of sea level data
2. fitting linear models to that data for estimates of the rate of sea level rise
3. comparing multiple models for prediction of future sea level

<u>**Instructions for tasks that will be graded are in bold below.**</u>

### PART 1 - Background

Good data science (and data analysis more generally) depends on a clear understanding of the underlying problem/situation, the methods by which the data you are about to analyze are collected, and the situational context in which that data sits.  To that end:

<b>
Provide a brief (no more than 500 words) descriptive overview of sea level rise, including its historical significance, context within a changing climate, and projections for the future.  Potential topics to consider and address include:

1. description of historical sea level measurements
  
2. what open satellite data exist and where they can be located

3. characteristic numbers for sea level measurements over time

4. why sea level might change as the climate changes

5. variation in sea level rise across the globe

6. what projections exist for the future of sea level rise

7. how coastal communities are mitigating the consequences of sea level rise

8. potential economic costs of sea level rise

9. potential social costs of sea level rise

Please include *references* within the description via weblinks (like the TOPEX/Poseidon link in the Overview in the cell above).
</b>

TEXT FOR ANSWER HERE.

### PART 2 - Loading and plotting the data

The NOAA data covers roughly 30 years of sea level changes, but there are data that go back further that are available from CSIRO (Commonwealth Scientific and Industrial Research Organization).  Descriptions of aggregated historical data from CSIRO can be found [here](https://research.csiro.au/slrwavescoast/sea-level/measurements-and-data/) and in the associated links.  The data we'll be using is available as a [CSV](https://datahub.io/core/sea-level-rise/r/epa-sea-level.csv) -- but please see the documentation and caveats associated with it in the README at the bottom of [this](https://datahub.io/core/sea-level-rise) page.

<b>Read in the CSIRO data from the link above labeled CSV.</b>

<b>Take the <u>minimum</u> across the "CSIRO Adjusted Sea Level" and "NOAA Adjusted Sea Level" colmuns and add it to the DataFrame as a column called "min_level".</b>

<b>The CSIRO sea level data is in inches.  Convert the min_level column to millimeters.</B>

Notice that the CSIRO time data is actually a string and includes months and dates.  **Use the cell below to create a column called "year_int" that is the CSIRO year as an integer so that we don't have to worry about the months and dates from now on.**

In [0]:
csiro["year_int"] = [int(i[:4]) for i in csiro["Year"]]

**Make a plot of the sea-level as a function of time.**

**Describe what information is conveyed by this plot.** <small> i.e, write a caption for the plot. </small>

TEXT FOR YOUR ANSWER HERE

### PART 3 - Fitting a linear model

As we did in class, your goal here is to estimate the rate of sea level rise.  However, we now have a much longer <i>temporal baseline</i>.

<b>Using the statsmodels api, fit a linear model to the data:</b>

${\rm sea~level} = a_1 \times {\rm time} + a_0$

<b>What is the rate of sea level rise that you find with this linear model fit to the CSIRO data?</b>

**Plot the CSIRO data with your best fit linear model overlaid.**

**Describe what information is conveyed by this plot.**

**Would you consider this model a "good" fit to the data?  Why or why not?**

TEXT FOR ANSWER HERE.

### PART 4 - Comparing Multiple Model Predictions

Let's expand the model that we're using to fit the data by adding a quadratic term.  In this part, you will compare the two model fits and use each to predict when Newcastle Battery Park will be under water due to rising sea level.

**Using the statsmodels api, fit a model to the CSIRO data that includes both a linear and quadratic dependence on time.**

**Plot the CSIRO data with both the linear model and quadratic model overlaid.**

**Describe what information is conveyed by this plot.**

**Determine the probability that the quadratic model is a better fit to the data using a likelihood ratio test.**

**Using Google Earth to determine the elevation of New Castle Battery Park, in what year will it be under water due to rising sea level?** 

**Summarize your findings throughout this notebook.  What are the key take aways from your analysis?  What are some of the shortcomings, potential biases, inaccuracies, assumptions, or approximations that you have made through out? (No more than 500 words)**

TEXT FOR YOUR ANSWER HERE.

---

**EXTRA CREDIT - Local Effects**

Notice this image:

![alt text](https://md.water.usgs.gov/gage_images/01482170.JPG)

That is a picture of a tide gauge in New Castle Battery Park.  So far we've been using global sea level as a data set for predicting when Battery Park will be underwater, but the rate of sea level rise is known to vary with location.  The USGS provides data going back to 2012.  Using the New Castle tide gauge data available from USGS, how would your answers about when New Castle Battery Park will be underwater change?$^{\dagger}$

<small><i>$^{\dagger}$ Note, this "Extra Credit" section is significantly more tricky than the previous sections! </i></small>

---