In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("project2_sp24.ipynb")

# Project 2: Climate Change—Temperatures and Precipitation

In this project, you will investigate data on climate change, or the long-term shifts in temperatures and weather patterns!

### Logistics

**Rules.** Don't share your code with anybody but your partner, if you have one. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.

**Support.** You are not alone! Come to office hours and talk to your classmates. If you're ever feeling overwhelmed or don't know how to make progress, email for help.

**Tests.** The tests that are given are **not comprehensive** and passing the tests for a question **does not** mean that you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work! You might want to create your own checks along the way to see if your answers make sense. Additionally, before you submit, make sure that none of your cells take a very long time to run (several minutes).

**Free Response Questions:** Make sure that you put the answers to the written questions in the indicated cell we provide. **Every free response question should include an explanation** that adequately answers the question.

**Advice.** Develop your answers incrementally. To perform a complicated task, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, **DO NOT** reuse the variable names that we use when we grade your answers. 

You **never** have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!

All of the concepts necessary for this project are found in the textbook. If you are stuck on a particular problem, reading through the relevant textbook section often will help clarify the concept.


---

To get started, load `datascience`, `numpy`, `matplotlib`, and `d8error`. Make sure to also run the first cell of this notebook to load `otter`.

In [None]:
# Run this cell to set up the notebook, but please don't change it.
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')

import warnings
warnings.simplefilter('ignore')
import d8error

## Part 1: Temperatures

In the following analysis, we will investigate one of the 21st century's most prominent issues: climate change. While the details of climate science are beyond the scope of this course, we can start to learn about climate change just by analyzing public records of different cities' temperature and precipitation over time.

We will analyze a collection of historical daily temperature and precipitation measurements from weather stations in 209 U.S. cities. The dataset was compiled by Yuchuan Lai and David Dzombak [1]; a description of the data from the original authors and the data itself is available [here](https://kilthub.cmu.edu/articles/dataset/Compiled_daily_temperature_and_precipitation_data_for_the_U_S_cities/7890488). 

[1] Lai, Yuchuan; Dzombak, David (2019): Compiled historical daily temperature and precipitation data for selected 209 U.S. cities. Carnegie Mellon University. Dataset.

### Part 1, Section 1: Cities

Run the following cell to load information about the `cities` and preview the first few rows.

In [None]:
cities = Table.read_table('city.csv', index_col=0)
cities.show(5)

The `cities` table has one row per weather station and the following columns:

1. `"Name"`: The name of the US city
2. `"ID"`: The unique identifier for the US city
3. `"Latitude"`: The latitude of the US city (measured in degrees of latitude)
4. `"Longitude"`: The longitude of the US city (measured in degrees of longitude)
4. `"Stn.Name"`: The name of the weather station in which the data was collected
5. `"Stn.stDate"`: A string representing the date of the first recording at that particular station
6. `"Stn.edDate"`: A string representing the date of the last recording at that particular station

The data lists the weather stations at which temperature and precipitation data were collected. Note that although some cities have multiple weather stations, only one is collecting data for that city at any given point in time. Thus, we are able to just focus on the cities themselves.

<!-- BEGIN QUESTION -->

**Question 1.1.1:** In the cell below, produce a scatter plot that plots the latitude and longitude of every city in the `cities` table so that the result places northern cities at the top and western cities at the left.

*Note*: It's okay to plot the same point multiple times!


In [None]:
...

<!-- END QUESTION -->

These cities are all within the continental U.S., and so the general shape of the U.S. should be visible in your plot. The shape will appear distorted compared to most maps for two reasons: the scatter plot is square even though the U.S. is wider than it is tall, and this scatter plot is an [equirectangular projection](https://en.wikipedia.org/wiki/Equirectangular_projection) of the spherical Earth. A geographical map of the same data uses the common [Pseudo-Mercator projection](https://en.wikipedia.org/wiki/Web_Mercator_projection).

In [None]:
# Just run this cell
Marker.map_table(cities.select('Latitude', 'Longitude', 'Name').relabeled('Name', 'labels'))

<!-- BEGIN QUESTION -->

**Question 1.1.2** How do the city locations reflect the distribution of population centers and geographic diversity across the US?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.1.3:** Assign `unique_city_count` to the number of unique cities that appear in the `cities` table.


In [None]:
unique_city_count = ...

# Do not change this line
print(f"There are {unique_city_count} unique cities that appear within our dataset.")

In [None]:
grader.check("q1_1_3")

In order to investigate further, it will be helpful to determine what region of the United States each city was located in: Northeast, Northwest, Southeast, or Southwest. For our purposes, we will be using the following geographical boundaries:

<img src= "usa_coordinates.png" alt="USA Coordinate Map" width="600"/>

1. A station is located in the `"Northeast"` region if its latitude is above or equal to 37 degrees and its longtitude is greater than or equal to -100 degrees.
2. A station is located in the `"Northwest"` region if its latitude is above or equal to 37 degrees and its longtitude is less than -100 degrees.
3. A station is located in the `"Southeast"` region if its latitude is below 37 degrees and its longtitude is greater than or equal to -100 degrees.
4. A station is located in the `"Southwest"` region if its latitude is below 37 degrees and its longtitude is less than -100 degrees.

**Question 1.1.4**: Define the `region_coordinates` function below. It should take in two arguments, a city's latitude (`lat`) and longitude (`lon`) coordinates, and output a string representing the region it is located in.


In [None]:
def region_coordinates(...):
    ...

In [None]:
grader.check("q1_1_4")

**Question 1.1.5**: Add a new column in `cities` labeled `Area` that contains the region in which the city is located. For full credit, you must use the `region_coordinates` function you defined rather than reimplementing its logic.


In [None]:
area_array = ...
cities = ...
cities.show(5)

In [None]:
grader.check("q1_1_5")

To confirm that you've defined your `region_coordinates` function correctly and successfully added the `Area` column to the `cities` table, run the following cell. Each region should have a different color in the result.

In [None]:
# Just run this cell
cities.scatter("Longitude", "Latitude", group="Area")

**Challenge Question 1.1.6 (OPTIONAL, ungraded)**: Create a new table called `cities_nearest`. It should contain the same columns as the `cities` table and an additional column called `"Nearest"` that contains the **name of the nearest city** that is in a different region from the city described by the row.

To approximate the distance between two cities, take the square root of the sum of the squared difference between their latitudes and the square difference between their longitudes. Don't use a `for` statement; instead, use the `apply` method and array arithmetic.

*Hint*: We have defined a `distance` function for you, which can be called on numbers `lat0` and `lon0` and arrays `lat1` and `lon1`.

In [None]:
def distance(lat0, lon0, lat1, lon1):
    "Approximate the distance between point (lat0, lon0) and (lat1, lon1) pairs in the arrays."
    return np.sqrt((lat0 - lat1) * (lat0 - lat1) + (lon0 - lon1) * (lon0 - lon1))

def nearest(name):
    row = ...
    others = ...
    distances = ...
    return others.with_column('dist', distances).sort('dist').row(0).item('Name')

nearest_array = cities.apply(nearest, "Name")

cities_nearest = ...
cities_nearest.show(5)

### Part 1, Section 2: Welcome to Sacramento, California

Each city has a different CSV file full of daily temperature and precipitation measurements. The file for Sacramento, California is included with this project as `sacramento.csv`. The files for other cities can be downloaded [here](https://kilthub.cmu.edu/articles/dataset/Compiled_daily_temperature_and_precipitation_data_for_the_U_S_cities/7890488) by matching them to the ID of the city in the `cities` table.

Since Sacramento is the city located closest to the Bay Area, it is interesting to look at its temperatures.

Run the following cell to load in the `sacramento` table. It has one row per day and the following columns:

1. `"Date"`: The date (a string) representing the date of the recording in **YYYY-MM-DD** format
2. `"tmax"`: The maximum temperature for the  day (°F)
3. `"tmin"`: The minimum temperature for the day (°F)
4. `"prcp"`: The recorded precipitation for the day (inches)

In [None]:
sacramento = Table.read_table("sacramento.csv", index_col=0)
sacramento.show(10)

**Question 1.2.1:** Assign the variable `highest_2020_average_temp` to the date of the **highest average temperature** in Sacramento, California for any day between January 1st, 2020 and December 31st, 2020. Your answer should be a string in the "YYYY-MM-DD" format. Feel free to use as many lines as you need. An average temperature is calculated as the sum of the max and min temperatures for the day divided by 2.

*Hint*: To limit the values in a column to only those that *contain* a certain string, pick the right `are.` predicate from the [Python Reference Sheet](http://data8.org/sp22/python-reference.html).

*Note:* Do **not** re-assign the `sacramento` variable; please use the `sacramento_with_averages_2020` variable instead.


In [None]:
...
sacramento_with_averages_2020 = ...
highest_2020_average_date = ...
highest_2020_average_date

In [None]:
grader.check("q1_2_1")

We can look back to our `sacramento` table to check the temperature readings for our `highest_2020_average_date` to see if anything special is going on. Run the cell below to find the row of the `sacramento` table that corresponds to the date we found above. 

In [None]:
# Just run this cell
sacramento.where("Date", highest_2020_average_date)

ZOO WEE MAMA! Look at the maximum temperature for that day. That's hot. But not as hot as Sacramento's all-time high temperature reached on September 6, 2022 of 116 degrees!!

The function `get_month_from_date` takes a date string in the **YYYY-MM-DD** format and returns an integer representing the **month**. The function `get_year_from_date` takes a date string and returns a string describing the **year**. Run this cell, but you do not need to understand how this code works or edit it.

In [None]:
# Just run this cell
import calendar

def get_month_from_date(date):
    "Return an abbreviation of the name of the month for a string's date."
    month = date[5:7]
    return f'{month} ({calendar.month_abbr[int(date[5:7])]})'

def get_year_from_date(date):
    """Returns an integer corresponding to the year of the input string's date."""
    return int(date[:4])


# Example
print('2022-04-01 has month', get_month_from_date('2022-04-01'),
      'and year', get_year_from_date('2022-04-01'))

**Question 1.2.2:** Add two new columns called `Month` and `Year` to the `sacramento` table that contain the month as a **string** (such as `"04 (Apr)"`) and the year as an **integer** for each day, respectively. 

*Note*: The functions above may be helpful!


In [None]:
months_array = ...
years_array = ...

 ...
sacramento.show(5)

In [None]:
grader.check("q1_2_2")

<!-- BEGIN QUESTION -->

**Question 1.2.3:** A recent study has found that all species appear to thrive at an "optimal" 20 degrees (68 degrees Fahrenheit). Using the `sacramento` table, create a line plot of the **average number of days the maximum temperature was above 68 degrees** for each year between 1920 and 2021. 

In [None]:
hot_day = ...
sacramento_hot = ...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.2.4:** Although still hotly debated (pun intended), many climate scientists agree that the effects of climate change began to surface in the early 1960s as a result of elevated levels of greenhouse gas emissions. How does the graph you produced in Question 1.2.3 support the claim that modern-day global warming began in the early 1960s? 


_Type your answer here, replacing this text._

<!-- END QUESTION -->

Averaging temperatures across an entire year can obscure some effects of climate change. For example, if summers get hotter but winters get colder, the annual average may not change much. Let's investigate how average **monthly** minimum temperatures have changed over time in Sacramento. 

**Question 1.2.5:** Create a `monthly_differences` table with one row per month and the following four columns in order: 
1. `"Month"`: The month (such as `"02 (Feb)"`)
2. `"Past"`: The average min temperature in that month from 1900-1979 (inclusive)
3. `"Present"`: The average min temperature in that month from 2016-2021 (inclusive)
4. `"Increase"`: The difference between the present and past average min temperatures in that month

First make a copy of the `sacramento` table and add a new column containing the corresponding **category** for each row. You may find the `categorize` function helpful. Then, use this new table to construct `monthly_differences`. Feel free to use as many lines as you need.

*Hint*: What table method can we use to get each **unique value** as its own column? 

*Note*: Please do **not** re-assign the `sacramento` variable!


In [None]:
def categorize(year):
    "Output if a year is in the Past, Present, or Other."
    if 1900 <= year <= 1979:
        return "Past"
    elif 2016 <= year <= 2021:
        return "Present"
    else:
        return "Other"
    
...
monthly_differences = ...
monthly_differences.show()


In [None]:
grader.check("q1_2_5")

### March in Sacramento

The `"Past"` column values are averaged over many decades, and so they are reliable estimates of the average high temperatures in those months before the effects of modern climate change. However, the `"Present"` column is based on only six years of observations. February, the shortest month, has the fewest total observations: only 170 days. Run the following cell to see this.

In [None]:
# Just run this cell
feb_present = sacramento.where('Year', are.between_or_equal_to(2016, 2021)).where('Month', '02 (Feb)')
feb_present.num_rows

Given February is a short month, we'll focus on the next closest value which is in March. Look back to your `monthly_differences` table. Compared to the other months, the increase for the month of March is the next smallest. Run the following cell to print out our observed difference.

In [None]:
# Just run this cell
mar_present = sacramento.where('Year', are.between_or_equal_to(2016, 2021)).where('Month', '03 (Mar)')
print(f"March Difference: {monthly_differences.row(2).item('Increase')}")

Are March months really getting warmer (i.e., is the average minimum temperature increasing)? Perhaps that small difference is somehow due to chance! To investigate this idea requires a thought experiment.

We can observe all of the March minimum temperatures from 2016 to 2021 (the present period), so we have access to the census; there's no random sampling involved. But, we can imagine that if more years pass with the same present-day climate, there would be different but similar minimum temperatures in future March days. From the data we observe, we can try to estimate the **average minimum March temperature** in this imaginary collection of all future March days that would occur in our modern climate, assuming the climate doesn't change any further and many years pass.

We can also imagine that the minimum temperature each day is like a **random draw from a distribution of temperatures for that month**. Treating actual observations of natural events as if they were each *randomly* sampled from some unknown distribution is a simplifying assumption. These temperatures were not actually sampled at random—instead they occurred due to the complex interactions of the Earth's climate—but treating them as if they were random abstracts away the details of this naturally occuring process and allows us to carry out statistical inference.  Conclusions are only as valid as the assumptions upon which they rest, but in this case thinking of daily temperatures as random samples from some unknown climate distribution seems at least plausible.

If we assume that the **actual temperatures were drawn at random from some large population of possible March days** in our modern climate, then we can not only estimate the population average of this distribution, but also quantify our uncertainty about that estimate using a confidence interval.

**We will just compute the lower bound of this confidence interval.** The upper bound of a confidence interval for a population average based on a sample is always higher than the sample average. We intend to compare our confidence interval to the historical average (ie. the `Past` value in our `monthly_differences` table). In all months, the sample average we will consider (i.e. the `Present` value in our `monthly_differences` table) is higher than the historical average. As a result, we know in advance that the upper bound of the confidence interval will be higher as well, and there is no need to compute the upper bound explicitly. (But you can if you wish!)

**Question 1.2.6.** Complete the implementation of the function `ci_lower`, which takes a one-column table `t` containing sample observations and a confidence `level` percentage such as 95 or 99. It returns the lower bound of a confidence interval for the population mean constructed using 4,000 bootstrap resamples.

After defining `ci_lower`, we have provided a line of code that calls `ci_lower` on the present-day March min temperatures to output the upper bound of a 95% confidence interval for the March average min temperature. The result should be around 45.5 degrees.


In [None]:
def ci_lower(t, level):
    """Compute an upper bound of a level% confidence interval of the 
    average of the population for which column 0 of Table t contains a sample.
    """
 
    stats = make_array()
    for k in np.arange(4000):
        stat = ...
        stats = ...
    ...

# Call ci_lower on the max temperatures in present-day February to find the lower bound of a 95% confidence interval.
mar_present_ci = ci_lower(mar_present.select('tmin'), 95)
mar_present_ci

In [None]:
grader.check("q1_2_6")

<!-- BEGIN QUESTION -->

**Question 1.2.7** The lower bound of the `mar_present_ci` 95% confidence interval is above the observed past March average minimum temperature of 42.9863 (from the `monthly_increases` table). What conclusion can you draw about the effect of climate change on March minimum temperatures in Sacramento from this information? Use a 5% p-value cutoff.

*Note*: If you're stuck on this question, re-reading the paragraphs under the *March* heading (particularly the first few) may be helpful.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### All Months

**Question 1.2.8.** Repeat the process of comparing the **lower bound of a 95% confidence interval** to the **past average** for each month. For each month, print out the name of the month (e.g., `02 (Feb)`), the observed past average, and the lower bound of a confidence interval for the present average.

Use the provided call to `print` in order to format the result as one line per month.

*Hint*: Your code should follow the same format as our code from above (i.e. the *February* section).   


In [None]:
comparisons = make_array()
months = ...
for month in months:
    past_average = ...
    present_averages = ...
    present_lower_bound = ...
    
    # Do not change the code below this line
    below = past_average < present_lower_bound
    if below:
        comparison = '**below**'
    else:
        comparison = '*above*'
    comparisons = np.append(comparisons, comparison)
    
    print('For', month, 'the past avg', round(past_average, 1), 
          'is', comparison, 
          'the lower bound', round(present_lower_bound, 1),
          'of the 95% CI of the present avg. \n')

In [None]:
grader.check("q1_2_8")

<!-- BEGIN QUESTION -->

**Question 1.2.9.** Summarize your findings. After comparing the past average to the 95% confidence interval's lower bound for each month, what conclusions can we make about the monthly average minimum temperature in historical (1900-1979) vs. modern (2016-2021) times in the twelve months? In other words, what null hypothesis should you consider, and for which months would you reject or fail to reject the null hypothesis? Use a 5% p-value cutoff.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Congratulations

Congratulations, you made it this far!


--- 
The cell below will re-run all of the autograder tests for Part 1 to double check your work.

In [None]:
checkpoint_tests = ["q1_1_3", "q1_1_4", "q1_1_5",
                    "q1_2_1", "q1_2_2", "q1_2_5", "q1_2_6", "q1_2_8"]

for test in checkpoint_tests:
    display(grader.check(test))

## Submission
If your instructor would like you to submit the work in part one as a checkpoint to the project, follow the instructions below.

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Your instructor may want you to submit your work thus far as a checkpoint.

**Reminders**:
- Make sure to wait until the autograder finishes running to ensure that your submission was processed properly and that you submitted to the correct assignment.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)

# Part 2: Drought

California is no stranger to drought; it is a recurring feature of our climate. According to the [California Department of Water Resources](https://water.ca.gov/water-basics/drought#:~:text=We%20recently%20experienced%20the%205,in%20the%201920s%20and%201930s.), California experienced drought periods from 2007-2009 and from 2012-2016. 

Let's look back at the Sacramento dataset, and consider the precipitation data. The `sacramento.csv` contains precipitation for each year since 1878. We will read in the original dataset again and name the file precipitation since we will be focusing on the precipitation. You may recall that "prcp" is the recorded precipitation for the day (inches). Run the cell below to look at the current sacramento dataset.

In [None]:
sacramento.show(5)

**Question 2.1.** The Sacramento dataset only has consistent precipitation data since 1960. Create a table `averages` that has one row for each year since 1960 (inclusive) in chronological order. It should contain the following columns:
1. `"Year"`: The year (a number)
2. `"Total Precipitation"`: The total precipitation in Sacramento that year


In [None]:
total_precip = ...
total_precip

In [None]:
grader.check("q2_1")

Run the cell below to plot the total precipitation in Sacramento over time, so that we can try to spot the drought visually. Pay careful attention to the drought years (2007-2009) and (2012-2016) identified by the CA Dept of Water Resources.

In [None]:
# Just run this cell
total_precip.plot('Year', 'Total Precipitation')

This plot isn't very revealing. Each year has a different amount of precipitation, and there is quite a bit of variability across years, as if each year's precipitation is a random draw from a distribution of possible outcomes. 

Could it be that these so-called "drought conditions" from 2007-2009 and 2012-2016 can be explained by chance? In other words, could it be that the annual precipitation amounts in Sacramento for these drought years are like **random draws from the same underlying distribution** as for other years? Perhaps nothing about the Earth's precipitation patterns has really changed, and Sacramento just happened to experience a few dry years close together. 

To assess this idea, let's conduct an A/B test in which **each year's total precipitation** is an outcome, and the condition is **whether or not the year is in the CA Water Department's drought period**.

This `precip_label` function distinguishes between drought years as described by the CA Water Department above (2007-2009 and 2012-2016) and other years. Note that the label "other" is perhaps misleading, since there were other droughts before 2007, such as the massive [1988 drought](https://en.wikipedia.org/wiki/1988%E2%80%9390_North_American_drought) that affected much of the U.S. However, if we're interested in whether these modern drought periods (2007-2009 and 2012-2016) are *normal* or *abnormal*, it makes sense to distinguish the years in this way. 

In [None]:
def precip_label(n):
    """Return the label for an input year n."""
    if 2007 <= n <= 2009 or 2012 <= n <= 2016:
        return 'drought'
    else:
        return 'other'

<!-- BEGIN QUESTION -->

**Question 2.2.** Define null and alternative hypotheses for an A/B test that investigates whether drought years are drier (have less precipitation) than other years.

*Note:* Please format your answer using the following structure.

- *Null hypothesis:* ...
- *Alternative hypothesis:* ...


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3.** First, define the table `precip`. It should contain one row per year and the following two columns:
- `"Label"`: Denotes if a year is part of a `"drought"` year or an `"other"` year
- `"Total Precipitation"`: The total precipitation in Sacramento that year

Then, construct an overlaid histogram of two observed distributions: the total precipitation in drought years and the total precipitation in other years. 

*Note*: Use the provided `bins` when creating your histogram, and do not re-assign the `sacramento` table. Feel free to use as many lines as you need!

*Hint*: The optional `group` argument in a certain function might be helpful!


In [None]:
bins = np.arange(1, 35, 3)
precip = ...
...

<!-- END QUESTION -->

Before you continue, inspect the histogram you just created and try to guess the conclusion of the A/B test. Building intuition about the result of hypothesis testing from visualizations is quite useful for data science applications. 

**Question 2.4.** Our next step is to choose a test statistic based on our alternative hypothesis in Question 2.2. Which of the following options are valid choices for the test statistic? Assign `ab_test_stat` to an array of integers corresponding to valid choices. Assume averages and totals are taken over the total precipitation sums for each year.

1. The **absolute** difference between the **total** precipitation in others years and the **total** precipitation in drought years.
2. The **total** precipitation in **drought** years.
3. The difference between the **total** precipitation in **others** years and the **total** precipitation in **drought** years.
4. The **average** precipitation in **drought** years.
5. The **absolute** difference between the **total** precipitation in others years and the **total** precipitation in drought years.
6. The difference between the **average** precipitation in **drought** years and the **average** precipitation in **other** years.


In [None]:
ab_test_stat = ...

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

**Question 2.5.** Fellow climate scientists Jim and Tisha point out that there are more **other** years than **drought** years, and we should disregard the **other** years because they are skewing the data. They conclude the only valid test statistic is the **total** precipitation in drought years. Do you agree with them? Why or why not?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

Before going on, check your `precip` table. It should have two columns `Label` and `Total Precipitation` with 62 rows, 8 of which are for `"drought"` years.

In [None]:
precip.show(5)

In [None]:
precip.group('Label')

**Question 2.6.** For our A/B test, we'll use the difference between the average precipitation in drought years and the average precipitation in other years as our test statistic:

$$\text{average precipitation in "drought" years} - \text{average precipitation in "other" years}$$

First, complete the function `test_stat`. It should take in a two-column table `t` with one row per year and two columns:
- `Label`: the label for that year (either `'drought'` or `'other'`)
- `Total Precipitation`: the total precipitation in Sacramento that year. 

Then, use the function you define to assign `observed_stat` to the observed test statistic.


In [None]:
def test_stat(t):
    ...

observed_stat = ...
observed_stat

In [None]:
grader.check("q2_6")

Now that we have defined our hypotheses and test statistic, we are ready to conduct our hypothesis test. We’ll start by defining a function to simulate the test statistic under the null hypothesis, and then call that function 4,000 times to construct an empirical distribution under the null hypothesis.

**Question 2.7.** Write a function to simulate the test statistic under the null hypothesis. The `simulate_null` function should simulate the null hypothesis once (not 4,000 times) and return the value of the test statistic for that simulated sample.

*Hint*: Using `t.with_column(...)` with a column name that already exists in a table `t` will replace that column with the newly specified values.


In [None]:
def simulate_null():
    ...

# Run your function a couple times to make sure that it works
simulate_null()

In [None]:
grader.check("q2_7")

**Question 2.8.** Fill in the blanks below to complete the simulation for the hypothesis test. Your simulation should compute 4,000 values of the test statistic under the null hypothesis and store the result in the array `simulated_values`.

*Hint:* You should use the `simulate_null` function you wrote in the previous question!

*Note:* Running this cell may take a few seconds. If it takes more than a minute, try to find a faster way to implement your `simulate_null` function.


In [None]:
simulated_values = ...

repetitions = ...
for i in np.arange(repetitions):
    ...

# Do not change these lines
Table().with_column('Difference Between Means', simulated_values).hist()
plt.scatter(observed_stat, 0, c="r", s=50);
plt.ylim(-0.01);

In [None]:
grader.check("q2_8")

**Question 2.9.** Compute the p-value for this hypothesis test, and assign it to the variable `p_value`.


In [None]:
p_value = ...
p_value

In [None]:
grader.check("q2_9")

<!-- BEGIN QUESTION -->

**Question 2.10.** State a conclusion from this test using a p-value cutoff of 1%. What have you learned about the CA Water Department's statement on drought?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.11.** Does your conclusion from Question 2.10 apply to the entire state of California? Why or why not?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Conclusion

Data science plays a central role in climate change research because massive simulations of the Earth's climate are necessary to assess the implications of climate data recorded from weather stations, satellites, and other sensors. [Berkeley Earth](http://berkeleyearth.org/data/) is a common source of data for these kinds of projects.

In this project, we found ways to apply our statistical inference technqiues that rely on random sampling even in situations where the data were not generated randomly, but instead by some complicated natural process that appeared random. We made assumptions about randomness and then came to conclusions based on those assumptions. Great care must be taken to choose assumptions that are realistic, so that the resulting conclusions are not misleading. However, making assumptions about data can be productive when doing so allows inference techniques to apply to novel situations.

**Congratulations on finishing Project 2! Time to submit.**

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)