In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw03.ipynb")

# Homework 03: Table Manipulation and Visualization

**Helpful Reference:**
* [Python Reference](https://www.data8.org/sp22/python-reference.html). Cheat sheet of helpful array & table methods used in this course!
  
**Reading**: 
* [Visualization](https://inferentialthinking.com/chapters/07/Visualization.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## 1. Unemployment


The Federal Reserve Bank of St. Louis publishes data about jobs in the US.  Below, we've loaded data on unemployment in the United States. There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:

1. Among people who are able to work and are looking for a full-time job, **the percentage who can't find a job**.  This is called the Non-Employment Index, or NEI.
2. Among people who are able to work and are looking for a full-time job, **the percentage who can't find any job** *or* **are only working at a part-time job**.  The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is NEI-PTER.  (Economists are great at marketing.)

Both metrics are calculated **monthly** by the Federal Reserve Bank of Richmond. For further information about this data, go to Federal Reserve Bank of Richmond's website on [Hornstein-Kudlyak-Lange Non-Employment Index](https://www.richmondfed.org/research/national_economy/non_employment_index). The original source of the data is [here for NEI](https://fred.stlouisfed.org/series/NEIM156SFRBRIC) and [here for NEI-PTER](https://fred.stlouisfed.org/series/NEIPTERM156SFRBRIC).

#### Question 1.1.

The data are provided in a CSV file called `unemployment.csv`.  Load that file into a table called `unemployment`.

In [None]:
unemployment = ...
unemployment

In [None]:
grader.check("q1_1")

#### Question 1.2.

Sort the data in descending order by NEI, naming the sorted table `by_nei`.  Create another table called `by_nei_pter` that's sorted in descending order by NEI-PTER instead.

In [None]:
by_nei = ...
by_nei_pter = ...

In [None]:
grader.check("q1_2")

#### Question 1.3.

Use `take` to make a table that only has the `Date` and `NEI` columns which contains only the data for the 10 months when NEI was the greatest. Call the new Table `greatest_nei` and sort it in descending order based on the `NEI` column. Recall that each row of `unemployment` represents a month.

**Hint:** You will need to remember / look up how to keep only certain columns and rows from a Table.

In [None]:
greatest_nei = ...
greatest_nei

In [None]:
grader.check("q1_3")

#### Question 1.4.

It's believed that many people became PTER (recall: "Part-Time for Economic Reasons") in the "Great Recession" of 2008-2009 and during the first phase of the COVID-19 pandemic in 2020-2021. The next few questions will help you to create a set of data that will allow you to investigate this claim.

Recall that NEI-PTER is the percentage of people who are unemployed (and counted in the NEI) plus the percentage of people who are PTER. **Compute an array** containing the percentage of people who were only PTER in each month.  (The first element of the array should correspond to the first row of `unemployment`, and so on.)

*Note:* Use the original `unemployment` table for this.

In [None]:
pter = ...
pter

In [None]:
grader.check("q1_4")

#### Question 1.5.

Add the `pter` array as a column named `PTER` to the original `unemployment` Table and sort the resulting Table by the `PTER` column in descending order. Call the new Table `by_pter`.

Try to do this with a single line of code, if you can.

In [None]:
by_pter = ...
by_pter

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

#### Question 1.6.

It would be helpful to create a line plot of the PTER over time to see how it has changed over the years. Unfortunately, the `Date` column is a string that can't be used as the independent variable of a line plot. You will need to create a new column, `Year`, that represents the year as a `float`. For example, January 1994 should be represented at `1994.0`, February 2000 should be represented as `2000.0833333333` (`0.0833333333` is $\approx \frac{1}{12}$), March 2020 would be represented as `2020.1666666667` and so on. 

First, think about how you could quickly create an array of these `float` values using a `numpy` function. Remember, the first row of the data starts with January 1994 (`1994.0`) and each row increases the year value by one twelfth of a year. Store this array to the name `year_array`.

Then, add both the `year_array` array and the `pter` array to the original `unemployment` table. Label these columns `Year` and `PTER`. 

Lastly, generate a line plot using one of the table methods you've learned in class.

In [None]:
year_array = ...
pter_over_time = ...
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Question 1.7.

Were PTER rates higher than usual during the Great Recession and during the COVID-19 pandemic (that is to say, were PTER rates particularly high in the years 2008-2011 and 2020-2021)? Write an explanation of your opinion backed by the data displayed in the line plot created in the previous question, and any additional statistical analysis you perform. Cite specific historical events that align with key dates when possible to add context to your explanation.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Birth Rates


The following table gives census-based population estimates for each state on both July 1, 2019 and July 1, 2020. The last four columns describe the components of the estimated change in population during this time interval. **For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico & the District of Columbia.**

The data was taken from the [US Census Website](https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/totals/). If you want to read more about the different column descriptions, click [here](https://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.pdf)!

The raw data is a bit messy and contains a large amount of additional data - run the cell below to pare down and clean the table to make it easier to work with.

In [None]:
# Don't change this cell; just run it.
pop = Table.read_table('nst-est2020-alldata.csv').where('SUMLEV', 40).select(["REGION", "NAME", 'POPESTIMATE2019', 'POPESTIMATE2020', 'BIRTHS2020', 'DEATHS2020', 'NETMIG2020', 'RESIDUAL2020'])
pop = pop.relabeled('POPESTIMATE2019', '2019').relabeled('POPESTIMATE2020', '2020')
pop = pop.relabeled('BIRTHS2020', 'BIRTHS').relabeled('DEATHS2020', 'DEATHS')
pop = pop.relabeled('NETMIG2020', 'MIGRATION').relabeled('RESIDUAL2020', 'OTHER')
pop = pop.with_columns("REGION", np.array([int(region) if region != "X" else 0 for region in pop.column("REGION")]))
pop.set_format([2, 3, 4, 5, 6, 7], NumberFormatter(decimals=0)).show(5)

Each of the columns `BIRTHS`, `DEATHS`, `MIGRATION`, and `OTHER` represent a change in the population of the state between 2019 and 2020, measured in individual people, due to: 

* **Births**. This number represents how much the population has increased due to births.
* **Deaths**. This number represents how much the population has decreased due to deaths. 
* **Migration**. This number represents how much the population has changed due to people moving in our out of the state. A positive number represents a net increase in the state population and a negative number represents a net decrease in the state population.
* **Other** This number represents how much the population has changed due to all other reasons. A positive number represents a net increase in the state population and a negative number represents a net decrease in the state population.

#### Question 2.1.

Investigate the national birth *rate* during this time period. Assign `us_birth_rate` to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the population size at the start of the time period.

**Hint:** Which year corresponds to the start of the time period?

In [None]:
us_birth_rate = ...
us_birth_rate

In [None]:
grader.check("q2_1")

#### Question 2.2.

Investigate the states who had a relatively large percentage of their population move either into the state or out of the state. Use Table and array methods to determine the **number of states** that had more than 0.5% of their population move in or out. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population size at the start of the period. 

Start by creating an extension to the `population` Table named `migration_rates` that contains a new column that contains the value of the **annual *net* rate of migration** for each state. Then, determine how many states had net migration rates higher than 0.5%.



In [None]:
migration_rates = ...
movers = ...
movers

In [None]:
grader.check("q2_2")

#### Question 2.3.

Investigate the total number of births that occurred in the Southeastern US between 2019 and 2020. Use Table and array methods to assign the total number of births, an integer, that occurred in region 3 (the Southeastern US) to `southeast_births`.

In [None]:
southeast_births = ...
southeast_births

In [None]:
grader.check("q2_3")

#### Question 2.4.

Assign `less_than_south_births` to the **number of states** that had a total population in 2019 that was smaller than the **total *number of births* in region 3 (the Southeastern US)** during this time interval.

In [None]:
less_than_south_births = ...
less_than_south_births

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

#### Question 2.5. 

In the next question, you will be creating a visualization to understand the relationship between birth and death rates. The annual death rate for a year-long period is the total number of deaths in that period as a proportion of the population size at the start of the time period.

What visualization is most appropriate to see if there is an association between birth and death rates during a given time interval?

1. Scatter Plot
2. Line Graph
3. Vertical Bar Chart
4. Horizontal Chart
5. Histogram

Assign `visualization` below to the number corresponding to the correct visualization.

In [None]:
visualization = ...

In [None]:
grader.check("q2_5")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Question 2.6. 

In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval. You can take as many steps as you need, as you may find it helpful to create a new Table before creating the visualization.

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Question 2.7.

Is there is an association between birth rate and death rate during this time interval? If so, how would you describe it? If not, how were you able to determine there was no association? **Write an explanation of your opinion backed by the data displayed in the visualization created in the previous question.** Offer an explanation as to why you believe it makes sense that there is or is not a correlation between these variables in context of the data. Meaning, why would it make sense for birth and death rates to have an association or not have an association?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 3. Marginal Histograms


Consider the following scatter plot: 

![](scatter.png)

The axes of the plot represent values of two variables: $x$ and $y$. 

Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of the points in the scatter plot
- `y`: a column containing the y-values of the points in the scatter plot

Below, you are given two histograms, each of which corresponds to either column `x` or column `y`. 

**Histogram A:** 

![](var1.png)

**Histogram B:** 

![](var2.png)

#### Question 3.1.

Suppose we run `t.hist('x')`. Which histogram does this code produce? Assign `histogram_column_x` to either 1 or 2.

1. Histogram A
2. Histogram B

In [None]:
histogram_column_x = ...

In [None]:
grader.check("q3_1")

<!-- BEGIN QUESTION -->

#### Question 3.2.

State at least one reason why you chose the histogram from Question 3.1. Make sure to indicate which histogram you selected (ex: "I chose histogram A because ...") and cite one or more specific characteristics from the scatter plot to support your choice.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 4. Uber


Every day millions of people use ride sharing apps, like Uber and Lyft, to get around their cities and towns. The [Uber Movement](https://movement.uber.com) project has released data on how people are traveling throughout some of the largest cities in the world. Investigate how Uber rides might be different in two large cities: Boston, MA in the USA and Manila in the Philippines.

Each data set below contains 200,000 weekday Uber rides for each city. The `sourceid` and `dstid` columns contain codes corresponding to start and end locations of each ride. The `hod` column contains codes corresponding to the hour of the day the ride took place. The `ride time` column contains the length of the ride, in minutes.

Run the cells below to load the data for both cities.

In [None]:
boston = Table.read_table("boston.csv")
boston

In [None]:
manila = Table.read_table("manila.csv")
manila

<!-- BEGIN QUESTION -->

#### Question 4.1.

To start investigating the differences in these two cities, visualize the distribution of ride times in each of the cities.

Produce two histograms, one for each city, that shows the distribution of all ride times. Use the provided bin sizes that are stored to the array `equal_bins` and **assign the horizontal label to have the correct units**. Need to remember how? Reread [7.2.2: Histograms ](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#histogram) in the textbook!

In [None]:
equal_bins = np.arange(0, 120, 5)
...
...

<!-- END QUESTION -->

#### Question 4.2.

Assign `boston_under_10` and `manila_under_10` to the percentage of rides that are less than 10 minutes in their respective metropolitan areas. The table below shows you the height of each bar in the histogram, measured in percent per minute, for the first few bars of the histograms you produced in the previous question.

| bin | Boston | Manilla |
|-----|--------|---------|
|[0, 5) | 1.2 | 0.6 |
|[5, 10) | 3.2 | 1.4 |
|[10, 15) | 4.95| 2.25|
|[15, 20) | 4.6 | 2.55 |

Your solution should be determined using only the information in this table, and you should not access the tables `boston` and `manila` in any way.

In [None]:
boston_under_10 = ...
manila_under_10 = ...

In [None]:
grader.check("q4_2")

#### Question 4.3.

Let's take a closer look at the distribution of ride times in Manila. Assign `manila_median_bin` to an integer (1, 2, 3, or 4) that corresponds to the bin that contains the median time 

1. 0-15 minutes  
2. 15-40 minutes  
3. 40-60 minutes  
4. 60-80 minutes  

**Hint:** The median of a sorted list has half of the list elements to its left, and half to its right

In [None]:
manila_median_bin = ...

In [None]:
grader.check("q4_3")

<!-- BEGIN QUESTION -->

#### Question 4.4.

Describe the main differences between the two histograms / the differences between the distributions of Uber ride lengths in these two cities. What about these two cities might be causing the differences in the distributions of ride times? 

**Write an explanation** of your opinion backed by the data displayed in the histograms created in the previous questions, and any additional statistical analysis you perform. Cite specific qualities of these cities when possible to add context to your explanation.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw03_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw03_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)