# Assignment 2: Exploratory Data Analysis

This assignment covers **Chapters 6-8** from the textbook as well as lecture material from Weeks 3-4. Please complete this assignment by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

This assignment is due by **11:59pm on Tuesday, February 28**.

In [None]:
import numpy as np
from datascience import *


# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

# This is if you want your plots to have the FiveThirtyEight style
plt.style.use('fivethirtyeight')

### Question 1. Unemployment

The Federal Reserve Bank of St. Louis publishes data about jobs in the US.  Below, we've loaded data on unemployment in the United States. There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:

1. Among people who are able to work and are looking for a full-time job, the percentage who can't find a job.  This is called the Non-Employment Index, or NEI.
2. Among people who are able to work and are looking for a full-time job, the percentage who can't find any job *or* are only working at a part-time job.  The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is NEI-PTER.  (Economists are great at marketing.)

The source of the data is [here](https://fred.stlouisfed.org/categories/33509).

**a)** The data are in a CSV file called `unemployment.csv`.  Load that file into a table called `unemployment`.

**b)** It's believed that many people became PTER (recall: "Part-Time for Economic Reasons") in the "Great Recession" of 2008-2009.  NEI-PTER is the percentage of people who are unemployed (and counted in the NEI) plus the percentage of people who are PTER.  Compute an array containing the percentage of people who were PTER in each quarter.  (The first element of the array should correspond to the first row of `unemployment`, and so on.) Call the array `pter`.

**c)** Add `pter` as a column to `unemployment` (named "PTER") and sort the resulting table by that column in descending order.  Call this new table `by_pter`.

**d)** We want to create a line plot of the PTER over time. To do this, we first add a `year` array and the `pter` array to the `unemployment` table labeled `Year` and `PTER`, respectively. Use the code below to create `pter_over_time`.

In [None]:
# Make sure your previous sections have created by_pter and pter properly for this to work!
year = 1994 + np.arange(by_pter.num_rows)/4

In [None]:
pter_over_time = unemployment.with_columns('Year', year,'PTER',pter)

Create a line plot using `pter_over_time`.

**e)** Were PTER rates high during or directly after the Great Recession (that is to say, were PTER rates particularly high in the years 2008 through 2011)?

### Question 2. Cards Against Humanity - Pulse of the Nation Public Opinion Poll

In this question, we'll explore the Cards Against Humanity poll dataset further using the tools we've learned. For your convenience, we have made a subset of the data available to you in a CSV file called `201709-CAH_PulseOfTheNation.csv`.

In [None]:
cah = Table.read_table('201709-CAH_PulseOfTheNation.csv')

**a)** Create a contingency table with the frequencies of people who believe in ghosts and level of education, with `Believe in Ghosts` going across the top and `Level of Education` on the side. Assign the table to `ghosts_by_education`.

**b)** Use the `ghosts_by_education` table to create an appropriate visualization.

**c)** Run the following cell and explain what it does. What does the `percents` function do? What does the `distributions` table represent?

*Note:* Make sure you set up the `ghosts_by_education` table properly in part d. You might need to change the order of variables.

In [None]:
def percents(array_x):
    return np.round( (array_x/sum(array_x))*100, 2)

labels = ghosts_by_education.labels

distributions = Table().with_columns(labels[0], ghosts_by_education.column(0),
                                     labels[1], percents(ghosts_by_education.column(1)),
                                     labels[2], percents(ghosts_by_education.column(2)),
                                     labels[3], percents(ghosts_by_education.column(3)))
distributions.show()

**d)** Do people who are less educated tended to believe in ghosts more? Use the summaries and visualizations you found above to justify your answer.

**e)** Write a function called `is_not_zero` which takes one argument and returns `True` if that value isn't 0 and `False` if the value is 0.

**f)** The variable `Transformers Movies` has the number of Transformers movies that the respondent saw in the past year. Add a new column to `cah` called "Transformers" which is `True` if the person has watched any Transformers movies and `False` if they haven't. 

**g)** How many people have watched any Transformers movies? How many have not? 

**h)** Is there a relationship between level of education and having watched at least one Transformers movie? Use the same analysis as in parts a-c to answer the question. 