# Lab 3: Visualizations

Please complete this lab by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

## Reviewing Table Operations

Let's review some of the table operations we went over last week. We used methods such as `group` and `pivot` to get summaries of categorical variables. Let's take a look again at how it works using the `imdb` and `population` datasets.

In [None]:
imdb = Table.read_table('imdb.csv')
imdb.show(5)

In [None]:
population_amounts = Table.read_table("world_population.csv")
years = np.arange(1950, 2016)
population = population_amounts.with_columns("Year", years)
population.show(5)

### `group`
The table method `group` takes as its argument a string representing a column name. This gives the count of each category of that variable. **Remember to only use this with categorical variables, since it won't really make sense with numerical data.** Note that even if a variable has numbers, that doesn't mean it's a numerical variable. For example, we can find the number of movies that were released in each decade using `group` with the `Decade` variable.

In [None]:
movies_by_decade = imdb.group('Decade')
movies_by_decade

We can also use the `collect` parameter to summarize other variables split up by the categories in `Decade`.

In [None]:
imdb.group('Decade', collect = np.mean)

### `pivot`

You can use `pivot` to create contingency tables, looking at the counts for multiple categorical variables. You can also use `pivot` to find, for example, the mean of a third variable within each combination of categories. We'll focus on the first one for now. Let's say we want find out what the distribution of movies that got rated higher than 8.4 was by decade.

In [None]:
imdb.pivot('Highly Rated','Decade')

In [None]:
imdb.pivot('Highly Rated','Decade', values = 'Rating', collect = np.mean)

<font color = 'red'>**Question 1. Suppose we wanted to find the median rating of movies by decade. How might we do this?**</font>

*Hint:* You might want to use `np.median`.

<font color = 'red'>**Question 2. What if we wanted to figure out the distribution of highly rated movies among movies more than 50 years old. How might we do this?**</font>

*Hint:* You can use an intermediate step and create a new Table.

## Visualizations

Why did we spend so much time on manipulating data? Because to get the visualization that we want, we're going to need to use many of these tools. In this section, we'll use the `imdb` dataset to explore some basic descriptive statistics and data visualizations.

### Histograms and Boxplots

We can make histograms by using the `.hist` method. To do this, though, you need to make sure that you only include the variable that you want as a Table. That is, you need to make sure that your data is in this form:

In [None]:
imdb_rating = imdb.select('Rating')
imdb_rating.show(5)

We then use the `.hist` method to create a histogram.

In [None]:
imdb_rating.hist()

You can control characteristics about the histogram using parameters such as `bins`.

In [None]:
imdb_rating.hist(bins = 5)

### Boxplots

Another visualization you can use for numerical data is a boxplot. As you might be able to guess, you can create this using the `boxplot` method. The data must be in a similar format.

In [None]:
imdb_rating.boxplot()

<font color = 'red'>**Question 3. Make a histogram of the number of votes that movies got. What can you say about the shape and center of the distribution based on the histogram? Try using a few different bin values. Which one would be best to use?**</font>

<font color = 'red'>**Question 4. Make a boxplot of the number of votes that movies got. What can you say about the shape and center of the distribution?**</font>

### Line Plot

Let's say we want to look at the change in movie ratings over time. To investigate this, we'll plot the mean rating over time. 

First, we need to get the data in the form we want it. We want a column with the different decades in our dataset (the x-axis) as well as a column with the mean rating for that decade (the y-axis). To obtain a table with this data, we can use the `group` method, using `collect = np.mean` as a parameter.

In [None]:
rating_by_decade = imdb.select('Decade','Rating').group('Decade', collect = np.mean)
rating_by_decade

In [None]:
# The variable to go on the x-axis is specified as the argument
rating_by_decade.plot('Decade')

<font color = 'red'>**Question 5. Make a line plot of the mean number of votes by decade. Describe the trend.**</font>

## Categorical Variables: Bar Chart

To make bar charts, we first need to create the summaries of the groups that we want to graph. Remember: bar charts are used for categorical variables. This means that we want to get the **counts** for each **category** in that variable. We can do this using the `group` method. 

In [None]:
movies_by_decade = imdb.group('Decade')
movies_by_decade.show(4)

We take this table and use the `barh` method, specifying which variable we want to create the graph of.

In [None]:
movies_by_decade.barh('Decade')

To create a bar chart with multiple categories, we can use the `pivot` method to create a contingency table and use that instead of the `group` method.

In [None]:
imdb.pivot('Highly Rated','Decade').barh('Decade')

<font color = 'red'>**Question 6. Create a bar graph with the counts of movies that were highly rated and not highly rated. Based on this graph and the graph of movies split by rating by decade, does any decade look particularly unusual to you?**</font>

## Practice with Descriptive Statistics and Visualizations

For the last few questions, we will use the Cards Against Humanity Pulse of the Nation dataset to look at what characteristics might be associated with supporting federal funding of scientific research. Let's start by bringing in the dataset and taking a look at the variables.

In [None]:
cah = Table.read_table('201709-CAH_PulseOfTheNation.csv')

In [None]:
cah.show(5)

The `Federal Funding of Scientific Research` column has answers to the question: "Is federal funding of scientific research too high, too low, or about right?" The `DK/REF` field refers to people who said they didn't know or refused to answer. 

<font color = 'red'>**Question 7. How many people who took this survey thought that federal funding of scientific research was too high? Too Low? About Right? Create a visualization that shows the distribution of answers to this question.**</font>

<font color = 'red'>**Question 8. Is level of education associated with opinions about federal funding of scientific research? Use descriptive statistics and visualizations to support your answer.**</font>

<font color = 'red'>**Question 9. Is income associated with opinions about federal funding of scientific research? Use descriptive statistics and visualizations to support your answer.**</font>