# Lesson 03: Table Operations and Visualizations

Welcome to Lesson 03!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- table operations.

- visualizations. 

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Reading a Table from a File

In [None]:
du_bois = Table.read_table('data/du_bois.csv')
du_bois

**Question 1.** Which group ("`CLASS`") spent the highest percentage on rent?

In [None]:
du_bois.select('STATUS')

In [None]:
du_bois.column('STATUS').item(3)

In [None]:
du_bois.select('STATUS')

In [None]:
du_bois.column('STATUS')

In [None]:
du_bois.column('ACTUAL AVERAGE')

In [None]:
du_bois.column('FOOD')

**Question 2.** What is the dollar amount spent on food?

In [None]:
du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD')

In [None]:
food_dollars = du_bois.column('ACTUAL AVERAGE') * du_bois.column('FOOD')
du_bois = du_bois.with_column(
    'Food $', food_dollars
)
du_bois

In [None]:
du_bois.set_format('FOOD', PercentFormatter)

In [None]:
du_bois.select('CLASS', 'ACTUAL AVERAGE', 'Food $', 'FOOD')

In [None]:
du_bois.column('FOOD')

In [None]:
du_bois.drop('OTHER')

## Selecting Data in a Column

In [None]:
movies = Table.read_table('data/movies_by_year_with_ticket_price.csv')
movies.show()

In [None]:
gross_in_dollars = movies.column('Total Gross')*1e6
gross_in_dollars

**Question 3.** How many tickets were sold for each movie?

In [None]:
tix_sold = ...
tix_sold

In [None]:
movies = movies.with_column('Tickets sold', tix_sold)

In [None]:
movies.show(4)

In [None]:
movies.set_format('Tickets sold', NumberFormatter)

In [None]:
movies.plot('Year', 'Tickets sold')

In [None]:
movies.where('Year', are.between(2000, 2005))

In [None]:
movies.where('Year', 2002)

In [None]:
movies.where('Year', are.equal_to(2002))

In [None]:
movies.where('#1 Movie', are.containing('Harry Potter'))

In [None]:
movies.take(np.arange(2, 5))

## Visualization ##

### Census 2017

In [None]:
full = Table.read_table('data/nc-est2017-agesex-res.csv')
full

In [None]:
full.sort('AGE')

In [None]:
partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2017')
partial.show(4)

In [None]:
simple = partial.relabeled(2, '2010').relabeled(3, '2017')
simple.show(4)

In [None]:
simple.sort('AGE')

In [None]:
simple.sort('AGE', descending=True)

In [None]:
no_999 = simple.where('AGE', are.below(999))
everyone = no_999.where('SEX', 0).drop('SEX')

In [None]:
everyone

### Line Chart 

We can use the [`plot`](http://data8.org/datascience/_autosummary/datascience.tables.Table.plot.html?highlight=plot#datascience.tables.Table.plot) method to draw a line chart.

In [None]:
everyone.plot('AGE')

To make the plot interactive we can use `Table.interactive_plots()`.

In [None]:
Table.interactive_plots()
everyone.plot('AGE')

In [None]:
everyone.plot('AGE', '2010')

In [None]:
Table.static_plots()
everyone.plot('AGE')

In [None]:
everyone.plot('AGE')
plt.plot([81,81],[0,5000000]);

### Census 2019

In [None]:
full = Table.read_table('data/nc-est2019-agesex-res.csv')
full

**Question 4.** Create a table with `SEX`, `AGE`, `POPESTIMATE2010`, `POPESTIMATE2019`.

In [None]:
partial = full.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2019')

**Question 5.** Relabel the columns `POPESTIMATE2010` and `POPESTIMATE2019` as 2010 and 2019.

In [None]:
simple = partial.relabeled(2, '2010').relabeled(3, '2019')

**Question 6.** Remove the age totals (i.e., observations with age 999).

In [None]:
no_999 = simple.where('AGE', are.below(999))

**Question 7.** Remove male and female (i.e., keep only the combined observations).

In [None]:
everyone = no_999.where('SEX', 0).drop('SEX')

## Line Plots ##

In [None]:
everyone.plot('AGE')

The plot above should be labeled. Here are two ways to label it.

In [None]:
everyone.plot('AGE')

# Print out a title 
print('US Population for 2010 and 2019')

In [None]:
everyone.plot('AGE')

# Use the .title method
plt.title('US Population for 2010 and 2019');

## Males and Females in 2019

Let's compare male and female counts per age.

In [None]:
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')

Let's make a new table with male and female ages for 2019.

In [None]:
pop_2019 = Table().with_columns(
    'Age', males.column('AGE'),
    'Males', males.column('2019'),
    'Females', females.column('2019')
)
pop_2019

In [None]:
pop_2019.plot('Age')

First let's find the total population for each year.

In [None]:
total = pop_2019.column('Males')+pop_2019.column('Females')
total

Now we can calculate the percent female for each age.

In [None]:
pct_female = pop_2019.column('Females')/total*100
pct_female

Let's round the percentage to two decimal places so it's easier to read. 

In [None]:
pct_female = np.round(pct_female, 2)
pct_female

Now we can add the `pct_female` column to the `pop_2019` table.

In [None]:
pop_2019 = pop_2019.with_column('Percent female', pct_female)
pop_2019

In [None]:
pop_2019.plot('Age', 'Percent female')

Look at the y-axis. The trend is not as dramatic as you might think.

In [None]:
pop_2019.plot('Age', 'Percent female')

## Scatter Plots ##

In [None]:
actors = Table.read_table('data/actors.csv')
actors

In [None]:
actors.scatter('Number of Movies', 'Total Gross')

In [None]:
actors.scatter('Number of Movies', 'Average per Movie')

In [None]:
actors.where('Average per Movie', are.above(400))

## Bar Charts ##

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv').sort('Gross (Adjusted)', descending=True)
top_movies

What are the top 10 movies based on gross adjusted revenue?

In [None]:
np.arange(10)

In [None]:
top10_adjusted = top_movies.take(np.arange(10))
top10_adjusted

Let's convert to millions of dollars for readability.

In [None]:
millions = np.round(top10_adjusted.column('Gross (Adjusted)')/1000000, 3)
top10_adjusted = top10_adjusted.with_column('Millions', millions)
top10_adjusted

A line plot doesn't make sense here: don't do this!

In [None]:
top10_adjusted.plot('Year', 'Millions')

In [None]:
top10_adjusted.barh('Title', 'Millions')