# Lesson 04: More Visualizations

Welcome to Lesson 04!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- more table operations.

- visualizations (histograms). 

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams["patch.force_edgecolor"]=True
Table.interactive_plots()

## Histograms

### Categorical Distribution

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv')
top_movies

In [None]:
top_movies = top_movies.with_column('Millions', np.round(top_movies.column('Gross')/1000000,3))
top_movies.take(np.arange(10)).barh('Title', 'Millions')

In [None]:
studios = top_movies.select('Studio')
studios

Use the `group` method to count up the number of occurrences of each category.

In [None]:
studio_distribution = studios.group('Studio')

In [None]:
studio_distribution

In [None]:
sum(studio_distribution.column('count'))

## Bar Charts

In [None]:
studio_distribution.barh('Studio')

In [None]:
studio_distribution.sort('count', descending=True).barh('Studio')

## Numerical Distribution

In [None]:
ages = 2021-top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)

In [None]:
top_movies

## Binning

In [None]:
min(ages), max(ages)

In [None]:
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 105)
my_bins

In [None]:
binned_data = top_movies.bin('Age', bins=my_bins)
binned_data

In [None]:
sum(binned_data.column('Age count'))

In [None]:
np.arange(0, 126, 25)

In [None]:
top_movies.bin('Age', bins=np.arange(0, 126, 25))

In [None]:
top_movies.bin('Age', bins=np.arange(0, 101, 25))

In [None]:
 np.arange(0, 101, 25)

In [None]:
top_movies.where('Age', 100)

## Histograms

In [None]:
my_bins

In [None]:
binned_data

Let's make our first histogram.

In [None]:
top_movies.hist('Age', bins=my_bins, unit='Year')

Let's try equally spaced bins instead.

In [None]:
top_movies.hist('Age', bins=np.arange(0, 110, 10), unit='Year')

Let's try not specifying any bins.

In [None]:
top_movies.hist('Age', unit='Year') 

Add a column containing what percent of movies are in each bin.

In [None]:
binned_data = binned_data.with_column(
    'Percent', 100*binned_data.column('Age count')/200)

In [None]:
binned_data

## Height

### What is the height of the [40, 65] bin?

**Step 1:** Calculate % of movies in the [40, 65) bin.

In [None]:
percent = binned_data.where('bin', 40).column('Percent').item(0)

**Step 2:** Calculate the width of the 40-65 bin.

In [None]:
width = 65-40

**Step 3:** Area of `rectangle = height*width` and `height = percent/width`.

In [None]:
height = percent/width
height

### What are the heights of the rest of the bins?

**Step 1:** Get the bin lefts.

In [None]:
bin_lefts = binned_data.take(np.arange(binned_data.num_rows-1))

**Step 2:** Get the bin widths.

In [None]:
bin_widths = np.diff(binned_data.column('bin'))
bin_lefts = bin_lefts.with_column('Width', bin_widths)

**Step 3:** Get the bin heights.

In [None]:
bin_heights = bin_lefts.column('Percent')/bin_widths
bin_lefts = bin_lefts.with_column('Height', bin_heights)

In [None]:
bin_lefts

In [None]:
top_movies.hist('Age', bins=my_bins, unit='Year')