In [None]:
import pandas as pd
import requests

# Welcome to data exploration with Python and Jupyter!

## Here&rsquo;s what to expect:

### Morning

1. [Overview of Jupyter Notebooks](#1.-Overview-of-Jupyter)
2. [Overview of Python syntax with calculation](#2.-Overview-of-Python-syntax)
3. [Refactoring our calculation using Python data structures](#3.-Overview-of-Python-data-structures)

### After First Break

4. [Introducing pandas and cleaning text data](#4.-Introducing-pandas-and-text-cleaning)
5. [Exploration: movie quotes](#5.-Exploration:-movie-quotes)
6. [Introduction to city population data](#6.-Introduction-to-city-population-data)

### Afternoon/After Lunch

7. [City population data prep](#7.-City-population-data-prep)

### After Second Break

8. [Exploration: city population data](#8.-Exploration:-city-population-data)

### Generally:

# <center>CAT GIFS</center>

![yay](https://78.media.tumblr.com/42a41bd1ace113eb410b0005192c2275/tumblr_p8xzms08OS1qhy6c9o1_500.gif)

# 1. Overview of Jupyter

In [2]:
%pwd

'/Users/melissa/Documents/data-exploration-workshop/completed_notebooks'

In [22]:
%quickref

In [None]:
%who int

# 2. Overview of Python syntax

## How much money does making coffee at home save?

Simple calculation

In [5]:
# multiply coffee and tip to get cost

2 * 1.2

2.4

In [6]:
# name variables

latte = 4
drip_coffee = 2
tip = .2

How do the coffee and tip variables differ?

In [25]:
# Check `type`

print(type(latte))
print(type(tip))

<class 'int'>
<class 'float'>


# 3. Overview of Python data structures

These include lists, dictionaries, tuples and sets. However, we only need to use the first two today.

#### Create a `list` of coffee preparations that you might order.

In [26]:
coffee_preps = ['latte', 'espresso', 'drip_coffee']

#### Use a `dict` to allow comparison of multiple coffee types 

In [9]:
price_lookup = {'latte': 5, 'espresso': 3, 'drip_coffee': 2}

In [10]:
price_lookup['latte']

5

How do we figure out the amount of money saved per coffee made at home? For simplicity, let's just do this calculation with drip coffee.

Related figures:

- 1 oz. = 28.3495 grams
- A cup of French press coffee uses about 15 grams of coffee grounds

#### Assign the appropriate values to these variables

In [11]:
gram = 1
ounce = gram * 28.3495
serving = gram * 15
dollars_per_ounce = 0.75

#### Construct a calculation for comparing these:

In [32]:
home_coffee = (serving/ounce) * dollars_per_ounce

In [33]:
home_coffee

0.3968323956330799

In [34]:
drip_coffee/home_coffee

5.0399111111111115

# First Break

![kites](https://78.media.tumblr.com/2c05bccd7e1c9353966e4095dc5caf1b/tumblr_ootmmjLVYj1qhy6c9o1_500.gif)

# 4. Introducing pandas and text cleaning

In [12]:
quotes = pd.read_html('https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movie_Quotes', header = 0)

What did we get back?

In [13]:
# Check type
type(quotes)

list

Given the data type, how do we check for the information we want?

In [14]:
quotes[2]

Unnamed: 0,Rank,Quotation,Character,Actor/Actress,Film,Year
0,1,"""Frankly, my dear, I don't give a damn..""",Rhett Butler,Clark Gable,Gone with the Wind,1939
1,2,"""I'm gonna make him an offer he can't refuse.""",Vito Corleone,Marlon Brando,The Godfather,1972
2,3,"""You don't understand! I coulda had class. I c...",Terry Malloy,Marlon Brando,On the Waterfront,1954
3,4,"""Toto, I've a feeling we're not in Kansas anym...",Dorothy Gale,Judy Garland,The Wizard of Oz,1939
4,5,"""Here's looking at you, kid.""",Rick Blaine,Humphrey Bogart,Casablanca,1942
5,6,"""Go ahead, make my day.""",Harry Callahan,Clint Eastwood,Sudden Impact,1983
6,7,"""All right, Mr. DeMille, I'm ready for my clos...",Norma Desmond,Gloria Swanson,Sunset Boulevard,1950
7,8,"""May the Force be with you.""",Han Solo,Harrison Ford,Star Wars,1977
8,9,"""Fasten your seatbelts. It's going to be a bum...",Margo Channing,Bette Davis,All About Eve,1950
9,10,"""You talkin' to me?""",Travis Bickle,Robert De Niro,Taxi Driver,1976


In [15]:
df = quotes[2]

How do we know our data is as expected?

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
Rank             100 non-null int64
Quotation        100 non-null object
Character        100 non-null object
Actor/Actress    100 non-null object
Film             100 non-null object
Year             100 non-null int64
dtypes: int64(2), object(4)
memory usage: 4.8+ KB


In [17]:
df.head()

Unnamed: 0,Rank,Quotation,Character,Actor/Actress,Film,Year
0,1,"""Frankly, my dear, I don't give a damn..""",Rhett Butler,Clark Gable,Gone with the Wind,1939
1,2,"""I'm gonna make him an offer he can't refuse.""",Vito Corleone,Marlon Brando,The Godfather,1972
2,3,"""You don't understand! I coulda had class. I c...",Terry Malloy,Marlon Brando,On the Waterfront,1954
3,4,"""Toto, I've a feeling we're not in Kansas anym...",Dorothy Gale,Judy Garland,The Wizard of Oz,1939
4,5,"""Here's looking at you, kid.""",Rick Blaine,Humphrey Bogart,Casablanca,1942


# 5. Exploration: movie quotes

How often does Brando appear?

In [18]:
df.loc[df['Actor/Actress'].str.contains("Brando")]

Unnamed: 0,Rank,Quotation,Character,Actor/Actress,Film,Year
1,2,"""I'm gonna make him an offer he can't refuse.""",Vito Corleone,Marlon Brando,The Godfather,1972
2,3,"""You don't understand! I coulda had class. I c...",Terry Malloy,Marlon Brando,On the Waterfront,1954
44,45,"""Stella! Hey, Stella!""",Stanley Kowalski,Marlon Brando,A Streetcar Named Desire,1951
46,47,"""Shane. Shane. Come back!""",Joey Starrett,Brandon De Wilde,Shane,1953


What year had the greatest number of awards?

In [19]:
# Hint: Start with df.groupby() and use shift + tab to look at its parameters

# df.groupby('Year').count().sort_values(by='Film', ascending=False)

grouped = df.groupby('Year')

In [20]:
grouped

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x1174ee860>

What is the GroupBy object?

You can read more in its documentation [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes).

But you can also explore!

In [21]:
# Use tab to explore the `grouped` object

grouped.

SyntaxError: invalid syntax (<ipython-input-21-1930e7d85537>, line 3)

What does it look like to find number of films per year?

In [None]:
grouped.count()

This is a start, but how do we sort what we've found?

In [None]:
grouped_by_count = grouped.count()

Find your method:

In [None]:
grouped_by_count.sort_values(by='Film')

This is useful! But it'd be more useful if it were sorted in descending order.

How do we figure out whether this is an option for the `sort_values` function?

In [None]:
grouped_by_count.sort_values(by='Film', )

# 6. Introduction to city population data

In [45]:
page = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)

How do we find the data we're looking for?

In [46]:
city_df = page[4]

In [47]:
city_df.head()

Unnamed: 0,2017rank,City,State[5],2017estimate,2010Census,Change,2016 land area,2016 population density,Location,Unnamed: 9,Unnamed: 10
0,1,New York[6],New York,8622698,8175133,+5.47%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W
1,2,Los Angeles,California,3999759,3792621,+5.46%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°W
2,3,Chicago,Illinois,2716450,2695598,+0.77%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W
3,4,Houston[7],Texas,2312717,2100263,+10.12%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W
4,5,Phoenix,Arizona,1626078,1445632,+12.48%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°W


Before we write any code here, let's talk about what this data can tell us -- and what it cannot.

# Lunch

![lunch](https://78.media.tumblr.com/548cfa1622853c30207245c3214df098/tumblr_oq7r3quYsh1qhy6c9o1_500.gif)

# 7. City population data prep

Let's reacquaint ourselves with the data.

In [48]:
city_df.head()

Unnamed: 0,2017rank,City,State[5],2017estimate,2010Census,Change,2016 land area,2016 population density,Location,Unnamed: 9,Unnamed: 10
0,1,New York[6],New York,8622698,8175133,+5.47%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W
1,2,Los Angeles,California,3999759,3792621,+5.46%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°W
2,3,Chicago,Illinois,2716450,2695598,+0.77%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W
3,4,Houston[7],Texas,2312717,2100263,+10.12%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W
4,5,Phoenix,Arizona,1626078,1445632,+12.48%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°W


These columns headers aren't as useful as they could be. How do we fix that?

In [37]:
city_df.rename?

#### We need a dictionary. How did we make one before?

In [38]:
useful_dict = {'foo': 5, 'bar': 'baz'}

This would take a lot of typing! Instead, you can make one by combining two lists.

#### What lists do you need?

In [49]:
current_columns = list(city_df.columns)

In [52]:
current_columns

['2017rank',
 'City',
 'State[5]',
 '2017estimate',
 '2010Census',
 'Change',
 '2016 land area',
 '2016 population density',
 'Location',
 'Unnamed: 9',
 'Unnamed: 10']

In [53]:
new_columns = ['2017rank',
               'City',
               'State',
               '2017estimate',
               '2010Census',
               'Change',
               '2016 land area (miles)',
               '2016 land area (km)',
               '2016 population density (miles)',
               '2016 population density (km)',
               'Location']

In [54]:
column_dict = dict(zip(current_columns, new_columns))

In [58]:
city_df.rename(columns=column_dict, inplace=True)

In [59]:
city_df.head()

Unnamed: 0,2017rank,City,State,2017estimate,2010Census,Change,2016 land area (miles),2016 land area (km),2016 population density (miles),2016 population density (km),2016 population density (miles).1
0,1,New York[6],New York,8622698,8175133,+5.47%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W
1,2,Los Angeles,California,3999759,3792621,+5.46%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°W
2,3,Chicago,Illinois,2716450,2695598,+0.77%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W
3,4,Houston[7],Texas,2312717,2100263,+10.12%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W
4,5,Phoenix,Arizona,1626078,1445632,+12.48%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°W


# Break!

![break2](https://78.media.tumblr.com/ff9287842f26cd9993b8151736992244/tumblr_osj6r9gFH21qhy6c9o1_500.gif)

In [57]:
city_df.head()

Unnamed: 0,2017rank,City,State,2017estimate,2010Census,Change,2016 land area (miles),2016 land area (km),2016 population density (miles),2016 population density (km),Location
0,1,New York[6],New York,8622698,8175133,+5.47%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W
1,2,Los Angeles,California,3999759,3792621,+5.46%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°W
2,3,Chicago,Illinois,2716450,2695598,+0.77%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W
3,4,Houston[7],Texas,2312717,2100263,+10.12%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W
4,5,Phoenix,Arizona,1626078,1445632,+12.48%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°W


# 8. Exploration: city population data

We'll figure out what questions we want to ask in the workshop, and we may not go in this direction. But here's a question you can pursue:

### What was the percent change in population *density* for each city between 2010 and 2016?

Optional: Remove footnote markup from city names

In [60]:
city_df['City'] = city_df['City'].str.replace("\[\d+\]", "", regex=True)

Optional: Drop columns you don't plan to use

In [66]:
# city_df.drop(['2017rank','2017estimate'], axis=1)
city_df

Unnamed: 0,2017rank,City,State,2017estimate,2010Census,Change,2016 land area (miles),2016 land area (km),2016 population density (miles),2016 population density (km),2016 population density (miles).1
0,1,New York,New York,8622698,8175133,+5.47%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W
1,2,Los Angeles,California,3999759,3792621,+5.46%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°W
2,3,Chicago,Illinois,2716450,2695598,+0.77%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W
3,4,Houston,Texas,2312717,2100263,+10.12%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W
4,5,Phoenix,Arizona,1626078,1445632,+12.48%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°W
5,6,Philadelphia,Pennsylvania,1580863,1526006,+3.59%,134.2 sq mi,347.6 km2,"11,683/sq mi","4,511/km2",40°00′34″N 75°08′00″W﻿ / ﻿40.0094°N 75.1333°W
6,7,San Antonio,Texas,1511946,1327407,+13.90%,461.0 sq mi,"1,194.0 km2","3,238/sq mi","1,250/km2",29°28′21″N 98°31′30″W﻿ / ﻿29.4724°N 98.5251°W
7,8,San Diego,California,1419516,1307402,+8.58%,325.2 sq mi,842.3 km2,"4,325/sq mi","1,670/km2",32°48′55″N 117°08′06″W﻿ / ﻿32.8153°N 117.1350°W
8,9,Dallas,Texas,1341075,1197816,+11.96%,340.9 sq mi,882.9 km2,"3,866/sq mi","1,493/km2",32°47′36″N 96°45′59″W﻿ / ﻿32.7933°N 96.7665°W
9,10,San Jose,California,1035317,945942,+9.45%,177.5 sq mi,459.7 km2,"5,777/sq mi","2,231/km2",37°17′48″N 121°49′08″W﻿ / ﻿37.2967°N 121.8189°W


#### Task: Make land area fields usable in calculations

#### Task: Calculate 2010 population density

#### Task: Calculate difference

Bonus: how do these cities' population densities compare to those of cities elsewhere in the world?
    
 https://en.wikipedia.org/wiki/List_of_cities_proper_by_population

# <center> 💖 Fin! 💖</center>
![fin](https://78.media.tumblr.com/7381533edc6941a7ec0d98c71303e3cd/tumblr_p2bac8rqVI1qhy6c9o1_500.gif)