##  A Quick Review from W1

Before pulling any data, we've gotta import all the packages we need

In [3]:
import pandas as pd
import numpy as np

Now we can read in the data from a link OR from a file in the same directory

In [4]:
movies = pd.read_csv('imdb.csv')

Take a quick look at the data

In [5]:
movies.head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
0,Avatar,7.9,PG-13,237000000.0,760505847.0,178.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,English,USA,1.78,2009.0
1,Pirates of the Caribbean: At World's End,7.1,PG-13,300000000.0,309404152.0,169.0,Action|Adventure|Fantasy,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,English,USA,2.35,2007.0
2,Spectre,6.8,PG-13,245000000.0,200074175.0,148.0,Action|Adventure|Thriller,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes,English,UK,2.35,2015.0
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0
4,Star Wars: Episode VII - The Force Awakens ...,7.1,,,,,Documentary,Doug Walker,Rob Walker,,Doug Walker,,,,


In [6]:
movies.shape

(5043, 15)

There's a lot of columns- could we get a full list? 

In [7]:
movies.columns

Index(['movie_title', 'imdb_score', 'content_rating', 'budget', 'gross',
       'duration', 'genres', 'actor_1_name', 'actor_2_name', 'actor_3_name',
       'director_name', 'language', 'country', 'aspect_ratio', 'title_year'],
      dtype='object')

How many movies does each director have on IMDB?

In [8]:
movies.director_name.value_counts()

Steven Spielberg       26
Woody Allen            22
Martin Scorsese        20
Clint Eastwood         20
Ridley Scott           17
                       ..
Eric Nicholas           1
Demian Lichtenstein     1
William Phillips        1
Dan Scanlon             1
Gérard Krawczyk         1
Name: director_name, Length: 2398, dtype: int64

What are the top 3 longest movies on the list?

In [9]:
movies.sort_values(by='duration', ascending=False).head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
1710,Trapped,8.2,,,,511.0,Crime|Drama|Thriller,Ólafur Darri Ólafsson,Ingvar Eggert Sigurðsson,Björn Hlynur Haraldsson,,Icelandic,Iceland,16.0,
2466,Carlos,7.7,Not Rated,,145118.0,334.0,Biography|Crime|Drama|Thriller,Edgar Ramírez,Nora von Waldstätten,Katharina Schüttler,,English,France,2.35,
1501,"Blood In, Blood Out",8.0,R,35000000.0,4496583.0,330.0,Crime|Drama,Delroy Lindo,Jesse Borrego,Raymond Cruz,Taylor Hackford,English,USA,1.66,1993.0
1144,Heaven's Gate,6.8,R,44000000.0,1500000.0,325.0,Adventure|Drama|Western,Jeff Bridges,Sam Waterston,Isabelle Huppert,Michael Cimino,English,USA,2.35,1980.0
3311,The Legend of Suriyothai,6.6,R,400000000.0,454255.0,300.0,Action|Adventure|Drama|History|War,Sarunyu Wongkrachang,Chatchai Plengpanich,Mai Charoenpura,Chatrichalerm Yukol,Thai,Thailand,1.85,2001.0


# Pandas Foundations

## Series vs. Dataframes 

So far we've been referring to *2 dimensional* tables using the `pd.DataFrame` object
<br>An individual column (or row), is *1 dimensional*. We call this a `pd.Series` object

In [8]:
type(movies)

pandas.core.frame.DataFrame

We can access a series using either form: `df.column` OR `df['column']`

In [10]:
print(type(movies.imdb_score))
print(movies.imdb_score)

<class 'pandas.core.series.Series'>
0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
5038    7.7
5039    7.5
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64


In [12]:
print(type(movies['imdb_score']))
print(movies['imdb_score'])

<class 'pandas.core.series.Series'>
0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
5038    7.7
5039    7.5
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64


To select a **row** as a series, we can use `df.iloc[]` and a specific row number (its index)

In [14]:
print(type(movies.iloc[3]))
movies.iloc[3]

<class 'pandas.core.series.Series'>


movie_title       The Dark Knight Rises 
imdb_score                           8.5
content_rating                     PG-13
budget                           2.5e+08
gross                        4.48131e+08
duration                             164
genres                   Action|Thriller
actor_1_name                   Tom Hardy
actor_2_name              Christian Bale
actor_3_name        Joseph Gordon-Levitt
director_name          Christopher Nolan
language                         English
country                              USA
aspect_ratio                        2.35
title_year                          2012
Name: 3, dtype: object

## Subsetting Columns

Subsetting and filtering is one of the most important, yet confusing topics when getting started.
<br>As we go along, feel free to run the code part by part to see what's going on in each step. `type()` is also a great tool here

Lets take a look at a smaller set of columns, say just the movie_title and each actor & director name

In [13]:
movies[['movie_title','actor_1_name','actor_2_name','actor_3_name','director_name']] # Note TWO square brackets

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name,director_name
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Star Wars: Episode VII - The Force Awakens ...,Doug Walker,Rob Walker,,Doug Walker
...,...,...,...,...,...
5038,Signed Sealed Delivered,Eric Mabius,Daphne Zuniga,Crystal Lowe,Scott Smith
5039,The Following,Natalie Zea,Valorie Curry,Sam Underwood,
5040,A Plague So Pleasant,Eva Boehnke,Maxwell Moody,David Chandler,Benjamin Roberds
5041,Shanghai Calling,Alan Ruck,Daniel Henney,Eliza Coupe,Daniel Hsia


Why did we use two square brackets? What we're doing is passing a `list` to the subset function.
<br> Pandas `DataFrame` objects know that whenever we place `[]` after it, we're looking to do some sort of filtering operation
<br><br> This ends up being pretty helpful, because it gives us a shortcut to create similar subsets over and over

In [12]:
actor_cols = ['actor_1_name','actor_2_name','actor_3_name']
movies[actor_cols]

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name
0,CCH Pounder,Joel David Moore,Wes Studi
1,Johnny Depp,Orlando Bloom,Jack Davenport
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Doug Walker,Rob Walker,
...,...,...,...
5038,Eric Mabius,Daphne Zuniga,Crystal Lowe
5039,Natalie Zea,Valorie Curry,Sam Underwood
5040,Eva Boehnke,Maxwell Moody,David Chandler
5041,Alan Ruck,Daniel Henney,Eliza Coupe


In [13]:
movies[['movie_title']+actor_cols]

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Star Wars: Episode VII - The Force Awakens ...,Doug Walker,Rob Walker,
...,...,...,...,...
5038,Signed Sealed Delivered,Eric Mabius,Daphne Zuniga,Crystal Lowe
5039,The Following,Natalie Zea,Valorie Curry,Sam Underwood
5040,A Plague So Pleasant,Eva Boehnke,Maxwell Moody,David Chandler
5041,Shanghai Calling,Alan Ruck,Daniel Henney,Eliza Coupe


Quick point of confusion: Check out the difference between `df['column']` vs `df[['column']]` ( try it out w/ `movies` below! )
<br><br> The former creates a *series*, since the input is just one *string*. The latter creates a *dataframe*, since the input is a *list* of columns. 

In [14]:
movies['content_rating']

0       PG-13
1       PG-13
2       PG-13
3       PG-13
4         NaN
        ...  
5038      NaN
5039    TV-14
5040      NaN
5041    PG-13
5042       PG
Name: content_rating, Length: 5043, dtype: object

In [16]:
movies[['content_rating']]

Unnamed: 0,content_rating
0,PG-13
1,PG-13
2,PG-13
3,PG-13
4,
...,...
5038,
5039,TV-14
5040,
5041,PG-13


## Filtering Rows With Conditions

Another common task is to filter rows based upon some criteria we have. We could:
1. Compare floats
2. Match strings
3. Check against multiple elements

In [17]:
movies[movies.imdb_score>8]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0
17,The Avengers,8.1,PG-13,220000000.0,623279547.0,173.0,Action|Adventure|Sci-Fi,Chris Hemsworth,Robert Downey Jr.,Scarlett Johansson,Joss Whedon,English,USA,1.85,2012.0
27,Captain America: Civil War,8.2,PG-13,250000000.0,407197282.0,147.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,Scarlett Johansson,Chris Evans,Anthony Russo,English,USA,2.35,2016.0
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4924,Butterfly Girl,8.7,,180000.0,,78.0,Documentary,Abigail Evans,Stacie Evans,Emily Gorell,Cary Bell,English,USA,,2014.0
4937,A Charlie Brown Christmas,8.4,TV-G,150000.0,,25.0,Animation|Comedy|Family,Peter Robbins,Bill Melendez,Christopher Shea,Bill Melendez,English,USA,1.33,1965.0
4945,The Brain That Sings,8.2,,125000.0,,62.0,Documentary|Family,,,,Amal Al-Agroobi,Arabic,United Arab Emirates,,2013.0
4972,"Peace, Propaganda & the Promised Land",8.3,,70000.0,,80.0,Documentary,Noam Chomsky,Seth Ackerman,Arik Ascherman,Sut Jhally,English,USA,,2004.0


In [18]:
movies[movies.content_rating=='G']

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
35,Monsters University,7.3,G,200000000.0,268488329.0,104.0,Adventure|Animation|Comedy|Family|Fantasy,Steve Buscemi,Tyler Labine,Sean Hayes,Dan Scanlon,English,USA,1.85,2013.0
41,Cars 2,6.3,G,200000000.0,191450875.0,106.0,Adventure|Animation|Comedy|Family|Sport,Joe Mantegna,Thomas Kretschmann,Eddie Izzard,John Lasseter,English,USA,2.35,2011.0
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0
91,The Polar Express,6.6,G,165000000.0,665426.0,100.0,Adventure|Animation|Family|Fantasy,Tom Hanks,Eddie Deezen,Peter Scolari,Robert Zemeckis,English,USA,2.35,2004.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4427,Modern Times,8.6,G,1500000.0,163245.0,87.0,Comedy|Drama|Family,Paulette Goddard,Stanley Blystone,Fred Malatesta,Charles Chaplin,English,USA,1.37,1936.0
4591,A Lego Brickumentary,6.8,G,1000000.0,100240.0,93.0,Documentary,G.W. Krauss,Brian Whitaker,Jason Bateman,Kief Davidson,English,Denmark,,2014.0
4725,Benji,6.1,G,500000.0,39552600.0,86.0,Adventure|Family|Romance,Frances Bavier,Peter Breck,Edgar Buchanan,Joe Camp,English,USA,1.85,1974.0
4787,Rise of the Entrepreneur: The Search for a Bet...,8.2,G,450000.0,,52.0,Documentary,Bob Proctor,Jack Canfield,Eric Worre,Joe Kenemore,English,USA,,2014.0


We'll use the `.isin()` operator to match against a group. Remember, `.isin()` accepts a `list` only

In [19]:
movies[movies.country.isin(['India','Sri Lanka', 'Pakistan', 'Bangladesh'])]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
1056,Earth,7.8,Unrated,,528972.0,110.0,Drama|Romance|War,Nandita Das,Gulshan Grover,Eric Peterson,Deepa Mehta,Hindi,India,,1998.0
1329,Baahubali: The Beginning,8.4,,18026148.0,6498000.0,159.0,Action|Adventure|Drama|Fantasy|War,Tamannaah Bhatia,Anushka Shetty,Prabhas,S.S. Rajamouli,Telugu,India,1.85,2015.0
2349,Ramanujan,7.0,,,,153.0,Biography|Drama|History,Mani Bharathi,Michael Lieber,Kevin McGowan,Gnana Rajasekaran,English,India,2.35,2014.0
3075,Kabhi Alvida Naa Kehna,6.0,R,700000000.0,3275443.0,193.0,Drama,Shah Rukh Khan,John Abraham,Preity Zinta,Karan Johar,Hindi,India,2.35,2006.0
3085,Housefull,5.3,,,1165104.0,144.0,Comedy,Arjun Rampal,Boman Irani,Riteish Deshmukh,Sajid Khan,Hindi,India,,2010.0
3208,Krrish,6.3,Not Rated,10000000.0,,168.0,Action|Adventure|Romance|Sci-Fi,Naseeruddin Shah,Rekha,Sharat Saxena,Rakesh Roshan,Hindi,India,2.35,2006.0
3273,Kites,6.0,,600000000.0,1602466.0,90.0,Action|Drama|Romance|Thriller,Bárbara Mori,Steven Michael Quezada,Kabir Bedi,Anurag Basu,English,India,,2010.0
3276,Jab Tak Hai Jaan,6.9,Not Rated,7217600.0,3047539.0,176.0,Drama|Romance,Shah Rukh Khan,Katrina Kaif,Vic Waghorn,Yash Chopra,Hindi,India,2.35,2012.0
3344,My Name Is Khan,8.0,PG-13,12000000.0,4018695.0,128.0,Adventure|Drama|Thriller,Shah Rukh Khan,Jimmy Shergill,Christopher B. Duncan,Karan Johar,Hindi,India,2.35,2010.0
3348,Namastey London,7.3,,,1207007.0,128.0,Comedy|Drama|Romance,Katrina Kaif,Clive Standen,Riteish Deshmukh,Vipul Amrutlal Shah,Hindi,India,,2007.0


### Masking

What's going on under the hood? With each argument, we're passing a list of `True` and `False`, for whether or not a given *row* matches the criteria.
<br><br>With the above examples:

In [21]:
movies.content_rating == 'PG'

0       False
1       False
2       False
3       False
4       False
        ...  
5038    False
5039    False
5040    False
5041    False
5042     True
Name: content_rating, Length: 5043, dtype: bool

In [22]:
movies[   movies.content_rating == 'PG'   ]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
7,Tangled,7.8,PG,260000000.0,200807262.0,100.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Brad Garrett,Donna Murphy,M.C. Gainey,Nathan Greno,English,USA,1.85,2010.0
9,Harry Potter and the Half-Blood Prince,7.5,PG,250000000.0,301956980.0,153.0,Adventure|Family|Fantasy|Mystery,Alan Rickman,Daniel Radcliffe,Rupert Grint,David Yates,English,UK,2.35,2009.0
16,The Chronicles of Narnia: Prince Caspian,6.6,PG,225000000.0,141614023.0,150.0,Action|Adventure|Family|Fantasy,Peter Dinklage,Pierfrancesco Favino,Damián Alcázar,Andrew Adamson,English,USA,2.35,2008.0
33,Alice in Wonderland,6.5,PG,200000000.0,334185206.0,108.0,Adventure|Family|Fantasy,Johnny Depp,Alan Rickman,Anne Hathaway,Tim Burton,English,USA,1.85,2010.0
38,Oz the Great and Powerful,6.4,PG,215000000.0,234903076.0,130.0,Adventure|Family|Fantasy,Tim Holmes,Mila Kunis,James Franco,Sam Raimi,English,USA,2.35,2013.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4962,The Lost Skeleton of Cadavra,7.0,PG,40000.0,110536.0,90.0,Comedy|Horror|Sci-Fi,Fay Masterson,Brian Howe,Larry Blamire,Larry Blamire,English,USA,1.85,2001.0
4963,"Dude, Where's My Dog?!",3.2,PG,20000.0,,82.0,Family,Kevin P. Farley,Gabriela Castillo,Brandon Middleton,Stephen Langford,English,USA,1.85,2014.0
4977,Super Size Me,7.3,PG,65000.0,11529368.0,100.0,Comedy|Documentary|Drama,Chemeeka Walker,Amanda Kearsan,Amelia Giancarlo,Morgan Spurlock,English,USA,1.78,2004.0
5001,The Last Waltz,8.2,PG,,321952.0,117.0,Documentary|Music,Ringo Starr,Levon Helm,Bob Dylan,Martin Scorsese,English,USA,1.85,1978.0


## Chaining

Chaining is super helpful in making our code more concise and readable

For example, if we wanted to combine some of the previous conditions and subsets:

In [48]:
df1 = movies[movies.imdb_score > 8]
df2 = df1[df1.content_rating == 'PG']
df3 = df2.sort_values(by='duration',ascending=False)
df3.head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
2088,Gandhi,8.1,PG,22000000.0,,240.0,Biography|Drama|History,Candice Bergen,Amrish Puri,John Gielgud,Richard Attenborough,English,UK,2.35,1982.0
2644,Lawrence of Arabia,8.4,PG,15000000.0,6000000.0,227.0,Adventure|Biography|Drama|History|War,Claude Rains,José Ferrer,Jack Hawkins,David Lean,English,UK,2.2,1962.0
3048,Barry Lyndon,8.1,PG,11000000.0,,184.0,Adventure|Drama|History|War,Ryan O'Neal,Steven Berkoff,Hardy Krüger,Stanley Kubrick,English,UK,1.66,1975.0
4066,The Bridge on the River Kwai,8.2,PG,3000000.0,27200000.0,161.0,Adventure|Drama|War,William Holden,Sessue Hayakawa,Jack Hawkins,David Lean,English,UK,2.35,1957.0
1504,Lion of the Desert,8.4,PG,35000000.0,,156.0,Biography|Drama|History|War,Oliver Reed,Rod Steiger,John Gielgud,Moustapha Akkad,English,Libya,2.35,1980.0


We could instead write the above as:

In [25]:
movies.loc[movies.imdb_score > 8]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0
17,The Avengers,8.1,PG-13,220000000.0,623279547.0,173.0,Action|Adventure|Sci-Fi,Chris Hemsworth,Robert Downey Jr.,Scarlett Johansson,Joss Whedon,English,USA,1.85,2012.0
27,Captain America: Civil War,8.2,PG-13,250000000.0,407197282.0,147.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,Scarlett Johansson,Chris Evans,Anthony Russo,English,USA,2.35,2016.0
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4924,Butterfly Girl,8.7,,180000.0,,78.0,Documentary,Abigail Evans,Stacie Evans,Emily Gorell,Cary Bell,English,USA,,2014.0
4937,A Charlie Brown Christmas,8.4,TV-G,150000.0,,25.0,Animation|Comedy|Family,Peter Robbins,Bill Melendez,Christopher Shea,Bill Melendez,English,USA,1.33,1965.0
4945,The Brain That Sings,8.2,,125000.0,,62.0,Documentary|Family,,,,Amal Al-Agroobi,Arabic,United Arab Emirates,,2013.0
4972,"Peace, Propaganda & the Promised Land",8.3,,70000.0,,80.0,Documentary,Noam Chomsky,Seth Ackerman,Arik Ascherman,Sut Jhally,English,USA,,2004.0


In [23]:
movies[movies.imdb_score > 8].loc[movies.content_rating == 'PG'].sort_values(by='duration',ascending=False).head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
2088,Gandhi,8.1,PG,22000000.0,,240.0,Biography|Drama|History,Candice Bergen,Amrish Puri,John Gielgud,Richard Attenborough,English,UK,2.35,1982.0
2644,Lawrence of Arabia,8.4,PG,15000000.0,6000000.0,227.0,Adventure|Biography|Drama|History|War,Claude Rains,José Ferrer,Jack Hawkins,David Lean,English,UK,2.2,1962.0
3048,Barry Lyndon,8.1,PG,11000000.0,,184.0,Adventure|Drama|History|War,Ryan O'Neal,Steven Berkoff,Hardy Krüger,Stanley Kubrick,English,UK,1.66,1975.0
4066,The Bridge on the River Kwai,8.2,PG,3000000.0,27200000.0,161.0,Adventure|Drama|War,William Holden,Sessue Hayakawa,Jack Hawkins,David Lean,English,UK,2.35,1957.0
1504,Lion of the Desert,8.4,PG,35000000.0,,156.0,Biography|Drama|History|War,Oliver Reed,Rod Steiger,John Gielgud,Moustapha Akkad,English,Libya,2.35,1980.0


This works because each part of our code `returns` a dataframe, so we can keep applying as many dataframe functions as we want at once.

Try it out: In one line, see if you can find the value counts of `content_rating` for movies with a gross revenue (`gross`) over `200000000` ($ 200 million)

In [27]:
movies[movies.gross > 200000000].content_rating.value_counts()

PG-13    97
PG       49
R        11
G        10
Name: content_rating, dtype: int64

# Practice w/ UFOs

Data sampled from the National UFO Reporting Center (NUFORC.
<br>With your breakout groups, open up `ufo.csv` and answer the following questions:

1. Among the West Coast states (California, Oregon, and Washington), how long (on average) did the fireballs encounters last?
2. Which state saw the most encounters that lasted between 5 minutes to 1 hour?
3. There was one particularly interesting encounter on `2/11/2004 00:00` in West Palm Beach, Florida. What happened?

<br>Hint: Break down each question into parts, and chain them back together
<br>There are many ways of arriving at the answer, there's no particular 'right' way

In [29]:
ufo = pd.read_csv('ufo.csv')
ufo.head()

Unnamed: 0,datetime,city,state,country,shape,duration_sec,duration_hrs,comments,latitude,longitude
0,1/25/2014 22:00,tewksbury,ma,us,light,4.0,00:04,Green and red falling light over walgreens res...,42.610556,-71.234722
1,8/20/2004 22:27,lake in the hills,il,us,triangle,180.0,2-3 minutes,On August 20&#442004 at exactly 10:25 to 10:27...,42.181667,-88.330278
2,1/23/2009 19:00,slidell,la,us,sphere,7200.0,2 hours,Bright red&#44 green &#44 white and blue spher...,30.275,-89.781111
3,12/15/1994 21:00,alliance,oh,us,formation,1800.0,30 min +,Brightly Colored Orbs Over Portage County,40.915278,-81.106111
4,7/31/2011 21:00,milford,ct,us,oval,15.0,10-15 seconds,Bright orange&#44 silent orb&#44 moving steadi...,41.222222,-73.056944


In [32]:
ufo.loc[ufo.state.isin(['ca','or','wa'])].loc[ufo['shape']=='fireball'].duration_sec.mean()

238.88888888888889

In [53]:
ufo.loc[ufo.duration_sec >= 300].loc[ufo.duration_sec <= 3600].state.value_counts().head(1)

ca    392
Name: state, dtype: int64

In [50]:
ufo.loc[ufo.datetime == '2/11/2004 00:00'].loc[ufo.city == 'west palm beach'].comments.values[0]

'BLINDING LIGHT LIFTED MY DOG AND TOOK OFF INTO SPACE'

# Groupby Objects

From before: How would we get the number of movies in this dataset that received each possible content rating?

In [54]:
movies.content_rating.value_counts()

R            2118
PG-13        1461
PG            701
Not Rated     116
G             112
Unrated        62
Approved       55
TV-14          30
TV-MA          20
X              13
TV-PG          13
TV-G           10
Passed          9
NC-17           7
GP              6
M               5
TV-Y7           1
TV-Y            1
Name: content_rating, dtype: int64

This is a good summary statistic to examine the distribution of our dataset. But...

What if we want to know how movie performance differs by rating?

In [56]:
movies_byRating = movies.groupby(by='content_rating')
type(movies_byRating)

pandas.core.groupby.generic.DataFrameGroupBy

This creates a special **GroupBy object**. 

For now, let's think of it like a *collection* of dataframes, seperated by each unique value from content rating (One group for `R`, `PG-13`, etc).

We can't easily see what the entire GroupBy object looks like, but we can get a specific group

In [57]:
movies_byRating

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c9de750>

In [58]:
print(type(movies_byRating.get_group('PG')))
movies_byRating.get_group('PG').head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
7,Tangled,7.8,PG,260000000.0,200807262.0,100.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Brad Garrett,Donna Murphy,M.C. Gainey,Nathan Greno,English,USA,1.85,2010.0
9,Harry Potter and the Half-Blood Prince,7.5,PG,250000000.0,301956980.0,153.0,Adventure|Family|Fantasy|Mystery,Alan Rickman,Daniel Radcliffe,Rupert Grint,David Yates,English,UK,2.35,2009.0
16,The Chronicles of Narnia: Prince Caspian,6.6,PG,225000000.0,141614023.0,150.0,Action|Adventure|Family|Fantasy,Peter Dinklage,Pierfrancesco Favino,Damián Alcázar,Andrew Adamson,English,USA,2.35,2008.0
33,Alice in Wonderland,6.5,PG,200000000.0,334185206.0,108.0,Adventure|Family|Fantasy,Johnny Depp,Alan Rickman,Anne Hathaway,Tim Burton,English,USA,1.85,2010.0
38,Oz the Great and Powerful,6.4,PG,215000000.0,234903076.0,130.0,Adventure|Family|Fantasy,Tim Holmes,Mila Kunis,James Franco,Sam Raimi,English,USA,2.35,2013.0


When we apply **aggregation** functions to a `GroupBy` object, i.e. `mean()`, we get back averages for the other columns **broken down** by each content rating

In [61]:
movies_byRating.mean()

Unnamed: 0_level_0,imdb_score,budget,gross,duration,aspect_ratio,title_year
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Approved,7.325455,4142475.0,48145860.0,116.218182,1.735818,1954.781818
G,6.529464,44999130.0,82455160.0,98.973214,1.962523,1995.892857
GP,6.916667,5550000.0,43800000.0,110.833333,2.024,1970.666667
M,6.84,4700000.0,62554450.0,114.6,2.012,1968.4
NC-17,6.542857,7940714.0,4476870.0,102.0,1.855714,1994.571429
Not Rated,6.631034,4479213.0,2090037.0,110.318966,2.101739,1997.504348
PG,6.294437,48389560.0,72952730.0,104.857347,2.06125,2000.392297
PG-13,6.257495,54257340.0,65648540.0,111.365068,2.151456,2005.710472
Passed,7.166667,3031421.0,11003540.0,115.222222,1.351111,1941.222222
R,6.527101,33647550.0,29983650.0,108.781767,2.118677,2002.950425


If we were just to apply `.mean()` to the entire dataframe, we'd only get back one row with summaries for the entire dataset

In [62]:
pd.DataFrame(movies.mean()).T

Unnamed: 0,imdb_score,budget,gross,duration,aspect_ratio,title_year
0,6.442138,39752620.0,48468410.0,107.201074,2.220403,2002.470517


We've got other ways to aggregate the data too.

Here, we're showing the mean, max, and min values of `imdb_score` by subsetting to only that column before applying a special `.agg()` function to aggregate the groupby object

In [66]:
movies_byRating.agg(['mean','max','min','count'])['imdb_score']

Unnamed: 0,content_rating,mean,max,min,count
0,Approved,7.325455,8.9,4.1,55
1,G,6.529464,8.6,1.6,112
2,GP,6.916667,8.0,6.2,6
3,M,6.84,8.1,6.0,5
4,NC-17,6.542857,7.6,4.6,7
5,Not Rated,6.631034,8.9,2.0,116
6,PG,6.294437,8.8,1.7,701
7,PG-13,6.257495,9.0,1.9,1461
8,Passed,7.166667,8.1,6.3,9
9,R,6.527101,9.3,1.9,2118


In [74]:
# The count should align with the size of the PG-13 group from above
movies_byRating.get_group('PG-13').shape

(1461, 15)

The results of these groupby operations are all dataframes, check it out with the `type( )` operator
This means we can start chaining together dataframe functions, for example `sort_values()` 

<br> Try it out: Break down median revenues by country, and sort them highest to lowest
<br> If you finish early, try to show the same thing as above, but now *exclude* countries with less than 30 movies

In [69]:
movies.groupby(by='country').agg('median').sort_values(by='gross',ascending=False).head()

Unnamed: 0_level_0,imdb_score,budget,gross,duration,aspect_ratio,title_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Taiwan,7.15,15000000.0,64340682.0,112.5,1.86,2007.5
Peru,5.4,45000000.0,57362581.0,110.0,1.85,1994.0
South Africa,6.25,17500000.0,45089048.0,100.5,1.85,2011.5
USA,6.5,20000000.0,32178777.0,103.0,1.85,2005.0
New Zealand,7.3,25000000.0,30465398.0,108.0,2.35,2005.0


In [74]:
df = movies.groupby(by='country').agg(['median','count'])['gross'].sort_values(by='median', ascending=False)
df[df['count'] >= 30]

Unnamed: 0_level_0,median,count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
USA,32178777.0,3235
Germany,20978074.5,88
Australia,17356110.0,42
UK,13401683.0,364
Canada,6854620.0,72
France,4291965.0,121


We can also group on multiple columns to get *all unique combinations* of those columns. 

<br> For example, we can see if the relationship between `content_rating` and `imdb_rating` differs across countries. 

In [77]:
movies_byCountryRating = movies.groupby(by=['country', 'content_rating'])

In [78]:
movies_byCountryRating.size()

country       content_rating
Afghanistan   PG-13              1
Argentina     R                  3
              Unrated            1
Aruba         R                  1
Australia     G                  2
                                ..
USA           Unrated           38
              X                 12
West Germany  M                  1
              PG                 1
              R                  1
Length: 176, dtype: int64

To get a specific group, we must now pass a `tuple`, since there are two "keys", or layers on which we've grouped by

In [80]:
movies_byCountryRating.get_group( ('USA','PG-13')   ).head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
0,Avatar,7.9,PG-13,237000000.0,760505847.0,178.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,English,USA,1.78,2009.0
1,Pirates of the Caribbean: At World's End,7.1,PG-13,300000000.0,309404152.0,169.0,Action|Adventure|Fantasy,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,English,USA,2.35,2007.0
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0
5,John Carter,6.6,PG-13,263700000.0,73058679.0,132.0,Action|Adventure|Sci-Fi,Daryl Sabara,Samantha Morton,Polly Walker,Andrew Stanton,English,USA,2.35,2012.0
6,Spider-Man 3,6.2,PG-13,258000000.0,336530303.0,156.0,Action|Adventure|Romance,J.K. Simmons,James Franco,Kirsten Dunst,Sam Raimi,English,USA,2.35,2007.0


In [82]:
movies_byCountryRating['imdb_score'].mean()

country       content_rating
Afghanistan   PG-13             7.400000
Argentina     R                 7.600000
              Unrated           7.200000
Aruba         R                 4.800000
Australia     G                 6.300000
                                  ...   
USA           Unrated           6.936842
              X                 6.466667
West Germany  M                 6.000000
              PG                7.400000
              R                 8.400000
Name: imdb_score, Length: 176, dtype: float64

# Summary

That's it for now!

<br>Today you learned how to:
- **Import** a dataset as a pandas object
- Check out quick features, like `.head()`, `.shape`, and `.value_counts()`
- The distinction between `pd.Series` and `pd.DataFrame` objects
- **Filter** rows based on some condition
- **Subset** columns to those we want

We also used special `GroupBy` objects to get specific drilled-down insights by:
<br>(1) first **disassociate** a dataset based on some criteria, then
<br>(2) get some **aggregate** value, such that the new dataframe has one row for every value we grouped by

<br> In practice, if we wanted to get a mean score, broken down by every value in a given column, we would do:
<br>`df.groupby(by='group column').agg('mean').score_column`