##  A Quick Review of Pandas

Before pulling any data, we've gotta import all the packages we need

In [33]:
import pandas as pd
import numpy as np

Now we can read in the data from a link OR from a file in the same directory

In [34]:
movies = pd.read_csv('imdb.csv')

Take a quick look at the data

In [42]:
movies.head(3)

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
0,Avatar,7.9,PG-13,237000000.0,760505847.0,178.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,English,USA,1.78,2009.0
1,Pirates of the Caribbean: At World's End,7.1,PG-13,300000000.0,309404152.0,169.0,Action|Adventure|Fantasy,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,English,USA,2.35,2007.0
2,Spectre,6.8,PG-13,245000000.0,200074175.0,148.0,Action|Adventure|Thriller,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes,English,UK,2.35,2015.0


How many rows are there? Columns?

In [37]:
df = pd.read_csv('https://raw.githubusercontent.com/benartuso/nodef19/master/data/02-cleaning-intro-vis/movie.csv')
df2 = df[['movie_title', 'imdb_score', 'content_rating', 'budget', 'gross','duration',
       'genres', 'actor_1_name', 'actor_2_name', 'actor_3_name',
       'director_name', 'language', 'country', 'aspect_ratio', 'title_year']]
df2.to_csv('imdb.csv',index=False)

In [38]:
movies.shape

(5043, 15)

There's a lot of columns- could we get a full list? 

In [43]:
movies.columns

Index(['movie_title', 'imdb_score', 'content_rating', 'budget', 'gross',
       'duration', 'genres', 'actor_1_name', 'actor_2_name', 'actor_3_name',
       'director_name', 'language', 'country', 'aspect_ratio', 'title_year'],
      dtype='object')

How many movies does each director have on IMDB?

In [146]:
movies.director_name.value_counts()

Steven Spielberg     26
Woody Allen          22
Martin Scorsese      20
Clint Eastwood       20
Ridley Scott         17
                     ..
Michael Cristofer     1
Mennan Yapo           1
Steve Carver          1
Keith Parmer          1
Ian Fitzgibbon        1
Name: director_name, Length: 2398, dtype: int64

What are the top 3 longest movies on the list?

In [58]:
movies.sort_values(by='duration', ascending=False).head(3)

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
1710,Trapped,8.2,,,,511.0,Crime|Drama|Thriller,Ólafur Darri Ólafsson,Ingvar Eggert Sigurðsson,Björn Hlynur Haraldsson,,Icelandic,Iceland,16.0,
2466,Carlos,7.7,Not Rated,,145118.0,334.0,Biography|Crime|Drama|Thriller,Edgar Ramírez,Nora von Waldstätten,Katharina Schüttler,,English,France,2.35,
1501,"Blood In, Blood Out",8.0,R,35000000.0,4496583.0,330.0,Crime|Drama,Delroy Lindo,Jesse Borrego,Raymond Cruz,Taylor Hackford,English,USA,1.66,1993.0


# Pandas Foundations

## Series vs. Dataframes 

So far we've been referring to *2 dimensional tables* using `pd.DataFrame` objects.
<br>An individual column, or row, is *1 dimensional*. We call this a `pd.Series` object.

In [60]:
type(movies)

pandas.core.frame.DataFrame

We can access a series using either form: `df.column` OR `df['column']`

In [119]:
print(type(movies.imdb_score))
movies.imdb_score

<class 'pandas.core.series.Series'>


0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
5038    7.7
5039    7.5
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64

In [120]:
print(type(movies['imdb_score']))
movies['imdb_score']

<class 'pandas.core.series.Series'>


0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
5038    7.7
5039    7.5
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64

To select a row as a series, we can use `df.iloc[]` and a specific row number

In [249]:
print(type(movies.iloc[143]))
movies.iloc[143]

<class 'pandas.core.series.Series'>


movie_title                                      Mars Needs Moms 
imdb_score                                                    5.4
content_rating                                                 PG
budget                                                    1.5e+08
gross                                                 2.13793e+07
duration                                                       88
genres            Action|Adventure|Animation|Comedy|Family|Sci-Fi
actor_1_name                                    Elisabeth Harnois
actor_2_name                                           Dan Fogler
actor_3_name                                    Tom Everett Scott
director_name                                         Simon Wells
language                                                  English
country                                                       USA
aspect_ratio                                                 2.35
title_year                                                   2011
duration_h

## Subsetting

Subsetting and filtering is one of the most important, yet confusing topics when getting started.
<br>As we go along, feel free to run the code part by part to see what's going on in each step. `type()` is also a great tool here

Lets take a look at a smaller set of columns, say just the movie_title and each actor & director name

In [50]:
movies[['movie_title','actor_1_name','actor_2_name','actor_3_name','director_name']] # Note TWO square brackets

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name,director_name
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Star Wars: Episode VII - The Force Awakens ...,Doug Walker,Rob Walker,,Doug Walker
...,...,...,...,...,...
5038,Signed Sealed Delivered,Eric Mabius,Daphne Zuniga,Crystal Lowe,Scott Smith
5039,The Following,Natalie Zea,Valorie Curry,Sam Underwood,
5040,A Plague So Pleasant,Eva Boehnke,Maxwell Moody,David Chandler,Benjamin Roberds
5041,Shanghai Calling,Alan Ruck,Daniel Henney,Eliza Coupe,Daniel Hsia


Why did we use two square brackets? What we're doing is passing a `list` to the subset function.
<br> Pandas `DataFrame` objects know that whenever we place `[]` after it, we're looking to do some sort of filtering operation
<br><br> This ends up being pretty helpful, because it gives us a shortcut to create similar subsets over and over

In [55]:
actor_cols = ['actor_1_name','actor_2_name','actor_3_name']
movies[['movie_title']+actor_cols]

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Star Wars: Episode VII - The Force Awakens ...,Doug Walker,Rob Walker,
...,...,...,...,...
5038,Signed Sealed Delivered,Eric Mabius,Daphne Zuniga,Crystal Lowe
5039,The Following,Natalie Zea,Valorie Curry,Sam Underwood
5040,A Plague So Pleasant,Eva Boehnke,Maxwell Moody,David Chandler
5041,Shanghai Calling,Alan Ruck,Daniel Henney,Eliza Coupe


Quick point of confusion: Check out the difference between `df['column']` vs `df[['column']]` ( try it out w/ `movies` below! )
<br><br> The latter creates a *dataframe*, since the input is a *list* of columns. The former creates a *series*, since the input is just one *string*.

In [81]:
movies['content_rating']

0       PG-13
1       PG-13
2       PG-13
3       PG-13
4         NaN
        ...  
5038      NaN
5039    TV-14
5040      NaN
5041    PG-13
5042       PG
Name: content_rating, Length: 5043, dtype: object

In [79]:
movies[['content_rating']]

Unnamed: 0,content_rating
0,PG-13
1,PG-13
2,PG-13
3,PG-13
4,
...,...
5038,
5039,TV-14
5040,
5041,PG-13


## Filtering With Conditions

Another common task is to filter rows based upon some criteria we have. We could:
1. Compare floats
2. Match strings
3. Check against multiple elements

In [91]:
movies[movies.imdb_score > 8]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year,duration_hrs
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0,2.733333
17,The Avengers,8.1,PG-13,220000000.0,623279547.0,173.0,Action|Adventure|Sci-Fi,Chris Hemsworth,Robert Downey Jr.,Scarlett Johansson,Joss Whedon,English,USA,1.85,2012.0,2.883333
27,Captain America: Civil War,8.2,PG-13,250000000.0,407197282.0,147.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,Scarlett Johansson,Chris Evans,Anthony Russo,English,USA,2.35,2016.0,2.450000
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0,1.716667
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0,1.633333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4924,Butterfly Girl,8.7,,180000.0,,78.0,Documentary,Abigail Evans,Stacie Evans,Emily Gorell,Cary Bell,English,USA,,2014.0,1.300000
4937,A Charlie Brown Christmas,8.4,TV-G,150000.0,,25.0,Animation|Comedy|Family,Peter Robbins,Bill Melendez,Christopher Shea,Bill Melendez,English,USA,1.33,1965.0,0.416667
4945,The Brain That Sings,8.2,,125000.0,,62.0,Documentary|Family,,,,Amal Al-Agroobi,Arabic,United Arab Emirates,,2013.0,1.033333
4972,"Peace, Propaganda & the Promised Land",8.3,,70000.0,,80.0,Documentary,Noam Chomsky,Seth Ackerman,Arik Ascherman,Sut Jhally,English,USA,,2004.0,1.333333


In [109]:
movies[movies.content_rating == 'G']

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year,duration_hrs
35,Monsters University,7.3,G,200000000.0,268488329.0,104.0,Adventure|Animation|Comedy|Family|Fantasy,Steve Buscemi,Tyler Labine,Sean Hayes,Dan Scanlon,English,USA,1.85,2013.0,1.733333
41,Cars 2,6.3,G,200000000.0,191450875.0,106.0,Adventure|Animation|Comedy|Family|Sport,Joe Mantegna,Thomas Kretschmann,Eddie Izzard,John Lasseter,English,USA,2.35,2011.0,1.766667
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0,1.716667
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0,1.633333
91,The Polar Express,6.6,G,165000000.0,665426.0,100.0,Adventure|Animation|Family|Fantasy,Tom Hanks,Eddie Deezen,Peter Scolari,Robert Zemeckis,English,USA,2.35,2004.0,1.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4427,Modern Times,8.6,G,1500000.0,163245.0,87.0,Comedy|Drama|Family,Paulette Goddard,Stanley Blystone,Fred Malatesta,Charles Chaplin,English,USA,1.37,1936.0,1.450000
4591,A Lego Brickumentary,6.8,G,1000000.0,100240.0,93.0,Documentary,G.W. Krauss,Brian Whitaker,Jason Bateman,Kief Davidson,English,Denmark,,2014.0,1.550000
4725,Benji,6.1,G,500000.0,39552600.0,86.0,Adventure|Family|Romance,Frances Bavier,Peter Breck,Edgar Buchanan,Joe Camp,English,USA,1.85,1974.0,1.433333
4787,Rise of the Entrepreneur: The Search for a Bet...,8.2,G,450000.0,,52.0,Documentary,Bob Proctor,Jack Canfield,Eric Worre,Joe Kenemore,English,USA,,2014.0,0.866667


We'll use the `.isin()` operator to match against a group. Remember, `.isin()` accepts a `list` only

In [110]:
movies[movies.country.isin(['India','Sri Lanka', 'Bangladesh', 'Pakistan'])]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year,duration_hrs
1056,Earth,7.8,Unrated,,528972.0,110.0,Drama|Romance|War,Nandita Das,Gulshan Grover,Eric Peterson,Deepa Mehta,Hindi,India,,1998.0,1.833333
1329,Baahubali: The Beginning,8.4,,18026148.0,6498000.0,159.0,Action|Adventure|Drama|Fantasy|War,Tamannaah Bhatia,Anushka Shetty,Prabhas,S.S. Rajamouli,Telugu,India,1.85,2015.0,2.65
2349,Ramanujan,7.0,,,,153.0,Biography|Drama|History,Mani Bharathi,Michael Lieber,Kevin McGowan,Gnana Rajasekaran,English,India,2.35,2014.0,2.55
3075,Kabhi Alvida Naa Kehna,6.0,R,700000000.0,3275443.0,193.0,Drama,Shah Rukh Khan,John Abraham,Preity Zinta,Karan Johar,Hindi,India,2.35,2006.0,3.216667
3085,Housefull,5.3,,,1165104.0,144.0,Comedy,Arjun Rampal,Boman Irani,Riteish Deshmukh,Sajid Khan,Hindi,India,,2010.0,2.4
3208,Krrish,6.3,Not Rated,10000000.0,,168.0,Action|Adventure|Romance|Sci-Fi,Naseeruddin Shah,Rekha,Sharat Saxena,Rakesh Roshan,Hindi,India,2.35,2006.0,2.8
3273,Kites,6.0,,600000000.0,1602466.0,90.0,Action|Drama|Romance|Thriller,Bárbara Mori,Steven Michael Quezada,Kabir Bedi,Anurag Basu,English,India,,2010.0,1.5
3276,Jab Tak Hai Jaan,6.9,Not Rated,7217600.0,3047539.0,176.0,Drama|Romance,Shah Rukh Khan,Katrina Kaif,Vic Waghorn,Yash Chopra,Hindi,India,2.35,2012.0,2.933333
3344,My Name Is Khan,8.0,PG-13,12000000.0,4018695.0,128.0,Adventure|Drama|Thriller,Shah Rukh Khan,Jimmy Shergill,Christopher B. Duncan,Karan Johar,Hindi,India,2.35,2010.0,2.133333
3348,Namastey London,7.3,,,1207007.0,128.0,Comedy|Drama|Romance,Katrina Kaif,Clive Standen,Riteish Deshmukh,Vipul Amrutlal Shah,Hindi,India,,2007.0,2.133333


### Masking

What's going on under the hood? With each argument, we're passing a list of `True` and `False`, for whether or not a given *row* matches the criteria.
<br><br>With the above examples:

In [114]:
movies.content_rating == 'G'

0       False
1       False
2       False
3       False
4       False
        ...  
5038    False
5039    False
5040    False
5041    False
5042    False
Name: content_rating, Length: 5043, dtype: bool

In [115]:
movies[  movies.content_rating=='G'  ]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year,duration_hrs
35,Monsters University,7.3,G,200000000.0,268488329.0,104.0,Adventure|Animation|Comedy|Family|Fantasy,Steve Buscemi,Tyler Labine,Sean Hayes,Dan Scanlon,English,USA,1.85,2013.0,1.733333
41,Cars 2,6.3,G,200000000.0,191450875.0,106.0,Adventure|Animation|Comedy|Family|Sport,Joe Mantegna,Thomas Kretschmann,Eddie Izzard,John Lasseter,English,USA,2.35,2011.0,1.766667
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0,1.716667
58,WALL·E,8.4,G,180000000.0,223806889.0,98.0,Adventure|Animation|Family|Sci-Fi,John Ratzenberger,Fred Willard,Jeff Garlin,Andrew Stanton,English,USA,2.35,2008.0,1.633333
91,The Polar Express,6.6,G,165000000.0,665426.0,100.0,Adventure|Animation|Family|Fantasy,Tom Hanks,Eddie Deezen,Peter Scolari,Robert Zemeckis,English,USA,2.35,2004.0,1.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4427,Modern Times,8.6,G,1500000.0,163245.0,87.0,Comedy|Drama|Family,Paulette Goddard,Stanley Blystone,Fred Malatesta,Charles Chaplin,English,USA,1.37,1936.0,1.450000
4591,A Lego Brickumentary,6.8,G,1000000.0,100240.0,93.0,Documentary,G.W. Krauss,Brian Whitaker,Jason Bateman,Kief Davidson,English,Denmark,,2014.0,1.550000
4725,Benji,6.1,G,500000.0,39552600.0,86.0,Adventure|Family|Romance,Frances Bavier,Peter Breck,Edgar Buchanan,Joe Camp,English,USA,1.85,1974.0,1.433333
4787,Rise of the Entrepreneur: The Search for a Bet...,8.2,G,450000.0,,52.0,Documentary,Bob Proctor,Jack Canfield,Eric Worre,Joe Kenemore,English,USA,,2014.0,0.866667


## Chaining

One more quick helpful thing before we break out to practice

## Practice w/ UFOs

Data sampled from the National UFO Reporting Center (NUFORC)
With your breakout groups, open up `ufo.csv` and answer the following questions:

1. Among the West Coast states (California, Oregon, and Washington), how long (on average) did the fireballs encounters last?
2. Which state saw the most encounters that lasted between 5 minutes to 1 hour?
3. There was one particularly interesting encounter on `2/11/2004 00:00` in West Palm Beach, Florida. What happened?

<br>Hint: Break down each question into parts, and chain them back together
<br>There are many ways of arriving at the answer, there's no particular 'right' way

In [290]:
ufo = pd.read_csv('ufo.csv')
ufo.head()

Unnamed: 0,datetime,city,state,country,shape,duration_sec,duration_hrs,comments,latitude,longitude
0,1/25/2014 22:00,tewksbury,ma,us,light,4.0,00:04,Green and red falling light over walgreens res...,42.610556,-71.234722
1,8/20/2004 22:27,lake in the hills,il,us,triangle,180.0,2-3 minutes,On August 20&#442004 at exactly 10:25 to 10:27...,42.181667,-88.330278
2,1/23/2009 19:00,slidell,la,us,sphere,7200.0,2 hours,Bright red&#44 green &#44 white and blue spher...,30.275,-89.781111
3,12/15/1994 21:00,alliance,oh,us,formation,1800.0,30 min +,Brightly Colored Orbs Over Portage County,40.915278,-81.106111
4,7/31/2011 21:00,milford,ct,us,oval,15.0,10-15 seconds,Bright orange&#44 silent orb&#44 moving steadi...,41.222222,-73.056944


In [314]:
# 1
ufo.loc[ufo.state.isin(['ca','or','wa'])].loc[ufo['shape']=='fireball'].duration_sec.mean()

238.88888888888889

In [316]:
# 2
ufo.loc[(ufo.duration_sec>=5*60) & (ufo.duration_sec<=60*60)].state.value_counts().head(1)

ca    392
Name: state, dtype: int64

In [318]:
# 3
ufo.loc[ufo.datetime == '2/11/2004 00:00'].loc[ufo.city=='west palm beach'].comments.values[0]

'BLINDING LIGHT LIFTED MY DOG AND TOOK OFF INTO SPACE'

# Groupby Objects

From last week: how would we get the number of movies in this dataset that received each possible content_rating?

In [0]:
#ANSWER:
movies.content_rating.value_counts()

R            2118
PG-13        1461
PG            701
Not Rated     116
G             112
Unrated        62
Approved       55
TV-14          30
TV-MA          20
TV-PG          13
X              13
TV-G           10
Passed          9
NC-17           7
GP              6
M               5
TV-Y            1
TV-Y7           1
Name: content_rating, dtype: int64

This is a good summary statistic to examine the distribution of our dataset. But...

**What if we want to know how movie performance differs by rating?**

In [70]:
rating_group = movies.groupby('content_rating')
type(rating_group)

pandas.core.groupby.generic.DataFrameGroupBy

This creates a **GroupBy object**. 

It's like a dataframe separated out by each unique value of the column we specified (content_rating). 

We can apply various operations to it and get **disaggregated values for each content rating**, instead of for the whole column. 

In [0]:
rating_group.mean()

Unnamed: 0_level_0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Approved,74.672727,116.218182,648.854545,185.236364,815.836364,48145860.0,44651.363636,1649.054545,1.254545,162.490909,4142475.0,1954.781818,285.290909,7.325455,1.735818,919.018182
G,105.162162,98.973214,156.446429,373.936364,3174.571429,82455160.0,89336.982143,4977.821429,0.5,197.232143,44999130.0,1995.892857,613.089286,6.529464,1.962523,4399.080357
GP,66.833333,110.833333,1884.0,112.333333,2349.833333,43800000.0,33355.5,2930.333333,1.166667,144.833333,5550000.0,1970.666667,308.333333,6.916667,2.024,25.0
M,77.4,114.6,2640.6,253.6,6729.4,62554450.0,50226.2,7812.8,1.2,199.0,4700000.0,1968.4,393.6,6.84,2.012,122.8
NC-17,184.285714,102.0,170.0,288.571429,4422.142857,4476870.0,56828.571429,5423.571429,0.428571,298.428571,7940714.0,1994.571429,372.0,6.542857,1.855714,5144.571429
Not Rated,92.482456,110.318966,269.478261,168.690265,1251.669565,2090037.0,25662.301724,2069.655172,1.077586,104.469565,4479213.0,1997.504348,380.904348,6.631034,2.101739,3333.431034
PG,119.71184,104.857347,776.071327,694.692857,6350.830243,72952730.0,73759.776034,9606.445078,1.259684,219.302425,48389560.0,2000.392297,1618.519258,6.294437,2.06125,5993.680456
PG-13,174.597107,111.365068,767.550992,954.518163,8452.399726,65648540.0,109344.096509,12783.44011,1.578007,361.326712,54257340.0,2005.710472,2214.359589,6.257495,2.151456,11051.508556
Passed,43.888889,115.222222,66.333333,221.111111,540.444444,11003540.0,35484.888889,1489.555556,2.444444,106.111111,3031421.0,1941.222222,389.888889,7.166667,1.351111,1829.555556
R,151.027423,108.781767,735.532106,557.777043,6819.372049,29983650.0,88584.095845,9855.895184,1.321834,289.30482,33647550.0,2002.950425,1708.955619,6.527101,2.118677,7476.420208


In [0]:
rating_group['imdb_score'].agg(['mean', 'min'])

Unnamed: 0_level_0,mean,min
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
Approved,7.325455,4.1
G,6.529464,1.6
GP,6.916667,6.2
M,6.84,6.0
NC-17,6.542857,4.6
Not Rated,6.631034,2.0
PG,6.294437,1.7
PG-13,6.257495,1.9
Passed,7.166667,6.3
R,6.527101,1.9


In [0]:
rating_group['imdb_score'].mean()

content_rating
Approved     7.325455
G            6.529464
GP           6.916667
M            6.840000
NC-17        6.542857
Not Rated    6.631034
PG           6.294437
PG-13        6.257495
Passed       7.166667
R            6.527101
TV-14        7.250000
TV-G         6.920000
TV-MA        8.250000
TV-PG        7.353846
TV-Y         7.400000
TV-Y7        7.200000
Unrated      6.920968
X            6.500000
Name: imdb_score, dtype: float64

This is nice, but unintuitive to read and would make for a poor visualization if we plotted it like this, since **the average ratings are out of order**.

What operator from Week 1 can we chain on to make this more orderly, and see the ratings from lowest to highest? 

In [0]:
#This is nice, but unintuitive to read and would make for a poor visualization if plotted in this order!

#What operator can we chain on to make this more orderly, and find the best ? 


In [0]:
#ANSWER
rating_group['imdb_score'].mean().sort_values(ascending=False)

content_rating
TV-MA        8.250000
TV-Y         7.400000
TV-PG        7.353846
Approved     7.325455
TV-14        7.250000
TV-Y7        7.200000
Passed       7.166667
Unrated      6.920968
TV-G         6.920000
GP           6.916667
M            6.840000
Not Rated    6.631034
NC-17        6.542857
G            6.529464
R            6.527101
X            6.500000
PG           6.294437
PG-13        6.257495
Name: imdb_score, dtype: float64

expected output: 

In [0]:
"""
content_rating
TV-MA        8.250000
TV-Y         7.400000
TV-PG        7.353846
Approved     7.325455
TV-14        7.250000
TV-Y7        7.200000
Passed       7.166667
Unrated      6.920968
TV-G         6.920000
GP           6.916667
M            6.840000
Not Rated    6.631034
NC-17        6.542857
G            6.529464
R            6.527101
X            6.500000
PG           6.294437
PG-13        6.257495
Name: imdb_score, dtype: float64
"""
""

''

Getting size of groups

In [0]:
type(rating_group)

pandas.core.groupby.generic.DataFrameGroupBy

In [0]:
rating_group.size()

content_rating
Approved       55
G             112
GP              6
M               5
NC-17           7
Not Rated     116
PG            701
PG-13        1461
Passed          9
R            2118
TV-14          30
TV-G           10
TV-MA          20
TV-PG          13
TV-Y            1
TV-Y7           1
Unrated        62
X              13
dtype: int64

In [0]:
rating_group.get_group('PG').head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
7,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Brad Garrett,Tangled,294810,2036,M.C. Gainey,1.0,17th century|based on fairy tale|disney|flower...,http://www.imdb.com/title/tt0398286/?ref_=fn_t...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,Alan Rickman,Harry Potter and the Half-Blood Prince,321795,58753,Rupert Grint,3.0,blood|book|love|potion|professor,http://www.imdb.com/title/tt0417741/?ref_=fn_t...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
16,Color,Andrew Adamson,258.0,150.0,80.0,201.0,Pierfrancesco Favino,22000.0,141614023.0,Action|Adventure|Family|Fantasy,Peter Dinklage,The Chronicles of Narnia: Prince Caspian,149922,22697,Damián Alcázar,4.0,brother brother relationship|brother sister re...,http://www.imdb.com/title/tt0499448/?ref_=fn_t...,438.0,English,USA,PG,225000000.0,2008.0,216.0,6.6,2.35,0
33,Color,Tim Burton,451.0,108.0,13000.0,11000.0,Alan Rickman,40000.0,334185206.0,Adventure|Family|Fantasy,Johnny Depp,Alice in Wonderland,306320,79957,Anne Hathaway,0.0,alice in wonderland|mistaking reality for drea...,http://www.imdb.com/title/tt1014759/?ref_=fn_t...,736.0,English,USA,PG,200000000.0,2010.0,25000.0,6.5,1.85,24000
38,Color,Sam Raimi,525.0,130.0,0.0,11000.0,Mila Kunis,44000.0,234903076.0,Adventure|Family|Fantasy,Tim Holmes,Oz the Great and Powerful,175409,73441,James Franco,4.0,circus|magic|magician|oz|witch,http://www.imdb.com/title/tt1623205/?ref_=fn_t...,511.0,English,USA,PG,215000000.0,2013.0,15000.0,6.4,2.35,60000


#### You can also group on multiple columns to get all unique combinations of those columns. 

Example: we can use this to assess if the relationship between content_rating and imdb_rating is different across countries. 



In [0]:
multi_group = movies.groupby(['country', 'content_rating'])

In [0]:
multi_group.size()

country         content_rating
Afghanistan     PG-13                1
Argentina       R                    3
                Unrated              1
Aruba           R                    1
Australia       G                    2
                PG                  11
                PG-13               11
                R                   26
                Unrated              1
Bahamas         R                    1
Belgium         R                    3
Brazil          R                    6
                Unrated              1
Bulgaria        R                    1
Cameroon        Not Rated            1
Canada          G                    1
                Not Rated            9
                PG                  10
                PG-13               25
                R                   65
                TV-14                1
                TV-G                 1
                TV-Y                 1
Chile           PG-13                1
China           PG               

In [0]:
multi_group['imdb_score'].mean().sort_values(ascending=False).head()

country     content_rating
Poland      TV-MA             9.1
Italy       Approved          8.9
Kyrgyzstan  PG-13             8.7
Italy       TV-MA             8.7
            PG-13             8.6
Name: imdb_score, dtype: float64

In [0]:
multi_group['imdb_score'].mean().sort_values(ascending=True).head()

country      content_rating
South Korea  PG                2.7
Canada       G                 2.8
China        PG                3.2
South Korea  PG-13             3.6
Bahamas      R                 4.4
Name: imdb_score, dtype: float64

In [0]:
multi_group.get_group(('USA', 'R')).head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
94,Color,Jonathan Mostow,280.0,109.0,84.0,191.0,M.C. Gainey,648.0,150350192.0,Action|Sci-Fi,Nick Stahl,Terminator 3: Rise of the Machines,305340,1769,Carolyn Hennesy,0.0,drifter|exploding truck|future|machine|skynet,http://www.imdb.com/title/tt0181852/?ref_=fn_t...,1676.0,English,USA,R,200000000.0,2003.0,284.0,6.4,2.35,0
126,Color,Lana Wachowski,275.0,138.0,0.0,30.0,Daniel Bernhardt,234.0,281492479.0,Action|Sci-Fi,Steve Bastoni,The Matrix Reloaded,421818,534,Helmut Bakaitis,0.0,car motorcycle chase|one against many|oracle|p...,http://www.imdb.com/title/tt0234215/?ref_=fn_t...,2789.0,English,USA,R,150000000.0,2003.0,198.0,7.2,2.35,0
136,Color,Joe Johnston,357.0,119.0,394.0,162.0,Simon Merrells,12000.0,61937495.0,Drama|Fantasy|Horror|Thriller,Anthony Hopkins,The Wolfman,89442,13071,Art Malik,0.0,asylum|death|full moon|transformation|werewolf,http://www.imdb.com/title/tt0780653/?ref_=fn_t...,432.0,English,USA,R,150000000.0,2010.0,490.0,5.8,1.85,0
147,Color,Wolfgang Petersen,220.0,196.0,249.0,844.0,Orlando Bloom,11000.0,133228348.0,Adventure,Brad Pitt,Troy,381672,17944,Julian Glover,2.0,greek|mythology|prince|trojan|troy,http://www.imdb.com/title/tt0332452/?ref_=fn_t...,1694.0,English,USA,R,175000000.0,2004.0,5000.0,7.2,2.35,0
158,Color,Edward Zwick,190.0,154.0,380.0,445.0,Tony Goldwyn,10000.0,111110575.0,Action|Drama|History|War,Tom Cruise,The Last Samurai,317166,11945,Chad Lindberg,1.0,captain|emperor|honor|japan|samurai,http://www.imdb.com/title/tt0325710/?ref_=fn_t...,928.0,English,USA,R,140000000.0,2003.0,956.0,7.7,2.35,0


In [0]:
multi_group.get_group(('USA', 'R')).movie_title.head()

94     Terminator 3: Rise of the Machines 
126                   The Matrix Reloaded 
136                           The Wolfman 
147                                  Troy 
158                      The Last Samurai 
Name: movie_title, dtype: object