# Exercise Set 5

Below are five problems, each worth 1 point. These problems are interleaved with short tutorials on Python. This assignment will be autograded to ensure a quick turnaround. After several of the problems, there are tests associated with your answer which will help you determine if you have solved the problem correctly. If you can run the cell after your answer and not get any errors, then you very likely have gotten the question right. If you have errors, hopefully the error will help you identify the mistake. However, for some questions that ask for information, the grading cell will not tell you whether or not you got it correct (there is no way to do so without giving away the answer). It will tell you whether or not your formatting is correct.

Note that just because you don't get errors on the questions that do check your answer doesn't mean that you got the question correct. I have some additional tests held back that I do not show here, though if you pass the ones shown, you will likely pass those as well.

When you are done with the assignment, you should save this notebook manually by clicking on the save button in the toolbar (the floppy disk icon). **Do not rely on autosave. Save manually!** Ensure that you have not renamed the file. **The autograder that is used to grade this notebook requires that the file be named `Exercise_V.ipynb`.** Once you save the notebook, follow the instructions in the `README.md` file to submit the assignment.

Also, do not remove any datasets from the folder before downloading. You should download the folder exactly as is.

Finally, you are encouraged to add new cells as you go through the notebook and experiment. Any cell that should not be copied or deleted is marked as such. As long as you don't copy or delete the cells marked as such, then you should feel free to experiment as much as you would like with this notebook.

## Pandas DataFrames

In this exercise, we will primarily be looking at how to manipulate data within a pandas dataframe. Some of this you will have seen before in starter code, but hopefully you will develop a deeper understanding and appreciation for how to manipulate data in Python. First, we will import pandas and call it `pd` (giving us access to all of the pandas functions by writing `pd.FUNCTION`).

In [19]:
import pandas as pd

I am deeply indebted to open source material available at a few places. The list (not exhaustive) of sources is the following:

  1. [Python Pandas Tutorial: A Complete Introduction for Beginners](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/). A large number of the examples was copied verbatim from here.
  2. [Pandas Exercises](https://github.com/guipsamora/pandas_exercises). A GitHub repository that provides a lot of examples around using Pandas. Some of the exercises were adapted from these examples, so do not use this as a resource until after submitting this and the next exercise sets.

## Problem Dataset - Chipotle Orders

In this assignment's questions, we are going to be analyzing some data from orders at Chipotle. You can read it in with the following cell. Notice the `sep = '\t'`. This is because the data set is tab seperated, not comma separated as we are used to.

Note that the below cell drops the duplicate rows, and it creates a new column called `item_price_number`. These were transformations to the dataset that were done in Exercise III. If you would like to familiarize yourself with these steps, you are encouraged to go back to Exercise III.

In [20]:
chipo = pd.read_csv("chipotle.tsv", sep = '\t')

chipo.drop_duplicates(inplace=True)

def price_to_float(price):
    return float(price[1:-1])

chipo['item_price_number'] = chipo['item_price'].apply(price_to_float)

You should spend some time familiarizing yourself with the data.

In [21]:
chipo.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price,item_price_number
0,1,1,Chips and Fresh Tomato Salsa,,$2.39,2.39
1,1,1,Izze,[Clementine],$3.39,3.39
2,1,1,Nantucket Nectar,[Apple],$3.39,3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98,16.98


This data set is broken up by `order_id` which corresponds to a particular order, and then within each order, there may be multiple items. The particular items are indicated by `item_name`, and any choices associated with those items are indicated by `choice_description` (e.g. the second item in `order_id=1` is an "Izze" drink with "Clementine" flavor). Additionally, there is the `item_price` for each order and the `quantity` of the item ordered. Be careful with the `item_price` as it is not the price for the item, it is the price for the item times the number of items ordered. You can see this in row 4 where `order_id=2` orders 2 chicken bowls for `$16.98` where a single chicken bowl would be half of that.

In [22]:
chipo.isnull().sum()

order_id                 0
quantity                 0
item_name                0
choice_description    1228
item_price               0
item_price_number        0
dtype: int64

## Advanced Pandas Dataframes

We will be looking at movies from IMDB. This data set is on Kaggle and you can find the original [here](https://www.kaggle.com/PromptCloudHQ/imdb-data/download).

In [23]:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

Note that we just specified a particular column of the csv (in this case, the one called "Title") as the index.

In [24]:
movies_df

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


What about our columns? We can see a full list with the following.

In [25]:
movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Often times it would be nice not to have spaces in the column names, so let's get rid of them.

In [26]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

In [27]:
movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

That's better, and we remembered to do `inplace=True` this time.

Sometimes we would like everything to be lower case as well, and that would be a pain to write it all out. However, we can use list comprehension to make it easy. First, let's get the lower case names.

In [28]:
lowercase_columns = [col.lower() for col in movies_df.columns]

In [29]:
lowercase_columns

['rank',
 'genre',
 'description',
 'director',
 'actors',
 'year',
 'runtime',
 'rating',
 'votes',
 'revenue_millions',
 'metascore']

Now, we can just set the columns of `movies_df` equal to this.

In [30]:
movies_df.columns = lowercase_columns

In [31]:
movies_df

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


Now, they are all lowercase.

## Working with Missing Values

While we've seen how to impute values in other assignments, sometimes it's good to clean up missing values straight in the dataframe. Let's see if we have any.

In [32]:
movies_df.isnull()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,False,False,False,False,False,False,False,False,False,True,False
Hostel: Part II,False,False,False,False,False,False,False,False,False,False,False
Step Up 2: The Streets,False,False,False,False,False,False,False,False,False,False,False
Search Party,False,False,False,False,False,False,False,False,False,True,False


That's not quite the best way to look at it `.isnull()` tells us using either `True` or `False` whether any particular values is missing. We'd really like to summarize over columns. We can use `.sum()` for that. This is also an example of chaining methods together. We often end up doing this with Pandas. The first method summarizes the data in one way, and then we summarize the summary.

In [33]:
movies_df.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

We can find the total number of null values using a double application of the `.sum()` method, once for all of the rows in each column and then once to sum over all of the columns.

In [34]:
movies_df.isnull().sum().sum()

192

### Problem #1 - 1 point

How many null values are there altogether in the Chipotle dataset? Store your answer in a variable called `null_values`. I.e, your answer should look like
``` python
null_values = x
```
where x is a numeric value. The grade will not tell you whether not you are right, but it will tell you whether your answer is in the correct format.

In [44]:
import pandas as pd
import numbers
chipotle_data = pd.read_csv('chipotle.tsv', delimiter='\t')
null_values = chipotle_data.isnull().sum().sum()
print(null_values)
assert isinstance(null_values, numbers.Number)

1246


In [45]:
# THIS IS A GRADING CELL. DO NOT EDIT.
# This cell will not tell you whether or not your answer is right.
# If you have not assigned something to the variable null_values or you have not assigned a string,
# you will get an error, but if you just get the question wrong,
# you will not necessarily get an error.
from nose.tools import assert_equal, assert_true
import numbers
print(null_values)
assert_true(isinstance(null_values, numbers.Number))

1246


Now we can see that we are missing some `revenue_millions` and some `metascore` data. Let's just get rid of those rows that have missing values.

In [46]:
movies_df.dropna()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Resident Evil: Afterlife,994,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
Project X,995,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


We didn't do `inplace=True`, so this won't stick to our original `movies_df` dataframe, but we can tell that it got rid of 162 rows. Maybe instead of getting rid of the rows, we would have preferred to get rid of the columns that have missing data. We can do that like this.

In [47]:
movies_df.dropna(axis=1)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727
...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881


Remember, `axis=0` means the rows of a dataframe, and `axis=1` means the columns of a dataframe whenever we are applying things like `.dropna()`. We see that we ended up with fewer columns (we got rid of the columns `revenue_millions` and `metascore`), but have all of our rows.

Of course, we could have imputed these values as well, and we've seen how to do that with `sklearn` functions. We can do something similar in pandas, but I don't recommend it.

## Sorting the Data

Often times we want to sort the dataframe in order to answer some questions. We can do that with the `.sort_values()` method, where we tell it `by` which column.

In [48]:
movies_df.sort_values(by = "rating")

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Disaster Movie,830,Comedy,"Over the course of one evening, an unsuspectin...",Jason Friedberg,"Carmen Electra, Vanessa Lachey,Nicole Parker, ...",2008,87,1.9,77207,14.17,15.0
Don't Fuck in the Woods,43,Horror,A group of friends are going on a camping trip...,Shawn Burkett,"Brittany Blanton, Ayse Howard, Roman Jossart,N...",2016,73,2.7,496,,
Dragonball Evolution,872,"Action,Adventure,Fantasy",The young warrior Son Goku sets out on a quest...,James Wong,"Justin Chatwin, James Marsters, Yun-Fat Chow, ...",2009,85,2.7,59512,9.35,45.0
Tall Men,648,"Fantasy,Horror,Thriller",A challenged man is stalked by tall phantoms i...,Jonathan Holbrook,"Dan Crisafulli, Kay Whitney, Richard Garcia, P...",2016,133,3.2,173,,57.0
Wrecker,969,"Action,Horror,Thriller",Best friends Emily and Lesley go on a road tri...,Micheal Bafaro,"Anna Hutchison, Andrea Whitburn, Jennifer Koen...",2015,83,3.5,1210,,37.0
...,...,...,...,...,...,...,...,...,...,...,...
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Intouchables,250,"Biography,Comedy,Drama",After he becomes a quadriplegic from a paragli...,Olivier Nakache,"François Cluzet, Omar Sy, Anne Le Ny, Audrey F...",2011,112,8.6,557965,13.18,57.0
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0


We may want to have it sorted descending instead.

In [49]:
movies_df.sort_values(by = "rating", ascending=False)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
Kimi no na wa,97,"Animation,Drama,Fantasy",Two strangers find themselves linked in a biza...,Makoto Shinkai,"Ryûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari...",2016,106,8.6,34110,4.68,79.0
...,...,...,...,...,...,...,...,...,...,...,...
Wrecker,969,"Action,Horror,Thriller",Best friends Emily and Lesley go on a road tri...,Micheal Bafaro,"Anna Hutchison, Andrea Whitburn, Jennifer Koen...",2015,83,3.5,1210,,37.0
Tall Men,648,"Fantasy,Horror,Thriller",A challenged man is stalked by tall phantoms i...,Jonathan Holbrook,"Dan Crisafulli, Kay Whitney, Richard Garcia, P...",2016,133,3.2,173,,57.0
Dragonball Evolution,872,"Action,Adventure,Fantasy",The young warrior Son Goku sets out on a quest...,James Wong,"Justin Chatwin, James Marsters, Yun-Fat Chow, ...",2009,85,2.7,59512,9.35,45.0
Don't Fuck in the Woods,43,Horror,A group of friends are going on a camping trip...,Shawn Burkett,"Brittany Blanton, Ayse Howard, Roman Jossart,N...",2016,73,2.7,496,,


Notice that just like before, we haven't actually changed the underlying dataframe because we didn't sort `inplace`.

In [50]:
movies_df

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


We can also sort by two dimensions. The following sorts first by year, and then by rating.

In [51]:
movies_df.sort_values(by = ["year", "rating"], ascending=False)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
Kimi no na wa,97,"Animation,Drama,Fantasy",Two strangers find themselves linked in a biza...,Makoto Shinkai,"Ryûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari...",2016,106,8.6,34110,4.68,79.0
Koe no katachi,862,"Animation,Drama,Romance","The story revolves around Nishimiya Shoko, a g...",Naoko Yamada,"Miyu Irino, Saori Hayami, Aoi Yuki, Kenshô Ono",2016,129,8.4,2421,,80.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Paint It Black,479,Drama,A young woman attempts to deal with the death ...,Amber Tamblyn,"Alia Shawkat, Nancy Kwan, Annabelle Attanasio,...",2016,96,8.3,61,,71.0
...,...,...,...,...,...,...,...,...,...,...,...
Superman Returns,925,"Action,Adventure,Sci-Fi","Superman reappears after a long absence, but i...",Bryan Singer,"Brandon Routh, Kevin Spacey, Kate Bosworth, Ja...",2006,154,6.1,246797,200.07,72.0
The Fast and the Furious: Tokyo Drift,309,"Action,Crime,Thriller",A teenager becomes a major competitor in the w...,Justin Lin,"Lucas Black, Zachery Ty Bryan, Shad Moss, Dami...",2006,104,6.0,193479,62.49,45.0
The Break-Up,551,"Comedy,Drama,Romance",In a bid to keep their luxurious condo from th...,Peyton Reed,"Jennifer Aniston, Vince Vaughn, Jon Favreau, J...",2006,106,5.8,106381,118.68,45.0
Lady in the Water,774,"Drama,Fantasy,Mystery",Apartment building superintendent Cleveland He...,M. Night Shyamalan,"Paul Giamatti, Bryce Dallas Howard, Jeffrey Wr...",2006,110,5.6,82701,42.27,36.0


### Problem #2 - 1 point

What item was ordered the most times in a single order? Remember, the `quantity` column is the number of times an item was ordered in a single order. The grade cell will not tell you whether or not you are right. Your answer should look like
``` python
most_single_ordered = "Item Name Typed Exactly"
```
With `Item Name Typed Exactly` replaced with the exact item name in quotes (and not indented obviously). I.e., if the item ordered the most times in a single order was `Nantucket Nectar`, then your answer would look like `most_single_ordered = "Nantucket Nectar"`.

In [56]:
import pandas as pd
chipotle_data = pd.read_csv('chipotle.tsv', delimiter='\t')
item_quantity = chipotle_data.groupby('item_name')['quantity'].sum()
most_single_ordered = item_quantity.idxmax()
print(most_single_ordered)

Chicken Bowl


In [57]:
# THIS IS A GRADING CELL. DO NOT EDIT.
# This cell will not tell you whether or not your answer is right.
# If you have not assigned something to the variable most_single_ordered or you have not assigned a string,
# you will get an error, but if you just get the question wrong,
# you will not necessarily get an error.
from nose.tools import assert_equal, assert_in
assert_equal(type(most_single_ordered), str)
print(most_single_ordered)

Chicken Bowl


## Summarizing the Data

There are a couple of ways we can use the DataFrame to summarize itself. In fact, we've been using these methods to build up our `summarize_dataframe` function (after this exercise, take a look at it and see if it makes more sense). The first is the `.describe()` method.

In [58]:
movies_df.describe()

Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


Note that the `.describe()` method only summarized the continous values. We can `.describe()` categorical values, but we have to call it on just the categorical values.

In [59]:
movies_df[['genre', 'description']].describe()

Unnamed: 0,genre,description
count,1000,1000
unique,207,1000
top,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...
freq,50,1


For a categorical value, we can tell the frequency of each category with `.value_counts()`. For a really long list (like `genre` which according to the above has 207 different values) we can use `.head()` to only get the top part by chaining methods together.

In [60]:
movies_df['genre'].value_counts().head(10)

genre
Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Animation,Adventure,Comedy    27
Action,Adventure,Fantasy      27
Comedy,Drama                  27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: count, dtype: int64

For continuous variables, we can get the correlation as well. Note that we use `numeric_only=True` as part of the argument. This is necessary for operations on pandas dataframes that expect numeric values, but there is a mixture of numeric and categorical variables in the dataframe.

In [61]:
movies_df.corr(numeric_only=True)

Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
rank,1.0,-0.261605,-0.221739,-0.219555,-0.283876,-0.271592,-0.191869
year,-0.261605,1.0,-0.1649,-0.211219,-0.411904,-0.12679,-0.079305
runtime,-0.221739,-0.1649,1.0,0.392214,0.407062,0.267953,0.211978
rating,-0.219555,-0.211219,0.392214,1.0,0.511537,0.217654,0.631897
votes,-0.283876,-0.411904,0.407062,0.511537,1.0,0.639661,0.325684
revenue_millions,-0.271592,-0.12679,0.267953,0.217654,0.639661,1.0,0.142397
metascore,-0.191869,-0.079305,0.211978,0.631897,0.325684,0.142397,1.0


## DataFrame Slicing, Dicing, and Extracting

We regularly only want to work with a subset of the data, either a set of columns or a set of rows. Fortunately, it is easy with DataFrames to get a subset. Let's start by pulling out a single column.

In [62]:
genre_col = movies_df['genre']

What did we actually get back?

In [63]:
type(genre_col)

pandas.core.series.Series

In [64]:
genre_col

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
                                     ...           
Secret in Their Eyes            Crime,Drama,Mystery
Hostel: Part II                              Horror
Step Up 2: The Streets          Drama,Music,Romance
Search Party                       Adventure,Comedy
Nine Lives                    Comedy,Family,Fantasy
Name: genre, Length: 1000, dtype: object

It's a Series! See, I told you that DataFrames were made up of Series.

If we want to get a DataFrame back, we need to give `movies_df` a _list_ of columns (it can be a list with a single item).

In [65]:
genre_df = movies_df[['genre']]

In [66]:
type(genre_df)

pandas.core.frame.DataFrame

In [67]:
genre_df

Unnamed: 0_level_0,genre
Title,Unnamed: 1_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi"
Prometheus,"Adventure,Mystery,Sci-Fi"
Split,"Horror,Thriller"
Sing,"Animation,Comedy,Family"
Suicide Squad,"Action,Adventure,Fantasy"
...,...
Secret in Their Eyes,"Crime,Drama,Mystery"
Hostel: Part II,Horror
Step Up 2: The Streets,"Drama,Music,Romance"
Search Party,"Adventure,Comedy"


If we are giving a list, we can give more columns.

In [68]:
subset_df = movies_df[['genre', 'rating']]

In [69]:
subset_df

Unnamed: 0_level_0,genre,rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1
Prometheus,"Adventure,Mystery,Sci-Fi",7.0
Split,"Horror,Thriller",7.3
Sing,"Animation,Comedy,Family",7.2
Suicide Squad,"Action,Adventure,Fantasy",6.2
...,...,...
Secret in Their Eyes,"Crime,Drama,Mystery",6.2
Hostel: Part II,Horror,5.5
Step Up 2: The Streets,"Drama,Music,Romance",6.2
Search Party,"Adventure,Comedy",5.6


For rows, we have two equivalent ways to get data. Both methods **loc**ate a row and return it.
  * `.loc` - **loc**ates by name
  * `.iloc` - **loc**ates by numerical **i**ndex
  
Let's get the data for the movie Prometheus.

In [70]:
prom = movies_df.loc["Prometheus"]

In [71]:
prom

rank                                                                2
genre                                        Adventure,Mystery,Sci-Fi
description         Following clues to the origin of mankind, a te...
director                                                 Ridley Scott
actors              Noomi Rapace, Logan Marshall-Green, Michael Fa...
year                                                             2012
runtime                                                           124
rating                                                            7.0
votes                                                          485820
revenue_millions                                               126.46
metascore                                                        65.0
Name: Prometheus, dtype: object

Alternatively, we can give it the row number. Remember, Prometheus is the second row in the DataFrame (and we start counting from 0).

In [72]:
prom = movies_df.iloc[1]

In [73]:
prom

rank                                                                2
genre                                        Adventure,Mystery,Sci-Fi
description         Following clues to the origin of mankind, a te...
director                                                 Ridley Scott
actors              Noomi Rapace, Logan Marshall-Green, Michael Fa...
year                                                             2012
runtime                                                           124
rating                                                            7.0
votes                                                          485820
revenue_millions                                               126.46
metascore                                                        65.0
Name: Prometheus, dtype: object

We can also get a range of rows, just like we can with a list. Let's look at a subset of the movies.

In [74]:
movie_subset = movies_df.loc['Prometheus':'Sing']

In [75]:
movie_subset

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0


In [76]:
movie_subset = movies_df.iloc[1:4]

In [77]:
movie_subset

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0


We can also use `.loc` and `.iloc` to access specific columns as well. The way we do this is to use the "second" slot in either `.loc` or `.iloc`.

In [78]:
column_subset = movies_df.loc[:,"genre"]

In [79]:
column_subset

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
                                     ...           
Secret in Their Eyes            Crime,Drama,Mystery
Hostel: Part II                              Horror
Step Up 2: The Streets          Drama,Music,Romance
Search Party                       Adventure,Comedy
Nine Lives                    Comedy,Family,Fantasy
Name: genre, Length: 1000, dtype: object

In [80]:
column_subset = movies_df.loc[:,"genre":"director"]

In [81]:
column_subset

Unnamed: 0_level_0,genre,description,director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn
Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott
Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan
Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet
Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer
...,...,...,...
Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray
Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth
Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu
Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong


Notice how we put `:` in the first slot (i.e. before the comma)? This says give us all the rows. Similarly with `.iloc`.

In [82]:
column_subset = movies_df.iloc[:,1]

In [83]:
column_subset

Title
Guardians of the Galaxy     Action,Adventure,Sci-Fi
Prometheus                 Adventure,Mystery,Sci-Fi
Split                               Horror,Thriller
Sing                        Animation,Comedy,Family
Suicide Squad              Action,Adventure,Fantasy
                                     ...           
Secret in Their Eyes            Crime,Drama,Mystery
Hostel: Part II                              Horror
Step Up 2: The Streets          Drama,Music,Romance
Search Party                       Adventure,Comedy
Nine Lives                    Comedy,Family,Fantasy
Name: genre, Length: 1000, dtype: object

In [84]:
column_subset = movies_df.iloc[:,1:4]

In [85]:
column_subset

Unnamed: 0_level_0,genre,description,director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn
Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott
Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan
Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet
Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer
...,...,...,...
Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray
Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth
Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu
Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong


We can also access a subset of rows and columns _at the same time_.

In [86]:
data_subset = movies_df.loc["Prometheus":"Sing", "genre":"director"]

In [87]:
data_subset

Unnamed: 0_level_0,genre,description,director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott
Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan
Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet


In [88]:
data_subset = movies_df.iloc[1:4, 1:4]

In [89]:
data_subset

Unnamed: 0_level_0,genre,description,director
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott
Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan
Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet


## Filtering Data

Often times, we really want to select data based on some condition, or really, to filter it. We can do that with DataFrames quite easily as well. Suppose we only want movies that were directed by Ridley Scott.

In [90]:
movies_df[movies_df['director'] == "Ridley Scott"]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
The Martian,103,"Adventure,Drama,Sci-Fi",An astronaut becomes stranded on Mars after hi...,Ridley Scott,"Matt Damon, Jessica Chastain, Kristen Wiig, Ka...",2015,144,8.0,556097,228.43,80.0
Robin Hood,388,"Action,Adventure,Drama","In 12th century England, Robin and his band of...",Ridley Scott,"Russell Crowe, Cate Blanchett, Matthew Macfady...",2010,140,6.7,221117,105.22,53.0
American Gangster,471,"Biography,Crime,Drama","In 1970s America, a detective works to bring d...",Ridley Scott,"Denzel Washington, Russell Crowe, Chiwetel Eji...",2007,157,7.8,337835,130.13,76.0
Exodus: Gods and Kings,517,"Action,Adventure,Drama",The defiant leader Moses rises up against the ...,Ridley Scott,"Christian Bale, Joel Edgerton, Ben Kingsley, S...",2014,150,6.0,137299,65.01,52.0
The Counselor,522,"Crime,Drama,Thriller",A lawyer finds himself in over his head when h...,Ridley Scott,"Michael Fassbender, Penélope Cruz, Cameron Dia...",2013,117,5.3,84927,16.97,48.0
A Good Year,531,"Comedy,Drama,Romance",A British investment broker inherits his uncle...,Ridley Scott,"Russell Crowe, Abbie Cornish, Albert Finney, M...",2006,117,6.9,74674,7.46,47.0
Body of Lies,738,"Action,Drama,Romance",A CIA agent on the ground in Jordan hunts down...,Ridley Scott,"Leonardo DiCaprio, Russell Crowe, Mark Strong,...",2008,128,7.1,182305,39.38,57.0


What if we want only movies with ratings greater than or equal to 8.6?

In [91]:
movies_df[movies_df['rating'] >= 8.6]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
Kimi no na wa,97,"Animation,Drama,Fantasy",Two strangers find themselves linked in a biza...,Makoto Shinkai,"Ryûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari...",2016,106,8.6,34110,4.68,79.0
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
The Intouchables,250,"Biography,Comedy,Drama",After he becomes a quadriplegic from a paragli...,Olivier Nakache,"François Cluzet, Omar Sy, Anne Le Ny, Audrey F...",2011,112,8.6,557965,13.18,57.0


Often times we really want to filter by more than one criterion. For example, we might want to get movies that are directed either by Christopher Nolan **or** Ridley Scott. To do an **or** we just connect the conditions by `|`, like below.

In [92]:
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
The Martian,103,"Adventure,Drama,Sci-Fi",An astronaut becomes stranded on Mars after hi...,Ridley Scott,"Matt Damon, Jessica Chastain, Kristen Wiig, Ka...",2015,144,8.0,556097,228.43,80.0
The Dark Knight Rises,125,"Action,Thriller",Eight years after the Joker's reign of anarchy...,Christopher Nolan,"Christian Bale, Tom Hardy, Anne Hathaway,Gary ...",2012,164,8.5,1222645,448.13,78.0
Robin Hood,388,"Action,Adventure,Drama","In 12th century England, Robin and his band of...",Ridley Scott,"Russell Crowe, Cate Blanchett, Matthew Macfady...",2010,140,6.7,221117,105.22,53.0
American Gangster,471,"Biography,Crime,Drama","In 1970s America, a detective works to bring d...",Ridley Scott,"Denzel Washington, Russell Crowe, Chiwetel Eji...",2007,157,7.8,337835,130.13,76.0
Exodus: Gods and Kings,517,"Action,Adventure,Drama",The defiant leader Moses rises up against the ...,Ridley Scott,"Christian Bale, Joel Edgerton, Ben Kingsley, S...",2014,150,6.0,137299,65.01,52.0


Given that we wanted to check whether or not a value was in a set of things (in this case whether or not "director" is in `["Christopher Nolan", "Ridley Scott"]`), we could have done something simpler and used the `.isin()` method.

In [93]:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
The Martian,103,"Adventure,Drama,Sci-Fi",An astronaut becomes stranded on Mars after hi...,Ridley Scott,"Matt Damon, Jessica Chastain, Kristen Wiig, Ka...",2015,144,8.0,556097,228.43,80.0
The Dark Knight Rises,125,"Action,Thriller",Eight years after the Joker's reign of anarchy...,Christopher Nolan,"Christian Bale, Tom Hardy, Anne Hathaway,Gary ...",2012,164,8.5,1222645,448.13,78.0
Robin Hood,388,"Action,Adventure,Drama","In 12th century England, Robin and his band of...",Ridley Scott,"Russell Crowe, Cate Blanchett, Matthew Macfady...",2010,140,6.7,221117,105.22,53.0
American Gangster,471,"Biography,Crime,Drama","In 1970s America, a detective works to bring d...",Ridley Scott,"Denzel Washington, Russell Crowe, Chiwetel Eji...",2007,157,7.8,337835,130.13,76.0
Exodus: Gods and Kings,517,"Action,Adventure,Drama",The defiant leader Moses rises up against the ...,Ridley Scott,"Christian Bale, Joel Edgerton, Ben Kingsley, S...",2014,150,6.0,137299,65.01,52.0


However, often times, we want to combine not just **or** statements or checking whether values are **in** something, but we want to chain together a bunch of filters and check if they all hold. That is an **and** statement and we use `&` to get it. Suppose that we wanted movies released between 2005 and 2010 with a rating above 8.0 but made below the 25th percentile in revenue. We can do this with the below.

In [94]:
movies_df[((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
          & (movies_df['rating'] > 8.0)
          & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3 Idiots,431,"Comedy,Drama",Two friends are searching for their long lost ...,Rajkumar Hirani,"Aamir Khan, Madhavan, Mona Singh, Sharman Joshi",2009,170,8.4,238789,6.52,67.0
The Lives of Others,477,"Drama,Thriller","In 1984 East Berlin, an agent of the secret po...",Florian Henckel von Donnersmarck,"Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul...",2006,137,8.5,278103,11.28,89.0
Incendies,714,"Drama,Mystery,War",Twins journey to the Middle East to discover t...,Denis Villeneuve,"Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim...",2010,131,8.2,92863,6.86,80.0
Taare Zameen Par,992,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,Aamir Khan,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.2,42.0


### Problem #3 - 1 point

Create a new dataframe called `chipo_filtered` that consists of all rows where more than one `Chicken Bowl` with the same `choice_description` was ordered. Note that each row corresponds to a particular item with a particular choice description. The `quantity` column denotes how many of that item with that choice description was in the order.

The grade cell will check your answer. If you do not get any errors, you very likely have gotten this question correct. However, there are held back tests as well.

In [121]:
import pandas as pd
import numpy as np
file_path = 'chipotle.tsv'
chipo_df = pd.read_csv(file_path, delimiter='\t')
chicken_bowl_orders = chipo_df[chipo_df['item_name'] == 'Chicken Bowl']
grouped = chicken_bowl_orders.groupby('choice_description')['quantity'].sum().reset_index()
filtered_groups = grouped[grouped['quantity'] > 1]
chipo_filtered = pd.merge(chipo_df, filtered_groups, on='choice_description')
print(chipo_filtered.head())

   order_id  quantity_x      item_name  \
0         2           2   Chicken Bowl   
1       484           1     Steak Bowl   
2       822           1     Steak Bowl   
3         3           1   Chicken Bowl   
4       519           1  Steak Burrito   

                                  choice_description item_price  quantity_y  
0  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98            2  
1  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...     $8.99            2  
2  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...     $8.99            2  
3  [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...    $10.98            2  
4  [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...    $11.48            2  


In [122]:
# THIS IS A GRADING CELL. DO NOT EDIT AND DO NOT COPY.
# This will give an error if your answer is wrong.
from nose.tools import assert_equal, assert_in
assert_equal(len(chipo_filtered), 33)
assert_equal(chipo_filtered.iloc[3]['quantity'], 3)
assert_equal(chipo_filtered.iloc[24]['order_id'], 1374)

AssertionError: 1085 != 33

## Using `.groupby` to Answer Questions About Groups

A very useful method on dataframes is the `.groupby()` method. `.groupby()` allows you to **group** the data **by** a certain column. This is very similar to a pivot table in excel. Once you have grouped the data, you can do useful operations like find the minimum rating over all movies for a particular directory. First, you need to group the data.

In [123]:
movies_grouped = movies_df.groupby('director')

We can now easily answer questions like, how many directors are in the data set. That is just the length of the `movies_grouped` table, which we can get by this.

In [124]:
len(movies_grouped)

644

Once we have the movies grouped together by `director`, we can pull out individual directors and look at the movies they have worked on.

In [125]:
movies_grouped.get_group('Zack Snyder')

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Batman v Superman: Dawn of Justice,61,"Action,Adventure,Sci-Fi",Fearing that the actions of Superman are left ...,Zack Snyder,"Ben Affleck, Henry Cavill, Amy Adams, Jesse Ei...",2016,151,6.7,472307,330.25,44.0
300,114,"Action,Fantasy,War",King Leonidas of Sparta and a force of 300 men...,Zack Snyder,"Gerard Butler, Lena Headey, David Wenham, Domi...",2006,117,7.7,637104,210.59,52.0
Watchmen,148,"Action,Drama,Mystery","In 1985 where former superheroes exist, the mu...",Zack Snyder,"Jackie Earle Haley, Patrick Wilson, Carla Gugi...",2009,162,7.6,410249,107.5,56.0
Sucker Punch,286,"Action,Fantasy",A young girl is institutionalized by her abusi...,Zack Snyder,"Emily Browning, Vanessa Hudgens, Abbie Cornish...",2011,110,6.1,204874,36.38,33.0
Man of Steel,295,"Action,Adventure,Fantasy","Clark Kent, one of the last of an extinguished...",Zack Snyder,"Henry Cavill, Amy Adams, Michael Shannon, Dian...",2013,143,7.1,577010,291.02,55.0


We can also use methods like `.sum()`, `.min()`, `.max()`, `.median`, `.count()`, and others to get a look into the data. Below we sum up all "summable" columns by each director.

In [126]:
movies_grouped.sum()

Unnamed: 0_level_0,rank,genre,description,actors,year,runtime,rating,votes,revenue_millions,metascore
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Aamir Khan,992,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.20,42.0
Abdellatif Kechiche,312,"Drama,Romance","Adèle's life is changed when she meets Emma, a...","Léa Seydoux, Adèle Exarchopoulos, Salim Kechio...",2013,180,7.8,103150,2.20,88.0
Adam Leon,784,"Comedy,Romance",A young man and woman find love in an unlikely...,"Callum Turner, Grace Van Patten, Michal Vondel...",2016,82,6.5,1031,0.00,77.0
Adam McKay,1910,"Biography,Comedy,DramaComedyAction,Comedy,Crim...",Four denizens in the world of high-finance pre...,"Christian Bale, Steve Carell, Ryan Gosling, Br...",8039,443,28.0,806827,438.14,262.0
Adam Shankman,1460,"Comedy,Drama,FamilyComedy,Drama,Musical",Pleasantly plump teenager Tracy Turnblad teach...,"John Travolta, Queen Latifah, Nikki Blonsky,Mi...",4019,240,12.6,167467,157.33,128.0
...,...,...,...,...,...,...,...,...,...,...
Xavier Dolan,1588,DramaDrama,"A widowed single mother, raising her violent s...","Anne Dorval, Antoine-Olivier Pilon, Suzanne Cl...",4030,236,15.1,44218,3.49,122.0
Yimou Zhang,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
Yorgos Lanthimos,479,"Comedy,Drama,RomanceDrama,Thriller","In a dystopian near future, single people, acc...","Colin Farrell, Rachel Weisz, Jessica Barden,Ol...",4024,213,14.4,172259,8.81,155.0
Zack Snyder,904,"Action,Adventure,Sci-FiAction,Fantasy,WarActio...",Fearing that the actions of Superman are left ...,"Ben Affleck, Henry Cavill, Amy Adams, Jesse Ei...",10055,683,35.2,2301544,975.74,240.0


We can then pull out individual directors and see their particular stats.

In [127]:
movies_grouped.sum().loc['Zack Snyder']

rank                                                              904
genre               Action,Adventure,Sci-FiAction,Fantasy,WarActio...
description         Fearing that the actions of Superman are left ...
actors              Ben Affleck, Henry Cavill, Amy Adams, Jesse Ei...
year                                                            10055
runtime                                                           683
rating                                                           35.2
votes                                                         2301544
revenue_millions                                               975.74
metascore                                                       240.0
Name: Zack Snyder, dtype: object

`.sum()` is probably not very useful here. However, we may want to know what Zack Snyder's best rated movie was. We can do that with `.max()`.

In [128]:
movies_grouped.max().loc['Zack Snyder']

rank                                                              295
genre                                              Action,Fantasy,War
description         King Leonidas of Sparta and a force of 300 men...
actors              Jackie Earle Haley, Patrick Wilson, Carla Gugi...
year                                                             2016
runtime                                                           162
rating                                                            7.7
votes                                                          637104
revenue_millions                                               330.25
metascore                                                        56.0
Name: Zack Snyder, dtype: object

You want to be a little careful about interpreting exactly what is going on above. the `.max()` operator is taking the maximum over all columns independently, so Zack Snyder's highest ranked film had a metascore of 56, but the description above is not about this movie. When we looked at all of Zack Snyder's films above, we can see that this metascore was associated with "Watchmen", but the description is of the movie "300".

Often times, we really need to sort the values to answer these questions.

## Sorting Values while Grouping

Dataframes also make sorting easy. Suppose we wanted to quickly answer the question about which of Zack Snyder's movies had the highest metascore, we can first get the group of all of Zack Snyder's movies (like above), and then we can sort by metascore.

In [129]:
movies_grouped.get_group('Zack Snyder').sort_values(by=['metascore'], ascending=False)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Watchmen,148,"Action,Drama,Mystery","In 1985 where former superheroes exist, the mu...",Zack Snyder,"Jackie Earle Haley, Patrick Wilson, Carla Gugi...",2009,162,7.6,410249,107.5,56.0
Man of Steel,295,"Action,Adventure,Fantasy","Clark Kent, one of the last of an extinguished...",Zack Snyder,"Henry Cavill, Amy Adams, Michael Shannon, Dian...",2013,143,7.1,577010,291.02,55.0
300,114,"Action,Fantasy,War",King Leonidas of Sparta and a force of 300 men...,Zack Snyder,"Gerard Butler, Lena Headey, David Wenham, Domi...",2006,117,7.7,637104,210.59,52.0
Batman v Superman: Dawn of Justice,61,"Action,Adventure,Sci-Fi",Fearing that the actions of Superman are left ...,Zack Snyder,"Ben Affleck, Henry Cavill, Amy Adams, Jesse Ei...",2016,151,6.7,472307,330.25,44.0
Sucker Punch,286,"Action,Fantasy",A young girl is institutionalized by her abusi...,Zack Snyder,"Emily Browning, Vanessa Hudgens, Abbie Cornish...",2011,110,6.1,204874,36.38,33.0


What we did above is we sorted values using `.sort_values()` first `by=['metascore']` and we told it that we wanted values sorted in descending order by setting `ascending=False` (ascending order is the default).

We can combine multiple of these operations to really start to get a good sense for our data. Suppose we wanted to know which director had the highest average metascore, we could use the `.mean()` method on the groups to get the mean of all of the columns by group, and then we can sort by metascore. Again, note the use of `numeric_only=True`.

In [130]:
movies_grouped.mean(numeric_only=True).sort_values(by=['metascore'], ascending=False)

Unnamed: 0_level_0,rank,year,runtime,rating,votes,revenue_millions,metascore
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barry Jenkins,42.0,2016.0,111.0,7.5,135095.0,27.85,99.0
Kenneth Lonergan,22.0,2016.0,137.0,7.9,134213.0,47.70,96.0
Todd Haynes,502.0,2015.0,118.0,7.2,77995.0,0.25,95.0
Kathryn Bigelow,540.0,2010.0,144.0,7.5,289342.0,55.71,94.5
Michael Goi,634.0,2011.0,85.0,4.9,6683.0,,94.0
...,...,...,...,...,...,...,...
S.S. Rajamouli,27.0,2015.0,159.0,8.3,76193.0,6.50,
Saul Dibb,821.0,2014.0,107.0,6.9,13711.0,,
Shawn Burkett,43.0,2016.0,73.0,2.7,496.0,,
Todor Chapkanov,124.0,2016.0,86.0,7.4,10428.0,,


It turns out Barry Jenkins has the highest average metascore. Let's see what he made.

In [131]:
movies_grouped.get_group('Barry Jenkins')

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Moonlight,42,Drama,"A chronicle of the childhood, adolescence and ...",Barry Jenkins,"Mahershala Ali, Shariff Earp, Duan Sanderson, ...",2016,111,7.5,135095,27.85,99.0


Now, let's see which director has made the most movies.

In [132]:
movies_grouped.count().sort_values(by=['year'], ascending=False)

Unnamed: 0_level_0,rank,genre,description,actors,year,runtime,rating,votes,revenue_millions,metascore
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Ridley Scott,8,8,8,8,8,8,8,8,8,8
M. Night Shyamalan,6,6,6,6,6,6,6,6,5,6
David Yates,6,6,6,6,6,6,6,6,6,6
Michael Bay,6,6,6,6,6,6,6,6,6,6
Paul W.S. Anderson,6,6,6,6,6,6,6,6,6,6
...,...,...,...,...,...,...,...,...,...,...
Ilya Naishuller,1,1,1,1,1,1,1,1,1,1
Ido Fluk,1,1,1,1,1,1,1,1,0,1
Hugo Gélin,1,1,1,1,1,1,1,1,0,0
Hope Dickson Leach,1,1,1,1,1,1,1,1,0,1


In this dataset, Ridley Scott has made the most movies. Notice we sorted by year because count just counts the values in each column that aren't missing and we just needed something without missing values.

### Problem #4 - 1 point

What is the most ordered item in the dataset? You likely will want to use both `.groupby()` and `.sum()` here. The grade cell will not tell you whether or not you are right. Your answer should look like
``` python
most_ordered = "Item Name Typed Exactly"
```
With `Item Name Typed Exactly` replaced with the exact item name in quotes (and not indented obviously). I.e., if the most common item was `Chips and Fresh Tomato Salsa`, then your answer would look like `most_ordered = "Chips and Fresh Tomato Salsa"`.

In [135]:
import pandas as pd
file_path = 'chipotle.tsv'
chipo_df = pd.read_csv(file_path, delimiter='\t')
items_grouped = chipo_df.groupby('item_name')['quantity'].sum()
most_ordered = items_grouped.idxmax()
most_ordered

'Chicken Bowl'

In [136]:
# THIS IS A GRADING CELL. DO NOT EDIT.
# This cell will not tell you whether or not your answer is right.
# If you have not assigned something to the variable most_ordered or you have not assigned a string,
# you will get an error, but if you just get the question wrong,
# you will not necessarily get an error.
from nose.tools import assert_equal, assert_in
assert_equal(type(most_ordered), str)
print(most_ordered)

Chicken Bowl


### Problem #5 - 1 point

What is the average revenue amount per order? You should assign your answers to the variable `average_revenue`. You can round to the second decimal place (i.e. `10.11` is okay, as is `10.11432` but not `10.1`). The grade cell will not tell you whether or not you are right. Your answer should look something like the below:
``` python
average_revenue = x
```
with x replaced with your answer as a number (and not indented, obviously).

In [137]:
import pandas as pd
file_path = 'chipotle.tsv'
chipo_df = pd.read_csv(file_path, delimiter='\t')
chipo_df['item_price'] = chipo_df['item_price'].str.replace('$', '').astype(float)
order_revenue = chipo_df.groupby('order_id')['item_price'].sum()
average_revenue = order_revenue.mean()
average_revenue = round(average_revenue, 2)
print(average_revenue)

18.81


In [138]:
# THIS IS A GRADING CELL. DO NOT EDIT.
# This cell will not tell you whether or not your answer is right.
# If you have not assigned something to the variable average_revenue or you have not assigned a number,
# you will get an error, but if you just get the question wrong,
# you will not necessarily get an error.
from nose.tools import assert_equal, assert_true
import numbers
assert_true(isinstance(average_revenue, numbers.Number))
print(average_revenue)

18.81
