#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Intermediate Pandas

Pandas is a powerful toolkit for data analysis and manipulation. In this colab we'll explore Pandas features such as filtering, grouping, merging, and more.

## Overview

### Learning Objectives

 * Learn and practice filtering of data using Pandas
 * Use column arithmetic
 * Group values in a `DataFrame`
 * Merging multiple `DataFrame`s


### Prerequisites

* Introduction to Pandas

### Estimated Duration

60 minutes

All these concepts are explained in more details in [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) of the Python Data Science Handbook.

We're going to be using a dataset about movies to try out processing some data with Pandas.

We start with some standard imports:

In [0]:
import pandas as pd
import numpy as np

Let's [download the movies_metadata.csv](https://storage.cloud.google.com/amli/public/movies_metadata.csv?folder=true&organizationId=433637338589) and save a copy to your computer.  Then upload the file into your colab runtime.

__Note__:  The file is ~30MB in size, expect the download and upload to take a few minutes depending on your internet connection speed.

In [0]:
# Upload the file you just downloaded from your computer to the colab runtime

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# List to verify that the file is uploaded to the colab runtime
!ls -l *

Then we load the data and checkout their shape:

In [0]:
df = pd.read_csv('./movies_metadata.csv').dropna(axis=1, how='all')

df.head()


## Exploring the data

This dataset was obtained from [Kaggle](https://www.kaggle.com/rounakbanik/the-movies-dataset/home) who downloaded it
through the TMDB API. 

The movies available in this dataset are in correspondence with the movies that are listed in the [MovieLens 
Latest Full Dataset](https://grouplens.org/datasets/movielens/latest/).

Let's see what data we have:

In [0]:
df.shape

Twenty-three columns of data for over 45,000 movies is going be a lot to look at but let's start by looking at what the columns represent:

In [0]:
df.columns

Here's an explanation of each column:
- __belongs_to_collection__: A stringified dictionary that identifies the collection that a movie belongs to (if any).
- __budget__: The budget of the movie in dollars.
- __genres__: A stringified list of dictionaries that list out all the genres associated with the movie.
- __homepage__: The Official Homepage of the movie.
- __id__: An arbitrary ID for the movie.
- __imdb_id__: The IMDB ID of the movie.
- __original_language__: The language in which the movie was filmed.
- __original_title__: The title of the movie in its original language.
- __overview__: A blurb of the movie.
- __popularity__: The Popularity Score assigned by TMDB.
- __poster_path__: The URL of the poster image (relative to http://image.tmdb.org/t/p/w185/).
- __production_companies__: A stringified list of production companies involved with the making of the movie.
- __production_countries__: A stringified list of countries where the movie was filmed or produced.
- __release_date__: Theatrical release date of the movie.
- __revenue__: World-wide revenue of the movie in dollars.
- __runtime__: Duration of the movie in minutes.
- __spoken_languages__: A stringified list of spoken languages in the film.
- __status__: Released, To Be Released, Announced, etc.
- __tagline__: The tagline of the movie.
- __title__: The official title of the movie.
- __video__: Indicates if there is a video present of the movie with TMDB.
- __vote_average__: The average rating of the movie on TMDB.
- __vote_count__: The number of votes by users, as counted by TMDB.

## Filtering the rows 

We often need to look at only a subset of the data that we are provided with. This is called filtering and Pandas provides us with many ways to filter our data.
We can filter out a `DataFrame` by using the array indexing notation from Python `[]` but putting a boolean test inside the brackets. 

For example, to consider films that earned less money than they cost to make, we could create a variable called `money_losers_df` that contains all columns for the movies whose revenue was less than their budget.

In [0]:
money_loser_df = df[df.revenue < df.budget]

print(money_loser_df.shape)
money_loser_df.head()


That's more than 5000 movies that lost money! Clearly a risky business.

Behind the scenes, numpy has taken `df.revenue < df.budget` and turned it into an array of `True` and `False` values. Pandas takes that array and uses it to decide which rows to include in the output and which rows to exclude:

In [0]:
df.revenue < df.budget

We will usually use boolean operations over a column to filter our data but could similarly filter with a list that we had constructed some other way. For example, to select every other movie:

In [0]:
trues_and_falses = [True, False] * int(len(money_loser_df)/2) + [True]
money_loser_df[trues_and_falses]

One weird quirk of filtering with booleans is that when our test has multiple clauses (like: A and B), we don't use the standard Python syntax of `and`, `or` and `not`. Instead we must use the bitwise operators which are `&`, `|` and `!` respectively and put parentheses around each clause. So, for example, to select all movies that lost money and cost more than 1 million to make, we would use:

In [0]:
expensive_failures = df[(df.revenue < df.budget) & (df.budget > 1000000)]

expensive_failures.head()

## Filtering columns


Often times, a dataset will also contain a lot of columns that we don't care about. We can filter columns by using the double-bracket notation and specifying the string names of the columns:

In [0]:
expensive_failures[['title', 'budget', 'revenue']]

*TIP: To remember this, generally remember that single brackets filter the rows while double brackets filter the columns.*

We can create a `Series` object from a column in a `DataFrame` by referring to its `values`. When doing this, it's helpful to specify an index by which we want to access the values in the `Series`. For example. we can create a `Series` called `vote_lookup` such that we are able to use a call to `vote_lookup['Dead Presidents']` to find the vote average of that movie.

In [0]:
vote_lookup = pd.Series(money_loser_df['vote_average'].values, index=money_loser_df['title'])

vote_lookup['Dead Presidents']

There are other types of filters that we can use. For example, all string predicates are accessible via a column attribute called `str`. We can then use the `startswith` predicate to find all movies that start with a particular string or letter. 

`sort_index` and double-bracket notation (`[[]]`) allows us to find the first movie that starts with a `P` or the last one that starts with an 'R':

In [0]:
print('First: ', vote_lookup[vote_lookup.index.str.startswith('P')].sort_index()[[0]])
print('\nLast: ',vote_lookup[vote_lookup.index.str.startswith('R')].sort_index()[[-1]])

Note that we could have used `iloc` instead to access values by location but that only gives us the value, not the index with the title:

In [0]:
vote_lookup[vote_lookup.index.str.startswith('P')].sort_index().iloc[0]

We can even do slices using strings to get all the movies that start with P or R (we could do this using the `|` operator too).

In [0]:
vote_lookup_ps_and_rs = vote_lookup.sort_index()["P2":"Ryna"]

vote_lookup_ps_and_rs

## Column Arithmetic

As we saw in the previous colab, we can do arithmetic on columns as if they were numbers. This applies to using multiple columns too:

In [0]:
money_loser_df['budget'] - money_loser_df['revenue']

We can then assign this value back to the `DataFrame` to create a new column:

In [0]:
money_loser_df.loc[: , 'loss'] = money_loser_df['budget'] - money_loser_df['revenue']

money_loser_df.head()

## Grouping

It is often useful to group data by a categorical value. For example, we might choose to group movies by language, country or origin, movie studio, etc.
Let's try it out with language: 

In [0]:
dir(df.groupby('original_language'))

Note that the result is not a `DataFrame` but a `DataFrameGroupBy` object. This object has one row for each language represented in `df` but isn't a `DataFrame` because Pandas doesn't know what to do with the other columns. We need to tell Pandas how we want the entries in the other columns to be combined. Even in the case where all entries are the same (like status in the illustration below) or there's a single entry being "grouped" (like nl below), Pandas awaits our instructions:

![group by figure](https://storage.googleapis.com/amli/public/group_by.png)

In the simple case where we want to combine all values in the same way, we can call one of the combining functions directly on the `DataFrameGroupBy` object. The most common combining functions are `sum`, `mean` and `count`:

In [0]:
original_language_averages = df.groupby('original_language').mean()

original_language_averages.head(20)

There are a few things to notice here:
- non-numeric columns (like status or overview) were dropped from the output
- the column that we grouped by (__original_language__) is no longer a column of our output `DataFrame` but is its index instead
- some of these values are nonsensical: what does the average of an id mean?

I can now lookup information for a particular language:

In [0]:
original_language_averages.loc["fr"]

For more fine-grained analysis, you can specify individual combining functions using the `agg` method:

In [0]:
df.groupby('original_language').agg({'budget': 'sum', 'revenue': 'mean'}).head()

## Merging

Frequently, data comes from different sources and has to be merged into a single data frame. For example, let's say that I have some notes about some of these movies that I want to merge:

In [0]:
# Take a dictionary of my data
my_notes_dict = {
    "Cutthroat Island": "Has one of my favorite stunts",
    "The Neverending Story III: Escape from Fantasia": "Too many sequels here",
    "Bio-Dome": "First Pauly Shore movie I ever saw",
    "The Empire Strikes Back": "My favorite in the SW series",
    "Mighty Aphrodite": "Features Helena Bonham Carter",
}

# Turn it into a DataFrame (print the intermediate value if you want to see the result)
my_notes = pd.DataFrame(pd.Series(my_notes_dict), columns=['my_notes'])

# Make sure the titles are in a column
my_notes['title'] = my_notes.index

# Then merge the data (just looking at the three columns I care about)
pd.merge(my_notes, money_loser_df)[["title", "my_notes", "loss"]]

# Exercises

Note: Use the initial DataFrame `df` for exercises, not `money_loser_df`

We'll continue using the movies data for these exercises:

## Exercise 1

I'd like to have a way to look up the budget of a particular movie.

Create a `Series` object called `budget_lookup` such that you are able to use a call to `budget_lookup['Dead Presidents']` to find the budget of that movie.

### Student Solution

In [0]:
budget_lookup = # Your code goes here
budget_lookup['Dead Presidents']

### Answer Key

**Solution**

In [0]:
budget_lookup = pd.Series(df["budget"].values, index = df["title"])
budget_lookup['Dead Presidents']

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 2

Create a `Series` that contains budget information for all the movies that start with an 'A' or a 'B'. 

HINT: You may need to check for NaN indices.

### Student Solution

In [0]:
budget_lookup_as_and_bs = # Your code goes here
budget_lookup_as_and_bs.shape

### Answer Key

**Solution**

In [0]:
# Trick: title has 3 Nan which results in Nan index in budget_lookup
# will encounter "ValueError: cannot index with vector containing NA / NaN values"

# Need to drop Nan indices
budget_lookup = budget_lookup[budget_lookup.index.isnull() == False]
budget_lookup.shape

# Step 1: Get the first movie that starts with an 'A'
budget_lookup_as = budget_lookup[budget_lookup.index.str.startswith('A')].sort_index().index[0]
print(budget_lookup_as)

# Step 2: Get the last movie that starts with a 'B': 
budget_lookup_bs = budget_lookup[budget_lookup.index.str.startswith('B')].sort_index().index[-1]
print(budget_lookup_bs)

# Step 3: Get all the movies that start with A or B
budget_lookup_as_and_bs = budget_lookup.sort_index()[budget_lookup_as:budget_lookup_bs]
budget_lookup_as_and_bs

# # one line solution - put it all together
# budget_lookup_as_and_bs = budget_lookup.sort_index()[budget_lookup[
#     budget_lookup.index.str.startswith('A')].sort_index().index[0] : 
#                                                      budget_lookup[budget_lookup.index.str.startswith('B')].sort_index().index[-1]]

# # Or: 
# budget_lookup_as_and_bs = budget_lookup[(budget_lookup.index.str.startswith('A') | budget_lookup.index.str.startswith('B'))].sort_index()

# budget_lookup_as_and_bs

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 3: Numbers as indices

Enough about movie budgets, it's time to budget my time instead. Because I schedule my day to the minute, I like to be able to look up movies by their runtime. So that when I have a spare two hours and 34 minutes, I can find all the movies that would fit precisely in that time slot (popcorn-making time is budgeted separately).

Create a `Series` called `time_scheduler` that is indexed by runtime and has the movie's title as its values. Note that you will need to use `sort_index()` in order to be able to look up movies by their duration.

While you're at it, remove any movie that is less than 10 minutes (can't get into it if it's too short) or longer than 3 hours (who's got time for that).

HINT: You'll have to use `pd.to_numeric` to force the runtimes to be numbers (instead of numbers in a string)

### Student Solution

In [0]:
time_scheduler = # Your code goes here
time_scheduler

### Answer Key

**Solution**

In [0]:
# create time_scheduler - beware of 260 Nan runtimes and 3 Nan movie titles
time_scheduler = pd.Series(df.title.values, index = df['runtime'])

# drop Nan indices
time_scheduler = time_scheduler[time_scheduler.index.isnull()==False]
print(time_scheduler.shape)

time_scheduler.isnull().sum() # no Nan title exist -> all 3 Nan happened to be dropped with dropping Nan indices


# drop movie less than 10 minutes or longer than 3 hours
print("total number of movies need to be dropped: "
      ,((time_scheduler.index < 10) | (time_scheduler.index > 180)).sum())

time_scheduler = time_scheduler[(time_scheduler.index > 10) & (time_scheduler.index < 180)]
time_scheduler.shape

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 4

Continuing with your solution from the exercise above, let's find all those two-hour-and-34-minute movies:

In [0]:
time_scheduler[154]

But what is the 154th shortest movie in this collection?

HINT: Use `iloc` to get it.

### Student Solution

In [0]:
movie_number_154 = # Your code goes here
movie_number_154

### Answer Key

**Solution**

In [0]:
movie_number_154 = time_scheduler.sort_index().iloc[153]
movie_number_154

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 5: Grouping

I'd like to find out the total budget for the movies in our data for recent years that we have in the records (we don't have a lot of budget info for really old films). Create a `DataFrame` with a row for each year from 1990 through 2017 (inclusive) and one column: the sum of all budgets in that year. Note that you'll need to process the __release_date__ column to extract the year.

### Student Solution

In [0]:
yearly_budgets = # Your code goes here

### Answer Key

**Solution**

In [0]:
# Drop Nan first since we need the year to be numerical to compare with
df.release_date.isnull().sum()
df = df[df.release_date.isnull() == False]

# Slice out year from release_date and convert to int
df['year'] = df.release_date.str.slice(stop=4).astype(int)

# Create grouped yearly_budget
yearly_budgets = df[(df.year >= 1990) & (df.year <= 2017)][["year", "budget"]].groupby('year').agg('sum')
yearly_budgets

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 6: Dealing with multiple DataFrames

Forget about budget or runtimes as criteria for selecting a movie, let's take a look at popular opinion. Our dataset has two relevant columns: `vote_average` and `vote_count`.

Let's create a variable called `df_high_rated` that only contains movies that have received more than 20 votes and whose average score is greater than 8.

### Student Solution

In [0]:
df_high_rated = # Your code goes here
df_high_rated[['title', 'vote_average', 'vote_count']]


### Answer Key

**Solution**

In [0]:
df_high_rated = df[(df.vote_count > 20) & (df.vote_average > 8)]
df_high_rated[['title', 'vote_average', 'vote_count']]

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 7: Dealing with multiple DataFrames (continued)


Here we have 178 high-quality movies, at least according to some people. But what about **my** opinion? 

Here are my favorite movies and their relative scores:

In [0]:
{
    "Star Wars": 9,
    "Paris is Burning": 8,
    "Dead Poets Society": 7,
    "The Empire Strikes Back": 9.5,
    "The Shining": 8,
    "Return of the Jedi": 8,
    "1941": 8,
    "Forrest Gump": 7.5,
}

 Create a DataFrame called `compare_votes` that contains the title as an index and both the `vote_average` and `my_vote` as its columns. Also only keep the movies that are both my favorites and popular favorites.

HINT: You'll need to create two `DataFrame`s, one for my ratings and one that maps titles to `vote_average`.

### Student Solution

In [0]:
compare_votes = # Your code goes here

### Answer Key

**Solution**

In [0]:
# create DataFrame of popular votes
popular_votes = pd.DataFrame(df_high_rated['vote_average'].values, 
                             index = df_high_rated['title'],
                             columns = ['vote_average'])
popular_votes

# create DataFrame of my votes
my_votes = pd.DataFrame(pd.Series({
        "Star Wars": 9,
        "Paris is Burning": 8,
        "Dead Poets Society": 7,
        "The Empire Strikes Back": 9.5,
        "The Shining": 8,
        "Return of the Jedi": 8,
        "1941": 8,
        "Forrest Gump": 7.5,
    }), columns = ['my_vote'])

my_votes

# add in common feature
my_votes['title'] = my_votes.index
popular_votes['title'] = popular_votes.index

# merge
compare_votes = pd.merge(popular_votes, my_votes)
compare_votes

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 8: Dealing with multiple DataFrames (continued)


There should be only 6 movies remaining.

Now add a column to `compare_votes` that measures the percentage difference between my rating and the popular rating for each movie. You'll need to take the difference between the `vote_average` and `my_vote` and divide it by `my_vote`.


### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
compare_votes['voting_diff'] = (compare_votes.vote_average - compare_votes.my_vote) / compare_votes.my_vote
compare_votes

**Validation**

In [0]:
# TODO(b/129330036)

## Exercise 9: Challenge (Ungraded)

Process the data in the __production_countries__ columns to extract one production country (one is enough, you can just use the first one). Then download the life expectancies dataset from [Gapminder](https://www.gapminder.org/data/) (the same dataset that we used for our `Linear Regression with Scikit` colab) and use it to compute for each country, the percentage of each country's GDP that the movies produced in that country represent.

### Student Solution

In [0]:
# Your code goes here

### Answer Key

**Solution**

In [0]:
df['production_countries']

In [0]:
gdp = pd.read_csv("total_gdp_us_inflation_adjusted.csv")
gdp

**Validation**

In [0]:
# TODO(b/129330036)