# Pandas data manipulation

In [1]:
import pandas as pd

# This allows 
from solutions import run_solution, show_solution

We will enhance our previous data set:

In [2]:
movie_titles = pd.read_parquet("../data/imdb_movie_titles.parquet")
movie_titles

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8,tt0000009,Miss Jerry,Miss Jerry,1894,45,Romance
498,tt0000502,Bohemios,Bohemios,1905,100,
570,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
587,tt0000591,The Prodigal Son,L'enfant prodigue,1907,90,Drama
610,tt0000615,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...,...
9185668,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,57,Documentary
9185695,tt9916680,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
9185707,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
9185718,tt9916730,6 Gunn,6 Gunn,2017,116,


## Basic manipulation

## Mathematics

## Comparison

## Filtering

Indexing in pandas Series / DataFrames (`[]`) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.

You can also use standard **comparison operators** like `<`, `<=`, `==`, `>=`, `>`, `!=`. 

As an example, find all movies from this year:

In [3]:
is_from_2022 = (movie_titles["startYear"] == 2022)
is_from_2022.sample(10)

6705009    False
6463593    False
8540713    False
7112115     <NA>
6166768    False
3680922    False
5454553    False
8062571    False
8892029    False
6688448     <NA>
Name: startYear, dtype: boolean

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [4]:
movie_titles[is_from_2022]

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
117686,tt0120589,A Dangerous Practice,A Dangerous Practice,2022,108,Drama
193052,tt0200940,Over-sexed Rugsuckers from Mars,Over-sexed Rugsuckers from Mars,2022,87,"Comedy,Sci-Fi"
254387,tt0265705,Saurians,Saurians,2022,83,"Action,Sci-Fi"
312786,tt0326716,5-25-77,'77,2022,132,"Comedy,Drama"
384150,tt0400871,Take Out,Take Out,2022,,Comedy
...,...,...,...,...,...,...
9174766,tt9893158,Clowning,Clowning,2022,96,"Crime,Romance"
9174767,tt9893160,No Way Out,No Way Out,2022,89,"Action,Crime,Thriller"
9175140,tt9894000,Twice As Strong: Made of Fire,Twice As Strong: Made of Fire,2022,122,Drama
9179935,tt9904252,"Nice & Naughty, A Christmas Story","Nice & Naughty, A Christmas Story",2022,,"Comedy,Drama,Fantasy"


It is possible to perform **logical operators** with boolean series too. But note that `and`, `or`, `not` are keywords. You should use `&`, `|` and `~` instead (as overloaded bit operators).

So perhaps we want to list all comedies longer than 3 hours?

In [5]:
is_a_long_comedy = (movie_titles["runtimeMinutes"] > 180) & (movie_titles["genres"].str.contains("Comedy"))
movie_titles[is_a_long_comedy].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
134279,tt0138280,Tre per sempre,Tre per sempre,1998,200,"Comedy,Drama"
3776650,tt15121776,Playing Earthbound with one hand holding a tom...,Playing Earthbound with one hand holding a tom...,2021,361,Comedy
4146569,tt1579931,Bahumathi,Bahumathi,2007,194,Comedy
116589,tt0119424,Jours de colère,Jours de colère,1997,182,"Biography,Comedy,Drama"
5446933,tt21233332,Smosh: Under the Influence,Smosh: Under the Influence,2022,238,Comedy
5005796,tt19263772,UGK David's Sit Down Stand-Up Comedy Special,UGK David's Sit Down Stand-Up Comedy Special,2022,181,Comedy
2163105,tt12146282,Gengsi Dong,Gengsi Dong,1980,196,Comedy
150753,tt0155567,Buddimantudu,Buddimantudu,1969,187,"Action,Comedy,Drama"
25967,tt0026435,Tailspin Tommy in the Great Air Mystery,Tailspin Tommy in The Great Air Mystery,1935,236,"Action,Adventure,Comedy"
74055,tt0075669,Amar Akbar Anthony,Amar Akbar Anthony,1977,184,"Action,Comedy,Drama"


We may wonder why we have two title-ish columns: originalTitle and primaryTitle.

In [6]:
different_title = (movie_titles["originalTitle"] != movie_titles["primaryTitle"])
different_title.name = "Different title"   # Series can have names (Note: DataFrames can't)
different_title.value_counts()

False    533888
True      76356
Name: Different title, dtype: Int64

In [7]:
movie_titles[different_title].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
366703,tt0382586,Beating a Drum,Bukchineun yeoja,1987,86.0,Thriller
32269,tt0032865,Passione,Oltre l'amore,1940,96.0,Drama
6189626,tt3259302,The Bitter Sea,Kuhai,1934,,Drama
5871816,tt2498020,Under the Umbrella,Kasa no shita,2012,108.0,"Drama,Family,Romance"
66615,tt0068004,The Shadow Whip,Ying zi shen bian,1971,78.0,"Action,Drama"
1693507,tt11301882,Choosi Choodangane,Choosi Choodangaane,2020,113.0,Romance
315987,tt0330030,Red Cherry 4,Balgan aengdu 4,1988,95.0,
2739243,tt13218916,We Wanted to Change the World,Mes gribejam izmainit pasauli,2020,66.0,"Documentary,Music"
55803,tt0056905,Les Carabiniers,Les carabiniers,1963,75.0,"Comedy,Drama,War"
1677912,tt11273780,The American Dream,Giac Mo My,2017,103.0,"Drama,Romance"


## Sorting

In [8]:
# Display 5 longest movies 
movie_titles.sort_values("runtimeMinutes", ascending=False).head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


In [9]:
# Alternative
movie_titles.nlargest(5, "runtimeMinutes")

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


**Exercise 10_oldest:** Find the 10 oldest movies that are longer than 2 hours.

In [18]:
# run_solution("10_oldest")
# show_solution("10_oldest")

**Exercise longest_title:** Show the row with the movie having the longest (primary) title.

Hint: `idxmax()` method on the Series returns the index of the item with the maximum value. You can't (at least should) not use the maximum value itself.

In [20]:
# run_solution("longest_title")
# show_solution("longest_title")

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>