# Pandas data manipulation

In [1]:
import pandas as pd

# This allows 
from solutions import run_solution, show_solution

We will enhance our previous analysis to all movies listed in IMDB.

In [2]:
movie_titles = pd.read_parquet("../data/imdb_movie_titles.parquet")
movie_titles.sample(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
498906,tt2153939,tvMovie,Reclaiming the American Dream,Reclaiming the American Dream,2010.0,,Documentary
515150,tt2334759,movie,Paladin,Paladin,,,Sci-Fi
155387,tt0256487,movie,Vejen til byen,Vejen til byen,1978.0,86.0,Drama
460645,tt18453100,movie,Wandering Soul,Wandering Soul,2017.0,92.0,"Drama,Fantasy"
147586,tt0241869,movie,Rabmadár,Rabmadár,1929.0,46.0,
714546,tt8393236,movie,Mercy Kill,Mercy Kill,,,"Drama,Thriller"
71833,tt0095563,movie,Mad About You,Mad About You,1989.0,92.0,Comedy
104736,tt0155780,movie,Joker,Joker,1949.0,,
552471,tt3301004,movie,Rod Taylor: Pulling No Punches,Rod Taylor: Pulling No Punches,2016.0,80.0,"Biography,Documentary"
98854,tt0138830,movie,Speedy Gonzales - noin 7 veljeksen poika,Speedy Gonzales - noin 7 veljeksen poika,1970.0,85.0,"Comedy,Western"


And we will explore it a bit

In [6]:
movie_titles.describe()

Unnamed: 0,year,runtimeMinutes
count,662337.0,478673.0
mean,1991.605895,86.429439
std,28.507317,116.039519
min,1894.0,1.0
25%,1974.0,66.0
50%,2002.0,87.0
75%,2014.0,99.0
max,2029.0,51420.0


In [7]:
movie_titles["titleType"].value_counts()

movie      611121
tvMovie    137403
Name: titleType, dtype: Int64

In [9]:
movie_titles["genres"].value_counts()

Drama                          138437
Documentary                    124846
Comedy                          60044
Horror                          15891
Thriller                        15498
                                ...  
Comedy,Romance,Short                1
Crime,Music,Western                 1
Comedy,Sport,Western                1
Action,Documentary,Thriller         1
Action,Crime,Short                  1
Name: genres, Length: 1458, dtype: Int64

## Basic manipulation

### Adding a column

There are two ways:

In [None]:
movie_copy = movie_titles.copy()
movie_copy[""]


In [3]:
movie_titles["titleType"].astype("category")

0           movie
1           movie
2           movie
3           movie
4           movie
           ...   
748519      movie
748520    tvMovie
748521      movie
748522      movie
748523      movie
Name: titleType, Length: 748524, dtype: category
Categories (2, string): [movie, tvMovie]

### Adding a new row (and overwriting)

There is no simple non-destructive way of adding a new row (apart from creating a copy and operating on it; or concating two existing DataFrames).



## Arithmetics & string manipulation

Standard arithmetic operators work on numerical columms too. And so do mathematical functions. Note all such operations are performed in a vector-like fashion.

In [None]:
movie_titles.assign(
    age=2022 - movie_titles["startYear"]
).sample(20)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,age
754278,tt0778787,Mulher de Proveta,Mulher de Proveta,1984.0,90.0,Comedy,38.0
3101083,tt13884444,American Dreamer,American Dreamer,2022.0,106.0,Comedy,0.0
906787,tt0936490,Planet Terry,Planet Terry,,,"Action,Comedy,Sci-Fi",
364908,tt0380742,Tapatan ng tapang,Tapatan ng tapang,1996.0,,Action,26.0
4342284,tt1626135,Balls to the Wall,Balls to the Wall,2011.0,85.0,Comedy,11.0
4883132,tt18566678,Mojo Savage,Mojo Savage,,,"Comedy,Drama",
70261,tt0071761,A Fu zheng chuan,A Fu zheng chuan,1974.0,,Comedy,48.0
8170235,tt7710832,Geschenk uit de bodem,Geschenk uit de bodem,2017.0,88.0,Documentary,5.0
2399092,tt12585702,Zeit läuft,Zeit läuft,2019.0,,Drama,3.0
7290055,tt5754978,Archaeology of the Future,Archaeology of the Future,,,Documentary,


**Exercise hour_length**: Calculate the length of movies in hours.

In [None]:
# run_solution("hour_length")
# show_solution("hour_length")

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres,runtimeHours
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance,0.75
1,tt0000502,movie,Bohemios,Bohemios,1905,100,,1.666667
2,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography",1.166667
3,tt0000591,movie,The Prodigal Son,L'enfant prodigue,1907,90,Drama,1.5
4,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama,
...,...,...,...,...,...,...,...,...
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary,1.666667
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama,1.1
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116,,1.933333


### Basic string operations

These are typically accessed using the `.str` "accessor" of the Series like this:
    
- series.str.lower
- series.str.split
- series.str.startswith
- series.str.contains
- ...

**Exercise pink:** Find all Pink Panther movies. Note that their title does not necessarily start with "Pink"

In [None]:
# is_pink = ...
# movie_titles[is_pink]

KeyError: Ellipsis

In [None]:
# run_solution("pink")
# show_solution("pink")

In [None]:
# String arithmetics work too!
url = "https://www.imdb.com/title/" + movie_titles["tconst"]
movie_titles[["primaryTitle"]].assign(url=url).sample(10)

NameError: name 'movie_titles' is not defined

Let's investigate the genres a bit:

In [None]:
split_genres = movie_titles.genres.str.split(",").dropna()
split_genres.sample(10)

251738         [Adventure, Comedy, Family]
6732771                      [Documentary]
59945                      [Comedy, Drama]
113187                   [Comedy, Romance]
7462                               [Drama]
80714                              [Drama]
1242242    [Biography, Documentary, Drama]
5351350                           [Comedy]
6138042           [Drama, Horror, Mystery]
1407479                            [Sport]
Name: genres, dtype: object

## Comparison

## Filtering

Indexing in pandas Series / DataFrames (`[]`) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on them.

You can also use standard **comparison operators** like `<`, `<=`, `==`, `>=`, `>`, `!=`. 

As an example, find all movies from this year:

In [None]:
is_from_2022 = (movie_titles["startYear"] == 2022)
is_from_2022.sample(10)

6705009    False
6463593    False
8540713    False
7112115     <NA>
6166768    False
3680922    False
5454553    False
8062571    False
8892029    False
6688448     <NA>
Name: startYear, dtype: boolean

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [None]:
movie_titles[is_from_2022]

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
117686,tt0120589,A Dangerous Practice,A Dangerous Practice,2022,108,Drama
193052,tt0200940,Over-sexed Rugsuckers from Mars,Over-sexed Rugsuckers from Mars,2022,87,"Comedy,Sci-Fi"
254387,tt0265705,Saurians,Saurians,2022,83,"Action,Sci-Fi"
312786,tt0326716,5-25-77,'77,2022,132,"Comedy,Drama"
384150,tt0400871,Take Out,Take Out,2022,,Comedy
...,...,...,...,...,...,...
9174766,tt9893158,Clowning,Clowning,2022,96,"Crime,Romance"
9174767,tt9893160,No Way Out,No Way Out,2022,89,"Action,Crime,Thriller"
9175140,tt9894000,Twice As Strong: Made of Fire,Twice As Strong: Made of Fire,2022,122,Drama
9179935,tt9904252,"Nice & Naughty, A Christmas Story","Nice & Naughty, A Christmas Story",2022,,"Comedy,Drama,Fantasy"


It is possible to perform **logical operators** with boolean series too. But note that `and`, `or`, `not` are keywords. You should use `&`, `|` and `~` instead (as overloaded bit operators).

So perhaps we want to list all comedies longer than 3 hours?

In [None]:
is_a_long_comedy = (movie_titles["runtimeMinutes"] > 180) & (movie_titles["genres"].str.contains("Comedy"))
movie_titles[is_a_long_comedy].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
134279,tt0138280,Tre per sempre,Tre per sempre,1998,200,"Comedy,Drama"
3776650,tt15121776,Playing Earthbound with one hand holding a tom...,Playing Earthbound with one hand holding a tom...,2021,361,Comedy
4146569,tt1579931,Bahumathi,Bahumathi,2007,194,Comedy
116589,tt0119424,Jours de colère,Jours de colère,1997,182,"Biography,Comedy,Drama"
5446933,tt21233332,Smosh: Under the Influence,Smosh: Under the Influence,2022,238,Comedy
5005796,tt19263772,UGK David's Sit Down Stand-Up Comedy Special,UGK David's Sit Down Stand-Up Comedy Special,2022,181,Comedy
2163105,tt12146282,Gengsi Dong,Gengsi Dong,1980,196,Comedy
150753,tt0155567,Buddimantudu,Buddimantudu,1969,187,"Action,Comedy,Drama"
25967,tt0026435,Tailspin Tommy in the Great Air Mystery,Tailspin Tommy in The Great Air Mystery,1935,236,"Action,Adventure,Comedy"
74055,tt0075669,Amar Akbar Anthony,Amar Akbar Anthony,1977,184,"Action,Comedy,Drama"


We may wonder why we have two title-ish columns: originalTitle and primaryTitle.

In [None]:
different_title = (movie_titles["originalTitle"] != movie_titles["primaryTitle"])
different_title.name = "Different title"   # Series can have names (Note: DataFrames can't)
different_title.value_counts()

False    533888
True      76356
Name: Different title, dtype: Int64

In [None]:
movie_titles[different_title].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
366703,tt0382586,Beating a Drum,Bukchineun yeoja,1987,86.0,Thriller
32269,tt0032865,Passione,Oltre l'amore,1940,96.0,Drama
6189626,tt3259302,The Bitter Sea,Kuhai,1934,,Drama
5871816,tt2498020,Under the Umbrella,Kasa no shita,2012,108.0,"Drama,Family,Romance"
66615,tt0068004,The Shadow Whip,Ying zi shen bian,1971,78.0,"Action,Drama"
1693507,tt11301882,Choosi Choodangane,Choosi Choodangaane,2020,113.0,Romance
315987,tt0330030,Red Cherry 4,Balgan aengdu 4,1988,95.0,
2739243,tt13218916,We Wanted to Change the World,Mes gribejam izmainit pasauli,2020,66.0,"Documentary,Music"
55803,tt0056905,Les Carabiniers,Les carabiniers,1963,75.0,"Comedy,Drama,War"
1677912,tt11273780,The American Dream,Giac Mo My,2017,103.0,"Drama,Romance"


## Sorting

In [None]:
# Display 5 longest movies 
movie_titles.sort_values("runtimeMinutes", ascending=False).head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


In [None]:
# Alternative
movie_titles.nlargest(5, "runtimeMinutes")

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


**Exercise 10_oldest:** Find the 10 oldest movies that are longer than 2 hours.

In [None]:
# run_solution("10_oldest")
# show_solution("10_oldest")

**Exercise longest_title:** Show the row with the movie having the longest (primary) title.

Hint: `idxmax()` method on the Series returns the index of the item with the maximum value. You can't (at least should) not use the maximum value itself.

In [None]:
# run_solution("longest_title")
# show_solution("longest_title")

## Simple visual analysis

In [5]:
import plotly.express as px

In [None]:
px.histogram

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>