# Pandas data manipulation

In the previous notebook, we learnt the basic data structures of pandas and how to look at them. In this notebook, we will manipulate them and

In [2]:
import pandas as pd

# Support the exercises
from solutions import run_solution, show_solution

We will enhance our previous analysis to all movies listed in IMDB.

In [3]:
movie_titles = pd.read_parquet("../data/imdb_movie_titles.parquet")
movie_titles.sample(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
498906,tt2153939,tvMovie,Reclaiming the American Dream,Reclaiming the American Dream,2010.0,,Documentary
515150,tt2334759,movie,Paladin,Paladin,,,Sci-Fi
155387,tt0256487,movie,Vejen til byen,Vejen til byen,1978.0,86.0,Drama
460645,tt18453100,movie,Wandering Soul,Wandering Soul,2017.0,92.0,"Drama,Fantasy"
147586,tt0241869,movie,Rabmadár,Rabmadár,1929.0,46.0,
714546,tt8393236,movie,Mercy Kill,Mercy Kill,,,"Drama,Thriller"
71833,tt0095563,movie,Mad About You,Mad About You,1989.0,92.0,Comedy
104736,tt0155780,movie,Joker,Joker,1949.0,,
552471,tt3301004,movie,Rod Taylor: Pulling No Punches,Rod Taylor: Pulling No Punches,2016.0,80.0,"Biography,Documentary"
98854,tt0138830,movie,Speedy Gonzales - noin 7 veljeksen poika,Speedy Gonzales - noin 7 veljeksen poika,1970.0,85.0,"Comedy,Western"


And we will explore it a bit:

In [4]:
movie_titles.dtypes

tconst            string
titleType         string
primaryTitle      string
originalTitle     string
year               Int64
runtimeMinutes     Int64
genres            string
dtype: object

The titleType contains a lot of repeating values:

In [5]:
movie_titles["titleType"].value_counts()

movie      611121
tvMovie    137403
Name: titleType, dtype: Int64

This looks like a good candidate to convert to categorical (using the [`astype`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method):

In [6]:
movie_titles["titleType"].astype("category")

0           movie
1           movie
2           movie
3           movie
4           movie
           ...   
748519      movie
748520    tvMovie
748521      movie
748522      movie
748523      movie
Name: titleType, Length: 748524, dtype: category
Categories (2, string): [movie, tvMovie]

Perhaps we could the same with genres...

In [7]:
movie_titles["genres"].value_counts()

Drama                          138437
Documentary                    124846
Comedy                          60044
Horror                          15891
Thriller                        15498
                                ...  
Comedy,Romance,Short                1
Crime,Music,Western                 1
Comedy,Sport,Western                1
Action,Documentary,Thriller         1
Action,Crime,Short                  1
Name: genres, Length: 1458, dtype: Int64

The `tconst` column looks like a good candidate for index. We might still want to check that it is unique:

In [8]:
movie_titles.set_index("tconst", verify_integrity=True)

Unnamed: 0_level_0,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance
tt0000502,movie,Bohemios,Bohemios,1905,100,
tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
tt0000591,movie,The Prodigal Son,L'enfant prodigue,1907,90,Drama
tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...,...
tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
tt9916730,movie,6 Gunn,6 Gunn,2017,116,


## Basic manipulation

### Destructive or non-destructive?

Whenever you perform multiple operations that take you from one DataFrame (or Series) to another one, you can follow one of the distinct approaches (:

- **non-destructive approach**: In each step, you take an original DataFrame and create a new one from it. You can either save the new data structure to a variable or directly perform a new operation on it (binding several such steps in one "pipe-line"). As we recommend this approach, you will see it used in the rest of the workshop frequently.

- **destructive approach**: You have one object, you write directly to it and wherever a method accepts the `inplace` argument, you pass it a `True` value. This is not encourages, especially if your object lives in many cells / functions and it is not always clear in which of potentially many states it is in.

Sometimes the latter approach is more efficient performance-wise (this depends on a concrete situation) but adds heavier cognitive burden on your and your readers' heads. Use it wisely and always limit the changes to one cell or one function.

### Adding (or replacing) a column


In [9]:
# Destructive way
movie_copy = movie_titles.copy()  # We want to operate on a new object.
movie_copy["titleTypeCat"] = movie_titles["titleType"].astype("category")
movie_copy

In [10]:
# Non-destructive way
movie_titles.iloc[:,:2].assign(
    titleTypeCat=movie_titles["titleType"].astype("category"),
    constant=42
).sample()

0           movie
1           movie
2           movie
3           movie
4           movie
           ...   
748519      movie
748520    tvMovie
748521      movie
748522      movie
748523      movie
Name: titleType, Length: 748524, dtype: category
Categories (2, string): [movie, tvMovie]

The [assign](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) method accepts keyword arguments as column names and values passed as the column contents. Note that we can pass a scalar value, it is then written to all cells.

### Adding a new row (and overwriting)

There is no simple non-destructive way of adding a new row (apart from creating a copy and operating on it; or concating two existing DataFrames). Also there is no fundamental difference between adding and overwriting the rows - like with writing to a `dict`.


In [11]:
movies_copy = movie_titles.copy()
movies_copy.loc[999999] = {
    "titleType": "phoneVideo",
    "primaryTitle": "My Pink Fluffy Unicorn Holiday 2022",
    "year": 2022,
    "runtimeMinutes": 5,
    "genres": "#insta:heart:",
}
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66.0,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116.0,
748523,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,49.0,Documentary
999999,,phoneVideo,My Pink Fluffy Unicorn Holiday 2022,,2022,5.0,#insta:heart:


In [12]:
movies_copy = movie_titles.copy()
movies_copy.iloc[-1] = {
    "titleType": "phoneVideo",
    "primaryTitle": "My Pink Fluffy Unicorn Holiday 2022",
    "year": 2022,
    "runtimeMinutes": 5,
    "genres": "#insta:heart:",
}
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100.0,Documentary
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66.0,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116.0,
748523,,phoneVideo,My Pink Fluffy Unicorn Holiday 2022,,2022,5.0,#insta:heart:


Writing to one particular cell uses indexers in the same way as when reading it.

In [13]:
movies_copy = movie_titles.iloc[:,:3].copy()
movies_copy.loc[748523, "primaryTitle"] = "CHANGED!!!"
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy
748521,tt9916706,movie,Dankyavar Danka
748522,tt9916730,movie,6 Gunn
748523,tt9916754,movie,CHANGED!!!


Note that pandas doesn't complain when you want to write to a non-existent column or when you use a string key instead of an integer one. It happily adds a new column and changes the index type!

In [14]:
movies_copy = movie_titles.iloc[:,:3]  # Shorte
movies_copy.loc["748523","primarytitle"] = "ADDED!!!"
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,primarytitle
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,
748521,tt9916706,movie,Dankyavar Danka,
748522,tt9916730,movie,6 Gunn,
748523,tt9916754,movie,Chico Albuquerque - Revelações,
748523,,,,ADDED!!!


In [15]:
movies_copy = movie_titles.copy()
movies_copy.loc[1:3,["primaryTitle", "originalTitle"]] = "CENSORED"
movies_copy.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894,45.0,Romance
1,tt0000502,movie,CENSORED,CENSORED,1905,100.0,
2,tt0000574,movie,CENSORED,CENSORED,1906,70.0,"Action,Adventure,Biography"
3,tt0000591,movie,CENSORED,CENSORED,1907,90.0,Drama
4,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama


## Deleting columns and rows

There are basically three ways how to do this (two destructive and one non-destructive way):

- the non-destructive [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method
- the destructive variant of drop with `inplace=True`
- the del statement (rarely used, not shown)

Instead of removing, you can also explicitly select the rows/columns you want.

In [16]:
# Non-destructive drop
movie_titles.drop(columns=["tconst", "titleType"])  # Returns a new object

Unnamed: 0,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,Miss Jerry,Miss Jerry,1894,45,Romance
1,Bohemios,Bohemios,1905,100,
2,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
3,The Prodigal Son,L'enfant prodigue,1907,90,Drama
4,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...
748519,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,6 Gunn,6 Gunn,2017,116,


In [17]:
# Destructive drop
movies_copy = movie_titles.copy()
movies_copy.drop(columns=["tconst", "titleType"], inplace=True)  # Returns None
movies_copy

Unnamed: 0,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,Miss Jerry,Miss Jerry,1894,45,Romance
1,Bohemios,Bohemios,1905,100,
2,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
3,The Prodigal Son,L'enfant prodigue,1907,90,Drama
4,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...
748519,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,6 Gunn,6 Gunn,2017,116,


In [18]:
# Non-destructive removal of the first 700000 rows
movie_titles.drop(labels=range(0, 700000))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
700000,tt7798040,movie,April Fools,4-gatsu no baka,2017,91,"Drama,Fantasy"
700001,tt7798090,movie,Ju jue lian kao de xiao zi,Ju jue lian kao de xiao zi,1979,,Drama
700002,tt7798098,movie,Persuasion Rooms,Ikna Odalari,2013,67,Documentary
700003,tt7798104,movie,Heavy Metal: A Mining Disaster in Northern Quebec,Heavy Metal: A Mining Disaster in Northern Quebec,2004,48,Documentary
700004,tt7798110,tvMovie,Just Fix It,Just Fix It,2019,25,Comedy
...,...,...,...,...,...,...,...
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116,


## Arithmetics & string manipulation

Standard arithmetic operators work on numerical columns too. Such operations are typically performed in a vector-like fashion.

In [22]:
movie_titles.assign(
    age=2022 - movie_titles["year"]
).sample(20)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,age
754278,tt0778787,Mulher de Proveta,Mulher de Proveta,1984.0,90.0,Comedy,38.0
3101083,tt13884444,American Dreamer,American Dreamer,2022.0,106.0,Comedy,0.0
906787,tt0936490,Planet Terry,Planet Terry,,,"Action,Comedy,Sci-Fi",
364908,tt0380742,Tapatan ng tapang,Tapatan ng tapang,1996.0,,Action,26.0
4342284,tt1626135,Balls to the Wall,Balls to the Wall,2011.0,85.0,Comedy,11.0
4883132,tt18566678,Mojo Savage,Mojo Savage,,,"Comedy,Drama",
70261,tt0071761,A Fu zheng chuan,A Fu zheng chuan,1974.0,,Comedy,48.0
8170235,tt7710832,Geschenk uit de bodem,Geschenk uit de bodem,2017.0,88.0,Documentary,5.0
2399092,tt12585702,Zeit läuft,Zeit läuft,2019.0,,Drama,3.0
7290055,tt5754978,Archaeology of the Future,Archaeology of the Future,,,Documentary,


**Exercise hour_length**: Calculate the length of movies in hours.

In [None]:
# run_solution("hour_length")
# show_solution("hour_length")

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres,runtimeHours
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance,0.75
1,tt0000502,movie,Bohemios,Bohemios,1905,100,,1.666667
2,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography",1.166667
3,tt0000591,movie,The Prodigal Son,L'enfant prodigue,1907,90,Drama,1.5
4,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama,
...,...,...,...,...,...,...,...,...
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary,1.666667
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama,1.1
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116,,1.933333


### Universal operation: [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

If you want to apply some operation on every row of a DataFrame, or every item of a Series, the `apply` becomes handy:

In [None]:
actors = pd.DataFrame()

### Basic string operations

These are typically accessed using the `.str` "accessor" of the Series like this:
    
- series.str.lower
- series.str.split
- series.str.startswith
- series.str.contains
- ...

More about this here: <https://pandas.pydata.org/docs/user_guide/text.html>

**Exercise pink:** Find all Pink Panther movies. Note that their title does not necessarily start with "Pink"

In [None]:
# is_pink = ...
# movie_titles[is_pink]

KeyError: Ellipsis

In [None]:
# run_solution("pink")
# show_solution("pink")

In [None]:
# String arithmetics work too!
url = "https://www.imdb.com/title/" + movie_titles["tconst"]
movie_titles[["primaryTitle"]].assign(url=url).sample(10)

NameError: name 'movie_titles' is not defined

Let's investigate the genres a bit:

In [None]:
split_genres = movie_titles.genres.str.split(",").dropna()
split_genres.sample(10)

251738         [Adventure, Comedy, Family]
6732771                      [Documentary]
59945                      [Comedy, Drama]
113187                   [Comedy, Romance]
7462                               [Drama]
80714                              [Drama]
1242242    [Biography, Documentary, Drama]
5351350                           [Comedy]
6138042           [Drama, Horror, Mystery]
1407479                            [Sport]
Name: genres, dtype: object

In [None]:
movie_titles.dropna(subset="genres").assign(
    split_genres = movie_titles.genres.str.split(","),
    genre_count = lambda df: df["split_genres"].apply(lambda item: len(item))
).sample(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres,split_genres,genre_count
500869,tt2171735,tvMovie,Imaginary Friend,Imaginary Friend,2012,86.0,Thriller,[Thriller],1
315829,tt1172198,movie,Redemption Song,Redemption Song,2008,90.0,Documentary,[Documentary],1
311770,tt11570180,tvMovie,"Isabel Marant, naissance d'une collection","Isabel Marant, naissance d'une collection",2019,51.0,Documentary,[Documentary],1
265157,tt0986232,movie,Mr. Kuka's Advice,Lekcje pana Kuki,2008,93.0,Comedy,[Comedy],1
319760,tt11859772,movie,Holding the Devil In,Holding the Devil In,2020,55.0,Thriller,[Thriller],1
218091,tt0402594,movie,XV en Zaachila,XV en Zaachila,2003,52.0,Documentary,[Documentary],1
322123,tt11948410,movie,Forest City: A Documentary Film,Forest City: A Documentary Film,2019,,Documentary,[Documentary],1
155553,tt0256898,movie,The Garden Was Full of Moon,Lunoy byl polon sad,2000,115.0,Romance,[Romance],1
197523,tt0350088,movie,Sana totoo na,Sana totoo na,2002,,Drama,[Drama],1
161683,tt0267869,movie,Rajendrudu Gajendrudu,Rajendrudu Gajendrudu,1993,152.0,"Comedy,Drama","[Comedy, Drama]",2


### Basic statistics

In [None]:
movie_titles["year"].describe()

count    662337.000000
mean       1991.605895
std          28.507317
min        1894.000000
25%        1974.000000
50%        2002.000000
75%        2014.000000
max        2029.000000
Name: year, dtype: float64

How many years in total 

In [13]:
movie_titles["runtimeMinutes"].sum() / 365.25 / 1440

78.65890752148452

To whet your visualization appetite and to check that your setup works, let's visualize how many movies were released in different times during the 128-year history of film. This is best represented by a [histogram](https://en.wikipedia.org/wiki/Histogram), a bar plot that shows how many observations fall within certain ranges of some axis - in this case, it counts how many movies have a certain value of the "year" column.

We will cover visualization in depth in the next notebook, so now without much explanation we will create our first plot using the [plotly](https://plotly.com/python/) library. This example also verifies that the installation worked for you.

In [22]:
import plotly.express as px
px.histogram(movie_titles, "year")

In [21]:
px.histogram(movie_titles[movie_titles["runtimeMinutes"] < 540], "runtimeMinutes", nbins=50)

In [15]:
movie_titles[["year", "runtimeMinutes"]].median()

year              2002.0
runtimeMinutes      87.0
dtype: float64

## Filtering and comparison

Indexing in pandas Series / DataFrames (`[]`) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on data.

You can also use standard **comparison operators** like `<`, `<=`, `==`, `>=`, `>`, `!=`. 

As an example, find all movies from this year:

In [None]:
is_from_2022 = (movie_titles["startYear"] == 2022)
is_from_2022.sample(10)

6705009    False
6463593    False
8540713    False
7112115     <NA>
6166768    False
3680922    False
5454553    False
8062571    False
8892029    False
6688448     <NA>
Name: startYear, dtype: boolean

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [None]:
movie_titles[is_from_2022]

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
117686,tt0120589,A Dangerous Practice,A Dangerous Practice,2022,108,Drama
193052,tt0200940,Over-sexed Rugsuckers from Mars,Over-sexed Rugsuckers from Mars,2022,87,"Comedy,Sci-Fi"
254387,tt0265705,Saurians,Saurians,2022,83,"Action,Sci-Fi"
312786,tt0326716,5-25-77,'77,2022,132,"Comedy,Drama"
384150,tt0400871,Take Out,Take Out,2022,,Comedy
...,...,...,...,...,...,...
9174766,tt9893158,Clowning,Clowning,2022,96,"Crime,Romance"
9174767,tt9893160,No Way Out,No Way Out,2022,89,"Action,Crime,Thriller"
9175140,tt9894000,Twice As Strong: Made of Fire,Twice As Strong: Made of Fire,2022,122,Drama
9179935,tt9904252,"Nice & Naughty, A Christmas Story","Nice & Naughty, A Christmas Story",2022,,"Comedy,Drama,Fantasy"


It is possible to perform **logical operators** with boolean series too. But note that `and`, `or`, `not` are keywords. You should use `&`, `|` and `~` instead (as overloaded bit operators).

So perhaps we want to list all comedies longer than 3 hours?

In [None]:
is_a_long_comedy = (movie_titles["runtimeMinutes"] > 180) & (movie_titles["genres"].str.contains("Comedy"))
movie_titles[is_a_long_comedy].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
134279,tt0138280,Tre per sempre,Tre per sempre,1998,200,"Comedy,Drama"
3776650,tt15121776,Playing Earthbound with one hand holding a tom...,Playing Earthbound with one hand holding a tom...,2021,361,Comedy
4146569,tt1579931,Bahumathi,Bahumathi,2007,194,Comedy
116589,tt0119424,Jours de colère,Jours de colère,1997,182,"Biography,Comedy,Drama"
5446933,tt21233332,Smosh: Under the Influence,Smosh: Under the Influence,2022,238,Comedy
5005796,tt19263772,UGK David's Sit Down Stand-Up Comedy Special,UGK David's Sit Down Stand-Up Comedy Special,2022,181,Comedy
2163105,tt12146282,Gengsi Dong,Gengsi Dong,1980,196,Comedy
150753,tt0155567,Buddimantudu,Buddimantudu,1969,187,"Action,Comedy,Drama"
25967,tt0026435,Tailspin Tommy in the Great Air Mystery,Tailspin Tommy in The Great Air Mystery,1935,236,"Action,Adventure,Comedy"
74055,tt0075669,Amar Akbar Anthony,Amar Akbar Anthony,1977,184,"Action,Comedy,Drama"


We may wonder why we have two title-ish columns: originalTitle and primaryTitle.

In [None]:
different_title = (movie_titles["originalTitle"] != movie_titles["primaryTitle"])
different_title.name = "Different title"   # Series can have names (Note: DataFrames can't)
different_title.value_counts()

False    533888
True      76356
Name: Different title, dtype: Int64

In [None]:
# Show just the ones
movie_titles[different_title].sample(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
366703,tt0382586,Beating a Drum,Bukchineun yeoja,1987,86.0,Thriller
32269,tt0032865,Passione,Oltre l'amore,1940,96.0,Drama
6189626,tt3259302,The Bitter Sea,Kuhai,1934,,Drama
5871816,tt2498020,Under the Umbrella,Kasa no shita,2012,108.0,"Drama,Family,Romance"
66615,tt0068004,The Shadow Whip,Ying zi shen bian,1971,78.0,"Action,Drama"
1693507,tt11301882,Choosi Choodangane,Choosi Choodangaane,2020,113.0,Romance
315987,tt0330030,Red Cherry 4,Balgan aengdu 4,1988,95.0,
2739243,tt13218916,We Wanted to Change the World,Mes gribejam izmainit pasauli,2020,66.0,"Documentary,Music"
55803,tt0056905,Les Carabiniers,Les carabiniers,1963,75.0,"Comedy,Drama,War"
1677912,tt11273780,The American Dream,Giac Mo My,2017,103.0,"Drama,Romance"


## Sorting

Previously, we were sorting by index (using the `sort_index` method). It is also possible sort by values (or by arbitrary expressions)

In [None]:
# Display 5 longest movies 
movie_titles.sort_values("runtimeMinutes", ascending=False).head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


In [None]:
# Alternative
movie_titles.nlargest(5, "runtimeMinutes")

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
8428894,tt8273150,Logistics,Logistics,2012,51420,Documentary
6447115,tt3854496,Ambiancé,Ambiancé,2020,43200,Documentary
2233947,tt12277054,Carnets Filmés (Liste Complète),Carnets Filmés (Liste Complète),2019,28643,Documentary
5937444,tt2659636,Modern Times Forever,Modern Times Forever,2011,14400,Documentary
1438453,tt10844900,Qw,Qw,2019,10062,Drama


In [None]:
# Sort by multiple columns
movie_titles.dropna().sort_values(["year", "runtimeMinutes"])

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance
506683,tt2210499,movie,Birmingham,Birmingham,1896,61,Documentary
141833,tt0230366,movie,Jeffries-Sharkey Contest,Jeffries-Sharkey Contest,1899,135,"Documentary,News,Sport"
161214,tt0266894,movie,The Republican National Convention,The Republican National Convention,1900,53,Documentary
172978,tt0291338,movie,May Day Parade,May Day Parade,1900,66,"Documentary,News"
...,...,...,...,...,...,...,...
674753,tt6857072,movie,Bonhoeffer: Holy Traitor,Bonhoeffer: Holy Traitor,2025,130,"Adventure,Biography,Drama"
721424,tt8712626,movie,b,b,2025,144,"Action,Sci-Fi"
667739,tt6587046,movie,How Do You Live?,Kimitachi wa dô ikiru ka,2026,125,"Adventure,Animation,Family"
446768,tt17371536,movie,King of Kuzaki Kingdom Life in Soviet Atlassia...,King of Kuzaki Kingdom Life in Soviet Atlassia...,2026,167,Romance


**Exercise 10_oldest:** Find the 10 oldest movies that are longer than 2 hours.

In [None]:
# run_solution("10_oldest")
# show_solution("10_oldest")

**Exercise longest_title:** Show the row with the movie having the longest (primary) title.

Hint: `idxmax()` method on the Series returns the index of the item with the maximum value. You can't (at least should) not use the maximum value itself.

In [None]:
# run_solution("longest_title")
# show_solution("longest_title")

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>