# Pandas data manipulation

In the previous notebook, we learnt the basic data structures of pandas and how to look at them. In this notebook, we will manipulate them and

In [7]:
import pandas as pd

# Support the exercises
from solutions import run_solution, show_solution

We will enhance our previous analysis to all movies listed in IMDB.

In [14]:
movie_titles = pd.read_parquet("../data/imdb_movie_titles.parquet")
movie_titles.sample(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
508479,tt2238032,movie,Skiptrace,Jue di tao wang,2016.0,107.0,"Action,Adventure,Comedy"
631316,tt5509142,movie,IceCream,IceCream,2016.0,122.0,"Drama,Romance"
578845,tt3961160,movie,Las aventuras de Moriana,Las aventuras de Moriana,2015.0,101.0,Comedy
466900,tt1905047,movie,The Icing,The Icing,2016.0,120.0,"Action,Crime,Thriller"
447817,tt1748227,movie,The Collection,The Collection,2012.0,82.0,"Action,Adventure,Horror"
159650,tt0263813,movie,Ommegang 1930. Eerste sortie,Ommegang 1930. Eerste sortie,1930.0,,Documentary
23037,tt0033669,movie,Golden Gate Girl,Golden Gate Girl,1941.0,110.0,Drama
311563,tt1156337,movie,Swagatam,Swagatam,2008.0,,Drama
556829,tt3416528,movie,Frisco,Frisco,,,Drama
722318,tt8747544,movie,The Other Side,The Other Side,,,Drama


And we will explore it a bit:

In [17]:
movie_titles.dtypes

tconst            string
titleType         string
primaryTitle      string
originalTitle     string
year               Int64
runtimeMinutes     Int64
genres            string
dtype: object

TODO TODO TODO

In [18]:
movie_titles["titleType"].value_counts()

movie      611121
tvMovie    137403
Name: titleType, dtype: Int64

This looks like a good candidate to convert to categorical (using the [`astype`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method):

In [23]:
movie_titles["titleType"].astype("category")

0           movie
1           movie
2           movie
3           movie
4           movie
           ...   
748519      movie
748520    tvMovie
748521      movie
748522      movie
748523      movie
Name: titleType, Length: 748524, dtype: category
Categories (2, string): [movie, tvMovie]

Perhaps we could the same with genres...

In [12]:
movie_titles["genres"].value_counts()

Drama                          138437
Documentary                    124846
Comedy                          60044
Horror                          15891
Thriller                        15498
                                ...  
Comedy,Romance,Short                1
Crime,Music,Western                 1
Comedy,Sport,Western                1
Action,Documentary,Thriller         1
Action,Crime,Short                  1
Name: genres, Length: 1458, dtype: Int64

The `tconst` column looks like a good candidate for index:

In [27]:
movie_titles.set_index("tconst", verify_integrity=True)

Unnamed: 0_level_0,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance
tt0000502,movie,Bohemios,Bohemios,1905,100,
tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
tt0000591,movie,The Prodigal Son,L'enfant prodigue,1907,90,Drama
tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...,...
tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
tt9916730,movie,6 Gunn,6 Gunn,2017,116,


## Basic manipulation

### Destructive or non-destructive?

Whenever you perform multiple operations that take you from one DataFrame (or Series) to another one, you can follow one of the distinct approaches (:

- **non-destructive approach**: In each step, you take an original DataFrame and create a new one from it. You can either save the new data structure to a variable or directly perform a new operation on it (binding several such steps in one "pipe-line"). As we recommend this approach, you will see it used in the rest of the workshop frequently.

- **destructive approach**: You have one object, you write directly to it and wherever a method accepts the `inplace` argument, you pass it a `True` value. This is not encourages, especially if your object lives in many cells / functions and it is not always clear in which of potentially many states it is in.

Sometimes the latter approach is more efficient performance-wise (this depends on a concrete situation) but adds heavier cognitive burden on your and your readers' heads. Use it wisely and always limit the changes to one cell or one function.

### Adding (or replacing) a column


In [25]:
# Destructive way
movie_copy = movie_titles.copy()  # We want to operate on a new object.
movie_copy["titleTypeCat"] = movie_titles["titleType"].astype("category")
movie_copy

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres,titleTypeCat
0,tt0000009,movie,Miss Jerry,Miss Jerry,1894,45,Romance,movie
1,tt0000502,movie,Bohemios,Bohemios,1905,100,,movie
2,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography",movie
3,tt0000591,movie,The Prodigal Son,L'enfant prodigue,1907,90,Drama,movie
4,tt0000615,movie,Robbery Under Arms,Robbery Under Arms,1907,,Drama,movie
...,...,...,...,...,...,...,...,...
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary,movie
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama,tvMovie
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,movie
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116,,movie


In [29]:
# Non-destructive way
movie_titles.iloc[:,:2].assign(
    titleTypeCat=movie_titles["titleType"].astype("category"),
    constant=42
)

Unnamed: 0,tconst,titleType,titleTypeCat,constant
0,tt0000009,movie,movie,42
1,tt0000502,movie,movie,42
2,tt0000574,movie,movie,42
3,tt0000591,movie,movie,42
4,tt0000615,movie,movie,42
...,...,...,...,...
748519,tt9916680,movie,movie,42
748520,tt9916692,tvMovie,tvMovie,42
748521,tt9916706,movie,movie,42
748522,tt9916730,movie,movie,42


The [assign](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) method allows

### Adding a new row (and overwriting)

There is no simple non-destructive way of adding a new row (apart from creating a copy and operating on it; or concating two existing DataFrames). Also there is no fundamental difference between adding and overwriting the rows - like with writing to a `dict`.


In [40]:
movies_copy = movie_titles.copy()
movies_copy.loc[999999] = {
    "titleType": "phoneVideo",
    "primaryTitle": "My Pink Fluffy Unicorn Holiday 2022",
    "year": 2022,
    "runtimeMinutes": 5,
    "genres": "#insta:heart:",
}
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66.0,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116.0,
748523,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013,49.0,Documentary
999999,,phoneVideo,My Pink Fluffy Unicorn Holiday 2022,,2022,5.0,#insta:heart:


In [54]:
movies_copy = movie_titles.copy()
movies_copy.iloc[-1] = {
    "titleType": "phoneVideo",
    "primaryTitle": "My Pink Fluffy Unicorn Holiday 2022",
    "year": 2022,
    "runtimeMinutes": 5,
    "genres": "#insta:heart:",
}
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100.0,Documentary
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66.0,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116.0,
748523,,phoneVideo,My Pink Fluffy Unicorn Holiday 2022,,2022,5.0,#insta:heart:


Writing to one particular cell uses indexers in the same way as when reading it.

In [63]:
movies_copy = movie_titles.iloc[:,:3].copy()
movies_copy.loc[748523, "primaryTitle"] = "CHANGED!!!"
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy
748521,tt9916706,movie,Dankyavar Danka
748522,tt9916730,movie,6 Gunn
748523,tt9916754,movie,CHANGED!!!


Note that pandas doesn't object, if we want to write to non-existent column or when we use a string key instead of an integer one. It happily adds a new column and changes the index type!

In [62]:
movies_copy = movie_titles.iloc[:,:3].copy()
movies_copy.loc["748523","primarytitle"] = "ADDED!!!"
movies_copy.tail()

Unnamed: 0,tconst,titleType,primaryTitle,primarytitle
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,
748521,tt9916706,movie,Dankyavar Danka,
748522,tt9916730,movie,6 Gunn,
748523,tt9916754,movie,Chico Albuquerque - Revelações,
748523,,,,6 Gun


## Deleting columns and rows

There are basically three ways how to do this (two destructive and one non-destructive way):

- the non-destructive [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method
- the destructive variant of drop with `inplace=True`
- the del statement (rarely used, not shown)

Instead of removing, you can also explicitly select the rows/columns you want.

In [67]:
# Non-destructive drop
movie_titles.drop(columns=["tconst", "titleType"])  # Returns a new object

Unnamed: 0,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,Miss Jerry,Miss Jerry,1894,45,Romance
1,Bohemios,Bohemios,1905,100,
2,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
3,The Prodigal Son,L'enfant prodigue,1907,90,Drama
4,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...
748519,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,6 Gunn,6 Gunn,2017,116,


In [69]:
# Destructive drop
movies_copy = movie_titles.copy()
movies_copy.drop(columns=["tconst", "titleType"], inplace=True)  # Returns None
movies_copy

Unnamed: 0,primaryTitle,originalTitle,year,runtimeMinutes,genres
0,Miss Jerry,Miss Jerry,1894,45,Romance
1,Bohemios,Bohemios,1905,100,
2,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,70,"Action,Adventure,Biography"
3,The Prodigal Son,L'enfant prodigue,1907,90,Drama
4,Robbery Under Arms,Robbery Under Arms,1907,,Drama
...,...,...,...,...,...
748519,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,6 Gunn,6 Gunn,2017,116,


In [74]:
# Non-destructive removal of the first 700000 rows
movie_titles.drop(labels=range(0, 700000))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres
700000,tt7798040,movie,April Fools,4-gatsu no baka,2017,91,"Drama,Fantasy"
700001,tt7798090,movie,Ju jue lian kao de xiao zi,Ju jue lian kao de xiao zi,1979,,Drama
700002,tt7798098,movie,Persuasion Rooms,Ikna Odalari,2013,67,Documentary
700003,tt7798104,movie,Heavy Metal: A Mining Disaster in Northern Quebec,Heavy Metal: A Mining Disaster in Northern Quebec,2004,48,Documentary
700004,tt7798110,tvMovie,Just Fix It,Just Fix It,2019,25,Comedy
...,...,...,...,...,...,...,...
748519,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,2007,100,Documentary
748520,tt9916692,tvMovie,Teatroteka: Czlowiek bez twarzy,Teatroteka: Czlowiek bez twarzy,2015,66,Drama
748521,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
748522,tt9916730,movie,6 Gunn,6 Gunn,2017,116,


## Arithmetics & string manipulation

Standard arithmetic operators work on numerical columms too. And so do mathematical functions. Note all such operations are performed in a vector-like fashion.

In [None]:
movie_titles.assign(
    age=2022 - movie_titles["startYear"]
).sample(20)

**Exercise hour_length**: Calculate the length of movies in hours.

In [None]:
# run_solution("hour_length")
# show_solution("hour_length")

### Basic string operations

These are typically accessed using the `.str` "accessor" of the Series like this:
    
- series.str.lower
- series.str.split
- series.str.startswith
- series.str.contains
- ...

**Exercise pink:** Find all Pink Panther movies. Note that their title does not necessarily start with "Pink"

In [None]:
# is_pink = ...
# movie_titles[is_pink]

In [None]:
# run_solution("pink")
# show_solution("pink")

In [41]:
# String arithmetics work too!
url = "https://www.imdb.com/title/" + movie_titles["tconst"]
movie_titles[["primaryTitle"]].assign(url=url).sample(10)

Unnamed: 0,primaryTitle,url
223920,Gambling,https://www.imdb.com/title/tt0421059
699851,The Comedy Roast of Chris Gehrt,https://www.imdb.com/title/tt7790880
217782,Chroniques de la violence ordinaire,https://www.imdb.com/title/tt0401864
49054,The Policeman,https://www.imdb.com/title/tt0066374
262017,Green,https://www.imdb.com/title/tt0943964
665539,Le Styliste,https://www.imdb.com/title/tt6520630
566321,No Easy Walk to Freedom,https://www.imdb.com/title/tt3633242
587567,Vier kriegen ein Kind,https://www.imdb.com/title/tt4217144
656414,Obeah,https://www.imdb.com/title/tt6259816
465781,Winterreise,https://www.imdb.com/title/tt18926958


Let's investigate the genres a bit:

In [42]:
split_genres = movie_titles.genres.str.split(",").dropna()
split_genres.sample(10)

346337                   [Documentary]
656522                   [Documentary]
691196                [Drama, Romance]
129258                 [Comedy, Drama]
450803       [Adventure, Crime, Drama]
141588                         [Drama]
315945                        [Comedy]
498234                        [Action]
431104    [Adventure, Family, Fantasy]
355225                   [Documentary]
Name: genres, dtype: object

In [49]:
movie_titles.dropna(subset="genres").assign(
    split_genres = movie_titles.genres.str.split(","),
    genre_count = lambda df: df["split_genres"].apply(lambda item: len(item))
).sample(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,year,runtimeMinutes,genres,split_genres,genre_count
500869,tt2171735,tvMovie,Imaginary Friend,Imaginary Friend,2012,86.0,Thriller,[Thriller],1
315829,tt1172198,movie,Redemption Song,Redemption Song,2008,90.0,Documentary,[Documentary],1
311770,tt11570180,tvMovie,"Isabel Marant, naissance d'une collection","Isabel Marant, naissance d'une collection",2019,51.0,Documentary,[Documentary],1
265157,tt0986232,movie,Mr. Kuka's Advice,Lekcje pana Kuki,2008,93.0,Comedy,[Comedy],1
319760,tt11859772,movie,Holding the Devil In,Holding the Devil In,2020,55.0,Thriller,[Thriller],1
218091,tt0402594,movie,XV en Zaachila,XV en Zaachila,2003,52.0,Documentary,[Documentary],1
322123,tt11948410,movie,Forest City: A Documentary Film,Forest City: A Documentary Film,2019,,Documentary,[Documentary],1
155553,tt0256898,movie,The Garden Was Full of Moon,Lunoy byl polon sad,2000,115.0,Romance,[Romance],1
197523,tt0350088,movie,Sana totoo na,Sana totoo na,2002,,Drama,[Drama],1
161683,tt0267869,movie,Rajendrudu Gajendrudu,Rajendrudu Gajendrudu,1993,152.0,"Comedy,Drama","[Comedy, Drama]",2


## Filtering and comparison

Indexing in pandas Series / DataFrames (`[]`) support also boolean (masked) arrays. These arrays can be obtained by applying boolean operations on data.

You can also use standard **comparison operators** like `<`, `<=`, `==`, `>=`, `>`, `!=`. 

As an example, find all movies from this year:

In [None]:
is_from_2022 = (movie_titles["startYear"] == 2022)
is_from_2022.sample(10)

Now we can directly apply the boolean mask. (Note: This is no magic. You can construct the mask yourself)

In [None]:
movie_titles[is_from_2022]

It is possible to perform **logical operators** with boolean series too. But note that `and`, `or`, `not` are keywords. You should use `&`, `|` and `~` instead (as overloaded bit operators).

So perhaps we want to list all comedies longer than 3 hours?

In [None]:
is_a_long_comedy = (movie_titles["runtimeMinutes"] > 180) & (movie_titles["genres"].str.contains("Comedy"))
movie_titles[is_a_long_comedy].sample(10)

We may wonder why we have two title-ish columns: originalTitle and primaryTitle.

In [None]:
different_title = (movie_titles["originalTitle"] != movie_titles["primaryTitle"])
different_title.name = "Different title"   # Series can have names (Note: DataFrames can't)
different_title.value_counts()

In [None]:
movie_titles[different_title].sample(10)

## Sorting

In [None]:
# Display 5 longest movies 
movie_titles.sort_values("runtimeMinutes", ascending=False).head()

In [None]:
# Alternative
movie_titles.nlargest(5, "runtimeMinutes")

**Exercise 10_oldest:** Find the 10 oldest movies that are longer than 2 hours.

In [None]:
# run_solution("10_oldest")
# show_solution("10_oldest")

**Exercise longest_title:** Show the row with the movie having the longest (primary) title.

Hint: `idxmax()` method on the Series returns the index of the item with the maximum value. You can't (at least should) not use the maximum value itself.

In [None]:
# run_solution("longest_title")
# show_solution("longest_title")