# Imports

In [1]:
import pandas as pd


# Data

In [2]:
links = pd.read_csv('../data/ml-25m/links.csv',
                    index_col='movieId', dtype={'imdbId': str, 'tmdbId': str, 'movieId': str})

movies25m = pd.read_csv('../data/ml-25m/movies.csv',
                        index_col='movieId', dtype={'movieid': str, 'title': str, 'genres': str})\
    .join(links)

movies1m = pd.read_csv('../data/ml-1m/movies.dat', sep='::',
                       engine='python',
                       encoding='latin-1',
                       names=['movieId', 'title', 'genres'],
                       index_col='movieId',
                       dtype={'movieId': str, 'title': str, 'genres': str})\
    .join(movies25m, lsuffix='_1m', rsuffix='_25m')


# Cleanup

How many movies in `moviesm1m` have no id?

In [3]:
no_title_id_idx = movies1m["imdbId"].isna()
noid_movies1m = movies1m[no_title_id_idx]
noid_movies1m.shape


(34, 6)

Just ignore those moviews with no ids:

In [4]:
movies1m = movies1m[~no_title_id_idx]
movies1m.shape


(3849, 6)

Check if there are movies with no imdb id:

In [5]:
# find NaN in imbdId in movies1m
movies1m["imdbId"].isna().sum()


0

Are the movie titles in 1m the same as 25m?

In [6]:
# show rows where title_1m != title_25m
# show only the title columns
# assign it to diff_titles
diff_titles = movies1m[movies1m["title_1m"] !=
                       movies1m["title_25m"]][["title_1m", "title_25m"]]
diff_titles.shape


(516, 2)

In [7]:
# show random 100 random samples of diff_titles
diff_titles.sample(100)


Unnamed: 0_level_0,title_1m,title_25m
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3832,"Black Sabbath (Tre Volti Della Paura, I) (1963)","Black Sabbath (Tre volti della paura, I) (1963)"
985,Small Wonders (1996),Small Wonders (1995)
989,Schlafes Bruder (Brother of Sleep) (1995),Brother of Sleep (Schlafes Bruder) (1995)
3228,Wirey Spindell (1999),Wirey Spindell (2000)
2665,Earth Vs. the Flying Saucers (1956),Earth vs. the Flying Saucers (1956)
...,...,...
3589,"Kill, Baby... Kill! (Operazione Paura) (1966)","Whom the Gods Wish to Destroy (Nibelungen, Tei..."
1830,Follow the Bitch (1998),Follow the Bitch (1996)
3558,"Law, The (Le Legge) (1958)","Law, The (a.k.a. Where the Hot Wind Blows!) (L..."
3140,"Three Ages, The (1923)",Three Ages (1923)


Looks like there are just minor differences in the title and/or year so we keep all of them.

Drop unneded columns:

In [8]:
# drop columns with _25m suffix and tmdbId
movies1m = movies1m.drop(
    columns=[col for col in movies1m.columns if col.endswith("_25m")])
movies1m = movies1m.drop(columns=["tmdbId"])

# rename columns with _1m suffix
movies1m = movies1m.rename(
    columns={col: col[:-3] for col in movies1m.columns if col.endswith("_1m")})

movies1m


Unnamed: 0_level_0,title,genres,imdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Animation|Children's|Comedy,0114709
2,Jumanji (1995),Adventure|Children's|Fantasy,0113497
3,Grumpier Old Men (1995),Comedy|Romance,0113228
4,Waiting to Exhale (1995),Comedy|Drama,0114885
5,Father of the Bride Part II (1995),Comedy,0113041
...,...,...,...
3948,Meet the Parents (2000),Comedy,0212338
3949,Requiem for a Dream (2000),Drama,0180093
3950,Tigerland (2000),Drama,0170691
3951,Two Family House (2000),Drama,0202641


Add URL to look up the moview in imbd:

In [9]:
# create a new column called imbd_url
# set imdb_url to https://www.imdb.com/title/tt + the value of imdbId + /plotsummary
movies1m["imdb_url"] = "https://www.imdb.com/title/tt" + \
    movies1m["imdbId"] + "/plotsummary"
movies1m


Unnamed: 0_level_0,title,genres,imdbId,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Animation|Children's|Comedy,0114709,https://www.imdb.com/title/tt0114709/plotsummary
2,Jumanji (1995),Adventure|Children's|Fantasy,0113497,https://www.imdb.com/title/tt0113497/plotsummary
3,Grumpier Old Men (1995),Comedy|Romance,0113228,https://www.imdb.com/title/tt0113228/plotsummary
4,Waiting to Exhale (1995),Comedy|Drama,0114885,https://www.imdb.com/title/tt0114885/plotsummary
5,Father of the Bride Part II (1995),Comedy,0113041,https://www.imdb.com/title/tt0113041/plotsummary
...,...,...,...,...
3948,Meet the Parents (2000),Comedy,0212338,https://www.imdb.com/title/tt0212338/plotsummary
3949,Requiem for a Dream (2000),Drama,0180093,https://www.imdb.com/title/tt0180093/plotsummary
3950,Tigerland (2000),Drama,0170691,https://www.imdb.com/title/tt0170691/plotsummary
3951,Two Family House (2000),Drama,0202641,https://www.imdb.com/title/tt0202641/plotsummary


Make the genres into a list:

In [10]:
# split genres column into a list of genres
movies1m["genres"] = movies1m["genres"].str.split("|")
movies1m


Unnamed: 0_level_0,title,genres,imdbId,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),"[Animation, Children's, Comedy]",0114709,https://www.imdb.com/title/tt0114709/plotsummary
2,Jumanji (1995),"[Adventure, Children's, Fantasy]",0113497,https://www.imdb.com/title/tt0113497/plotsummary
3,Grumpier Old Men (1995),"[Comedy, Romance]",0113228,https://www.imdb.com/title/tt0113228/plotsummary
4,Waiting to Exhale (1995),"[Comedy, Drama]",0114885,https://www.imdb.com/title/tt0114885/plotsummary
5,Father of the Bride Part II (1995),[Comedy],0113041,https://www.imdb.com/title/tt0113041/plotsummary
...,...,...,...,...
3948,Meet the Parents (2000),[Comedy],0212338,https://www.imdb.com/title/tt0212338/plotsummary
3949,Requiem for a Dream (2000),[Drama],0180093,https://www.imdb.com/title/tt0180093/plotsummary
3950,Tigerland (2000),[Drama],0170691,https://www.imdb.com/title/tt0170691/plotsummary
3951,Two Family House (2000),[Drama],0202641,https://www.imdb.com/title/tt0202641/plotsummary


In [11]:
# extract imdb_url as a list
imdb_urls = movies1m["imdb_url"].tolist()
# save imdb_urls to a file
with open("../data/imdb_urls.txt", "w") as f:
    f.write("\n".join(imdb_urls))


Now you can run the `scrape-movie-medata` target!

Once that's done:

In [12]:
# read the movie_metadata.jsonl file into a dataframe named movies_metadata
movies_metadata = pd.read_json("../data/movie_metadata.jsonl", lines=True)
# rename source_url to imdb_url
movies_metadata = movies_metadata.rename(columns={"source_url": "imdb_url"})
# drop title and id columns
movies_metadata = movies_metadata.drop(columns=["title", "id"])
# if imdb_url has https://m then replace m with wwww
movies_metadata["imdb_url"] = movies_metadata["imdb_url"].str.replace(
    "https://m", "https://www")
movies_metadata.head()


Unnamed: 0,plot,summary,poster_url,imdb_url
0,When two kids find and play a magical board ga...,"Jumanji, one of the most unique--and dangerous...",https://m.media-amazon.com/images/M/MV5BZTk2Zm...,https://www.imdb.com/title/tt0113497/plotsummary
1,John and Max resolve to save their beloved bai...,Things don't seem to change much in Wabasha Co...,https://m.media-amazon.com/images/M/MV5BMjQxM2...,https://www.imdb.com/title/tt0113228/plotsummary
2,George Banks must deal not only with his daugh...,"In this sequel to ""Father of the Bride"", Georg...",https://m.media-amazon.com/images/M/MV5BOTEyNz...,https://www.imdb.com/title/tt0113041/plotsummary
3,A group of high-end professional thieves start...,Hunters and their prey--Neil and his professio...,https://m.media-amazon.com/images/M/MV5BYjZjNT...,https://www.imdb.com/title/tt0113277/plotsummary
4,An ugly duckling having undergone a remarkable...,"While she was growing up, Sabrina Fairchild sp...",https://m.media-amazon.com/images/M/MV5BYjQ5Zj...,https://www.imdb.com/title/tt0114319/plotsummary


In [13]:
movies1m

Unnamed: 0_level_0,title,genres,imdbId,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),"[Animation, Children's, Comedy]",0114709,https://www.imdb.com/title/tt0114709/plotsummary
2,Jumanji (1995),"[Adventure, Children's, Fantasy]",0113497,https://www.imdb.com/title/tt0113497/plotsummary
3,Grumpier Old Men (1995),"[Comedy, Romance]",0113228,https://www.imdb.com/title/tt0113228/plotsummary
4,Waiting to Exhale (1995),"[Comedy, Drama]",0114885,https://www.imdb.com/title/tt0114885/plotsummary
5,Father of the Bride Part II (1995),[Comedy],0113041,https://www.imdb.com/title/tt0113041/plotsummary
...,...,...,...,...
3948,Meet the Parents (2000),[Comedy],0212338,https://www.imdb.com/title/tt0212338/plotsummary
3949,Requiem for a Dream (2000),[Drama],0180093,https://www.imdb.com/title/tt0180093/plotsummary
3950,Tigerland (2000),[Drama],0170691,https://www.imdb.com/title/tt0170691/plotsummary
3951,Two Family House (2000),[Drama],0202641,https://www.imdb.com/title/tt0202641/plotsummary


In [14]:
# reset the index of movies1m
# set the index to imdb_url
# call it a
# set the index of movies_metadata to imdb_url
# call it b
# left join a and b and call it movies1m
# then reset the index and set the index to movieId
movies1m = movies1m.reset_index().set_index("imdb_url")
movies_metadata = movies_metadata.set_index("imdb_url")
movies1m = movies1m.join(movies_metadata, how="left").reset_index().set_index("movieId")
# order the columns by title, plot, summary, genres, poster_url, and imdb_url
movies1m = movies1m[["title", "plot", "summary", "genres", "poster_url", "imdb_url"]]
movies1m.head()

Unnamed: 0_level_0,title,plot,summary,genres,poster_url,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3309,"Dog's Life, A (1920)",The Little Tramp and his dog companion struggl...,Poor Charlie lives in a vacant lot. He tries t...,[Comedy],https://m.media-amazon.com/images/M/MV5BYWFkMj...,https://www.imdb.com/title/tt0009018/plotsummary
3132,Daddy Long Legs (1919),An orphan discovers that she has an anonymous ...,Wealthy Jervis Pendleton acts as benefactor fo...,[Comedy],https://m.media-amazon.com/images/M/MV5BMWYwYT...,https://www.imdb.com/title/tt0010040/plotsummary
2821,Male and Female (1919),Lady Mary Lasenby is a spoiled maiden who alwa...,"Lord Brockelhurst, his unwilling betrothed Lad...","[Adventure, Drama]",https://m.media-amazon.com/images/M/MV5BODE2ZT...,https://www.imdb.com/title/tt0010418/plotsummary
2823,"Spiders, The (Die Spinnen, 1. Teil: Der Golden...",Kay Hoog finds a message that indicates that s...,"In San Francisco, the sportsman Kay Hoog tells...","[Action, Drama]",https://m.media-amazon.com/images/M/MV5BMTY2MD...,https://www.imdb.com/title/tt0010726/plotsummary
3231,"Saphead, The (1920)",The simple-minded son of a rich financier must...,Nick Van Alstyne owns the Henrietta silver min...,[Comedy],https://m.media-amazon.com/images/M/MV5BZDNiOD...,https://www.imdb.com/title/tt0011652/plotsummary


Deal with movies without a plot:

In [15]:
# create a df containing movies without a plot
movies_without_plot = movies1m[movies1m["plot"].isna()]
movies_without_plot.shape


(26, 6)

In [16]:
# display the movies without a plot
movies_without_plot


Unnamed: 0_level_0,title,plot,summary,genres,poster_url,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
770,Costa Brava (1946),,,[Drama],,https://www.imdb.com/title/tt0038426/plotsummary
2851,Saturn 3 (1979),,,"[Adventure, Sci-Fi, Thriller]",,https://www.imdb.com/title/tt0081454/plotsummary
2258,Master Ninja I (1984),,,[Action],,https://www.imdb.com/title/tt0087690/plotsummary
1155,"Invitation, The (Zaproszenie) (1986)",,,[Drama],,https://www.imdb.com/title/tt0092281/plotsummary
1107,Loser (1991),,,[Comedy],,https://www.imdb.com/title/tt0102336/plotsummary
752,Vermont Is For Lovers (1992),,,"[Comedy, Romance]",,https://www.imdb.com/title/tt0105737/plotsummary
1319,Kids of Survival (1993),,,[Documentary],,https://www.imdb.com/title/tt0107314/plotsummary
1421,Grateful Dead (1995),,,[Documentary],,https://www.imdb.com/title/tt0113212/plotsummary
791,"Last Klezmer: Leopold Kozlowski, His Life and ...",,,[Documentary],,https://www.imdb.com/title/tt0113610/plotsummary
1316,Anna (1996),,,[Drama],,https://www.imdb.com/title/tt0115548/plotsummary


In [17]:
# since only a small number of movies have no plot, remove all movies without a plot from movies1m
movies1m = movies1m[~movies1m["plot"].isna()]
# save movies1m to a parquet file named movies_postprocessed.parquet
movies1m.to_parquet("../data/movies_postprocessed.parquet")


In [18]:
movies1m

Unnamed: 0_level_0,title,plot,summary,genres,poster_url,imdb_url
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3309,"Dog's Life, A (1920)",The Little Tramp and his dog companion struggl...,Poor Charlie lives in a vacant lot. He tries t...,[Comedy],https://m.media-amazon.com/images/M/MV5BYWFkMj...,https://www.imdb.com/title/tt0009018/plotsummary
3132,Daddy Long Legs (1919),An orphan discovers that she has an anonymous ...,Wealthy Jervis Pendleton acts as benefactor fo...,[Comedy],https://m.media-amazon.com/images/M/MV5BMWYwYT...,https://www.imdb.com/title/tt0010040/plotsummary
2821,Male and Female (1919),Lady Mary Lasenby is a spoiled maiden who alwa...,"Lord Brockelhurst, his unwilling betrothed Lad...","[Adventure, Drama]",https://m.media-amazon.com/images/M/MV5BODE2ZT...,https://www.imdb.com/title/tt0010418/plotsummary
2823,"Spiders, The (Die Spinnen, 1. Teil: Der Golden...",Kay Hoog finds a message that indicates that s...,"In San Francisco, the sportsman Kay Hoog tells...","[Action, Drama]",https://m.media-amazon.com/images/M/MV5BMTY2MD...,https://www.imdb.com/title/tt0010726/plotsummary
3231,"Saphead, The (1920)",The simple-minded son of a rich financier must...,Nick Van Alstyne owns the Henrietta silver min...,[Comedy],https://m.media-amazon.com/images/M/MV5BZDNiOD...,https://www.imdb.com/title/tt0011652/plotsummary
...,...,...,...,...,...,...
3539,"Filth and the Fury, The (2000)",A film about the career of the notorious punk ...,A documentary about the punk band The Sex Pist...,[Documentary],https://m.media-amazon.com/images/M/MV5BNDI5Zj...,https://www.imdb.com/title/tt0236216/plotsummary
3865,"Original Kings of Comedy, The (2000)",A concert film featuring four major African Am...,"February 26 and 27, 2000, the Original Kings o...","[Comedy, Documentary]",https://m.media-amazon.com/images/M/MV5BMTI5ND...,https://www.imdb.com/title/tt0236388/plotsummary
3851,I'm the One That I Want (2000),"November, 1999, Margaret Cho is home in San Fr...","November, 1999, Margaret Cho is home in San Fr...",[Comedy],https://m.media-amazon.com/images/M/MV5BMTQ4MD...,https://www.imdb.com/title/tt0251739/plotsummary
3890,Back Stage (2000),If you ever wanted to know what really goes on...,If you ever wanted to know what really goes on...,[Documentary],https://m.media-amazon.com/images/M/MV5BMTg1OD...,https://www.imdb.com/title/tt0259207/plotsummary
