## This notebook uses the movie sequel data scraped from wikipedia to find the corresponding original films and data for the movies

In [5]:
import urllib.request
import json
import pandas as pd
from datetime import datetime
import numpy as np
from loguru import logger
from IPython.display import clear_output
import requests
import re
from scipy import stats

tmdb_key = "ad63716b3506edd1aaa3aef6c8ebd46b"

pd.options.mode.chained_assignment = None  # default='warn'

In [6]:
df = pd.read_csv("SequelIMDBlinks.csv")
df

Unnamed: 0,item,title,imdb
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571
...,...,...,...
2896,http://www.wikidata.org/entity/Q56877293,Money Trap,tt8442644
2897,item,title,imdb
2898,http://www.wikidata.org/entity/Q19893083,Let Hoi Decide,tt3701078
2899,http://www.wikidata.org/entity/Q10841742,Fool for Love (2010 film),tt1630038


Cleaning the dataframe removing some unwanted values. 

In [None]:
df = df.drop(["item"], axis='columns')

In [7]:
df = df[df["imdb"] != "imdb"]
df = df.dropna()
df.size

8589

Remove movies with the same IMDb id.

In [8]:
df = df[~df.duplicated(subset=["imdb"])]
df.size

7875

These functions take in a json and returns the specified data.

In [9]:
def get_tmdb_id(json_data):
    try:
        return json_data["movie_results"][0]["id"]
    except:
        return None

def get_collection_id(json_data):
    try:
        return json_data["belongs_to_collection"]["id"]
    except:
        return None

def get_original_movie(json_data):
    try:
        # gets the release date of the first movie in the collection and turns it into a datetime 
        oldest_movie = json_data["parts"][0]
        oldest_movie_date = datetime.strptime(oldest_movie["release_date"].replace("-","/"), '%Y/%m/%d')

        # goes through the rest of the movies in the collection checking if their are any earlier movies.
        for movie in json_data["parts"][1:]:
            try:
                movie_date = datetime.strptime(movie["release_date"].replace("-","/"), '%Y/%m/%d')
                if(oldest_movie_date > movie_date):
                    oldest_movie = movie
                    oldest_movie_date = movie_date
            # if there is a problem with the release_date (eg it is missing) then skip it        
            except:
                continue
        # returns the id of the original movie or Nonw        
        original_movie_id = oldest_movie["id"]
        return str(original_movie_id)
    except:
        return None
    
def get_original_movie_name(json_data):
    try:
        return json_data["title"]
    except:
        return None
    
def get_original_movie_id(json_data):
    try:
        return json_data["imdb_id"]
    except:
        return None    

"get_data" searches for a movie in tmdb using the movie's imdb id.

In [10]:
def get_data(prefix, get_json_fun):
    movie_json = fetch(prefix, {"external_source":"imdb_id"})
    data = get_json_fun(movie_json)
    return str(data)

Uses the tmdb api to a json.

In [11]:
def fetch(endpoint, params={}):
    # construct the url
    api_prefix = "https://api.themoviedb.org/3"
    url = api_prefix
    
    if not endpoint.startswith("/"):
        url += "/"
    
    url += endpoint
    
    params["api_key"] = tmdb_key
    url += "?" + urllib.parse.urlencode(params)
    #clear.output()
    clear_output()
    logger.info(url)

    try:
        response = urllib.request.urlopen(url)
        raw_json = response.read().decode("utf-8")    
        return json.loads(raw_json)
    # if an error occured return None 
    except:
        return None

Using a column in our dataframe, search the tmdb api for a specific item using one of our "get" functions from earlier and adds the data to a new  column.

In [12]:
def add_new_col_to_df(df, prefix, existing_df_col_name, get_json_fun, new_col_name):
    df[new_col_name] = df.apply(lambda x: get_data(prefix + x[existing_df_col_name], get_json_fun), axis =1)
    df = df.dropna()
    return df

First we add the tmdb ID for each movie. This will be crucial for searching the tmdb api in future queries.

In [13]:
df = add_new_col_to_df(df, "/find/", "imdb", get_tmdb_id, "tmdb_id")
df.head()

2023-05-18 14:42:06.156 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/find/tt1630038?external_source=imdb_id&api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,item,title,imdb,tmdb_id
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608,458423
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605,679
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076,657
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644,8077
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571,18239


A collection in TMDB is a list of related movies. We can use this to find the original movie for a sequel.

In [14]:
df = add_new_col_to_df(df, "/movie/", "tmdb_id", get_collection_id, "collection_id")
df.head()

2023-05-18 14:49:24.978 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/movie/138969?external_source=imdb_id&api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,item,title,imdb,tmdb_id,collection_id
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608,458423,458558
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605,679,8091
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076,657,645
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644,8077,8091
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571,18239,33514


Adding a new column with the tmdb ID for each original movie to the sequel. The original movie is the oldest movie in the collection.

In [15]:
df = add_new_col_to_df(df, "/collection/", "collection_id", get_original_movie, "original_movie_id")
df.head()

2023-05-18 14:54:48.566 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/collection/390205?external_source=imdb_id&api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,item,title,imdb,tmdb_id,collection_id,original_movie_id
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608,458423,458558,11631
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605,679,8091,348
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076,657,645,646
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644,8077,8091,348
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571,18239,33514,8966


Adds the title of the original movie.

In [17]:
df = add_new_col_to_df(df, "/movie/", "original_movie_id", get_original_movie_name, "original_movie_name")
df.head()

2023-05-18 15:00:08.368 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/movie/138969?external_source=imdb_id&api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,item,title,imdb,tmdb_id,collection_id,original_movie_id,original_movie_name
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608,458423,458558,11631,Mamma Mia!
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605,679,8091,348,Alien
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076,657,645,646,Dr. No
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644,8077,8091,348,Alien
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571,18239,33514,8966,Twilight


Lastly, adds a new column for the imdb ID of the original movie.

In [19]:
df = add_new_col_to_df(df, "/movie/", "original_movie_id", get_original_movie_id, "original_imdb_id")
df.head()

2023-05-18 15:03:38.616 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/movie/138969?external_source=imdb_id&api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,item,title,imdb,tmdb_id,collection_id,original_movie_id,original_movie_name,original_imdb_id
0,http://www.wikidata.org/entity/Q36951175,Mamma Mia! Here We Go Again,tt6911608,458423,458558,11631,Mamma Mia!,tt0795421
1,http://www.wikidata.org/entity/Q104814,Aliens (film),tt0090605,679,8091,348,Alien,tt0078748
2,http://www.wikidata.org/entity/Q106571,From Russia with Love (film),tt0057076,657,645,646,Dr. No,tt0055928
3,http://www.wikidata.org/entity/Q108543,Alien 3,tt0103644,8077,8091,348,Alien,tt0078748
4,http://www.wikidata.org/entity/Q116928,The Twilight Saga: New Moon,tt1259571,18239,33514,8966,Twilight,tt1099212


Dropping columns that are no longer needed and renaming some columns.

In [21]:
df.rename(columns={"title": "sequel"}, inplace=True)
df.rename(columns={"imdb": "sequel_imdb_id"}, inplace=True)

Removes movies if the original movie ID is the same as the sequel.

In [22]:
df.drop(df[df.sequel_imdb_id == df.original_imdb_id].index, inplace=True)
df.shape

(2593, 8)

Storing the ids of sequels in a file so we could get information for those movies from IMDB.

In [25]:
sequel_ids = df[["original_movie_name", "original_imdb_id", "sequel", "sequel_imdb_id"]]

# original name = prequel_sequel_ids.csv
sequel_ids.to_csv("sequel_ids.csv")

Constructing a new dataframe where each original movie in our dataframe has a list of sequels. The sequels are stored in a dictionary where the values are the movie's imdb and tmddb ID respectively. This helps us avoid multiple resquests to TMDB for movies that have multiple sequels.

In [26]:
original_movies = {}
original_movies_id = []
original_movies_tmdb_id = []

for index, row in df.iterrows():
    original_movie = row["original_movie_name"]
    if original_movie:
        if original_movie not in original_movies:
            original_movies[original_movie] = {row["sequel"]: [row["sequel_imdb_id"], row["tmdb_id"]]}
            original_movies_id.append(row["original_imdb_id"])   
            original_movies_tmdb_id.append(row["original_movie_id"])   
        else:
            movie_dict = original_movies[original_movie]
            movie_dict[row["sequel"]] = [row["sequel_imdb_id"], row["tmdb_id"]]
            original_movies[original_movie] = movie_dict
            
tmdb_df = pd.DataFrame()
tmdb_df["original_movies"]  = (original_movies.keys())
tmdb_df["original_movie_imdb"] = original_movies_id
tmdb_df["original_movie_tmdb"] = original_movies_tmdb_id
tmdb_df["sequels"] = list(original_movies.values()) 
tmdb_df.head()

Unnamed: 0,original_movies,original_movie_imdb,original_movie_tmdb,sequels
0,Mamma Mia!,tt0795421,11631,"{'Mamma Mia! Here We Go Again': ['tt6911608', ..."
1,Alien,tt0078748,348,"{'Aliens (film)': ['tt0090605', '679'], 'Alien..."
2,Dr. No,tt0055928,646,"{'From Russia with Love (film)': ['tt0057076',..."
3,Twilight,tt1099212,8966,"{'The Twilight Saga: New Moon': ['tt1259571', ..."
4,King Kong,tt0074751,10730,"{'King Kong Lives': ['tt0091344', '31947'], 'S..."


This goes through each original movie and sequel in the dataframe and gets a dataframe of various information for each movie.

In [28]:
list_of_movie_jsons = []
is_sequel_list = []

for index, row in tmdb_df.iterrows():
    response = None
    endpoint = "/movie/" + str(row["original_movie_tmdb"])
    response = fetch(endpoint)
    if response:
        list_of_movie_jsons.append(response)
        is_sequel_list.append(0)
    for sequel_id in row["sequels"].values():
        response = None
        endpoint = "/movie/" + sequel_id[1]
        response = fetch(endpoint)
        if response:
            list_of_movie_jsons.append(response)   
            is_sequel_list.append(1)

# removing missing values from list
list_of_movie_jsons = list(filter(lambda x: x is not None, list_of_movie_jsons))
movies_df = pd.DataFrame(list_of_movie_jsons)   
movies_df["is_sequel"] = is_sequel_list
movies_df.head()

2023-05-18 15:30:19.530 | INFO     | __main__:fetch:15 - https://api.themoviedb.org/3/movie/329851?api_key=ad63716b3506edd1aaa3aef6c8ebd46b


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,is_sequel
0,False,/ns9T8glyF7mYwxrcUHXm22nMf9t.jpg,"{'id': 458558, 'name': 'Mamma Mia! Collection'...",52000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",http://www.mammamiamovie.com,11631,tt0795421,en,Mamma Mia!,...,609841637,108,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Take a trip down the aisle you'll never forget,Mamma Mia!,False,6.966,5868,0
1,False,/gtv2H1u9eGffjxVqNfJBZuFCKxR.jpg,"{'id': 458558, 'name': 'Mamma Mia! Collection'...",75000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",http://mammamiamovie.com,458423,tt6911608,en,Mamma Mia! Here We Go Again,...,395044706,113,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Discover how it all began.,Mamma Mia! Here We Go Again,False,7.136,3039,1
2,False,/AmR3JG1VQVxU8TfAvljUhfSFUOx.jpg,"{'id': 8091, 'name': 'Alien Collection', 'post...",11000000,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na...",https://www.20thcenturystudios.com/movies/alien,348,tt0078748,en,Alien,...,104931801,117,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,In space no one can hear you scream.,Alien,False,8.142,12773,0
3,False,/jMBpJFRtrtIXymer93XLavPwI3P.jpg,"{'id': 8091, 'name': 'Alien Collection', 'post...",18500000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",https://www.20thcenturystudios.com/movies/aliens,679,tt0090605,en,Aliens,...,183316455,137,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,This time it's war.,Aliens,False,7.921,8445,1
4,False,/nEmOmbCWBXS3tHU2N49z693KDK.jpg,"{'id': 8091, 'name': 'Alien Collection', 'post...",50000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",https://www.20thcenturystudios.com/movies/alien-3,8077,tt0103644,en,Alien³,...,159773545,114,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The bitch is back.,Alien³,False,6.348,4894,1


The total number or original movies and sequels in this dataframe.

In [29]:
((movies_df["is_sequel"] == 0).sum(),(movies_df["is_sequel"] == 1).sum())

(1299, 2547)

Only keeping the columns we want and renaming some columns.

In [31]:
sequels_tmdb_df = movies_df[['title','imdb_id', 'id','release_date', 'is_sequel', 'runtime','vote_average','vote_count', 'popularity', 'budget', 
                       'revenue', 'genres','original_language', 'belongs_to_collection', 'production_companies', 'production_countries']]

sequels_tmdb_df = sequels_tmdb_df.rename(columns={"id": "tmdb_id"})

Storing the collection ids of movies. 

In [32]:
sequels_tmdb_df["collection_id"] = sequels_tmdb_df["belongs_to_collection"].apply(lambda x: x["id"] if type(x) == dict else np.nan)
sequels_tmdb_df.drop("belongs_to_collection", inplace=True, axis=1)

Saving output to csv.

In [39]:
sequels_tmdb_df.to_csv("sequels_tmdb_data.csv")