Kristen Swerzenski

DSC 540

18 February 2024

## Project Milestone 4: Cleaning and Formatting API Data

### Introduction

For my API, I chose to pull data from The Movie Database (TMDb) API to fill in some extra details on movies that I have data on from my other sources (budget, genres, user ratings, etc.). Because our ultimate goal will be to combine our three data sources by using a common key, and to be mindful of the number of requests that I am sending to the API, I am going to pull data on film titles that appear in both of my previous data sets so far. To do that I will begin by reading in the titles from my two previously cleaned and saved .csv files and get a list of common titles between the two which I will then loop into calls to the API. 

In [1]:
# Importing necessary packages
import json
import requests
import csv
import pandas as pd
from pandas import json_normalize
import numpy as np

In [2]:
# Setting pandas display settings to view all data
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
# Defining a function to read titles from my previous data sets
def read_movie_titles(filename):
    # Creating an empty set to store titles
    titles = set()
    # Opening and reading the .csv file
    with open(filename, newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        # Appending the data in the title column of each row into the titles set
        for row in reader:
            titles.add(row['title'])
    return titles

# Reading the movie title data from both files
titles_file1 = read_movie_titles('boxoffice.csv')
titles_file2 = read_movie_titles('streaming_movies.csv')

# Using .intersection() to find the common titles
common_titles = titles_file1.intersection(titles_file2)

# Converting the set of common titles to a list
common_titles_list = list(common_titles)

# Printing the list of common movie titles to check it
print("Common Movie Titles:")
for title in common_titles_list:
    print(title)

Common Movie Titles:
After We Collided
Supernova
Bombshell
Leave No Trace
The Addams Family
Beirut
Crawl
Honey Boy
Den of Thieves
Hotel Transylvania 3: Summer Vacation
Save Yourselves!
The Last Full Measure
Seberg
Everybody Knows
Apollo 11
Peppermint
The Secrets We Keep
The Happytime Murders
Colette
The Wretched
2 Hearts
Possessor
The Tax Collector
Brian Banks
She Dies Tomorrow
Free Solo
Hotel Mumbai
The Hate U Give
The Secret Life of Pets 2
Echo in the Canyon
I Feel Pretty
Adrift
Honest Thief
The Resort
Beautiful Boy
The Beach Bum
The Goldfinch
Infamous
The Last Black Man in San Francisco
Bharat
The Water Man
Jathi Ratnalu
Raya and the Last Dragon
Final Account
The Assistant
Black Panther
Les Misérables
Three Identical Strangers
Samson
Midsommar
Sorry to Bother You
After
The Hunt
Bad Samaritan
Five Feet Apart
Cats
47 Meters Down: Uncaged
The Best of Enemies
Jay and Silent Bob Reboot
Pain and Glory
Ava
Wild Mountain Thyme
Spider-Man: Into the Spider-Verse
Dora and the Lost City of Gold

Now that I know what requests for what data I want to make for the API, I am going to make a clal to the API looping in that list of titles and converting the returned JSON data into a pandas data frame.

In [4]:
# Opening the API key stored in a .json file
with open('tmdbapikey.json') as f:
    api_data = json.load(f)
    API_KEY = api_data['tmdb_apikey']

# Creating a function that fetches movie data by title
def fetch_movie_details(title, api_key):
    # Building the url for the API call
    url = f'https://api.themoviedb.org/3/search/movie?query={title}&api_key={api_key}'
    # Sending a get request to the API url
    response = requests.get(url)
    # Parsing the response as JSON data
    data = response.json()
    # Returning the retrieved JSON data
    return data

# Creating a list to store JSON data from API call (will be a list of dictionaries)
common_titles_data = []

# Looping the fetch_movie_details function through all titles in the common_titles_list to get data for each of the movies
for title in common_titles_list:
    # Fetching data for a title using thr API key
    movie_data = fetch_movie_details(title, API_KEY)
    # Extending the common_titles_data list with the movie data
    common_titles_data.extend(movie_data.get('results', []))

# Converting the list of JSON data dictionaries to a pandas data frame
movie_data = pd.DataFrame(common_titles_data)

# Displaying the data frame to check it
display(movie_data)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,"[10749, 18]",613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0
1,False,/przsU4aiGsdlDkCp5uV8kq8gbLe.jpg,"[878, 27, 53]",10384,en,Supernova,"Set in the 22nd century, when a battered salva...",13.953,/AmHGUjhgXnOYRYrF2ZYMtzfwxPe.jpg,2000-01-14,Supernova,False,4.916,350.0
2,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,"[10749, 18]",642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0
3,False,,"[878, 35]",254168,es,Supernova,"Count Nado, is in love with Fénix, an intergal...",1.332,/lHVKzIQttkvwsR3thbPaHr5orW7.jpg,1993-02-26,Supernova,False,3.8,4.0
4,False,/9n46ONadB3wz2gi2R7vR5NgRAQ4.jpg,"[878, 28, 18, 10770]",42904,en,Supernova,A international science conference is held in ...,4.104,/jdHJKwwGE9bLLItExwN7Uzfb9aP.jpg,2005-09-05,Supernova,False,4.375,44.0
5,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,"[53, 18]",632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0
6,False,,[18],255713,no,Supernova,A boy and his father spend a day at the beach ...,0.743,,2007-04-24,Supernova,False,6.0,1.0
7,False,,[18],1191038,pl,Supernova,,0.807,,2015-07-01,Supernova,False,0.0,0.0
8,False,/3PGNAiILVNhIYtpge5miP2AaT3H.jpg,[18],253702,nl,Supernova,"Meis is fifteen, lives in the back of beyond a...",1.68,/gV5gaGMfdZxWJQwFEUDLjMxbfOY.jpg,2014-04-10,Supernova,False,5.0,8.0
9,False,,[18],1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0


It seems like there may have been some multiple matches for certain titles and pulled in a number of different duplicate title films. I will need to make another call to the API in order to get more detailed information on the films now that I have the movie IDs, but first I am going to clean up some of the extranneous titles retrieved in order to not overload the API.

#### Step 1: Dropping all films that are not exact title matches to the titles in common_titles_list

The first thing I want to do is drop all titles that were not exact matches to the title in my common titles list. 

In [5]:
# Checking to see if title is an exact match to titles in movie_data list, dropping of not
movie_data = movie_data[movie_data['title'].isin(common_titles_list)]

# Resetting index after dropping rows
movie_data.reset_index(drop=True, inplace=True)

# Checking the length of the data frame to see if rows were dropped
len(movie_data)

493

The number of rows significantly decreased, so it seems like that worked relatively well. However, there are some duplicate films with exact title matches so I can do a little more trimming down before making the next API call.

#### Step 2: Changing release_date to datetime and removing films with release dates before 2018

Since my project is focusing on films released in 2018 or alter, I can do a little additional trimming before making another call to the API. I will first convert the release_date column to a datetime type and drop any rows with release dates before 2018.

In [6]:
# Converting release_date to datetime
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])

# Filtering rows with release dates before 2018
movie_data = movie_data[movie_data['release_date'].dt.year >= 2018]

# Resetting index after dropping rows
movie_data.reset_index(drop=True, inplace=True)

# Checking length of dataframe to ensure additional columns were dropped
len(movie_data)

260

This is a much more manageable number of calls to make to the API, so I stop the trimming here. There will be additional matching that needs to be done when the data frames are all combined in the last milestone (to account for duplicate titles, I will be merging on common keys release date and title), but for now this is a good amount of data to continue on gathering from the API.

### Pulling additional data from the API using the movie ID and merging data frames

The next API call has to be made by passing in a movie ID to the url. After the first API data retrieval, I now have IDs for all the movies in my data frame which I will use to make the next call and fetch some additional movie details,.

In [7]:
# Defining a function to fetch movie details from the API
def fetch_movie_details(movie_ids, api_key):
    # Constructing the base url
    url = 'https://api.themoviedb.org/3/movie/'
    # Creating an empty list to store movie data
    movie_details = []
    # Iterating over each movie ID
    for movie_id in movie_ids:
        # Constructing the full url by passing in movie ID and API key
        movie_url = f'{url}{movie_id}?api_key={api_key}'
        try:
            # Sending a get request tofetch data from API
            response = requests.get(movie_url)
            # Simple error handling - checking if response was successful
            if response.status_code == 200:
                # If response was successful, parse JSON data
                movie_data = response.json()
                
                # The following columns need some further JSON parsing as they return a list of dictionaries within the dataframe cell
                # Converting genres to comma-separated string
                movie_data['genres'] = ', '.join([genre['name'] for genre in movie_data.get('genres', [])])
                
                # Converting spoken_languages to comma-separated string
                movie_data['spoken_languages'] = ', '.join([lang['name'] for lang in movie_data.get('spoken_languages', [])])
                
                # Converting production_countries to comma-separated string
                movie_data['production_countries'] = ', '.join([country['name'] for country in movie_data.get('production_countries', [])])
                
                # Converting production_countries to comma-separated string
                movie_data['production_companies'] = ', '.join([company['name'] for company in movie_data.get('production_companies', [])])
                
                # Extracting the name of the collection to which the movie belongs if there is one
                belongs_to_collection = movie_data.get('belongs_to_collection')
                if belongs_to_collection:
                    movie_data['belongs_to_collection'] = belongs_to_collection.get('name', '')
                    
                # Appending movie data to the list
                movie_details.append(movie_data)
            # If data was not retrieved, print a message showing which movie data wasn't retrieved for
            else:
                print(f"Failed to retrieve data for movie ID {movie_id}")
        # If response was not received, print error that was encountered
        except Exception as e:
            print(f"An error occurred while processing movie ID {movie_id}: {e}")
    # Return movie_details list
    return movie_details


# Seperating API calls into batches of 50 to better limit calls to API
batch_size = 50

# Splitting the movie IDs into batches
movie_ids_batches = [movie_data['id'].iloc[i:i+batch_size] for i in range(0, len(movie_data), batch_size)]

# Creating a list to store data from each batch
additional_details = []

# Looping through each batch of movie IDs and fetch additional details
for batch_index, movie_ids_batch in enumerate(movie_ids_batches):
    # Printing current batch being processed to track progress
    print(f"Processing batch {batch_index+1} out of {len(movie_ids_batches)}")
    # Using fetch_movie_details on current batch
    batch_details = fetch_movie_details(movie_ids_batch, API_KEY)
    # Extending additional_details list with each batch
    additional_details.extend(batch_details)

# Converting the list of dictionaries from the JSON data into a data frame
additional_details_df = pd.DataFrame(additional_details)

# Displaying the data frame
display(additional_details_df)

Processing batch 1 out of 6
Processing batch 2 out of 6
Processing batch 3 out of 6
Processing batch 4 out of 6
Processing batch 5 out of 6
Processing batch 6 out of 6


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,After Collection,14000000,"Romance, Drama",,613504,tt10362466,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,2020-09-02,48000000,105,English,Released,Can love overcome the past?,After We Collided,False,7.195,5157
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,,0,"Romance, Drama",,642208,tt11169050,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2020-11-20,2506542,94,English,Released,,Supernova,False,7.0,291
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,,0,"Thriller, Drama",https://supernova-film.pl/movies/9922,632455,tt10666454,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,2019-11-22,0,78,Polski,Released,One moment can change your entire life,Supernova,False,6.891,22
3,False,,,0,Drama,,1179465,,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,,,2023-09-22,0,12,English,Released,,Supernova,False,6.0,1
4,False,,,0,,,925991,,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,,Brazil,2018-08-15,0,0,,Released,,Supernova,True,0.0,0
5,False,/nC86jat6h6WxlQeCRyZIQAjjCsb.jpg,,0,Music,https://youtu.be/FiKa5yIgvW4,919678,,en,Supernova,An experiment in atmosphere.,0.65,/AtxrCldnGgiDO9zQQgmwtVXfWso.jpg,,,2021-12-24,0,1,No Language,Released,,Supernova,False,0.0,0
6,False,,,0,"Science Fiction, Drama",,1162706,tt16384574,de,Supernova,,1.148,/tCvnJCvAd1JMWsdh2Ec7YTL2cZc.jpg,"Filmakademie Baden-Württemberg, Animationsinst...",,2021-10-28,0,0,,Released,,Supernova,False,0.0,0
7,False,,,0,,,641317,,fr,Supernova,At 8 years old jacob is diagnosed with a rare ...,0.6,/6VANHsx53OkEjnj3pDDSOdN5iQu.jpg,,,2019-07-11,0,11,,Released,,Supernova,False,0.0,0
8,False,,,0,,,926891,tt17007744,en,Supernova,An aged ex-Club Kid prepares to re-enter queer...,0.671,/9Z9JqpC5i22mEs5gzBgYXmUE8Ad.jpg,,,2022-02-18,0,12,English,Released,,Supernova,False,0.0,0
9,False,,,0,Drama,,1213722,tt27960800,fr,Supernova,"Basile, 17 years old, activist, wants to chang...",0.6,/dZPcalyHftmWw8CmaOlDdvhMe15.jpg,Tripode Productions,France,2023-04-04,0,43,Français,Released,,Supernova,False,0.0,0


Now that this data fetch was successful, I will go ahead and merge the two data frames into a single data frame:

#### Merging the movie_data and additional_details_df frames along common key movie ID

In [8]:
# Merging the data frames based on the common key movie ID
movie_data = pd.merge(movie_data, additional_details_df, on='id', how='inner')
# Displaying merged data frame
display(movie_data)

Unnamed: 0,adult_x,backdrop_path_x,genre_ids,id,original_language_x,original_title_x,overview_x,popularity_x,poster_path_x,release_date_x,title_x,video_x,vote_average_x,vote_count_x,adult_y,backdrop_path_y,belongs_to_collection,budget,genres,homepage,imdb_id,original_language_y,original_title_y,overview_y,popularity_y,poster_path_y,production_companies,production_countries,release_date_y,revenue,runtime,spoken_languages,status,tagline,title_y,video_y,vote_average_y,vote_count_y
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,"[10749, 18]",613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,After Collection,14000000,"Romance, Drama",,tt10362466,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,2020-09-02,48000000,105,English,Released,Can love overcome the past?,After We Collided,False,7.195,5157
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,"[10749, 18]",642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,,0,"Romance, Drama",,tt11169050,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2020-11-20,2506542,94,English,Released,,Supernova,False,7.0,291
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,"[53, 18]",632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,,0,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,2019-11-22,0,78,Polski,Released,One moment can change your entire life,Supernova,False,6.891,22
3,False,,[18],1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,False,,,0,Drama,,,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,,,2023-09-22,0,12,English,Released,,Supernova,False,6.0,1
4,False,,[],925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,False,,,0,,,,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,,Brazil,2018-08-15,0,0,,Released,,Supernova,True,0.0,0
5,False,/nC86jat6h6WxlQeCRyZIQAjjCsb.jpg,[10402],919678,en,Supernova,An experiment in atmosphere.,0.65,/AtxrCldnGgiDO9zQQgmwtVXfWso.jpg,2021-12-24,Supernova,False,0.0,0.0,False,/nC86jat6h6WxlQeCRyZIQAjjCsb.jpg,,0,Music,https://youtu.be/FiKa5yIgvW4,,en,Supernova,An experiment in atmosphere.,0.65,/AtxrCldnGgiDO9zQQgmwtVXfWso.jpg,,,2021-12-24,0,1,No Language,Released,,Supernova,False,0.0,0
6,False,,"[878, 18]",1162706,de,Supernova,,1.148,/tCvnJCvAd1JMWsdh2Ec7YTL2cZc.jpg,2021-10-28,Supernova,False,0.0,0.0,False,,,0,"Science Fiction, Drama",,tt16384574,de,Supernova,,1.148,/tCvnJCvAd1JMWsdh2Ec7YTL2cZc.jpg,"Filmakademie Baden-Württemberg, Animationsinst...",,2021-10-28,0,0,,Released,,Supernova,False,0.0,0
7,False,,[],641317,fr,Supernova,At 8 years old jacob is diagnosed with a rare ...,0.6,/6VANHsx53OkEjnj3pDDSOdN5iQu.jpg,2019-07-11,Supernova,False,0.0,0.0,False,,,0,,,,fr,Supernova,At 8 years old jacob is diagnosed with a rare ...,0.6,/6VANHsx53OkEjnj3pDDSOdN5iQu.jpg,,,2019-07-11,0,11,,Released,,Supernova,False,0.0,0
8,False,,[],926891,en,Supernova,An aged ex-Club Kid prepares to re-enter queer...,0.671,/9Z9JqpC5i22mEs5gzBgYXmUE8Ad.jpg,2022-02-18,Supernova,False,0.0,0.0,False,,,0,,,tt17007744,en,Supernova,An aged ex-Club Kid prepares to re-enter queer...,0.671,/9Z9JqpC5i22mEs5gzBgYXmUE8Ad.jpg,,,2022-02-18,0,12,English,Released,,Supernova,False,0.0,0
9,False,,[18],1213722,fr,Supernova,"Basile, 17 years old, activist, wants to chang...",0.6,/dZPcalyHftmWw8CmaOlDdvhMe15.jpg,2023-04-04,Supernova,False,0.0,0.0,False,,,0,Drama,,tt27960800,fr,Supernova,"Basile, 17 years old, activist, wants to chang...",0.6,/dZPcalyHftmWw8CmaOlDdvhMe15.jpg,Tripode Productions,France,2023-04-04,0,43,Français,Released,,Supernova,False,0.0,0


Now that I have all the data I was looking for from the API, I can continue my cleaning steps.

#### Step 3: Removing unnecessary columns after merge

Merging the data frames left me with both duplicate columns and columns that I do not really need (such as genre IDs since I was able to parse out the genres). I am going to drop any unnecessary columns.

In [9]:
# Creating a list of columns to drop
columns_to_drop = ['genre_ids', 'adult_y', 'backdrop_path_y', 'original_language_y', 'original_title_y', 'overview_y', 
                   'popularity_y', 'poster_path_y', 'release_date_y', 'title_y', 'video_y', 'vote_average_y', 'vote_count_y']
# Using .drop() to drop the columns in the list
movie_data = movie_data.drop(columns = columns_to_drop)

In [10]:
# Checking to make sure columns were dropped
movie_data.head()

Unnamed: 0,adult_x,backdrop_path_x,id,original_language_x,original_title_x,overview_x,popularity_x,poster_path_x,release_date_x,title_x,video_x,vote_average_x,vote_count_x,belongs_to_collection,budget,genres,homepage,imdb_id,production_companies,production_countries,revenue,runtime,spoken_languages,status,tagline
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,After Collection,14000000,"Romance, Drama",,tt10362466,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,48000000,105,English,Released,Can love overcome the past?
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,,0,"Romance, Drama",,tt11169050,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2506542,94,English,Released,
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,,0,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,0,78,Polski,Released,One moment can change your entire life
3,False,,1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,,0,Drama,,,,,0,12,English,Released,
4,False,,925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,,0,,,,,Brazil,0,0,,Released,


#### Step 4: Renaming columns

The column names are also quite messsy noe so I am going to replace them to make the data frame easier to work with:

In [11]:
# Creating a dictionary of column mapping, mapping original headers to new headers
column_mapping = {
    'adult_x': 'adult',
    'backdrop_path_x': 'backdrop_path',
    'id': 'id_tmdb',
    'original_language_x': 'original_language',
    'original_title_x': 'original_title',
    'overview_x': 'overview',
    'popularity_x': 'popularity',
    'poster_path_x': 'poster_path',
    'release_date_x': 'release_date_tmdb',
    'title_x': 'title',
    'video_x': 'video',
    'vote_average_x': 'vote_average',
    'vote_count_x': 'vote_count',
}

In [12]:
# Renaming the columns using .rename()
movie_data.rename(columns=column_mapping, inplace=True)
# Checking to make sure columns were renamed
movie_data.head()

Unnamed: 0,adult,backdrop_path,id_tmdb,original_language,original_title,overview,popularity,poster_path,release_date_tmdb,title,video,vote_average,vote_count,belongs_to_collection,budget,genres,homepage,imdb_id,production_companies,production_countries,revenue,runtime,spoken_languages,status,tagline
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,After Collection,14000000,"Romance, Drama",,tt10362466,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,48000000,105,English,Released,Can love overcome the past?
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,,0,"Romance, Drama",,tt11169050,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2506542,94,English,Released,
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,,0,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,0,78,Polski,Released,One moment can change your entire life
3,False,,1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,,0,Drama,,,,,0,12,English,Released,
4,False,,925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,,0,,,,,Brazil,0,0,,Released,


#### Step 5: Replacing 0s in certian columns with NaN

The columns budget, runtime, and revenue all have a number of rows with 0s entered. However, it doesn't make sense to have a film with a budget or runtime of 0, and it is unlikely that a film made no money, so more than likely there is just not data associated with certain films for these columns. I want to change these 0s to NaNs so as not to throw off any potential future analyses.

In [13]:
movie_data['budget'] = movie_data['budget'].replace(0, np.nan)
movie_data['revenue'] = movie_data['revenue'].replace(0, np.nan)
movie_data['runtime'] = movie_data['runtime'].replace(0, np.nan)

display(movie_data)

Unnamed: 0,adult,backdrop_path,id_tmdb,original_language,original_title,overview,popularity,poster_path,release_date_tmdb,title,video,vote_average,vote_count,belongs_to_collection,budget,genres,homepage,imdb_id,production_companies,production_countries,revenue,runtime,spoken_languages,status,tagline
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,After Collection,14000000.0,"Romance, Drama",,tt10362466,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,48000000.0,105.0,English,Released,Can love overcome the past?
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,,,"Romance, Drama",,tt11169050,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2506542.0,94.0,English,Released,
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,,,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,,78.0,Polski,Released,One moment can change your entire life
3,False,,1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,,,Drama,,,,,,12.0,English,Released,
4,False,,925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,,,,,,,Brazil,,,,Released,
5,False,/nC86jat6h6WxlQeCRyZIQAjjCsb.jpg,919678,en,Supernova,An experiment in atmosphere.,0.65,/AtxrCldnGgiDO9zQQgmwtVXfWso.jpg,2021-12-24,Supernova,False,0.0,0.0,,,Music,https://youtu.be/FiKa5yIgvW4,,,,,1.0,No Language,Released,
6,False,,1162706,de,Supernova,,1.148,/tCvnJCvAd1JMWsdh2Ec7YTL2cZc.jpg,2021-10-28,Supernova,False,0.0,0.0,,,"Science Fiction, Drama",,tt16384574,"Filmakademie Baden-Württemberg, Animationsinst...",,,,,Released,
7,False,,641317,fr,Supernova,At 8 years old jacob is diagnosed with a rare ...,0.6,/6VANHsx53OkEjnj3pDDSOdN5iQu.jpg,2019-07-11,Supernova,False,0.0,0.0,,,,,,,,,11.0,,Released,
8,False,,926891,en,Supernova,An aged ex-Club Kid prepares to re-enter queer...,0.671,/9Z9JqpC5i22mEs5gzBgYXmUE8Ad.jpg,2022-02-18,Supernova,False,0.0,0.0,,,,,tt17007744,,,,12.0,English,Released,
9,False,,1213722,fr,Supernova,"Basile, 17 years old, activist, wants to chang...",0.6,/dZPcalyHftmWw8CmaOlDdvhMe15.jpg,2023-04-04,Supernova,False,0.0,0.0,,,Drama,,tt27960800,Tripode Productions,France,,43.0,Français,Released,


#### Step 6: Turning Genres into more usable data using get_dummies()

This data is in relatively good shape, however there is one last transformation I'd like to make: turning the genres column into more usable data. As it stands right now, all of a film's genres are crammed into a singular cell and seperated by a comma. In order to make analysis by genre easier, I am going to break that column out using get_dummies from pandas to create seperate columns for each genre and populating a 1 if the movie falls under that genre and a 0 if it does not.

In [14]:
# Converting the genres column into ddummy variables, using a comma as the separator
genre_dummies = movie_data['genres'].str.get_dummies(sep=', ')

# Renaming the dummy columns based on the genre type (ex, genre_Action)
genre_dummies.columns = ['genre_' + column for column in genre_dummies.columns]

# Concatenating the dummy variables onto the original data frame
movie_data = pd.concat([movie_data, genre_dummies], axis=1)

In [15]:
# Checking the data frame
movie_data.head()

Unnamed: 0,adult,backdrop_path,id_tmdb,original_language,original_title,overview,popularity,poster_path,release_date_tmdb,title,video,vote_average,vote_count,belongs_to_collection,budget,genres,homepage,imdb_id,production_companies,production_countries,revenue,runtime,spoken_languages,status,tagline,genre_Action,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Family,genre_Fantasy,genre_History,genre_Horror,genre_Music,genre_Mystery,genre_Romance,genre_Science Fiction,genre_TV Movie,genre_Thriller,genre_War,genre_Western
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,After Collection,14000000.0,"Romance, Drama",,tt10362466,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,48000000.0,105.0,English,Released,Can love overcome the past?,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,,,"Romance, Drama",,tt11169050,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2506542.0,94.0,English,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,,,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,,78.0,Polski,Released,One moment can change your entire life,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
3,False,,1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,,,Drama,,,,,,12.0,English,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,False,,925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,,,,,,,Brazil,,,,Released,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now the genre data is in a much more usable format.

### Printing and Exporting the Final Data Set

In [16]:
# Displaying the final data set
display(movie_data)

Unnamed: 0,adult,backdrop_path,id_tmdb,original_language,original_title,overview,popularity,poster_path,release_date_tmdb,title,video,vote_average,vote_count,belongs_to_collection,budget,genres,homepage,imdb_id,production_companies,production_countries,revenue,runtime,spoken_languages,status,tagline,genre_Action,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Family,genre_Fantasy,genre_History,genre_Horror,genre_Music,genre_Mystery,genre_Romance,genre_Science Fiction,genre_TV Movie,genre_Thriller,genre_War,genre_Western
0,False,/6hgItrYQEG33y0I7yP2SRl2ei4w.jpg,613504,en,After We Collided,Tessa finds herself struggling with her compli...,74.446,/kiX7UYfOpYrMFSAGbI6j1pFkLzQ.jpg,2020-09-02,After We Collided,False,7.195,5157.0,After Collection,14000000.0,"Romance, Drama",,tt10362466,"Voltage Pictures, Offspring Entertainment, Fra...",United States of America,48000000.0,105.0,English,Released,Can love overcome the past?,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
1,False,/BjLgTDAcZc94KomcQAxMVye0yI.jpg,642208,en,Supernova,"Sam and Tusker, partners of 20 years, are trav...",15.737,/xpLi04zHu36TH8nuvFAwAF3LUkq.jpg,2020-11-20,Supernova,False,7.0,291.0,,,"Romance, Drama",,tt11169050,"Quiddity Films, The Bureau, BBC Film",United Kingdom,2506542.0,94.0,English,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,False,/xK5l1CMj4HllYfgCcOngC86E4o1.jpg,632455,pl,Supernova,"Three men, one place and one event that will c...",3.302,/xWDiMDQtNNsPAr3ANsQU2SVno6j.jpg,2019-11-22,Supernova,False,6.891,22.0,,,"Thriller, Drama",https://supernova-film.pl/movies/9922,tt10666454,"Canal+ Polska, Stowarzyszenie Filmowców Polskich",Poland,,78.0,Polski,Released,One moment can change your entire life,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
3,False,,1179465,en,Supernova,"On an awkward day on the beach, Hannah tries t...",0.6,,2023-09-22,Supernova,False,6.0,1.0,,,Drama,,,,,,12.0,English,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,False,,925991,pt,Supernova,,0.6,/AbVuOfwlYeViMouqCrr6nJiEuuI.jpg,2018-08-15,Supernova,True,0.0,0.0,,,,,,,Brazil,,,,Released,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,False,/nC86jat6h6WxlQeCRyZIQAjjCsb.jpg,919678,en,Supernova,An experiment in atmosphere.,0.65,/AtxrCldnGgiDO9zQQgmwtVXfWso.jpg,2021-12-24,Supernova,False,0.0,0.0,,,Music,https://youtu.be/FiKa5yIgvW4,,,,,1.0,No Language,Released,,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
6,False,,1162706,de,Supernova,,1.148,/tCvnJCvAd1JMWsdh2Ec7YTL2cZc.jpg,2021-10-28,Supernova,False,0.0,0.0,,,"Science Fiction, Drama",,tt16384574,"Filmakademie Baden-Württemberg, Animationsinst...",,,,,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
7,False,,641317,fr,Supernova,At 8 years old jacob is diagnosed with a rare ...,0.6,/6VANHsx53OkEjnj3pDDSOdN5iQu.jpg,2019-07-11,Supernova,False,0.0,0.0,,,,,,,,,11.0,,Released,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,False,,926891,en,Supernova,An aged ex-Club Kid prepares to re-enter queer...,0.671,/9Z9JqpC5i22mEs5gzBgYXmUE8Ad.jpg,2022-02-18,Supernova,False,0.0,0.0,,,,,tt17007744,,,,12.0,English,Released,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,False,,1213722,fr,Supernova,"Basile, 17 years old, activist, wants to chang...",0.6,/dZPcalyHftmWw8CmaOlDdvhMe15.jpg,2023-04-04,Supernova,False,0.0,0.0,,,Drama,,tt27960800,Tripode Productions,France,,43.0,Français,Released,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
# Exporting to .csv
movie_data.to_csv('tmdb_data.csv', index=False)

### Ethical Considerations of this Data Wrangling

This was one of my first experiences working with data retrieved through an API, so I took extra caution to be mindful of where and how I was retrieving them data for this part of the project. First, the API I chose was Thhe Movie Database (TMDb), which is a reputable and well-established database for film and television shows. Upon choosing the API, I ensured that my intentions for the data and what I would be asking of the API adhered to the policies outlined by TMDb, and when requesting the API key I had to submit these intentions and be approved. For my initial API call, I used my previous data sets to make more targeted requests for data to limit the number of times I was sending requests to the API. By the second API call, I also learned how to send data requests in batches so as to not overload the API's servers by sending large numbers of requests all at once. While working with the data in my notebook, I was also mindful to not rerun the code blocks for making API calls unnecessarily to not clog up the servers as well, even though my calls were most likely relatively small. Working with an API taught me a lot about being respectful and transparent about how and foro what I am requesting the data for.

As for the data wrangling steps themselves, the only consideration I can see is in my step of changing 0s in revenue, budget, and runtime to NaNs. While runtime is fairly self explanatory (all films have at least some runtime, so it was safe to assume 0s meant bi runtime was a vailabel), I did make this same assumption for both budget and revenue. I assumed that all films, no matter how large or small, should have some type of budget or record of cost of resources allocated to the project, but it very well could be the case that there are films that were somehow created with no budget. My assumptions for revenue were along the similar vein, thought there very well could be films that were made and generated no revenue and could have truly been 0 in this field. However, I weighed the potential costs of losing a few of these data points with the benefits of ensuring that 0s were replaced with NaN to cover cases where the 0 truly just meant there was no data, and the benefits of handling these outliers at this point in my process outweighed the potential costs. 

Looking overall at my project, the biggest ethical consideration I am taking into account at this time is data representation. Once I retrieved common movie titles from my previous two data sets in order to make more targeted calls to the API for data, I realized that the data pulled from my web source was exclusive to the top performing films of each year. While this might give me a good data set to investigate common threads between monetarily successful films, it would also be beneficial to look at commonalities between films that didn't generate as much revenue, and looking ahead to the final milestone as my data stands now this may be an underrepresented group in my final merged data set. 