## Data Acquisition
We have acquired a Kaggle dataset with data corresponding to about 700,000 movies. We would like to combine aspects of this data with data from the OMDb database to additionally acquire movie poster images for our neural network to process as well as other data fields corresponding to movie Rating (PG, R, etc) as well as ratings (IMDB, Metacritic).

In [2]:
import pandas as pd

In [4]:
# The data from the Kaggle page
df = pd.read_csv('movies.csv')

https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies/data

In [4]:
df.head(1)

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,runtime,status,tagline,vote_average,vote_count,credits,keywords,poster_path,backdrop_path,recommendations
0,615656,Meg 2: The Trench,Action-Science Fiction-Horror,en,An exploratory dive into the deepest depths of...,8763.998,Apelles Entertainment-Warner Bros. Pictures-di...,2023-08-02,129000000.0,352056482.0,116.0,Released,Back for seconds.,7.079,1365.0,Jason Statham-Wu Jing-Shuya Sophia Cai-Sergio ...,based on novel or book-sequel-kaiju,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,/qlxy8yo5bcgUw2KAmmojUKp4rHd.jpg,1006462-298618-569094-1061181-346698-1076487-6...


#### Select relevant subset
For the purpose of our project we will focus on movies with `revenue` > 100,000 USD and `original_language` of english.

In [5]:
# Filter the DataFrame for movies with revenue > 100,000
df_filtered = df[df['revenue'] > 100000][['title', 'release_date', 'original_language','genres','budget','revenue','runtime']].copy()
# Filter the DataFrame for movies with revenue > 100,000
df_filtered = df_filtered[df_filtered['original_language'] == 'en'].copy()

# Extract the year from the 'release_date' column
df_filtered['year'] = df_filtered['release_date'].str[:4]

# Check the filtered DataFrame
df_filtered.head(2)


Unnamed: 0,title,release_date,original_language,genres,budget,revenue,runtime,year
0,Meg 2: The Trench,2023-08-02,en,Action-Science Fiction-Horror,129000000.0,352056482.0,116.0,2023
1,The Pope's Exorcist,2023-04-05,en,Horror-Mystery-Thriller,18000000.0,65675816.0,103.0,2023


#### Create columns in our DF for OMDb data to load into

In [6]:
# Add empty columns for the OMDb data
df_filtered['Rated'] = None
df_filtered['Poster'] = None
df_filtered['Ratings'] = None
df_filtered['Metascore'] = None
df_filtered['imdbRating'] = None
df_filtered['imdbVotes'] = None
df_filtered['imdbID'] = None

# Check the DataFrame structure after adding new columns
df_filtered.head(2)


Unnamed: 0,title,release_date,original_language,genres,budget,revenue,runtime,year,Rated,Poster,Ratings,Metascore,imdbRating,imdbVotes,imdbID
0,Meg 2: The Trench,2023-08-02,en,Action-Science Fiction-Horror,129000000.0,352056482.0,116.0,2023,,,,,,,
1,The Pope's Exorcist,2023-04-05,en,Horror-Mystery-Thriller,18000000.0,65675816.0,103.0,2023,,,,,,,


In [7]:
# Remove films with troublesome characters
df_filtered = df_filtered[~df_filtered['title'].str.contains(r'[\\/]', regex=True)].copy()

#### The following function accesses the OMDb api and acquires the necessary data
Where available, data while be stored in filtered_df. Movie posters will be stored in separate folder with the path to images stored in the `Poster` column of the dataframe. Upon completion of the function execution, updated data will be stored as 'omdb_enriched_data.csv'  

While not perfect, the following code was successfull in acquiring poster images for about 8600 films. A data cleaning step is needed to follow.

In [7]:
import requests
import os
import time
import pandas as pd

# Function to fetch OMDb data and download poster images, then save to CSV
def fetch_omdb_data_and_update_df(df_filtered, api_key, output_csv='omdb_enriched_data.csv', limit=2000, timeout_duration=10):
    count = 0
    # Ensure the poster directory exists
    if not os.path.exists('posters'):
        os.makedirs('posters')

    for index, row in df_filtered.iterrows():
        # Skip rows where OMDb data has already been fetched
        if pd.notna(row['imdbID']):
            print(f"Skipping {row['title']} ({row['year']}) - already processed.")
            continue

        title = row['title']
        year = row['year']

        # Ensure the year is a string and remove any decimals
        year = str(int(float(year))) if pd.notna(year) else ''

        print(f"Fetching data for: {title} ({year})")

        # API request with title and year
        url = f"http://www.omdbapi.com/?t={title}&y={year}&apikey={api_key}"
        response = requests.get(url)

        if response.status_code == 200:
            data = response.json()

            # Check if the movie was found
            if data.get('Response', 'False') == 'True':
                # Populate the DataFrame with OMDb data
                df_filtered.at[index, 'Rated'] = data.get('Rated')
                df_filtered.at[index, 'Ratings'] = data.get('Ratings')  # is a list of ratings
                df_filtered.at[index, 'Metascore'] = pd.to_numeric(data.get('Metascore'), errors='coerce')
                df_filtered.at[index, 'imdbRating'] = pd.to_numeric(data.get('imdbRating'), errors='coerce')
                df_filtered.at[index, 'imdbVotes'] = data.get('imdbVotes')
                df_filtered.at[index, 'imdbID'] = data.get('imdbID')

                # Download the poster if available
                poster_url = data.get('Poster')
                if poster_url and poster_url != "N/A":
                    try:
                        img_data = requests.get(poster_url, timeout=timeout_duration).content
                        file_name = f"posters/{title.replace(' ', '_')}_{year}_photo.jpg"
                        with open(file_name, "wb") as img_file:
                            img_file.write(img_data)

                        # Store the poster path in the DataFrame
                        df_filtered.at[index, 'Poster'] = file_name
                        print(f"Poster downloaded and saved as {file_name}")
                    except requests.exceptions.Timeout:
                        print(f"Timeout occurred for {title} ({year}). Skipping poster.")
                else:
                    print(f"No poster available for {title} ({year})")

            else:
                print(f"Movie not found in OMDb for {title} ({year})")

        else:
            print(f"Failed to fetch data for {title} ({year}). Status code: {response.status_code}")

        # Sleep for 1 second to avoid overwhelming the API
        time.sleep(1)

        count += 1
        print(f"Progress: {count}/{limit} movies processed.")
        if count >= limit:
            break

    # Save the updated DataFrame to a CSV file
    df_filtered.to_csv(output_csv, index=False)
    print(f"Data saved to {output_csv}")

    return df_filtered

In [8]:
# Load previously saved data if exists
try:
    df_filtered = pd.read_csv('omdb_enriched_data.csv')  
    print("Loaded existing OMDb enriched data.")
except FileNotFoundError:
    print("No previous data found. Starting fresh with df_filtered.")

api_key = "**********"
# Run function, set limit as desired
df_filtered = fetch_omdb_data_and_update_df(df_filtered, api_key, limit=800)

# Check the updated DataFrame
df_filtered.head(2)

Loaded existing OMDb enriched data.
Skipping Meg 2: The Trench (2023.0) - already processed.
Skipping The Pope's Exorcist (2023.0) - already processed.
Skipping Deadpool & Wolverine (2024.0) - already processed.
Skipping Transformers: Rise of the Beasts (2023.0) - already processed.
Skipping Dune: Part Two (2024.0) - already processed.
Skipping Ant-Man and the Wasp: Quantumania (2023.0) - already processed.
Skipping Creed III (2023.0) - already processed.
Skipping Insidious: The Red Door (2023.0) - already processed.
Skipping Despicable Me 4 (2024.0) - already processed.
Skipping Spider-Man: Across the Spider-Verse (2023.0) - already processed.
Skipping Kingdom of the Planet of the Apes (2024.0) - already processed.
Skipping Beetlejuice Beetlejuice (2024.0) - already processed.
Skipping Aquaman and the Lost Kingdom (2023.0) - already processed.
Skipping Shazam! Fury of the Gods (2023.0) - already processed.
Skipping Knights of the Zodiac (2023.0) - already processed.
Skipping Furiosa: 

KeyboardInterrupt: 