## Film Content Insights



## Overview

This project analyzes current trends in the film industry by examining the performance of different genres at the box office. By investigating various datasets, including Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers, we aim to identify what types of films are currently most successful. This analysis will help in forecasting which film genres hold the most promise for profitability and audience engagement, thus guiding strategic decisions related to film production, marketing, and distribution.

## Business Problem

The film industry is highly competitive and continuously evolving, with varying audience preferences and technological advancements shaping market dynamics. Understanding which film genres are performing well at the box office can enable the newly established movie studio to allocate resources effectively, maximize returns, and expand its market presence. By leveraging detailed box office data, the studio can make informed decisions about which types of films to produce, potentially leading to increased profitability and audience acclaim.

<img src="https://wallpapercave.com/wp/wp8021237.jpg" width="300" alt="Descriptive Text">


## Data Understanding

The datasets include a mix of structured data from well-known film databases, covering extensive details about film genres, box office earnings, ratings, and audience feedback across several years. Each film is identified uniquely, allowing for precise tracking of its performance from release to international earnings. This comprehensive data enables an in-depth analysis of market trends, audience preferences, and financial outcomes associated with different film types.

In [82]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os




# Setting visualisation styles
sns.set(style="whitegrid")

bom_movie_gross = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData\bom.movie_gross.csv.gz', compression='gzip')

# Path to the zip file
zip_file_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData\im.db.zip'
# Directory where the db file will be extracted
extraction_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData'

# Unzip the database file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the directory
    zip_ref.extractall(extraction_path)

# Assuming the database file is named 'im.db' and is the only file in the zip
db_file_path = os.path.join(extraction_path, 'im.db')

# Connect to the SQLite database
conn = sqlite3.connect(db_file_path)

# Now you can perform database operations
# Example: Listing all tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
for table in tables:
    print(table)





# Querying data from SQLite database
query = """
SELECT *
FROM movie_basics
JOIN movie_ratings ON movie_basics.movie_id = movie_ratings.movie_id;
"""
movie_data = pd.read_sql_query(query, conn)

('movie_basics',)
('directors',)
('known_for',)
('movie_akas',)
('movie_ratings',)
('persons',)
('principals',)
('writers',)


In [83]:

def load_file(filepath, tablename, file_type='csv'):
    encodings = ['utf-8', 'ISO-8859-1', 'cp1252']  # List of encodings to try
    for encoding in encodings:
        try:
            # Try reading the file with the current encoding
            if file_type == 'csv':
                df = pd.read_csv(filepath, encoding=encoding)
            elif file_type == 'tsv':
                df = pd.read_csv(filepath, sep='\t', encoding=encoding)
            # If read is successful, save to SQL database
            df.to_sql(tablename, conn, if_exists='replace', index=False)
            print(f"Table {tablename} created successfully with {encoding} encoding.")
            break  # Stop trying encodings if successful
        except UnicodeDecodeError as e:
            print(f"Failed to read {filepath} with {encoding} encoding. Error: {str(e)}")
    else:
        # This block executes if all encodings fail
        raise ValueError(f"All encoding attempts failed for {filepath}. Please check the file encoding and data.")



# List of files and their formats
files = [
    (r"C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\bom.movie_gross.csv.gz", "bom_movie_gross", 'csv'),
    (r"C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\rt.movie_info.tsv.gz", "rt_movie_info", 'tsv'),
    (r"C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\rt.reviews.tsv.gz", "rt_reviews", 'tsv'),
    (r"C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\tmdb.movies.csv.gz", "tmdb_movies", 'csv'),
    (r"C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\tn.movie_budgets.csv.gz", "tn_movie_budgets", 'csv')
]

# Import each file into the SQLite database
for file_path, table_name, file_type in files:
    load_file(file_path, table_name, file_type)

#List of tables from your files
table_names = ["bom_movie_gross", "rt_movie_info", "rt_reviews", "tmdb_movies", "tn_movie_budgets"]

# Print columns and attributes for each table
for table_name in table_names:
    print(f"Schema for table {table_name}:")
    cursor.execute(f"PRAGMA table_info({table_name});")
    columns = cursor.fetchall()
    for column in columns:
        # Column details: name, type, nullable (1 if Yes, 0 if No), default value
        print(f"Column: {column[1]}, Type: {column[2]}, Nullable: {'Yes' if column[3] == 0 else 'No'}, Default: {column[4]}")
    print("\n")  # Add a newline for better readability between tables


Table bom_movie_gross created successfully with utf-8 encoding.
Table rt_movie_info created successfully with utf-8 encoding.
Failed to read C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\rt.reviews.tsv.gz with utf-8 encoding. Error: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte
Table rt_reviews created successfully with ISO-8859-1 encoding.
Table tmdb_movies created successfully with utf-8 encoding.
Table tn_movie_budgets created successfully with utf-8 encoding.
Schema for table bom_movie_gross:
Column: title, Type: TEXT, Nullable: Yes, Default: None
Column: studio, Type: TEXT, Nullable: Yes, Default: None
Column: domestic_gross, Type: REAL, Nullable: Yes, Default: None
Column: foreign_gross, Type: TEXT, Nullable: Yes, Default: None
Column: year, Type: INTEGER, Nullable: Yes, Default: None


Schema for table rt_movie_info:
Column: id, Type: INTEGER, Nullable: Yes, Default: None
Column: synopsis, Type: TEXT, Nullable: Yes, Default: None
Col

In [None]:
# Confirm the database path
print(conn)
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print("Available tables:", tables)


In [104]:


sql_query3 = """
SELECT 
    mb.primary_title,
    mb.start_year,
    p.primary_name AS director_name,
    ri.box_office,
    mr.averagerating,
    mr.numvotes
FROM movie_basics mb
JOIN directors d ON mb.movie_id = d.movie_id
JOIN persons p ON d.person_id = p.person_id
JOIN movie_ratings mr ON mb.movie_id = mr.movie_id
JOIN rt_movie_info ri ON ri.director = p.primary_name
WHERE ri.director IS NOT NULL
GROUP BY mb.movie_id
ORDER BY mb.movie_id ASC;
"""

df_director_movies = pd.read_sql_query(sql_query3, conn)

# Filter the DataFrame to only include rows where 'box_office' is not None
filtered_director_movies = df_director_movies[df_director_movies['box_office'].notna()]

# Display the filtered DataFrame
filtered_director_movies

# Drop duplicates based on 'primary_title' and 'review_rating'
unique_ratings1 = filtered_director_movies.drop_duplicates(subset=['primary_title', 'averagerating'])

# Sort the DataFrame by 'numvotes' in descending order
sorted_unique_ratings1 = unique_ratings1.sort_values(by='numvotes', ascending=False)

# Select the top 20 unique ratings
top_20_unique_ratings1 = sorted_unique_ratings1.head(20)

# Display the top 20 unique ratings
top_20_unique_ratings1


Unnamed: 0,primary_title,start_year,director_name,box_office,averagerating,numvotes
185,Inception,2010,Christopher Nolan,53100000,8.8,1841066
179,The Dark Knight Rises,2012,Christopher Nolan,53100000,8.4,1387769
37,Interstellar,2014,Christopher Nolan,53100000,8.6,1299334
42,The Avengers,2012,Joss Whedon,25335935,8.1,1183655
677,Gone Girl,2014,David Fincher,127490802,8.1,761592
740,Avengers: Age of Ultron,2015,Joss Whedon,25335935,7.3,665594
494,X-Men: Days of Future Past,2014,Bryan Singer,82989109,8.0,620079
86,Skyfall,2012,Sam Mendes,22877808,7.8,592221
149,The Social Network,2010,David Fincher,127490802,7.7,568578
38,World War Z,2013,Marc Forster,3349167,7.0,553751


## Data Processing with Pandas and SQLite

Next, we will perform a series of SQL and Pandas operations to retrieve and process data from the IMDB movie database, focusing on movie ratings and related attributes. The top 20 unique ratings are displayed, showcasing key information about each movie, including its title, director, character involvement, average rating, and total votes. This data is particularly useful for analyzing viewer preferences and the impact of directorial roles on movie ratings.

In [24]:
#Combine Data
sql_query2 = """
SELECT 
    mb.primary_title, 
    p.primary_name, 
    pr.characters,
    mr.averagerating,
    mr.numvotes
FROM movie_basics mb
JOIN directors d ON mb.movie_id = d.movie_id
JOIN persons p ON d.person_id = p.person_id  -- Corrected join to include persons
JOIN principals pr ON p.person_id = pr.person_id  -- Ensuring pr.person_id refers to persons table
JOIN movie_ratings mr ON mb.movie_id = mr.movie_id
GROUP BY mb.movie_id, p.primary_name
ORDER BY mb.movie_id ASC;
"""


# Assuming 'conn' is already your active connection to the SQLite database
result = pd.read_sql_query(sql_query2, conn)
result.head()



# Drop duplicates based on 'primary_title' and 'averagerating'
unique_ratings = result.drop_duplicates(subset=['primary_title', 'averagerating'])

# Sort the DataFrame by 'numvotes' in descending order
sorted_unique_ratings = unique_ratings.sort_values(by='numvotes', ascending=False)

# Select the top 20 unique ratings
top_20_unique_ratings = sorted_unique_ratings.head(20)

# Display the top 20 unique ratings
top_20_unique_ratings









Unnamed: 0,primary_title,primary_name,characters,averagerating,numvotes
2635,Inception,Christopher Nolan,,8.8,1841066
2471,The Dark Knight Rises,Christopher Nolan,,8.4,1387769
303,Interstellar,Christopher Nolan,,8.6,1299334
13717,Django Unchained,Quentin Tarantino,,8.4,1211405
352,The Avengers,Joss Whedon,,8.1,1183655
555,The Wolf of Wall Street,Martin Scorsese,,8.2,1035358
1175,Shutter Island,Martin Scorsese,,8.1,1005960
17491,Guardians of the Galaxy,James Gunn,,8.1,948394
3117,Deadpool,Tim Miller,,8.0,820847
2782,The Hunger Games,Gary Ross,,7.2,795227
