## Film Content Insights



## Overview

This project analyzes current trends in the film industry by examining the performance of different genres at the box office. By investigating various datasets, including Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers, we aim to identify what types of films are currently most successful. This analysis will help in forecasting which film genres hold the most promise for profitability and audience engagement, thus guiding strategic decisions related to film production, marketing, and distribution.

## Business Problem

The film industry is highly competitive and continuously evolving, with varying audience preferences and technological advancements shaping market dynamics. Understanding which film genres are performing well at the box office can enable the newly established movie studio to allocate resources effectively, maximize returns, and expand its market presence. By leveraging detailed box office data, the studio can make informed decisions about which types of films to produce, potentially leading to increased profitability and audience acclaim.

<img src="https://wallpapercave.com/wp/wp8021237.jpg" width="300" alt="Descriptive Text">


## Data Understanding

The datasets include a mix of structured data from well-known film databases, covering extensive details about film genres, box office earnings, ratings, and audience feedback across several years. Each film is identified uniquely, allowing for precise tracking of its performance from release to international earnings. This comprehensive data enables an in-depth analysis of market trends, audience preferences, and financial outcomes associated with different film types.

In [108]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os




# Setting visualisation styles
sns.set(style="whitegrid")


# Load TSV files
rt_movie_info = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\rt.movie_info.tsv.gz', delimiter='\t', compression='gzip')
# Load TSV files using latin-1 encoding
rt_reviews = pd.read_csv(
    r'C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\rt.reviews.tsv.gz',
    delimiter='\t',
    compression='gzip',
    encoding='latin-1'
)


# Load CSV files
bom_movie_gross = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\bom.movie_gross.csv.gz', compression='gzip')
tmdb_movies = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\tmdb.movies.csv.gz', compression='gzip')
tn_movie_budgets = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\dsc-phase-2-project\zippedData\tn.movie_budgets.csv.gz', compression='gzip')


# Path to the zip file
zip_file_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData\im.db.zip'
# Directory where the db file will be extracted
extraction_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData'

# Unzip the database file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the directory
    zip_ref.extractall(extraction_path)

# Assuming the database file is named 'im.db' and is the only file in the zip
db_file_path = os.path.join(extraction_path, 'im.db')

# Connect to the SQLite database
conn = sqlite3.connect(db_file_path)

# Now you can perform database operations
# Example: Listing all tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
for table in tables:
    print(table)





# Querying data from SQLite database
query = """
SELECT *
FROM movie_basics
JOIN movie_ratings ON movie_basics.movie_id = movie_ratings.movie_id;
"""
movie_data = pd.read_sql_query(query, conn)

('movie_basics',)
('directors',)
('known_for',)
('movie_akas',)
('movie_ratings',)
('persons',)
('principals',)
('writers',)


## Data Processing with Pandas and SQLite

Next, we will perform a series of SQL and Pandas operations to retrieve and process data from the IMDB movie database, focusing on movie ratings and related attributes. The top 20 unique ratings are displayed, showcasing key information about each movie, including its title, director, character involvement, average rating, and total votes. This data is particularly useful for analyzing viewer preferences and the impact of directorial roles on movie ratings.

In [120]:
#Combine Data
sql_query2 = """

SELECT 
    mb.primary_title, 
    p.primary_name, 
    mb.genres,
    mr.averagerating,
    mr.numvotes
FROM 
    movie_basics mb
JOIN 
    directors d ON mb.movie_id = d.movie_id
JOIN 
    persons p ON d.person_id = p.person_id
JOIN 
    principals pr ON p.person_id = pr.person_id
JOIN 
    movie_ratings mr ON mb.movie_id = mr.movie_id
GROUP BY 
    mb.movie_id, p.primary_name
ORDER BY 
    mb.movie_id ASC;
"""



# Assuming 'conn' is already your active connection to the SQLite database
result = pd.read_sql_query(sql_query2, conn)
result.head()



# Drop duplicates based on 'primary_title' and 'averagerating'
unique_ratings = result.drop_duplicates(subset=['primary_title', 'averagerating'])

# Sort the DataFrame by 'numvotes' in descending order
sorted_unique_ratings = unique_ratings.sort_values(by='numvotes', ascending=False)

# Select the top 20 unique ratings
top_20_unique_ratings = sorted_unique_ratings.head(20)

# Display the top 20 unique ratings
top_20_unique_ratings









Unnamed: 0,primary_title,primary_name,genres,averagerating,numvotes
2635,Inception,Christopher Nolan,"Action,Adventure,Sci-Fi",8.8,1841066
2471,The Dark Knight Rises,Christopher Nolan,"Action,Thriller",8.4,1387769
303,Interstellar,Christopher Nolan,"Adventure,Drama,Sci-Fi",8.6,1299334
13717,Django Unchained,Quentin Tarantino,"Drama,Western",8.4,1211405
352,The Avengers,Joss Whedon,"Action,Adventure,Sci-Fi",8.1,1183655
555,The Wolf of Wall Street,Martin Scorsese,"Biography,Crime,Drama",8.2,1035358
1175,Shutter Island,Martin Scorsese,"Mystery,Thriller",8.1,1005960
17491,Guardians of the Galaxy,James Gunn,"Action,Adventure,Comedy",8.1,948394
3117,Deadpool,Tim Miller,"Action,Adventure,Comedy",8.0,820847
2782,The Hunger Games,Gary Ross,"Action,Adventure,Sci-Fi",7.2,795227


Next, we will read and analyze the CSV and TSV files in order to then merge and perform analysis with the queried database

In [110]:
# List columns of each DataFrame
dataframes = {
    'RT Movie Info': rt_movie_info,
    'RT Reviews': rt_reviews,
    'BOM Movie Gross': bom_movie_gross,
    'TMDB Movies': tmdb_movies,
    'TN Movie Budgets': tn_movie_budgets
}

# Print columns for each DataFrame
for name, df in dataframes.items():
    print(f"Columns in {name}:")
    print(df.columns.tolist())
    print()


Columns in RT Movie Info:
['id', 'synopsis', 'rating', 'genre', 'director', 'writer', 'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime', 'studio']

Columns in RT Reviews:
['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', 'date']

Columns in BOM Movie Gross:
['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']

Columns in TMDB Movies:
['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title', 'popularity', 'release_date', 'title', 'vote_average', 'vote_count']

Columns in TN Movie Budgets:
['id', 'release_date', 'movie', 'production_budget', 'domestic_gross', 'worldwide_gross']



## Analysis 1: Find ROI for highest grossing movies based on Genre

The next step is to merge dataframes in order to see highest grossing and viewed films, then create a new column calculating the Return on Investment (ROI) and finally, sort and group by Genre.

In [122]:
# Rename 'title' to 'primary_title' in bom_movie_gross for consistency
bom_movie_gross = bom_movie_gross.rename(columns={'title': 'primary_title'})

# Merge unique_ratings with bom_movie_gross
merged_df = pd.merge(unique_ratings, bom_movie_gross[['primary_title']],
                     on='primary_title', how='left')

# Merge the result with tn_movie_budgets
final_merged_df = pd.merge(merged_df, tn_movie_budgets[['primary_title', 'production_budget', 'domestic_gross', 'worldwide_gross']],
                           on='primary_title', how='left')

# Display the first few rows of the final DataFrame
final2 = final_merged_df.sort_values(by='numvotes', ascending=False)

final_gross = final2.head(30)

final_gross


Unnamed: 0,primary_title,primary_name,genres,averagerating,numvotes,production_budget,domestic_gross,worldwide_gross
2394,Inception,Christopher Nolan,"Action,Adventure,Sci-Fi",8.8,1841066,"$160,000,000","$292,576,195","$835,524,642"
2249,The Dark Knight Rises,Christopher Nolan,"Action,Thriller",8.4,1387769,"$275,000,000","$448,139,099","$1,084,439,099"
281,Interstellar,Christopher Nolan,"Adventure,Drama,Sci-Fi",8.6,1299334,"$165,000,000","$188,017,894","$666,379,375"
12049,Django Unchained,Quentin Tarantino,"Drama,Western",8.4,1211405,"$100,000,000","$162,805,434","$449,948,323"
328,The Avengers,Joss Whedon,"Action,Adventure,Sci-Fi",8.1,1183655,"$225,000,000","$623,279,547","$1,517,935,897"
329,The Avengers,Joss Whedon,"Action,Adventure,Sci-Fi",8.1,1183655,"$60,000,000","$23,385,416","$48,585,416"
513,The Wolf of Wall Street,Martin Scorsese,"Biography,Crime,Drama",8.2,1035358,"$100,000,000","$116,900,694","$389,870,414"
1092,Shutter Island,Martin Scorsese,"Mystery,Thriller",8.1,1005960,"$80,000,000","$128,012,934","$299,461,782"
15279,Guardians of the Galaxy,James Gunn,"Action,Adventure,Comedy",8.1,948394,"$170,000,000","$333,172,112","$770,867,516"
2839,Deadpool,Tim Miller,"Action,Adventure,Comedy",8.0,820847,"$58,000,000","$363,070,709","$801,025,593"
