## Film Content Insights



## Overview

This project analyzes current trends in the film industry by examining the performance of different genres at the box office. By investigating various datasets, including Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers, we aim to identify what types of films are currently most successful. This analysis will help in forecasting which film genres hold the most promise for profitability and audience engagement, thus guiding strategic decisions related to film production, marketing, and distribution.

## Business Problem

The film industry is highly competitive and continuously evolving, with varying audience preferences and technological advancements shaping market dynamics. Understanding which film genres are performing well at the box office can enable the newly established movie studio to allocate resources effectively, maximize returns, and expand its market presence. By leveraging detailed box office data, the studio can make informed decisions about which types of films to produce, potentially leading to increased profitability and audience acclaim.

<img src="https://wallpapercave.com/wp/wp8021237.jpg" width="300" alt="Descriptive Text">


## Data Understanding

The datasets include a mix of structured data from well-known film databases, covering extensive details about film genres, box office earnings, ratings, and audience feedback across several years. Each film is identified uniquely, allowing for precise tracking of its performance from release to international earnings. This comprehensive data enables an in-depth analysis of market trends, audience preferences, and financial outcomes associated with different film types.

In [1]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os

# Setting visualisation styles
sns.set(style="whitegrid")

bom_movie_gross = pd.read_csv(r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData\bom.movie_gross.csv.gz', compression='gzip')

# Path to the zip file
zip_file_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData\im.db.zip'
# Directory where the db file will be extracted
extraction_path = r'C:\Users\neali\Documents\Flatiron\2\phase-2-project-v3\dsc-phase-2-project-v3\zippedData'

# Unzip the database file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the directory
    zip_ref.extractall(extraction_path)

# Assuming the database file is named 'im.db' and is the only file in the zip
db_file_path = os.path.join(extraction_path, 'im.db')

# Connect to the SQLite database
conn = sqlite3.connect(db_file_path)

# Now you can perform database operations
# Example: Listing all tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
for table in tables:
    print(table)


# Querying data from SQLite database
query = """
SELECT *
FROM movie_basics
JOIN movie_ratings ON movie_basics.movie_id = movie_ratings.movie_id;
"""
movie_data = pd.read_sql_query(query, conn)

('movie_basics',)
('directors',)
('known_for',)
('movie_akas',)
('movie_ratings',)
('persons',)
('principals',)
('writers',)


In [None]:
#info for Movie Basics
query_basics = "PRAGMA table_info(movie_basics);"
basics_info = pd.read_sql_query(query_basics, conn)
print("Columns in movie_basics:")
print(basics_info[['name', 'type']])


#Table info for Movie_ratings
query_basics2 = "PRAGMA table_info(movie_ratings);"
basics_info = pd.read_sql_query(query_basics2, conn)
print("Columns in movie_ratings:")
print(basics_info[['name', 'type']])

#Tables info for Movie AKAs
query_basics3 = "PRAGMA table_info(movie_akas);"
basics_info = pd.read_sql_query(query_basics3, conn)
print("Columns in movie_akas:")
print(basics_info[['name', 'type']])

In [22]:
#Combine Data
sql_query2 = """
SELECT 
    mb.primary_title, 
    p.primary_name, 
    pr.characters,
    mr.averagerating,
    mr.numvotes
FROM movie_basics mb
JOIN directors d ON mb.movie_id = d.movie_id
JOIN persons p ON d.person_id = p.person_id  -- Corrected join to include persons
JOIN principals pr ON p.person_id = pr.person_id  -- Ensuring pr.person_id refers to persons table
JOIN movie_ratings mr ON mb.movie_id = mr.movie_id
GROUP BY mb.movie_id, p.primary_name
ORDER BY mb.movie_id ASC
LIMIT 10;
"""


# Assuming 'conn' is already your active connection to the SQLite database
result = pd.read_sql_query(sql_query2, conn)
result.head()



# Drop duplicates based on 'primary_title' and 'averagerating'
unique_ratings = result.drop_duplicates(subset=['primary_title', 'averagerating'])

# Sort the DataFrame by 'numvotes' in descending order
sorted_unique_ratings = unique_ratings.sort_values(by='numvotes', ascending=False)

# Select the top 50 unique ratings
top_20_unique_ratings = sorted_unique_ratings.head(50)

# Display the top 50 unique ratings
top_20_unique_ratings









Unnamed: 0,primary_title,primary_name,characters,averagerating,numvotes
2,The Other Side of the Wind,Orson Welles,,6.9,4517
8,Pál Adrienn,Ágnes Kocsis,,6.8,451
7,Joe Finds Grace,Anthony Harrison,"[""Joseph Briteman""]",8.1,263
4,The Wandering Soap Opera,Raoul Ruiz,,6.5,119
0,Sunghursh,Harnam Singh Rawail,,7.0,77
9,So Much for Justice!,Miklós Jancsó,,4.6,64
1,One Day Before the Rainy Season,Mani Kaul,,7.2,43
6,Bigfoot,Mc Jones,,4.1,32
3,Sabse Bada Sukh,Hrishikesh Mukherjee,,6.1,13
