# Exploratory Notebook

In [2]:
import pandas as pd
import sqlite3 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 🎬 Exploratory Data Analysis (EDA): Box Office Movie Insights

## 📌 Objective
Our goal is to explore what types of films are currently performing best at the box office, in order to guide our company’s new movie studio toward data-driven content creation.

---

## 1. 🗂️ Data Overview

- Load the dataset
- Preview the first few rows
- Check shape, column names, and data types
- Identify missing values and duplicates

```python
df.shape
df.columns
df.info()
df.head()
df.isnull().sum()
df.duplicated().sum()

> I will start workin the two data sets that i have already and pick relevant data within that i can start my analysis with 

### BOM Dataset EDA
Lets us start by first having a look into the Box Office Movies Dataset and see what it looks like. 


In [131]:
df1 = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
df1.info(), df1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


(None,
                                          title studio  domestic_gross  \
 0                                  Toy Story 3     BV     415000000.0   
 1                   Alice in Wonderland (2010)     BV     334200000.0   
 2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   
 3                                    Inception     WB     292600000.0   
 4                          Shrek Forever After   P/DW     238700000.0   
 
   foreign_gross  year  
 0     652000000  2010  
 1     691300000  2010  
 2     664300000  2010  
 3     535700000  2010  
 4     513900000  2010  )

In [115]:
# looking into df1
print(df1.info())
print("Shape of the dataset")
print(df1.shape)
print()
print("The columns on the dataset:\n", df1.columns)
print()
print("The sum of missing values", df1.isnull().sum())
print()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
None
Shape of the dataset
(3387, 5)

The columns on the dataset:
 Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'], dtype='object')

The sum of missing values title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64



###  Data Cleaning: Box Office Movies (BOM)

Before analysis, we first explore and clean the dataset to ensure it's ready for univariate, bivariate, and multivariate analyses.

---

####  Missing Values Summary

| Column          | Missing Values |
|-----------------|----------------|
| `title`         | 0              |
| `studio`        | 5              |
| `domestic_gross`| 28             |
| `foreign_gross` | 1350           |
| `year`          | 0              |

---
#### Cleaning Steps

- `studio`: Fill missing values with `"Unknown"`  
- `domestic_gross`: Drop rows with missing values (only 28 rows)  
- `foreign_gross`: Convert from string to numeric (remove commas), then handle missing as needed  
- Add new columns:
  - `worldwide_gross` = domestic + foreign
  - `proportion_foreign` = foreign / worldwide
- Drop duplicates if any

---


In [125]:
df1['title']

0                                       Toy Story 3
1                        Alice in Wonderland (2010)
2       Harry Potter and the Deathly Hallows Part 1
3                                         Inception
4                               Shrek Forever After
                           ...                     
3382                                      The Quake
3383                    Edward II (2018 re-release)
3384                                       El Pacto
3385                                       The Swan
3386                              An Actor Prepares
Name: title, Length: 3387, dtype: object

### 🎬 IMDB Database Analysis

We have an ERD (Entity Relationship Diagram) that shows the schema of our IMDB movie database. This gives us a bird’s eye view of how the data tables are related, helping us make sense of what we are working with and which parts of the database will be most useful in answering business questions.

Below is the ERD for the IMDB dataset:

![IMDB ERD Diagram](movie_data_erd.jpeg)  <!-- Adjust the path as needed -->

---

### 🧭 Next Steps for Analysis

We will follow a structured process to explore and extract value from the database:

#### 🔍 1. **Explore the Tables**
- Load all tables into Pandas DataFrames
- Inspect the shape, columns, and sample data from each

#### 🧹 2. **Clean the Data**
- Handle missing values
- Standardize data types (e.g., dates, numeric fields)
- Remove duplicates
- Engineer new features where helpful (e.g., gross revenue, genre flags)

#### 🧱 3. **Understand Relationships**
- Identify primary and foreign keys
- Decide which joins will allow us to combine useful information (e.g., `movie_basics` + `movie_ratings`)

#### 📊 4. **Perform EDA**
- Univariate: Analyze individual features like genres, years, ratings
- Bivariate: Study relationships like genre vs rating, year vs vote count
- Multivariate: Combine features to reveal deeper patterns (e.g., genre + year + rating)

#### 📈 5. **Visualize Insights**
- Use bar plots, histograms, heatmaps, and scatter plots
- Highlight patterns and trends that are directly relevant to decision-making

#### 🧠 6. **Make Business Recommendations**
- Based on data, suggest what kind of content the company should invest in
- Consider genre performance, ideal release years, director patterns, etc.

---

Let’s begin by exploring and loading all tables from the database!


In [66]:
# opening im.db file 

conn = sqlite3.connect('zippedData/im.db/im.db')
cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
tables

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

In [144]:
# Loading movie_basics
q1 = """
SELECT * 
FROM movie_basics
"""
df_movie_basics = pd.read_sql(q1, conn)
df_movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [146]:
# viewing into the movie_basics dataFrame
df_movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


In [137]:
# Loading the directors tables
q2 = """
SELECT * 
FROM directors
"""
df_directors = pd.read_sql(q2, conn)
df_directors.info(), df_directors.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291174 entries, 0 to 291173
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   291174 non-null  object
 1   person_id  291174 non-null  object
dtypes: object(2)
memory usage: 4.4+ MB


(None,
     movie_id  person_id
 0  tt0285252  nm0899854
 1  tt0462036  nm1940585
 2  tt0835418  nm0151540
 3  tt0835418  nm0151540
 4  tt0878654  nm0089502)

In [72]:
q3 = """
SELECT * 
FROM known_for
"""
df_4 = pd.read_sql(q3, conn)
df_4.info(), df_4.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638260 entries, 0 to 1638259
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   person_id  1638260 non-null  object
 1   movie_id   1638260 non-null  object
dtypes: object(2)
memory usage: 25.0+ MB


(None,
    person_id   movie_id
 0  nm0061671  tt0837562
 1  nm0061671  tt2398241
 2  nm0061671  tt0844471
 3  nm0061671  tt0118553
 4  nm0061865  tt0896534)

In [81]:
q4 = """
SELECT * 
FROM movie_ratings
"""
df_5 = pd.read_sql(q4, conn)
df_5.info(), df_5.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


(None,
        averagerating      numvotes
 count   73856.000000  7.385600e+04
 mean        6.332729  3.523662e+03
 std         1.474978  3.029402e+04
 min         1.000000  5.000000e+00
 25%         5.500000  1.400000e+01
 50%         6.500000  4.900000e+01
 75%         7.400000  2.820000e+02
 max        10.000000  1.841066e+06)