* Student Name: Daniel Mwaka
* Student Pace: DSF-FT12-Hybrid
* Instructor Name: Samuel Karu

# Box Office Performance Analysis for New Movie Studio

## Introduction 

The ever-increasing adoption and embracement of internet-hosted, media-sharing platforms exposes audiences to a diverse, highly-dense entertainment alternatives. This claim is justified by the rising number of companies entering the video streaming sector. Additionally, long-video content is increasigly facing stiff competition from short-video based content from social media sites such as Tiktok. Although venturing into the movie production sector is a potentially profitable portfolio diversification strategy; data-driven decision making is vital in orienting the company toward producing captivating, engaging, and appealing films to stratified target market segments. This project examines these factors systematically using a data-driven approach. 

# Problem Statement

The company plans to diversify its portfolio by launching a new division for movie production. Designing, implementing, sourcing talent, and operational expenses for running a new studio is a costly endevour. To ensure that the produces profitable movies, the company seeks data-driven insights to support appropriate corporate decisions.

## Analysis Focus

The project investigates the correlation  between runtime minutes, genre, and () on the grossing of films in the market.

# Objectives

<strong> 1: Understanding the Dataset </strong>

* <strong> Goal: </strong> Gain an indepth understanding on the datsets.

* <strong> Tasks: </strong>
    * Review shape, columns, data types.
    * Dropping unnecessary columns/ fields 
    * Data cleaning (remove duplicates and handle missing values) 

<strong> 2: Industry Background </strong>

* <strong> Goal: </strong> Comprehend trends in the film industry and triangulate potential predictor variables for a film's total grossing.  

* <strong> Tasks: </strong>
    
    *
    
    *

In [20]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import math
%matplotlib inline


In [21]:
# Load the data from the .csv file as a DataFrame and display first five rows
movie_gross_data = pd.read_csv('/home/mwakad/Desktop/box-office-movie-insights/zipped-data/bom.movie_gross.csv')
movie_gross_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [22]:
# Check the DataFrame's shape
movie_gross_data.shape
print(f"DataFrame consists of {movie_gross_data.shape[0]} rows")
print(f"DataFrame consists of {movie_gross_data.shape[1]} columns")

DataFrame consists of 3387 rows
DataFrame consists of 5 columns


In [23]:
# Check column attributes
movie_gross_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


There are multiple rows entries with missing data values for the `studio`, `domestic_gross`, and `foreign_gross` columns.  

In [24]:
# Create a copy of the data DataFrame to perform data cleaning
data = movie_gross_data.copy()

In [25]:
# Check unique values for the `studio` column
data['studio'].value_counts()

IFC          166
Uni.         147
WB           140
Fox          136
Magn.        136
            ... 
IW             1
JS             1
Trafalgar      1
Triu           1
Viv.           1
Name: studio, Length: 257, dtype: int64

In [26]:
# Drop row entries with missing values for the 'studio' column
data = data.dropna(subset=['studio'])

In [27]:
# Convert the year to a Datatime object
data['year'] = pd.to_datetime(data['year'])

In [28]:
data.dtypes

title                     object
studio                    object
domestic_gross           float64
foreign_gross             object
year              datetime64[ns]
dtype: object

In [29]:
# convert the foreign_gross from object to float64

# Remove commas
data['foreign_gross'] = data['foreign_gross'].astype(str).str.replace(',', '') 

# Convert to numeric (float64)
data['foreign_gross'] = pd.to_numeric(data['foreign_gross'], errors='coerce')  

In [30]:
# Confirm the columns are in the appropriate datatype
data.dtypes

title                     object
studio                    object
domestic_gross           float64
foreign_gross            float64
year              datetime64[ns]
dtype: object

In [31]:
# Check shape 
data.shape
print(f"DataFrame consists of {data.shape[0]} rows")
print(f"DataFrame consists of {data.shape[1]} columns")

DataFrame consists of 3382 rows
DataFrame consists of 5 columns


In [32]:
# Check column attributes
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   title           3382 non-null   object        
 1   studio          3382 non-null   object        
 2   domestic_gross  3356 non-null   float64       
 3   foreign_gross   2033 non-null   float64       
 4   year            3382 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 158.5+ KB


In [33]:
# Compute descriptive statistics for columns with numerical values
data.describe()

Unnamed: 0,domestic_gross,foreign_gross
count,3356.0,2033.0
mean,28771490.0,74954900.0
std,67006940.0,137514500.0
min,100.0,600.0
25%,120000.0,3700000.0
50%,1400000.0,18700000.0
75%,27950000.0,74900000.0
max,936700000.0,960500000.0


In [34]:
# Imputing the missing values with the respective median for each column is the most appropriate altenative.

# Calculate the medians
domestic_gross_median = data['domestic_gross'].median()
foreign_gross_median = data['foreign_gross'].median()

# Impute missing values with medians
data['domestic_gross'].fillna(domestic_gross_median, inplace=True)
data['foreign_gross'].fillna(foreign_gross_median, inplace=True)

In [35]:
# Check shape after cleaning
data.shape
print(f"DataFrame consists of {data.shape[0]} rows")
print(f"DataFrame consists of {data.shape[1]} columns")

DataFrame consists of 3382 rows
DataFrame consists of 5 columns


## Data Cleaning on the SQL3 dataset

In [36]:
# Create a connection to DataBase
conn = sqlite3.connect('/home/mwakad/Desktop/box-office-movie-insights/zipped-data/im.db')

In [37]:
# Load data from the movie_basics table
imdb_basics = pd.read_sql("SELECT * FROM movie_basics", conn)
imdb_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [38]:
imdb_basics.shape
print(f"DataFrame consists of {imdb_basics.shape[0]} rows")
print(f"DataFrame consists of {imdb_basics.shape[1]} columns")

DataFrame consists of 146144 rows
DataFrame consists of 6 columns


In [39]:
# Create a copy of the data DataFrame to perform data cleaning
imdb_basics_cleaned = imdb_basics.copy()

In [40]:
# Convert the year to a Datatime object
imdb_basics_cleaned['start_year'] = pd.to_datetime(imdb_basics_cleaned['start_year'])

In [41]:
# Drop row entries with missing values for non numerical columns
imdb_basics_cleaned = imdb_basics_cleaned.dropna(subset=['primary_title'])
imdb_basics_cleaned = imdb_basics_cleaned.dropna(subset=['original_title'])
imdb_basics_cleaned = imdb_basics_cleaned.dropna(subset=['genres'])

In [42]:
# Compute descriptive statistics for the runtime_minutes column
imdb_basics_cleaned.describe()

Unnamed: 0,runtime_minutes
count,112232.0
mean,86.261556
std,167.896646
min,1.0
25%,70.0
50%,87.0
75%,99.0
max,51420.0


The extremely large max runtime_minutes for the 51420 datapoint is too far from the mean and median. This anomaly necessiates evaluation of unique_values

In [43]:
imdb_basics_cleaned['runtime_minutes'].value_counts()

90.0      7050
80.0      3460
85.0      2882
100.0     2635
95.0      2518
          ... 
221.0        1
321.0        1
319.0        1
410.0        1
2160.0       1
Name: runtime_minutes, Length: 361, dtype: int64

In [44]:
imdb_basics_cleaned = imdb_basics_cleaned[imdb_basics_cleaned['runtime_minutes'] <= 240]

In [45]:
imdb_basics_cleaned['runtime_minutes'].value_counts()

90.0     7050
80.0     3460
85.0     2882
100.0    2635
95.0     2518
         ... 
187.0       1
236.0       1
221.0       1
234.0       1
213.0       1
Name: runtime_minutes, Length: 239, dtype: int64

In [46]:
# Compute descriptive statistics for the runtime_minutes column
imdb_basics_cleaned.describe()

Unnamed: 0,runtime_minutes
count,112032.0
mean,84.681528
std,27.758424
min,1.0
25%,70.0
50%,87.0
75%,99.0
max,240.0


Since the distribuction of the runtime_minutes data is less skewed (mean is 84.68, median is 87) the best strategy to impute for missing to fill them with the mean

In [47]:
# Imputing missing values with the mean
imdb_basics_cleaned['runtime_minutes'] = imdb_basics_cleaned['runtime_minutes'].fillna(imdb_basics_cleaned['runtime_minutes'].mean())

In [48]:
# checking for duplicates
imdb_basics_cleaned.duplicated().sum()

0

In [49]:
# Check shape after cleaning
imdb_basics_cleaned.shape
print(f"DataFrame consists of {imdb_basics_cleaned.shape[0]} rows")
print(f"DataFrame consists of {imdb_basics_cleaned.shape[1]} columns")

DataFrame consists of 112032 rows
DataFrame consists of 6 columns


Loading the movie_ratings table

In [50]:
# Load data from the movie_ratings table
imdb_ratings = pd.read_sql("SELECT * FROM movie_ratings", conn)
imdb_ratings.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [51]:
imdb_ratings.shape
print(f"DataFrame consists of {imdb_ratings.shape[0]} rows")
print(f"DataFrame consists of {imdb_ratings.shape[1]} columns")

DataFrame consists of 73856 rows
DataFrame consists of 3 columns


In [52]:
# Check column attributes
imdb_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [53]:
# print the count of missing values 
imdb_ratings.isna().sum()

movie_id         0
averagerating    0
numvotes         0
dtype: int64

In [54]:
# checking for duplicates
imdb_ratings.duplicated().sum()

0

The imdb_ratings Dataframe neither has missing values nor duplicates and does not require data cleaning.

Merging

In [None]:
merge