* Student Name: Daniel Mwaka
* Student Pace: DSF-FT12-Hybrid
* Instructor Name: Samuel Karu

# Box Office Performance Analysis for New Movie Studio

## Introduction 

The ever-increasing adoption and embracement of internet-hosted, media-sharing platforms exposes audiences to a diverse, highly-dense entertainment alternatives. This claim is justified by the rising number of companies entering the video streaming sector. Additionally, long-video content is increasigly facing stiff competition from short-video based content from social media sites such as Tiktok. Although venturing into the movie production sector is a potentially profitable portfolio diversification strategy; data-driven decision making is vital in orienting the company toward producing captivating, engaging, and appealing films to stratified target market segments. This project examines these factors systematically using a data-driven approach. 

# Problem Statement

The company plans to diversify its portfolio by launching a new division for movie production. Designing, implementing, sourcing talent, and operational expenses for running a new studio is a costly endevour. To ensure that the produces profitable movies, the company seeks data-driven insights to support appropriate corporate decisions.

## Analysis Focus

The project investigates the correlation  between runtime minutes, genre, and () on the grossing of films in the market.

# Objectives

<strong> 1: Understanding the Dataset </strong>

* <strong> Goal: </strong> Gain an indepth understanding on the datsets.

* <strong> Tasks: </strong>
    * Review shape, columns, data types.
    * Dropping unnecessary columns/ fields 
    * Data cleaning (remove duplicates and handle missing values) 

<strong> 2: Industry Background </strong>

* <strong> Goal: </strong> Comprehend trends in the film industry and triangulate potential predictor variables for a film's total grossing.  

* <strong> Tasks: </strong>
    
    *
    
    *

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import math
%matplotlib inline


In [2]:
# Load the data from the .csv file as a DataFrame and display first five rows
movie_gross_data = pd.read_csv('/home/mwakad/Desktop/box-office-movie-insights/zipped-data/bom.movie_gross.csv')
movie_gross_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [3]:
# Check the DataFrame's shape
movie_gross_data.shape
print(f"DataFrame consists of {movie_gross_data.shape[0]} rows")
print(f"DataFrame consists of {movie_gross_data.shape[1]} columns")

DataFrame consists of 3387 rows
DataFrame consists of 5 columns


In [4]:
# Check column attributes
movie_gross_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


There are multiple rows entries with missing data values for the `studio`, `domestic_gross`, and `foreign_gross` columns.  

In [5]:
# Create a copy of the data DataFrame to perform data cleaning
data = movie_gross_data.copy()

In [6]:
# Check unique values for the `studio` column
data['studio'].value_counts()

IFC       166
Uni.      147
WB        140
Fox       136
Magn.     136
         ... 
Mon         1
Swen        1
BM&DH       1
PalUni      1
IVP         1
Name: studio, Length: 257, dtype: int64

In [7]:
# Drop row entries with missing values for the 'studio' column
data = data.dropna(subset=['studio'])

In [8]:
# Convert the year to a Datatime object
data['year'] = pd.to_datetime(data['year'])

In [9]:
data.dtypes

title                     object
studio                    object
domestic_gross           float64
foreign_gross             object
year              datetime64[ns]
dtype: object

In [10]:
# convert the foreign_gross from object to float64

# Remove commas
data['foreign_gross'] = data['foreign_gross'].astype(str).str.replace(',', '') 

# Convert to numeric (float64)
data['foreign_gross'] = pd.to_numeric(data['foreign_gross'], errors='coerce')  

In [11]:
# Confirm the columns are in the appropriate datatype
data.dtypes

title                     object
studio                    object
domestic_gross           float64
foreign_gross            float64
year              datetime64[ns]
dtype: object

In [12]:
# Check shape 
data.shape
print(f"DataFrame consists of {data.shape[0]} rows")
print(f"DataFrame consists of {data.shape[1]} columns")

DataFrame consists of 3382 rows
DataFrame consists of 5 columns


In [13]:
# Check column attributes
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   title           3382 non-null   object        
 1   studio          3382 non-null   object        
 2   domestic_gross  3356 non-null   float64       
 3   foreign_gross   2033 non-null   float64       
 4   year            3382 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 158.5+ KB


In [14]:
# Compute descriptive statistics for columns with numerical values
data.describe()

Unnamed: 0,domestic_gross,foreign_gross
count,3356.0,2033.0
mean,28771490.0,74954900.0
std,67006940.0,137514500.0
min,100.0,600.0
25%,120000.0,3700000.0
50%,1400000.0,18700000.0
75%,27950000.0,74900000.0
max,936700000.0,960500000.0


The mean for the domestic_gross column is significantly higher than the median, and the standard deviation is quite large skeweing the distribuction to the right. Similarly,  the mean for the foreign_gross is higher than the median, and the standard deviation is large, indicating right skewness.  

In [15]:
# Imputing the missing values with the respective median for each column is the most appropriate altenative.

# Calculate the medians
domestic_gross_median = data['domestic_gross'].median()
foreign_gross_median = data['foreign_gross'].median()

# Impute missing values with medians
data['domestic_gross'].fillna(domestic_gross_median, inplace=True)
data['foreign_gross'].fillna(foreign_gross_median, inplace=True)

In [16]:
# Check shape after cleaning
data.shape
print(f"DataFrame consists of {data.shape[0]} rows")
print(f"DataFrame consists of {data.shape[1]} columns")

DataFrame consists of 3382 rows
DataFrame consists of 5 columns
