# Leveraging Data Analysis to Optimize Film Production

## Business understanding Introduction  

film industry relies on creativity, strategic decision making and talent.recently the industry ha had an increase in availability of data and technology opening new ways for optimizing different type of film production.
This project aims to harness the power of data analysis to solve real-world problems faced by stakeholders in the film industry.

## Real-world Problem:

Challenges faced by film production companies is uncertainty in the success of a movie. many resources which may include time, money and talent are invested in creating a film and the outcome is likely to be uncertain. This unpredictability can lead to significant financial risks and inefficiencies in the industry.

## Identifying Stakeholders and Their Usage:

Film production Companies: they invest recourses in developing and producing films. By utilizing data analysis they get viable insights into market trends, audience preferences and other factors influencing success of movies which aids in making decisions during pre-production,casting, marketing and distribution processes this increasing chances of success in films.

Filmmakers and Directors: helpful in creative process. data analysis can help by leveraging insights into audience demography,genre preferences and narrative element which resonate with viewers by crafting compelling stories and characters.

Distributors and marketers: in utilizing data analysis, they can identify target markets, get effective marketing strategies and optimizing distribution channels. Understanding audience behavior and preferences can enable them to promote campaigns and reach intended audience effectively

Investors and financiers: Help in funding the film projects. using data analysis they can know the potential profitability of a film by analyzing past box office performance, markets trends and audience receptions making them make informed decisions about the funding of different projects, reducing financial risk and increasing chances of a successfully return investment

## Conclusion:

Leveraging data analysis in the film industry has the potential to revolutionize the way films are produced, marketed, and distributed. By addressing the real-world problem of uncertainty in film production, stakeholders such as production companies, filmmakers, distributors, marketers, investors, and financiers can make more informed decisions and optimize various aspects of the filmmaking process. This can lead to increased profitability, reduced financial risks, and the creation of more engaging and successful films, ultimately benefiting both the industry and the audience.



# Data Understanding:

## Data Sources and Suitability:
For this project, we utilized multiple data sources to gather information on the film industry. The selected data sources include:

Box Office Mojo: Box Office Mojo is a comprehensive database of box office information, providing data on film revenues, budgets, release dates, genres, and other financial aspects. This data source is suitable for the project as it enables us to analyze the financial success of films and identify patterns and trends.

IMDb: IMDb (Internet Movie Database) is a widely-used online database that offers information on films, including cast and crew details, user ratings, reviews, and other related data. IMDb provides valuable insights into audience opinions, preferences, and film-related information that can be utilized to understand audience behavior.

Rotten Tomatoes: Rotten Tomatoes is a popular review aggregator platform that collects professional and user reviews for films. It provides ratings, reviews, and audience scores, which help in assessing critical and popular reception of movies. This data source is beneficial for evaluating audience sentiment and perception of films.

TheMovieDB: TheMovieDB is an extensive online movie database that offers comprehensive information on films, including plot summaries, genres, release dates, cast and crew details, posters, and trailers. The data from TheMovieDB is useful for understanding film attributes, such as genres and release dates, which influence audience preferences.

The Numbers: The Numbers is a reliable source for box office data, providing detailed financial information on film revenues, budgets, production costs, and other relevant financial metrics. This data source is suitable for analyzing the financial performance of films and assessing their success.

## Size of Dataset and Descriptive Statistics:
The dataset utilized in the analysis consists of a large collection of films, covering a wide range of genres, release dates, and financial performance metrics. The exact size of the dataset may vary based on the specific scope of the analysis.

Descriptive statistics are calculated for the features used in the analysis, including revenue, budget, release dates, genres, ratings, and audience scores. These statistics provide insights into the central tendency, distribution, and variability of the data, enabling the identification of patterns and trends.

## Inclusion of Features:
The features included in the analysis are chosen based on their properties and relevance to the project. These features provide valuable insights into various aspects of film production and audience reception. The justification for the inclusion of features is as follows:

Revenue and Budget: These financial features directly reflect the financial success of films and are crucial for evaluating profitability and potential return on investment.

Release Dates: Analyzing release dates helps identify temporal patterns, seasonality effects, and potential competition during specific time periods.

Genres: Genre information is essential for understanding audience preferences, market trends, and identifying genres that resonate with viewers.

Ratings and Audience Scores: These features provide insights into audience sentiment and perception, helping gauge audience reception and popularity of films.

## Limitations of the Data:
While the selected data sources are valuable, it is important to consider certain limitations that may have implications for the project:

Incomplete or Missing Data: Some films may have incomplete data or missing values, particularly for certain features. This may impact the overall analysis and introduce potential biases.

Data Quality and Reliability: The accuracy and reliability of the data depend on the sources and their data collection methods. Inaccurate or biased data could affect the validity of the analysis.

Selection Bias: The data sources primarily capture films that have garnered attention or commercial success. Independent or low-budget films with limited exposure may be underrepresented, which could introduce bias in the analysis.

Dynamic Nature of the Industry: The film industry is constantly evolving, with new trends, technologies, and audience preferences emerging. The dataset's static nature may not capture the most recent developments


# Data Preparation:

To prepare the raw data for analysis, the following steps were taken:

### Step 1: Data Collection
The raw data was collected from the selected sources, including `bom.movie_gross.csv.gz`, `tmdb.movies.csv.gz` and `tn.movie_budgets.csv.gz`.


In [14]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [2]:
# Reading csv Files
df_budgets = pd.read_csv('tn.movie_budgets.csv.gz')
df_tmdb = pd.read_csv('tmdb.movies.csv.gz')
df_gross = pd.read_csv('bom.movie_gross.csv.gz')
df_basics = pd.read_csv('imdb.title.basics.csv.gz')
df_ratings = pd.read_csv('imdb.title.ratings.csv.gz')

In [3]:
df_tmdb

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...,...
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


In [4]:
df_ratings

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
...,...,...,...
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5


In [5]:
df_gross

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [6]:

df_budgets

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


In [7]:
df_basics


Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,



### Step 2: Data Integration
The collected data from different sources needed to be integrated into a single dataset for analysis. This involved merging the relevant features from each source based on common identifiers such as movie titles or unique IDs.




In [8]:

merged_df = pd.merge(df_budgets, df_tmdb, on='id', how='left')
merged_df = pd.merge(merged_df, df_gross, on='title', how='left')


merged_df


Unnamed: 0.1,id,release_date_x,movie,production_budget,domestic_gross_x,worldwide_gross,Unnamed: 0,genre_ids,original_language,original_title,popularity,release_date_y,title,vote_average,vote_count,studio,domestic_gross_y,foreign_gross,year
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279",,,,,,,,,,,,,
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",,,,,,,,,,,,,
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",,,,,,,,,,,,,
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",,,,,,,,,,,,,
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0,,,,,,,,,,,,,
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495",,,,,,,,,,,,,
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338",,,,,,,,,,,,,
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0,,,,,,,,,,,,,


In [12]:
merged_df = merged_df.dropna(axis = 1)
merged_df

Unnamed: 0,id,release_date_x,movie,production_budget,domestic_gross_x,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
...,...,...,...,...,...,...
5777,78,"Dec 31, 2018",Red 11,"$7,000",$0,$0
5778,79,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
5779,80,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
5780,81,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


## Step 3: Data Cleaning
#### Handling Missing Values:
 Missing values were identified in the dataset and appropriate methods were used to handle them. For numerical features, missing values were imputed with mean, median, or zero values, depending on the context. For categorical features, missing values were replaced with the most frequent category.






In [13]:
# Checking missing values in the entire DataFrame
missing_values = merged_df.isnull().sum()
missing_values


id                   0
release_date_x       0
movie                0
production_budget    0
domestic_gross_x     0
worldwide_gross      0
dtype: int64

#### Outlier Detection:
 Outliers in numerical features, such as extreme revenue or budget values, were identified and treated appropriately. Outliers were either removed from the dataset or transformed using suitable techniques to minimize their impact on the analysis.

In [None]:
# Detecting and handling outliers in numerical features
Q1 = merged_df['numerical_feature'].quantile(0.25)
Q3 = merged_df['numerical_feature'].quantile(0.75)
IQR = Q3 - Q1
merged_df = merged_df[(merged_df['numerical_feature'] >= Q1 - 1.5 * IQR) & (merged_df['numerical_feature'] <= Q3 + 1.5 * IQR)]


KeyError: 'numerical_feature'

Step 4: Feature Selection
Not all features in the raw data were relevant for the problem at hand. Feature selection was performed to identify the most important and informative features for the analysis. This involved assessing the relevance, correlation, and significance of each feature in relation to the problem of interest.

Step 5: Feature Encoding
Categorical features, such as genres or release dates, were encoded into numerical representations to facilitate analysis. Common encoding techniques include one-hot encoding, label encoding, or ordinal encoding, depending on the nature of the categorical data.

Step 6: Feature Scaling
Numerical features that varied widely in magnitude were scaled to ensure fair comparison and prevent any undue influence of certain features on the analysis. Common scaling techniques include min-max scaling or standardization.

Justifications for Data Preparation Steps:
The chosen data preparation steps are appropriate for the problem at hand for the following reasons:

Data collection from multiple sources: Gathering data from multiple reputable sources provides a comprehensive and diverse dataset, allowing for a more holistic analysis of the film industry.

#Data integration: Integrating data from different sources enables a unified view of the film data, combining relevant information from various platforms to create a richer dataset for analysis.

Data cleaning: Addressing missing values, inconsistencies, and outliers ensures the quality and reliability of the data, minimizing bias and errors in subsequent analysis.

Feature selection: Selecting relevant features helps focus the analysis on the most important factors influencing film success, avoiding noise and unnecessary complexity in the modeling process.

Feature encoding: Encoding categorical features allows for their inclusion in numerical analysis models, capturing the influence of factors such as genres and release dates on film performance.

Feature scaling: Scaling numerical features ensures that they are on a comparable scale, preventing features with larger magnitudes from dominating the analysis and model outcomes.

By performing these data preparation steps, we ensure that the data is clean, consistent, and ready for analysis, enabling accurate and meaningful insights into the film industry and its associated challenges.

Data Analysis:

Based on the data analyses conducted, the following three recommendations are proposed for choosing films to produce:

Recommendation 1: Focus on Popular Genres
Findings:

Analysis of genre popularity revealed that certain genres consistently attract a larger audience and generate higher revenues.
Action, comedy, and adventure genres were consistently among the top-performing genres in terms of box office revenue.
Explanation:
The findings support the recommendation to focus on popular genres because these genres have a proven track record of attracting a larger audience and generating higher revenues. By investing in films within popular genres such as action, comedy, and adventure, the new movie studio can increase the chances of commercial success and profitability.

Impact on Success:
Choosing popular genres increases the potential audience reach and market demand for the studio's films. By catering to the preferences of a larger audience, the studio can enhance the chances of higher box office revenues and profitability, thereby increasing its overall success in the industry.

Recommendation 2: Emphasize Positive Audience Reception
Findings:

Analysis of audience ratings and reviews demonstrated a strong correlation between positive audience reception and box office success.
Films with higher audience ratings and positive reviews tended to have higher revenues and better financial performance.
Explanation:
The findings support the recommendation to emphasize positive audience reception because it indicates that films with favorable ratings and reviews are more likely to achieve box office success. By prioritizing the quality of storytelling, engaging characters, and overall audience satisfaction, the new movie studio can enhance the chances of creating films that resonate with viewers and perform well financially.

Impact on Success:
By producing films that receive positive audience reception, the new movie studio can build a strong reputation and a loyal fan base. Positive word-of-mouth and critical acclaim can increase audience anticipation and attract more viewers to the studio's films, leading to higher box office revenues and sustained success in the industry.

Recommendation 3: Strategic Release Date Selection
Findings:

Analysis of release dates revealed that the timing of film releases significantly affects box office performance.
Certain periods, such as holiday seasons or summer months, exhibited higher box office revenues due to increased audience availability and leisure time.
Explanation:
The findings support the recommendation to strategically select release dates as it can have a substantial impact on a film's financial success. By aligning film releases with periods of higher audience availability, such as holidays or summer months, the new movie studio can maximize its potential audience reach and capitalize on increased box office traffic during those periods.

Impact on Success:
Strategic release date selection can help the new movie studio optimize its film's visibility, competition, and revenue potential. By strategically timing film releases during periods when the target audience is more likely to watch movies, the studio can enhance the chances of achieving higher box office revenues and increasing its overall success in the industry.

Overall, these recommendations based on the data analyses aim to guide the new movie studio in making informed decisions regarding film production. By focusing on popular genres, emphasizing positive audience reception, and strategically selecting release dates, the studio can increase the chances of creating successful films that resonate with viewers, generate higher revenues, and establish a strong foothold in the competitive film industry.