# NEW MICROSOFT'S MOVIE STUDIO (Exploratory Data Analysis)

# 1. Defining the Question
 
   ### a) Specifying the Data Analytic Question
   
The movies industry is characterized by diffrent genres of movies. They include animations, action movies, comedies etc. When producing a movie it is important to decide on what genre to get into. This can be done by determining the best perfoming movies in the market. There are different aspects that can be used to determine how different genres are perfoming example What are the top-perfoming movie genres in terms of box office revenue, and how can this information be used to guide Microsoft's new movie studio in selecting the most promising genre for their film production?
  
  ### b) Defining the Metric for Success
  
The objective is to identify the movie genres that have perfomed exceptionally well in terms of box office revenue. By analyzing and comparing the revenue figures across different genres, we can determine the genres that have a higher likelihood of achieving commercial success. Additionally other metrics can be  used to evaluate success such as return on investment(R.O.I)

### c) Understanding the context 
We will consider the following key aspects:
 1. Market dynamics: Get an understanding of the current state of the movie industry, including market trends, audience preferences and competition. This will help identify opportunities and challenges.
 2. Business goals: Understand microsoft's goals and objectives for entering the movie industry.
 3. familiarize ourselves with any relevant regulations or policies that may impact the movie production and distribution process.
 
 ### d) Recording the Experimental Design
 1. Data collection
 2. Data preprocessing
 3. Exploratory data analysis(EDA)
 4. Genre perfomance analyzing
 5. Comparative analysis
 6. insights and recommendations 
 7. Sensitivity analysis 
 8. Reporting and Visualization
 
 ### e) Data Relevance
 1. Data selection that is relevant to our analysis
 2. Data quality. Check completeness, accuracy, consisistency and reliability of the data.
 3. Data scope. we can cosider the scope of the data in terms of time.


## 2. IMPORTING RELEVANT LIBRARIES

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 3. LOADING THE DATA

In [3]:
# loading data of bom gross incomes
bom_gross = pd.read_csv('bom.movie_gross.csv.gz')
bom_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [21]:
bom_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [4]:
# loading data of cast names
cast_names = pd.read_csv('imdb.name.basics.csv.gz')
cast_names

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"
...,...,...,...,...,...,...
606643,nm9990381,Susan Grobes,,,actress,
606644,nm9990690,Joo Yeon So,,,actress,"tt9090932,tt8737130"
606645,nm9991320,Madeline Smith,,,actress,"tt8734436,tt9615610"
606646,nm9991786,Michelle Modigliani,,,producer,


In [22]:
cast_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 6 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   nconst              606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
 5   known_for_titles    576444 non-null  object 
dtypes: float64(2), object(4)
memory usage: 27.8+ MB


In [5]:
# loading data for movie titles
movie_title = pd.read_csv('imdb.title.akas.csv.gz')
movie_title.head()

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


In [6]:
# loading data for directors and writers
crew_information = pd.read_csv('imdb.title.crew.csv.gz')
crew_information.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


In [7]:
#loading data of individuals involved and thier roles
principals_information = pd.read_csv('imdb.title.principals.csv.gz')
principals_information.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [8]:
# loading data for movie ratings
movie_ratings = pd.read_csv('imdb.title.ratings.csv.gz')
movie_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [9]:
# loading data for movie information
movie_info = pd.read_csv('rt.movie_info.tsv.gz', delimiter='\t')
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [24]:
movie_info['rating']

0        R
1        R
2        R
3        R
4       NR
        ..
1555     R
1556    PG
1557     G
1558    PG
1559     R
Name: rating, Length: 1560, dtype: object

In [23]:
movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [10]:
# loading data for ratings and reviews
rt_reviews = pd.read_csv('rt.reviews.tsv.gz', delimiter='\t', encoding='latin-1')
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [25]:
rt_reviews['rating']

0          3/5
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
54427      NaN
54428      1/5
54429      2/5
54430    2.5/5
54431      3/5
Name: rating, Length: 54432, dtype: object

In [11]:
# loading data of tdmb movies information
tdmb_movies_info = pd.read_csv('tmdb.movies.csv.gz',)
tdmb_movies_info.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [12]:
# loading data for movie budget
movie_budget = pd.read_csv('tn.movie_budgets.csv.gz')
movie_budget.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


## 4. DATA WRANGLING

  ### 4.1 Dropping columns
  
Out of the several datasets that were collected, only some features and rows are relevant to this process. Therefore, in this step, the features that are not required from each dataset will be dropped. The remaining datasets will then be joined.

In [13]:
# dropping the studio column from the data set
new_bom_gross = bom_gross.drop("studio", axis=1)
new_bom_gross.head()

Unnamed: 0,title,domestic_gross,foreign_gross,year
0,Toy Story 3,415000000.0,652000000,2010
1,Alice in Wonderland (2010),334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,296000000.0,664300000,2010
3,Inception,292600000.0,535700000,2010
4,Shrek Forever After,238700000.0,513900000,2010


In [14]:
new_bom_gross.shape


(3387, 4)

The new data set has 3387 rows and 4 columns

In [15]:
# selecting the relevant columns
new_cast_names = cast_names[['nconst', 'primary_name', 'known_for_titles']]
new_cast_names.head()

Unnamed: 0,nconst,primary_name,known_for_titles
0,nm0061671,Mary Ellen Bauder,"tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,"tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,"tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,"tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,"tt0452644,tt0452692,tt3458030,tt2178256"


In [16]:
# select the relevant columns
new_movie_title = movie_title[['title_id', 'ordering', 'title', 'is_original_title']]
new_movie_title.head()

Unnamed: 0,title_id,ordering,title,is_original_title
0,tt0369610,10,Джурасик свят,0.0
1,tt0369610,11,Jurashikku warudo,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,0.0
3,tt0369610,13,O Mundo dos Dinossauros,0.0
4,tt0369610,14,Jurassic World,0.0


In [17]:
# dropping the irrelevant columns from the data set
new_principals_information = principals_information.drop(['characters','job'], axis=1)
new_principals_information.head()

Unnamed: 0,tconst,ordering,nconst,category
0,tt0111414,1,nm0246005,actor
1,tt0111414,2,nm0398271,director
2,tt0111414,3,nm3739909,producer
3,tt0323808,10,nm0059247,editor
4,tt0323808,1,nm3579312,actress


In [19]:
# selecting useful columns
new_movie_info = movie_info[['id', 'genre', 'rating', 'director', 'writer']]
new_movie_info.head()

Unnamed: 0,id,genre,rating,director,writer
0,1,Action and Adventure|Classics|Drama,R,William Friedkin,Ernest Tidyman
1,3,Drama|Science Fiction and Fantasy,R,David Cronenberg,David Cronenberg|Don DeLillo
2,5,Drama|Musical and Performing Arts,R,Allison Anders,Allison Anders
3,6,Drama|Mystery and Suspense,R,Barry Levinson,Paul Attanasio|Michael Crichton
4,7,Drama|Romance,NR,Rodney Bennett,Giles Cooper
