# MICROSOFT NEW MOVIE STUDIO.

## 1. Business Understanding.

#### a) SPECIFYING THE DATA ANALYTICS QUESTION.
Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

#### b) PROBLEM STATEMENT.
Microsoft is interested in making a new movie studio but do not know how to go about it. This analysis aims at helping to make the right insights for the company to venture into.

#### c) DEFINING THE METRIC OF SUCCESS.
This project will be considered a success if it can yield at least 3 actionable recommendations that would help microsoft make business driven decisions on what movies to venture into by knowing which genres are the most popular and how to invest in them.
More than likely, these insights are all influenced ample knowledge following CRISPDM methodology.

#### d) MAIN OBJECTIVE.
To come up with actionable insights that would be implemented by Microsoft to make a new movie studio.

#### e) SPECIFIC OBJECTIVES.
###### .To find out which genres of movies are most popular.
###### .To find out the budget for investment.
###### .To find out which publishers and studios are most popular.

#### f) EXPERIMENTAL DESIGN
##### 1.Data Collection
##### 2.Read and check the data
##### 3.Cleaning the data
##### 4.Exploratory Data Analysis
##### 5.Conclusions and Recommendations

### g) Data Understanding.
The data used is contained in a folder with zippedData that has movie datasets from:
##### .im.db.zip:
#####    .movie_basics 
#####    .movie_ratings
##### .bom.movie_gross.csv.gz

## 2. Importing Libraries.

In [1]:
#Import necessary libraríes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## 3. Reading the Data.

In [2]:
# from bom movie gross
movie_gross= pd.read_csv('bom.movie_gross.csv.gz')
movie_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [3]:
#from title basics.
title_basics = pd.read_csv('imdb.title.basics.csv.gz')
title_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [4]:
#from ratings.
ratings = pd.read_csv('imdb.title.ratings.csv.gz')
ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [5]:
movie_info= pd.read_csv('rt.movie_info.tsv.gz', delimiter='\t', encoding='latin1')
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [6]:
# from movie budgets.
movie_budgets= pd.read_csv('tn.movie_budgets.csv.gz')
movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [7]:
name_basics= pd.read_csv('imdb.name.basics.csv.gz')
name_basics.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


In [8]:
movies_popularity= pd.read_csv('tmdb.movies.csv.gz')
movies_popularity.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


## 4.DATA WRANGLING.

### 4.1 DROPPING COLUMNS.

In [9]:
# from title basics.
title_basics_new = title_basics[['tconst','primary_title','genres']]
title_basics_new.head()

Unnamed: 0,tconst,primary_title,genres
0,tt0063540,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,"Comedy,Drama,Fantasy"


In [10]:
#from movie_info
movie_info_new= movie_info[['id','writer','director']]
movie_info_new.head()

Unnamed: 0,id,writer,director
0,1,Ernest Tidyman,William Friedkin
1,3,David Cronenberg|Don DeLillo,David Cronenberg
2,5,Allison Anders,Allison Anders
3,6,Paul Attanasio|Michael Crichton,Barry Levinson
4,7,Giles Cooper,Rodney Bennett


In [11]:
# from movie budgets.
movie_budgets_new= movie_budgets[['id','production_budget','domestic_gross','worldwide_gross']]
movie_budgets_new.head()

Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross
0,1,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"$350,000,000","$42,762,350","$149,762,350"
3,4,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"$317,000,000","$620,181,382","$1,316,721,747"


In [12]:
#from name basics.
name_basics_new= name_basics[['primary_name','known_for_titles']]
name_basics_new.head()

Unnamed: 0,primary_name,known_for_titles
0,Mary Ellen Bauder,"tt0837562,tt2398241,tt0844471,tt0118553"
1,Joseph Bauer,"tt0896534,tt6791238,tt0287072,tt1682940"
2,Bruce Baum,"tt1470654,tt0363631,tt0104030,tt0102898"
3,Axel Baumann,"tt0114371,tt2004304,tt1618448,tt1224387"
4,Pete Baxter,"tt0452644,tt0452692,tt3458030,tt2178256"


In [13]:
# from movie popularity.
movies_popularity_new= movies_popularity[['original_title','vote_average','vote_count','popularity']]
movies_popularity_new.head()

Unnamed: 0,original_title,vote_average,vote_count,popularity
0,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,33.533
1,How to Train Your Dragon,7.7,7610,28.734
2,Iron Man 2,6.8,12368,28.515
3,Toy Story,7.9,10174,28.005
4,Inception,8.3,22186,27.92


In [14]:
# from movie bom.
movie_gross_new= movie_gross[['title','domestic_gross','foreign_gross','year']]
movie_gross_new.head()

Unnamed: 0,title,domestic_gross,foreign_gross,year
0,Toy Story 3,415000000.0,652000000,2010
1,Alice in Wonderland (2010),334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,296000000.0,664300000,2010
3,Inception,292600000.0,535700000,2010
4,Shrek Forever After,238700000.0,513900000,2010


### 4.2. Checking for null values.

In [15]:
title_basics_new.isna().sum()

tconst              0
primary_title       0
genres           5408
dtype: int64

In [16]:
title_basics_new.describe()

Unnamed: 0,tconst,primary_title,genres
count,146144,146144,140736
unique,146144,136071,1085
top,tt0063540,Home,Documentary
freq,1,24,32185


In [17]:
title_basics_new= title_basics_new.dropna(subset=['genres'])
title_basics_new.head()

Unnamed: 0,tconst,primary_title,genres
0,tt0063540,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,"Comedy,Drama,Fantasy"


In [18]:
title_basics_new.isna().sum()

tconst           0
primary_title    0
genres           0
dtype: int64

In [19]:
title_basics_new.head()

Unnamed: 0,tconst,primary_title,genres
0,tt0063540,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,"Comedy,Drama,Fantasy"


In [20]:
title_basics_new.shape

(140736, 3)

In [21]:
#from movie info.
movie_info_new.isna().sum()

id            0
writer      449
director    199
dtype: int64

In [22]:
movie_info_new= movie_info_new.dropna(subset=['writer'])
movie_info_new.head()

Unnamed: 0,id,writer,director
0,1,Ernest Tidyman,William Friedkin
1,3,David Cronenberg|Don DeLillo,David Cronenberg
2,5,Allison Anders,Allison Anders
3,6,Paul Attanasio|Michael Crichton,Barry Levinson
4,7,Giles Cooper,Rodney Bennett


In [23]:
movie_info_new.dtypes

id           int64
writer      object
director    object
dtype: object

In [24]:
movie_info_new.isna().sum()

id           0
writer       0
director    68
dtype: int64

In [25]:
movie_info_new= movie_info_new.dropna(subset=['director'])
movie_info_new.head()

Unnamed: 0,id,writer,director
0,1,Ernest Tidyman,William Friedkin
1,3,David Cronenberg|Don DeLillo,David Cronenberg
2,5,Allison Anders,Allison Anders
3,6,Paul Attanasio|Michael Crichton,Barry Levinson
4,7,Giles Cooper,Rodney Bennett


In [26]:
movie_info_new.isna().sum()

id          0
writer      0
director    0
dtype: int64

In [27]:
movie_info_new.shape

(1043, 3)

In [28]:
# from movie_budgets
movie_budgets_new.isna().sum()

id                   0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [29]:
movie_budgets_new.shape

(5782, 4)

In [30]:
movie_budgets_new.head()

Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross
0,1,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"$350,000,000","$42,762,350","$149,762,350"
3,4,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"$317,000,000","$620,181,382","$1,316,721,747"


In [32]:
movie_budgets_new['production_budget'] = movie_budgets_new['production_budget'].str.replace('$',"")
movie_budgets_new['domestic_gross'] = movie_budgets_new['domestic_gross'].str.replace('$',"")
movie_budgets_new['worldwide_gross'] = movie_budgets_new['worldwide_gross'].str.replace('$',"")


In [33]:
movie_budgets_new['production_budget'] = movie_budgets_new['production_budget'].str.replace(',',"")
movie_budgets_new['domestic_gross'] = movie_budgets_new['domestic_gross'].str.replace(',',"")
movie_budgets_new['worldwide_gross'] = movie_budgets_new['worldwide_gross'].str.replace(',',"")


In [35]:
movie_budgets_new.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [None]:
#from name basics.
name_basics_new.isna().sum()

In [None]:
name_basics_new.info()

In [None]:
name_basics_new.describe()

In [None]:
name_basics_new= name_basics_new.dropna(subset = ['known_for_titles'])
name_basics_new.head()

In [None]:
name_basics_new.describe()

In [None]:
name_basics_new.isna().sum()

In [None]:
#from movie_popularity.
movies_popularity_new.isna().sum()

In [None]:
movies_popularity_new.info()

In [None]:
movies_popularity_new.describe()

In [None]:
#from bom movie
movie_gross_new.isna().sum()

In [None]:
movie_gross_new.info()

In [None]:
movie_gross_new.describe()

In [None]:
movie_gross_new['foreign_gross'].isna().sum()

In [None]:
movie_gross_new['domestic_gross'].isna().sum()

In [None]:
# Fill missing values with the mode in 'domestic_gross' column
domestic_mode = movie_gross_new['domestic_gross'].mode().values[0]
movie_gross_new['domestic_gross'].fillna(domestic_mode, inplace=True)

# Fill missing values with the mode in 'foreign_gross' column
foreign_mode = movie_gross_new['foreign_gross'].mode().values[0]
movie_gross_new['foreign_gross'].fillna(foreign_mode, inplace=True)


In [None]:
movie_gross_new.head()

In [None]:
movie_gross_new.isna().sum()

In [None]:
# Convert 'domestic_gross' column to string
movie_gross_new['domestic_gross'] = movie_gross_new['domestic_gross'].astype(str)

# Convert 'foreign_gross' column to string
movie_gross_new['foreign_gross'] = movie_gross_new['foreign_gross'].astype(str)

In [None]:
# Remove commas from 'domestic_gross' column values and convert to float
movie_gross_new['domestic_gross'] = movie_gross_new['domestic_gross'].str.replace(',', '').astype(float)

# Remove commas from 'foreign_gross' column values and convert to float
movie_gross_new['foreign_gross'] = movie_gross_new['foreign_gross'].str.replace(',', '').astype(float)

In [None]:
movie_gross_new