## Final Project Submission

Please fill out:
* Student names: Christopher Noel, Margaret Nyairo, Victor Masinde, James Ngumo, Anthony Ekeno. 
* Student pace: full time 
* Instructor name: Maryann Mwikali


 Market Analysis & Insights For Strategic Movie Production

## Business Understanding

### Overview
This project is designed to aid a company's venture into the movie production industry by launching a new studio. Through comprehensive data analysis, the project will identify current trends and provide actionable insights from box office data. This information will guide the company in determining the types of movies that are most successful in today’s market, thereby supporting strategic content creation and maximizing box office returns

###  The Problem Statement
The company needs to pinpoint what types of movies are most successful in the current market to propel the new studio's launch. Specifically, we aim to:

1. **Identify the Genres Performing Well at the Box Office**: Determine which movie genres are currently popular and yield high box office returns.
   
2. **Analyze Movie Budgets and Profitability**: Evaluate the relationship between production budgets, returns, and overall profitability to find the optimal investment range.
   
3. **Assess Audience Demographics Driving Success**: Understand which demographic segments are contributing significantly to box office revenues.
   
4. **Recommend Optimal Release Seasons or Windows**: Identify the best times of the year for releasing movies to maximize box office performance.

### Main objectives
To identify the most successful types of films currently at the box office and translate these insights into actionable strategies that guide the new movie studio's production choices, ensuring competitive and commercial success in the film industry.


 ## Data Wrangling
 


 ### import libraries
First we import the necessary packages for Exploratory data analysis.


In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3 

%matplotlib inline


### Exploratory Data Analysis

The data used in this analysis contains data collected from various popular movie sites such as Box Office Mojo, IMDb, rotten tomato reviews. It contains detailed information on movie titles, actors, directors, box office earnings, and movie ratings.
1. im.db

2. box office mojo

3. movie budgets 

4. movie info 


 

### Import Data sets



##### First Dataset - im.db
This dataset will form root basis of analysis

In [2]:
#FIRST DATA SET IM.DB- 

#establishing a connection with database.
conn = sqlite3.connect("zippedData/im.db")
cur = conn.cursor()

#opening the database

pd.read_sql("""
SELECT*
FROM sqlite_master
""",conn) 

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


###### An Entity Relationship Diagram [ERD]
-Below is an ERD explaining further contents contained in tables shown above

-After studying each column name in the ERD below, we discover that we are interested in the contenct of two tables. These are; **movie_basics and movie_ratings**  since they have contents that will be vital to our analysis.

-Analyse structure of the tables by using **JOIN** statement to combine them.

![Alt text](movie_data_erd.jpeg)

######  JOIN tables[ movie basics + movie ratings]  to create a new dataframe  **imdb**

In [3]:
## merge required tables then convert it to a dataframe
#Select relevant information from movie_basics table
#JOIN to movie_ratings

imdb = pd.read_sql("""
SELECT primary_title,start_year,runtime_minutes,genres,averagerating
FROM movie_basics
JOIN movie_ratings
USING("movie_id")
""",conn)
imdb

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating
0,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0
1,One Day Before the Rainy Season,2019,114.0,"Biography,Drama",7.2
2,The Other Side of the Wind,2018,122.0,Drama,6.9
3,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1
4,The Wandering Soap Opera,2017,80.0,"Comedy,Drama,Fantasy",6.5
...,...,...,...,...,...
73851,Diabolik sono io,2019,75.0,Documentary,6.2
73852,Sokagin Çocuklari,2019,98.0,"Drama,Family",8.7
73853,Albatross,2017,,Documentary,8.5
73854,La vida sense la Sara Amat,2019,,,6.6


##### SECOND DATASET  -box office mojo gross
Essential for understanding box office success and profitability.

In [4]:
## SECOND DATASET 
gross = pd.read_csv("zippedData/bom.movie_gross.csv")
gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


##### THIRD DATASET -movie budgets
Used to assess budget-related profitability.

In [5]:
## THIRD DATASET
budget = pd.read_csv("zippedData/tn.movie_budgets.csv")
budget.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


###### FOURTH DATASET - movie info
Allows us to categorize movies, essential for identifying high-performing genres


In [6]:
# FOURTH DATASET
movie_info = pd.read_csv("zippedData/rt.movie_info.tsv",sep="\t")
movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


####  Data Understanding 
In this section we are going to examine our data for better understanding before we start working on it. The section helps us ascertain the number of rows and columns(`.shape`). We also get a slight summary of the data displaying column names, number of non-null values and the Dtype of the column contents(`info`).When dealing with continous data, we can have a brief statistical summary of the columns with intergers or float data types(`.describe`)

######  .info Function
This returns the summary of the data frame

In [7]:
# first dataset
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   primary_title    73856 non-null  object 
 1   start_year       73856 non-null  int64  
 2   runtime_minutes  66236 non-null  float64
 3   genres           73052 non-null  object 
 4   averagerating    73856 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 2.8+ MB


In [8]:
# second dataset 
gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [9]:
#  third dataset
budget.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [10]:
# fourth dataset 
movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


###### .columns Function
This returns column labels of each dataframe

In [11]:
# First DataSet
imdb.columns

Index(['primary_title', 'start_year', 'runtime_minutes', 'genres',
       'averagerating'],
      dtype='object')

In [12]:
# Second DataSet
gross.columns

Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'], dtype='object')

In [13]:
# Third DataSet
budget.columns

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

In [14]:
# Fourth DataSet 
movie_info.columns

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

###### .describe Function 
This returns the descriptive statistics of the dataframe

In [15]:
# First DataSet
imdb.describe()

Unnamed: 0,start_year,runtime_minutes,averagerating
count,73856.0,66236.0,73856.0
mean,2014.276132,94.65404,6.332729
std,2.614807,208.574111,1.474978
min,2010.0,3.0,1.0
25%,2012.0,81.0,5.5
50%,2014.0,91.0,6.5
75%,2016.0,104.0,7.4
max,2019.0,51420.0,10.0


In [16]:
# Second DataSet 
gross.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [17]:
# Third Dataset
budget.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [18]:
# Fourth DataSet
movie_info.describe()

Unnamed: 0,id
count,1560.0
mean,1007.303846
std,579.164527
min,1.0
25%,504.75
50%,1007.5
75%,1503.25
max,2000.0


###### .shape  function 
-This returns number of rows , columns

In [19]:
# First DataSet
imdb.shape

(73856, 5)

In [20]:
# Seond DataSet
gross.shape

(3387, 5)

In [21]:
# Third DataSet
budget.shape

(5782, 6)

In [22]:
# Fourth DataSet
movie_info.shape

(1560, 12)

We are working with four different datasets for this project. The first one named **imdb** has 73856 rows and 5 columns.It was sourced from internet. It has all the three types of data namely Float, intergers and objects. The data here helps us analyse the `genres, titles, runtime minutes and the ratings of movies`

The second dataset is named **gross** and it has 3387 rows and 5 columns. It was sourced from internet. It has Float, intergers and objects as dataypes. The data here helps us analyse income generated as it contains columns with `domestic and foreign gross data.`

The third dataset is named **budget** and it has 5782 rows and 6 columns. It was sourced from internet. The datatypes in this dataset are intergers and objects only. It has columns that can help us calculate the budget of producing a movie i.e `production_budget`. We can also see seasonal trends as it has a column with information on dates. 

The last dataset is named **movie_info** and it has 1560 rows and 12 columns. It is sourced from internet. The datatypes in this dataset are intergers and objects only. It has columns with information about `writer, director, studio` etc that can help us make informed reccomendations at the end of the project


# Data cleaning
 Modify the data set in some manner to correct erroneous data, remove redundancies;
 
 PART A

 checking for `missing values`
 
 checking for `Duplicates`
 
 PART B
 
 deciding whether to fill/drop missing values& duplicates
 
 dropping Columns {based(1&2)/ on irrelevance }

#### Part A 
##### checking for missing values in DataSets

In [23]:
## First DataSet %missing_Values
imdb.isnull().mean()*100

primary_title       0.000000
start_year          0.000000
runtime_minutes    10.317374
genres              1.088605
averagerating       0.000000
dtype: float64

In [24]:
# Second DataSet %missing_Values
gross.isnull().mean()*100

title              0.000000
studio             0.147623
domestic_gross     0.826690
foreign_gross     39.858282
year               0.000000
dtype: float64

In [25]:
# Third DataSet %missing_Values
budget.isnull().mean()*100

id                   0.0
release_date         0.0
movie                0.0
production_budget    0.0
domestic_gross       0.0
worldwide_gross      0.0
dtype: float64

In [26]:
# Fourth DataSet %missing_Values
movie_info.isnull().mean()*100

id               0.000000
synopsis         3.974359
rating           0.192308
genre            0.512821
director        12.756410
writer          28.782051
theater_date    23.012821
dvd_date        23.012821
currency        78.205128
box_office      78.205128
runtime          1.923077
studio          68.333333
dtype: float64

####  check for duplicates in DataSets

In [27]:
# FirstDataSet -duplicates--has 1 duplicate
imdb.duplicated().value_counts()

False    73855
True         1
dtype: int64

In [28]:
# SecondDataSet - duplicates--has no duplicates
gross.duplicated().value_counts()

False    3387
dtype: int64

In [29]:
# ThirdDataSet -duplicates--has no duplicates
budget.duplicated().value_counts()

False    5782
dtype: int64

In [30]:
# FourthDataSet -duplicates--has no duplicates
movie_info.duplicated().value_counts()

False    1560
dtype: int64

### Part B  
-Drop `duplicates`

-Drop `rows` with missing_values

-Drop `columns` based on relevance & consistency

In [31]:
# code drops duplicated rows in the imdb DataSet
imdb.drop_duplicates(inplace=True)
imdb.duplicated().value_counts()

False    73855
dtype: int64

In [32]:
# # The code below drop rows with missing values permanently
imdb.dropna(subset=['runtime_minutes',  'genres',], inplace=True)
imdb

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating
0,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0
1,One Day Before the Rainy Season,2019,114.0,"Biography,Drama",7.2
2,The Other Side of the Wind,2018,122.0,Drama,6.9
4,The Wandering Soap Opera,2017,80.0,"Comedy,Drama,Fantasy",6.5
6,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy",8.1
...,...,...,...,...,...
73849,Padmavyuhathile Abhimanyu,2019,130.0,Drama,8.4
73850,Swarm Season,2019,86.0,Documentary,6.2
73851,Diabolik sono io,2019,75.0,Documentary,6.2
73852,Sokagin Çocuklari,2019,98.0,"Drama,Family",8.7


In [33]:
# Check to confirm no more missing values in dataframe
imdb.isna().sum()

primary_title      0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
dtype: int64

##### Dropping Columns
-`movie_info & budget` , have a common column with the same name `id`

-On further analysis the columns have no corelation. To retain consistency we drop 

In [34]:
## drop 'id' column form budget DataSet
budget.drop(columns=['id'], inplace=True)

###### For movie_info we decide to drop the columns  currency & box office , since they had 78% missing values and would cause inconsistencies in our data

In [35]:
#We drop columns that have been noted 
movie_info.drop(columns=['currency', 'box_office', 'id'], inplace=True)

### Data Transformation
-Converting the datatypes in our datasets to enable manipulation

-`gross[foreign_gross]-&-budget[domestic_gross,producion_budget,worldwide_gross]`  the columns in this dataset are  `object` datatype,this means the numeric figures are read like `strings` instead of numbers. So to have clarity ,we convert them to `float datatype` for more accurate readings

In [36]:
# Second DataSet-- gross
# Changing data type from an object to a float

gross['foreign_gross'] = gross['foreign_gross'].replace({'\$': '', ',': ''}, regex=True).astype(float)
gross.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 132.4+ KB


###### converting budget columns`domestic_gross`, `production_budget`, `worldwide_gross` to `float datatype`


In [37]:
# List of columns to clean and convert to float
columns_converted = ['domestic_gross', 'production_budget', 'worldwide_gross']

# Loop through each column, replace the unwanted characters, and convert to float
for column in columns_converted:
    budget[column] = budget[column].replace({'\$': '', ',': ''}, regex=True).astype(float)

In [38]:
budget.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   release_date       5782 non-null   object 
 1   movie              5782 non-null   object 
 2   production_budget  5782 non-null   float64
 3   domestic_gross     5782 non-null   float64
 4   worldwide_gross    5782 non-null   float64
dtypes: float64(3), object(2)
memory usage: 226.0+ KB


# Data Merging
-Having a cleaned data versions we perform  `.merge`

-Merging data frames combines multiple datasets into a single, unified dataset, allowing for a more comprehensive analysis.

-Merging the datasets will enable us to explain more with visualization for better understanding of our objective

##### First Merge 
-The gross & imdb  dataset

-We perform a inner join ; we use this to retain all rows from both dataframes with matching keys, that is :

-gross[title]---imdb[primary_title]

In [39]:
## merge imdb & gross
mer = pd.merge(gross, imdb, left_on='title', right_on='primary_title', how='inner')
mer

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,primary_title,start_year,runtime_minutes,genres,averagerating
0,Toy Story 3,BV,415000000.0,652000000.0,2010,Toy Story 3,2010,103.0,"Adventure,Animation,Comedy",8.3
1,Inception,WB,292600000.0,535700000.0,2010,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8
2,Shrek Forever After,P/DW,238700000.0,513900000.0,2010,Shrek Forever After,2010,93.0,"Adventure,Animation,Comedy",6.3
3,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000.0,2010,The Twilight Saga: Eclipse,2010,124.0,"Adventure,Drama,Fantasy",5.0
4,Iron Man 2,Par.,312400000.0,311500000.0,2010,Iron Man 2,2010,124.0,"Action,Adventure,Sci-Fi",7.0
...,...,...,...,...,...,...,...,...,...,...
2970,Souvenir,Strand,11400.0,,2018,Souvenir,2016,90.0,"Drama,Music,Romance",6.0
2971,Souvenir,Strand,11400.0,,2018,Souvenir,2014,86.0,"Comedy,Romance",5.9
2972,Beauty and the Dogs,Osci.,8900.0,,2018,Beauty and the Dogs,2017,100.0,"Crime,Drama,Thriller",7.0
2973,The Quake,Magn.,6200.0,,2018,The Quake,2018,106.0,"Action,Drama,Thriller",6.2


##### Second Merge
-The second merger consists of combining newly merged dataset   with movie_info

-`(gross,imdb)mer `+ `movie_info`

-we merge using `innerjoin ` since the datasets have a common column name `studio`


In [40]:
#Merge the datasets on common columns---(gross,imdb) + movie_info
merged_df = pd.merge(mer, movie_info, on='studio')
merged_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,primary_title,start_year,runtime_minutes,genres,averagerating,synopsis,rating,genre,director,writer,theater_date,dvd_date,runtime
0,Inception,WB,292600000.0,535700000.0,2010,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,"Directed by Clint Eastwood, the mysterious dra...",R,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes
1,Due Date,WB,100500000.0,111200000.0,2010,Due Date,2010,95.0,"Adventure,Comedy",6.5,"Directed by Clint Eastwood, the mysterious dra...",R,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes
2,Yogi Bear,WB,100200000.0,101300000.0,2010,Yogi Bear,2010,80.0,"Adventure,Animation,Comedy",4.6,"Directed by Clint Eastwood, the mysterious dra...",R,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes
3,The Book of Eli,WB,94800000.0,62300000.0,2010,The Book of Eli,2010,118.0,"Action,Adventure,Drama",6.9,"Directed by Clint Eastwood, the mysterious dra...",R,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes
4,The Town,WB,92200000.0,61800000.0,2010,The Town,2010,125.0,"Crime,Drama,Thriller",7.6,"Directed by Clint Eastwood, the mysterious dra...",R,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,A Ghost Story,A24,1600000.0,,2017,A Ghost Story,2017,92.0,"Drama,Fantasy,Romance",6.8,Imagine the end of the world. Now imagine some...,R,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes
3650,Trespass Against Us,A24,5700.0,,2017,Trespass Against Us,2016,99.0,"Action,Crime,Drama",5.8,Imagine the end of the world. Now imagine some...,R,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes
3651,Hereditary,A24,44100000.0,35300000.0,2018,Hereditary,2018,127.0,"Drama,Horror,Mystery",7.3,Imagine the end of the world. Now imagine some...,R,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes
3652,The Children Act,A24,548000.0,17000000.0,2018,The Children Act,2017,105.0,Drama,6.7,Imagine the end of the world. Now imagine some...,R,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes


#### Final Merge---- gross + imdb + info + budget
-Combine all 4 Datasets into a singular dataframe

-`(gross,imdb,info)merged_df + budget`

--we merge using innerjoin since the datasets have a common column name `domestic_gross`

In [41]:
#The next merge
#Merge the datasets on common columns---gross.imdb,info,budget
merged_df2 = pd.merge(merged_df, budget, on='domestic_gross')
merged_df2

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,primary_title,start_year,runtime_minutes,genres,averagerating,...,genre,director,writer,theater_date,dvd_date,runtime,release_date,movie,production_budget,worldwide_gross
0,The Losers,WB,23600000.0,5800000.0,2010,The Losers,2010,97.0,"Action,Adventure,Crime",6.4,...,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes,"Nov 21, 1946",The Best Years of Our Lives,2100000.0,23600000.0
1,The Losers,WB,23600000.0,5800000.0,2010,The Losers,2013,112.0,Drama,1.7,...,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes,"Nov 21, 1946",The Best Years of Our Lives,2100000.0,23600000.0
2,Blended,WB,46300000.0,81700000.0,2014,Blended,2014,117.0,"Comedy,Romance",6.5,...,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes,"Nov 7, 1963",It's a Mad Mad Mad Mad World,9400000.0,60000000.0
3,Jersey Boys,WB,47000000.0,20600000.0,2014,Jersey Boys,2014,134.0,"Biography,Drama,Music",6.8,...,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes,"Oct 17, 1978",Halloween,325000.0,70000000.0
4,Dolphin Tale 2,WB,42000000.0,15800000.0,2014,Dolphin Tale 2,2014,107.0,"Drama,Family",6.4,...,Drama|Mystery and Suspense,Clint Eastwood,Brian Helgeland,"Oct 8, 2003","Jun 8, 2004",137 minutes,"Oct 17, 1956",Around the World in 80 Days,6000000.0,42000000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Obvious Child,A24,3100000.0,,2014,Obvious Child,2014,84.0,"Comedy,Drama,Romance",6.8,...,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes,"Aug 30, 1972",The Last House on the Left,87000.0,3100000.0
86,While We're Young,A24,7600000.0,9700000.0,2015,While We're Young,2014,97.0,"Comedy,Drama,Mystery",6.3,...,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes,"Sep 25, 1961",The Hustler,2000000.0,7600000.0
87,Amy,A24,8400000.0,,2015,Amy,2015,128.0,"Biography,Documentary,Music",7.8,...,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes,"Oct 19, 1979",Meteor,16000000.0,8400000.0
88,Amy,A24,8400000.0,,2015,Amy,2013,94.0,Horror,1.9,...,Drama|Horror,Trey Edward Shults,Trey Edward Shults,"Jun 9, 2017","Sep 12, 2017",91 minutes,"Oct 19, 1979",Meteor,16000000.0,8400000.0


##### Final Analysis of  our new data frame

In [42]:
# has 90 rows & 22 columns
merged_df2.shape

(90, 22)

In [43]:
# Final COLUMN NAMES
merged_df2.columns

Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year',
       'primary_title', 'start_year', 'runtime_minutes', 'genres',
       'averagerating', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'runtime', 'release_date', 'movie',
       'production_budget', 'worldwide_gross'],
      dtype='object')

In [44]:
# Final DESCRIPTIVE STATISTICS -percentile, mean , standard deviation
merged_df2.describe()

Unnamed: 0,domestic_gross,foreign_gross,year,start_year,runtime_minutes,averagerating,production_budget,worldwide_gross
count,90.0,61.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,12931290.0,84848840.0,2014.522222,2014.388889,105.122222,6.621111,8675615.0,17268530.0
std,20743530.0,168130400.0,2.678369,2.45725,15.887121,1.169989,13983110.0,32544790.0
min,5000.0,600.0,2010.0,2010.0,65.0,1.7,87000.0,5000.0
25%,2000000.0,11100000.0,2012.0,2012.0,95.0,6.2,1200000.0,2000000.0
50%,3850000.0,16600000.0,2015.0,2014.0,104.5,6.9,3742866.0,4358000.0
75%,8400000.0,68900000.0,2017.0,2017.0,116.0,7.3,10750000.0,12300000.0
max,86300000.0,542100000.0,2018.0,2019.0,180.0,8.1,110000000.0,195300000.0
