## Final Project Submission

Please fill out:
* Student name: Charles Odhiambo ,Savins Nanyaemuny , Tracy Gwehona , Amos Ledama , Brian Siele, Lilian Kaburo
* Student pace:  full time
* Instructor name: Mwikali
* Blog post URL:


## Business Understanding

### Overview

Within today's highly competitive entertainment market, companies attract audiences and revenues with unique video content. The company has decided to open a new movie studio in order to take advantage of this trend, although it has little prior experience producing films.
This project will analyze the current box office trends and genres of movies that are performing well. Analyzing the data will allow the determination of some key factors of success regarding movie genre, budget, window of release, and audience preference.
The insights from this analysis will be translated into actionable recommendations to guide the new movie studio in choosing the right types of films to produce for optimal performance in the market.

### Business Problem

The company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies.

I aim to:
1. Identify film genres and types with strong box office performance.
2. Evaluate factors such as budget, release timing, and audience reception that influence a movie’s success.
3. Analyze data from multiple sources, including Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers.
4. Provide recommendations on the types of films to produce based on the analysis results.

The findings will help the head of the company's new movie studio decide what type of films to create.

## Data Understanding

The datasets being used for this project was obtained from different sources such as [IMDB](https://www.imdb.com/), [Box Office Mojo](https://www.boxofficemojo.com/), [Rotten Tomatoes](https://www.rottentomatoes.com/) and [The Numbers](https://www.the-numbers.com/). They contain records of films, their genres, ratings, reviews production year and many more. We are going to review the properties of the various datasets to see what they include.

First, import the various libraries required.

In [1]:
# Import the necessary module 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import sqlite3
import zipfile

#### 1) Loading the various datasets.

The IMDB database is in a zip file.

In [2]:
# # Extract the IMDB database

# def unzip_data(filename):
#     zip_ref = zipfile.ZipFile(filename, "r")
#     zip_ref.extractall()
#     zip_ref.close()

# unzip_data('zippedData/im.db.zip')

# # Connect to sqlite3

# path = 'im.db'

# conn = sqlite3.connect(path)

Looking at the various tables in the database from the ERD, 'movie_basics' and 'movie_ratings' seem more relevant for our analysis.

In [3]:
# Join the two tables 

# q = """
# SELECT * 
# FROM movie_basics
# JOIN movie_ratings
#     USING(movie_id)
# ;
# """

# imdb = pd.read_sql(q, conn)

Closing the connection to the SQLite database.

In [4]:
# conn.close()

Loading our second dataset.

In [5]:
# Load the data from the compressed CSV file
bom = pd.read_csv('zippedData/bom.movie_gross.csv.gz', compression='gzip')

Loading the third dataset.

In [6]:
tn=pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')

Loading the fourth dataset.

In [7]:
tmdb=pd.read_csv('zippedData/tmdb.movies.csv.gz', compression='gzip',index_col=0)


#### 2) .head()

The .head() method is used to peep on what contents do a data entail. What do the values of each column and row look like.

In [8]:
# imdb.head()

In [9]:
bom.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [10]:
tn.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [11]:
tmdb.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


#### 3) .info()

We are looking at the various columns in each dataset, their data type, shape of the dataframe by looking at the number of rows and columns and which columns have missing values.

In [12]:
# imdb.info()

In [13]:
bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [14]:
tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [15]:
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB


#### 4) .describe()

Here, we are looking at the summary of the descriptive statistics of the columns with numerical data.

In [16]:
# imdb.describe()

In [17]:
bom.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [18]:
tn.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [19]:
tmdb.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0
mean,295050.15326,3.130912,5.991281,194.224837
std,153661.615648,4.355229,1.852946,960.961095
min,27.0,0.6,0.0,1.0
25%,157851.0,0.6,5.0,2.0
50%,309581.0,1.374,6.0,5.0
75%,419542.0,3.694,7.0,28.0
max,608444.0,80.773,10.0,22186.0


## Data Preparation

### Data Merging

In [20]:
# Data merging
# Check for missing values
# Duplicates
# Change the column names
# Column to drop
# Create a csv file and store the data

In [21]:
# Merge the bom to the imdb
# merged_df =pd.merge(imdb,tmdb,how='inner',left_on='primary_title',right_on='title')
# merged_df=merged_df.merge(tn,how='inner',left_on='primary_title',right_on='movie').merge(
#   bom,how='inner',on='title'  
#  )

In [22]:
# merged_df.info()

In [23]:
# Write to a csv
# merged_df.to_csv('Merged_data.csv',index=False)

### Data cleaning

In [24]:
# Check for missing values
# Duplicates
# Change the column names
# Column to drop

In [25]:
# Import the data
merged_data =pd.read_csv('Merged_data.csv')
merged_data.head()

Unnamed: 0,movie_id,primary_title,original_title_x,start_year,runtime_minutes,genres,averagerating,numvotes,genre_ids,id_x,...,id_y,release_date_y,movie,production_budget,domestic_gross_x,worldwide_gross,studio,domestic_gross_y,foreign_gross,year
0,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
1,tt4339118,On the Road,On the Road,2014,89.0,Drama,6.0,6,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
2,tt5647250,On the Road,On the Road,2016,121.0,Drama,5.7,127,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
3,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,"[12, 35, 18, 14]",116745,...,37,"Dec 25, 2013",The Secret Life of Walter Mitty,"$91,000,000","$58,236,838","$187,861,183",Fox,58200000.0,129900000,2013
4,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",6.5,105116,"[80, 18, 9648, 53]",169917,...,67,"Sep 19, 2014",A Walk Among the Tombstones,"$28,000,000","$26,017,685","$62,108,587",Uni.,26300000.0,26900000,2014


In [26]:
# Drop the columns
merged_data.columns

Index(['movie_id', 'primary_title', 'original_title_x', 'start_year',
       'runtime_minutes', 'genres', 'averagerating', 'numvotes', 'genre_ids',
       'id_x', 'original_language', 'original_title_y', 'popularity',
       'release_date_x', 'title', 'vote_average', 'vote_count', 'id_y',
       'release_date_y', 'movie', 'production_budget', 'domestic_gross_x',
       'worldwide_gross', 'studio', 'domestic_gross_y', 'foreign_gross',
       'year'],
      dtype='object')

In [27]:
# Removed duplicated columns
remove_col=['original_title_x','id_x','release_date_x','id_y','domestic_gross_x','movie_id','title','movie','genre_ids']

merged_data.drop(columns = remove_col,inplace =True)



In [28]:
# Rename the columns
merged_data.rename(columns={ 'original_title_y':'original_title' , 'release_date_y':'release_date' , 'domestic_gross_y':'domestic_gross'}
                   ,inplace=True)

In [29]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1794 entries, 0 to 1793
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   primary_title      1794 non-null   object 
 1   start_year         1794 non-null   int64  
 2   runtime_minutes    1756 non-null   float64
 3   genres             1783 non-null   object 
 4   averagerating      1794 non-null   float64
 5   numvotes           1794 non-null   int64  
 6   original_language  1794 non-null   object 
 7   original_title     1794 non-null   object 
 8   popularity         1794 non-null   float64
 9   vote_average       1794 non-null   float64
 10  vote_count         1794 non-null   int64  
 11  release_date       1794 non-null   object 
 12  production_budget  1794 non-null   object 
 13  worldwide_gross    1794 non-null   object 
 14  studio             1794 non-null   object 
 15  domestic_gross     1793 non-null   float64
 16  foreign_gross      1507 

In [30]:
# Select the object dtypes
merged_data.select_dtypes(include='object').head()

Unnamed: 0,primary_title,genres,original_language,original_title,release_date,production_budget,worldwide_gross,studio,foreign_gross
0,On the Road,"Adventure,Drama,Romance",en,On the Road,"Mar 22, 2013","$25,000,000","$9,313,302",IFC,8000000
1,On the Road,Drama,en,On the Road,"Mar 22, 2013","$25,000,000","$9,313,302",IFC,8000000
2,On the Road,Drama,en,On the Road,"Mar 22, 2013","$25,000,000","$9,313,302",IFC,8000000
3,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",en,The Secret Life of Walter Mitty,"Dec 25, 2013","$91,000,000","$187,861,183",Fox,129900000
4,A Walk Among the Tombstones,"Action,Crime,Drama",en,A Walk Among the Tombstones,"Sep 19, 2014","$28,000,000","$62,108,587",Uni.,26900000


In [31]:
# change the columns data types to int

# Remove dollar sign and commas from worlwide_gross
merged_data['production_budget']=pd.Series(
[str(i).replace("$","").replace(",","") for i in merged_data.select_dtypes(include='object').production_budget]
).astype(float)

# Remove dollar sign and commas from worlwide_gross
merged_data['worldwide_gross']=pd.Series(
[str(i).replace("$","").replace(",","") for i in merged_data.select_dtypes(include='object').worldwide_gross]
).astype(float)

# Remove commas from worlwide_gross
merged_data['foreign_gross']= pd.Series([str(i).replace(",",'') for i in merged_data.foreign_gross]).astype(float)
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1794 entries, 0 to 1793
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   primary_title      1794 non-null   object 
 1   start_year         1794 non-null   int64  
 2   runtime_minutes    1756 non-null   float64
 3   genres             1783 non-null   object 
 4   averagerating      1794 non-null   float64
 5   numvotes           1794 non-null   int64  
 6   original_language  1794 non-null   object 
 7   original_title     1794 non-null   object 
 8   popularity         1794 non-null   float64
 9   vote_average       1794 non-null   float64
 10  vote_count         1794 non-null   int64  
 11  release_date       1794 non-null   object 
 12  production_budget  1794 non-null   float64
 13  worldwide_gross    1794 non-null   float64
 14  studio             1794 non-null   object 
 15  domestic_gross     1793 non-null   float64
 16  foreign_gross      1507 

In [None]:
# Select the object dtypes
pd.Series([str(i).replace("$","").replace(",","") for i in merged_data.select_dtypes(include='object').production_budget]).astype(float)

In [33]:
# Check for duplicates
merged_data.drop_duplicates().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1634 entries, 0 to 1793
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   primary_title      1634 non-null   object 
 1   start_year         1634 non-null   int64  
 2   runtime_minutes    1596 non-null   float64
 3   genres             1623 non-null   object 
 4   averagerating      1634 non-null   float64
 5   numvotes           1634 non-null   int64  
 6   original_language  1634 non-null   object 
 7   original_title     1634 non-null   object 
 8   popularity         1634 non-null   float64
 9   vote_average       1634 non-null   float64
 10  vote_count         1634 non-null   int64  
 11  release_date       1634 non-null   object 
 12  production_budget  1634 non-null   float64
 13  worldwide_gross    1634 non-null   float64
 14  studio             1634 non-null   object 
 15  domestic_gross     1633 non-null   float64
 16  foreign_gross      1379 

In [34]:
merged_data[merged_data.duplicated()].head()

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes,original_language,original_title,popularity,vote_average,vote_count,release_date,production_budget,worldwide_gross,studio,domestic_gross,foreign_gross,year
67,Dallas Buyers Club,2013,117.0,"Biography,Drama",8.0,402462,en,Dallas Buyers Club,12.389,7.9,4961,"Nov 1, 2013",5000000.0,60611845.0,Focus,27300000.0,27900000.0,2013
158,Justice League,2017,120.0,"Action,Adventure,Fantasy",6.5,329135,en,Justice League,34.953,6.2,7510,"Nov 17, 2017",300000000.0,655945209.0,WB,229000000.0,428900000.0,2017
194,Goosebumps,2015,103.0,"Adventure,Comedy,Family",6.3,72858,en,Goosebumps,18.957,6.2,2147,"Oct 16, 2015",58000000.0,158905324.0,Sony,80100000.0,70100000.0,2015
235,Blue Valentine,2010,112.0,"Drama,Romance",7.4,170089,en,Blue Valentine,8.994,6.9,1677,"Dec 29, 2010",1000000.0,16566240.0,Wein.,9700000.0,2600000.0,2010
281,Doctor Strange,2016,115.0,"Action,Adventure,Fantasy",7.5,514510,en,Doctor Strange,33.035,7.3,12582,"Nov 4, 2016",165000000.0,676404566.0,BV,232600000.0,445100000.0,2016


In [35]:
merged_data.drop_duplicates(inplace=True)

In [36]:
merged_data

Unnamed: 0,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes,original_language,original_title,popularity,vote_average,vote_count,release_date,production_budget,worldwide_gross,studio,domestic_gross,foreign_gross,year
0,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,en,On the Road,8.919,5.6,518,"Mar 22, 2013",25000000.0,9313302.0,IFC,744000.0,8000000.0,2012
1,On the Road,2014,89.0,Drama,6.0,6,en,On the Road,8.919,5.6,518,"Mar 22, 2013",25000000.0,9313302.0,IFC,744000.0,8000000.0,2012
2,On the Road,2016,121.0,Drama,5.7,127,en,On the Road,8.919,5.6,518,"Mar 22, 2013",25000000.0,9313302.0,IFC,744000.0,8000000.0,2012
3,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,en,The Secret Life of Walter Mitty,10.743,7.1,4859,"Dec 25, 2013",91000000.0,187861183.0,Fox,58200000.0,129900000.0,2013
4,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",6.5,105116,en,A Walk Among the Tombstones,19.373,6.3,1685,"Sep 19, 2014",28000000.0,62108587.0,Uni.,26300000.0,26900000.0,2014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1789,Uncle Drew,2018,103.0,"Comedy,Sport",5.7,9739,en,Uncle Drew,10.836,6.5,220,"Jun 29, 2018",18000000.0,46527161.0,LG/S,42500000.0,4200000.0,2018
1790,BlacKkKlansman,2018,135.0,"Biography,Crime,Drama",7.5,149005,en,BlacKkKlansman,25.101,7.6,3138,"Aug 10, 2018",15000000.0,93017335.0,Focus,49300000.0,44000000.0,2018
1791,"Paul, Apostle of Christ",2018,108.0,"Adventure,Biography,Drama",6.7,5662,en,"Paul, Apostle of Christ",12.005,7.1,98,"Mar 23, 2018",5000000.0,25529498.0,Affirm,17600000.0,5500000.0,2018
1792,Instant Family,2018,118.0,"Comedy,Drama",7.4,46728,en,Instant Family,22.634,7.6,782,"Nov 16, 2018",48000000.0,119736188.0,Par.,67400000.0,53200000.0,2018


In [37]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1634 entries, 0 to 1793
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   primary_title      1634 non-null   object 
 1   start_year         1634 non-null   int64  
 2   runtime_minutes    1596 non-null   float64
 3   genres             1623 non-null   object 
 4   averagerating      1634 non-null   float64
 5   numvotes           1634 non-null   int64  
 6   original_language  1634 non-null   object 
 7   original_title     1634 non-null   object 
 8   popularity         1634 non-null   float64
 9   vote_average       1634 non-null   float64
 10  vote_count         1634 non-null   int64  
 11  release_date       1634 non-null   object 
 12  production_budget  1634 non-null   float64
 13  worldwide_gross    1634 non-null   float64
 14  studio             1634 non-null   object 
 15  domestic_gross     1633 non-null   float64
 16  foreign_gross      1379 

In [38]:
# conn.close()