## Final Project Submission

Please fill out:
* Student name: Charles Odhiambo ,Savins Nanyaemuny , Tracy Gwehona , Amos Ledama , Brian Siele, Lilian Kaburo
* Student pace:  full time
* Instructor name: Mwikali
* Blog post URL:


## Business Understanding

### Overview

Within today's highly competitive entertainment market, companies attract audiences and revenues with unique video content. The company has decided to open a new movie studio in order to take advantage of this trend, although it has little prior experience producing films.
This project will analyze the current box office trends and genres of movies that are performing well. Analyzing the data will allow the determination of some key factors of success regarding movie genre, budget, window of release, and audience preference.
The insights from this analysis will be translated into actionable recommendations to guide the new movie studio in choosing the right types of films to produce for optimal performance in the market.

### Business Problem

The company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies.

I aim to:
1. Identify film genres and types with strong box office performance.
2. Evaluate factors such as budget, release timing, and audience reception that influence a movie’s success.
3. Analyze data from multiple sources, including Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers.
4. Provide recommendations on the types of films to produce based on the analysis results.

The findings will help the head of the company's new movie studio decide what type of films to create.

## Data Understanding

The datasets being used for this project was obtained from different sources such as [IMDB](https://www.imdb.com/), [Box Office Mojo](https://www.boxofficemojo.com/), [Rotten Tomatoes](https://www.rottentomatoes.com/) and [The Numbers](https://www.the-numbers.com/). They contain records of films, their genres, ratings, reviews production year and many more. We are going to review the properties of the various datasets to see what they include.

First, import the various libraries required.

In [1]:
# Import the necessary module 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import sqlite3
import zipfile

#### 1) Loading the various datasets.

The IMDB database is in a zip file.

In [2]:
# Extract the IMDB database

def unzip_data(filename):
    zip_ref = zipfile.ZipFile(filename, "r")
    zip_ref.extractall()
    zip_ref.close()

unzip_data('zippedData/im.db.zip')

# Connect to sqlite3

path = 'im.db'

conn = sqlite3.connect(path)

Looking at the various tables in the database from the ERD, 'movie_basics' and 'movie_ratings' seem more relevant for our analysis.

In [3]:
# Join the two tables 

q = """
SELECT * 
FROM movie_basics
JOIN movie_ratings
    USING(movie_id)
;
"""

imdb = pd.read_sql(q, conn)

Loading our second dataset.

In [4]:
# Load the data from the compressed CSV file
bom = pd.read_csv('zippedData/bom.movie_gross.csv.gz', compression='gzip')

Loading the third dataset.

In [5]:
tn=pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')

Loading the fourth dataset.

In [6]:
rt1 =pd.read_csv('zippedData/rt.reviews.tsv.gz',sep='\t', encoding='ISO-8859-1')

Loading the fifth dataset

In [7]:
rt2 = pd.read_csv('zippedData/rt.movie_info.tsv.gz',sep='\t', encoding='ISO-8859-1')


Loading the sixth dataset

In [42]:
tmdb=pd.read_csv('zippedData/tmdb.movies.csv.gz', compression='gzip',index_col=0)


Closing the connection to the SQLite database.

In [8]:
conn.close()

#### 2) .head()

The .head() method is used to peep on what contents do a data entail. What do the values of each column and row look like.

In [9]:
imdb.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


In [10]:
bom.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [11]:
tn.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [12]:
rt1.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [13]:
rt2.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


#### 3) .info()

We are looking at the various columns in each dataset, their data type, shape of the dataframe by looking at the number of rows and columns and which columns have missing values.

In [14]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


In [15]:
bom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [16]:
tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [17]:
rt1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [18]:
rt2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [41]:
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


#### 4) .describe()

Here, we are looking at the summary of the descriptive statistics of the columns with numerical data.

In [19]:
imdb.describe()

Unnamed: 0,start_year,runtime_minutes,averagerating,numvotes
count,73856.0,66236.0,73856.0,73856.0
mean,2014.276132,94.65404,6.332729,3523.662
std,2.614807,208.574111,1.474978,30294.02
min,2010.0,3.0,1.0,5.0
25%,2012.0,81.0,5.5,14.0
50%,2014.0,91.0,6.5,49.0
75%,2016.0,104.0,7.4,282.0
max,2019.0,51420.0,10.0,1841066.0


In [20]:
bom.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [21]:
tn.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [22]:
rt1.describe()

Unnamed: 0,id,top_critic
count,54432.0,54432.0
mean,1045.706882,0.240594
std,586.657046,0.427448
min,3.0,0.0
25%,542.0,0.0
50%,1083.0,0.0
75%,1541.0,0.0
max,2000.0,1.0


In [23]:
rt2.describe()

Unnamed: 0,id
count,1560.0
mean,1007.303846
std,579.164527
min,1.0
25%,504.75
50%,1007.5
75%,1503.25
max,2000.0


In [None]:
data_name = [imdb,bom,tn,rt1,rt2]
for i in data_name:
    print(i.info(),'\n')

In [None]:
# Data merging
# Check for missing values
# Duplicates
# Change the column names
# Column to drop
# Create a csv file and store the data

In [54]:
# Merge the bom to the imdb
merged_df =pd.merge(imdb,tmdb,how='inner',left_on='primary_title',right_on='title')
merged_df=merged_df.merge(tn,how='inner',left_on='primary_title',right_on='movie').merge(
  bom,how='inner',on='title'  
 )

In [None]:
merged_df.info()

In [57]:
# Write to a csv
# merged_df.to_csv('Merged_data.csv',index=False)

### Data cleaning

In [None]:
# Check for missing values
# Duplicates
# Change the column names
# Column to drop

In [70]:
# Import the data
merged_data =pd.read_csv('Merged_data.csv')
merged_data.head()

Unnamed: 0,movie_id,primary_title,original_title_x,start_year,runtime_minutes,genres,averagerating,numvotes,genre_ids,id_x,...,id_y,release_date_y,movie,production_budget,domestic_gross_x,worldwide_gross,studio,domestic_gross_y,foreign_gross,year
0,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
1,tt4339118,On the Road,On the Road,2014,89.0,Drama,6.0,6,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
2,tt5647250,On the Road,On the Road,2016,121.0,Drama,5.7,127,"[12, 18]",83770,...,17,"Mar 22, 2013",On the Road,"$25,000,000","$720,828","$9,313,302",IFC,744000.0,8000000,2012
3,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,"[12, 35, 18, 14]",116745,...,37,"Dec 25, 2013",The Secret Life of Walter Mitty,"$91,000,000","$58,236,838","$187,861,183",Fox,58200000.0,129900000,2013
4,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.0,"Action,Crime,Drama",6.5,105116,"[80, 18, 9648, 53]",169917,...,67,"Sep 19, 2014",A Walk Among the Tombstones,"$28,000,000","$26,017,685","$62,108,587",Uni.,26300000.0,26900000,2014


In [61]:
# Drop the columns
merged_data.columns

Index(['movie_id', 'primary_title', 'original_title_x', 'start_year',
       'runtime_minutes', 'genres', 'averagerating', 'numvotes', 'genre_ids',
       'id_x', 'original_language', 'original_title_y', 'popularity',
       'release_date_x', 'title', 'vote_average', 'vote_count', 'id_y',
       'release_date_y', 'movie', 'production_budget', 'domestic_gross_x',
       'worldwide_gross', 'studio', 'domestic_gross_y', 'foreign_gross',
       'year'],
      dtype='object')

In [71]:
# Removed duplicated columns
remove_col=['original_title_x','id_x','release_date_x','id_y','domestic_gross_x','movie_id','title','movie']

merged_data.drop(columns = remove_col,inplace =True)



In [74]:
# Rename the columns
merged_data.rename(columns={ 'original_title_y':'original_title' , 'release_date_y':'release_date' , 'domestic_gross_y':'domestic_gross'}
                   ,inplace=True)

In [76]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1794 entries, 0 to 1793
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   primary_title      1794 non-null   object 
 1   start_year         1794 non-null   int64  
 2   runtime_minutes    1756 non-null   float64
 3   genres             1783 non-null   object 
 4   averagerating      1794 non-null   float64
 5   numvotes           1794 non-null   int64  
 6   genre_ids          1794 non-null   object 
 7   original_language  1794 non-null   object 
 8   original_title     1794 non-null   object 
 9   popularity         1794 non-null   float64
 10  vote_average       1794 non-null   float64
 11  vote_count         1794 non-null   int64  
 12  release_date       1794 non-null   object 
 13  production_budget  1794 non-null   object 
 14  worldwide_gross    1794 non-null   object 
 15  studio             1794 non-null   object 
 16  domestic_gross     1793 

In [77]:
conn.close()