## Final Project Submission

Please fill out:
* Student name: Caleb Ochieng
* Student pace: Part time
* Scheduled project review date/time: 24th July, 2023 at 12:00 am
* Instructor name: Maryann Mwikali
* Blog post URL:


# PHASE 1 PROJECT

Microsoft has presented a hypothetical scenario where they express interest in establishing a movie studio. They have reached out to me for assistance in analyzing historical movie data to identify successful patterns and potential pitfalls. To gauge success, we will focus on two key metrics: profit and viewer ratings. Microsoft's primary goal is to ensure profitability for their new studio, leading me to base my recommendations on past elements that have proven profitable in movies. Additionally, we will place significant emphasis on positive audience reception, especially for the initial movie releases, recognizing the significant impact of making a favorable first impression on future viewer interest. As a result, I will also explore trends related to movies with both high and low ratings.

To conduct the analysis, three primary factors will be closely examined: <span style='color:green'>**budget, genre, and star appeal**</span>. Concerning the budget, we will assess the optimal capital investment necessary to realistically generate substantial profits. The genre analysis will involve a comprehensive study of movies released in the US market to identify genres that have historically performed well both financially and critically.

The third crucial factor is star-power, where we will delve into the influence of recognizable actors and directors on a movie's financial success and ratings. Evaluating this aspect is particularly complex as it requires developing a metric that measures the level of "star-power" an individual possesses over time in their career. By exploring these factors, we aim to derive valuable insights that will effectively guide Microsoft's movie studio venture.

In [1]:
#Importing libraries necessary for the project
import pandas as pd
import csv
import matplotlib.pyplot as plt
import sqlite3
import numpy as np
%matplotlib inline

In [2]:
#View the movie csv files
!ls

CONTRIBUTING.md
LICENSE.md
README.md
awesome.gif
bom.movie_gross.csv
name.basics.csv
rt.movie_info.tsv
rt.reviews.tsv
student.ipynb
title.akas.csv
title.basics.csv
title.crew.csv
title.principals.csv
title.ratings.csv
tmdb.movies.csv
tn.movie_budgets.csv
zippedData


# DATA INSPECTION

Inspect each movie dataset to see the info within

### **Box Office Mojo Dataset**

In [3]:
#Viewing the bom dataset
df = pd.read_csv('bom.movie_gross.csv')
#Viewing the first 10 entries
df.head(10)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
5,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010
6,Iron Man 2,Par.,312400000.0,311500000,2010
7,Tangled,BV,200800000.0,391000000,2010
8,Despicable Me,Uni.,251500000.0,291600000,2010
9,How to Train Your Dragon,P/DW,217600000.0,277300000,2010


In [4]:
#Looking at what is in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [5]:
# To view all duplicate rows
df_duplicates_rows = df[df.duplicated(keep=False)]

# To view all duplicate columns
df_duplicates_columns = df.T.duplicated()

# Display the duplicate rows
print("Duplicate Rows:")
print(df_duplicates_rows)

# Display the duplicate columns
print("Duplicate Columns:")
print(df_duplicates_columns[df_duplicates_columns].index.tolist()) 
#The index.tolist() method is used to convert the index of the duplicates_columns Series into a list, 
#which gives you the names of the duplicate columns.

Duplicate Rows:
Empty DataFrame
Columns: [title, studio, domestic_gross, foreign_gross, year]
Index: []
Duplicate Columns:
[]


The <span style='color:blue'>**Box Office Mojo**</span> dataset comprises information such as movie title, studio, domestic and foreign gross, and the corresponding year of release. The dataset encompasses a total of 3387 movies.

The duplicated columns i.e title, studio, domestic_gross, foreign_gross and year are all necessary since we would expect some of the data in these columns to appear more than once.

This dataset does not contain any information that fits my primary factors of examination and therefore, I am unlikely to use it.

***

### **TheMovieDB Dataset**

In [6]:
#Viewing the dataset
db = pd.read_csv('tmdb.movies.csv')
#viewing the first 10 entries
db.head(10)

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186
5,5,"[12, 14, 10751]",32657,en,Percy Jackson & the Olympians: The Lightning T...,26.691,2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229
6,6,"[28, 12, 14, 878]",19995,en,Avatar,26.526,2009-12-18,Avatar,7.4,18676
7,7,"[16, 10751, 35]",10193,en,Toy Story 3,24.445,2010-06-17,Toy Story 3,7.7,8340
8,8,"[16, 10751, 35]",20352,en,Despicable Me,23.673,2010-07-09,Despicable Me,7.2,10057
9,9,"[16, 28, 35, 10751, 878]",38055,en,Megamind,22.855,2010-11-04,Megamind,6.8,3635


In [7]:
#Looking at what is in the dataset
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [8]:
# To view all duplicate rows
db_duplicates_rows = db[db.duplicated(keep=False)]

# To view all duplicate columns
db_duplicates_columns = db.T.duplicated()

# Display the duplicate rows
print("Duplicate Rows:")
print(db_duplicates_rows)

# Display the duplicate columns
print("Duplicate Columns:")
print(db_duplicates_columns[db_duplicates_columns].index.tolist())
#The index.tolist() method is used to convert the index of the duplicates_columns Series into a list, 
#which gives you the names of the duplicate columns.

Duplicate Rows:
Empty DataFrame
Columns: [Unnamed: 0, genre_ids, id, original_language, original_title, popularity, release_date, title, vote_average, vote_count]
Index: []
Duplicate Columns:
[]


The <span style='color:blue'>TheMovieDB</span> dataset includes information such as genre, title, popularity, release date, vote average, and vote count. To gain a comprehensive understanding of these columns, I will need to look at them deeper. Specifically, I will explore the concept of popularity and determine its underlying factors. Additionally, I will investigate whether a higher vote count necessarily indicates that a movie was well received. By delving into these aspects, I aim to enhance my comprehension of the dataset and provide more insightful analysis.

All the duplicated columns seem essential except for the <span style='color:red'>'id'</span> column which is suppose to be unique for each item in the dataset rows. For the data in this dataset to satisfy the aim of this analysis, dropping the duplicates will be necessary.

In [23]:
#Viewing the duplicates in the 'id' column
db_duplicates = db[db.duplicated('id')]
db_duplicates 

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count,movie_original_title
2473,2473,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,Toy StoryToy Story
2477,2477,"[16, 35, 10751]",863,en,Toy Story 2,22.698,1999-11-24,Toy Story 2,7.5,7553,Toy Story 2Toy Story 2
2536,2536,"[12, 28, 878]",20526,en,TRON: Legacy,13.459,2010-12-10,TRON: Legacy,6.3,4387,TRON: LegacyTRON: Legacy
2673,2673,"[18, 10749]",46705,en,Blue Valentine,8.994,2010-12-29,Blue Valentine,6.9,1677,Blue ValentineBlue Valentine
2717,2717,"[35, 18, 14, 27, 9648]",45649,en,Rubber,8.319,2010-09-01,Rubber,5.9,417,RubberRubber
...,...,...,...,...,...,...,...,...,...,...,...
26481,26481,"[35, 18]",270805,en,Summer League,0.600,2013-03-18,Summer League,4.0,3,Summer LeagueSummer League
26485,26485,"[27, 53]",453259,en,Devils in the Darkness,0.600,2013-05-15,Devils in the Darkness,3.5,1,Devils in the DarknessDevils in the Darkness
26504,26504,"[27, 35, 27]",534282,en,Head,0.600,2015-03-28,Head,1.0,1,HeadHead
26510,26510,[99],495045,en,Fail State,0.600,2018-10-19,Fail State,0.0,1,Fail StateFail State


In [None]:
#Dropping the 'original_title' column because it contains the same information as the 'title' column. I decidec to drop 
# 'original_title column' because as much as it contained the same information as the 'title' column, it also had some titles
#written in different languages such as 'Счастье мое' for 'My Joy' and 'สวรรค์บ้านนา' for 'Agrarian Utopia'

db.drop(['original_title'], axis=1, inplace=True)

In [32]:
db.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,popularity,release_date,title,vote_average,vote_count,movie_original_title
0,0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,Harry Potter and the Deathly Hallows: Part 1Ha...
1,1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,How to Train Your DragonHow to Train Your Dragon
2,2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,Iron Man 2Iron Man 2
3,3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174,Toy StoryToy Story
4,4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186,InceptionInception


In [33]:
#Dropping the duplicates in the 'id' column and viewing the new dataset info
db.drop_duplicates(subset='id').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25497 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            25497 non-null  int64  
 1   genre_ids             25497 non-null  object 
 2   id                    25497 non-null  int64  
 3   original_language     25497 non-null  object 
 4   popularity            25497 non-null  float64
 5   release_date          25497 non-null  object 
 6   title                 25497 non-null  object 
 7   vote_average          25497 non-null  float64
 8   vote_count            25497 non-null  int64  
 9   movie_original_title  25497 non-null  object 
dtypes: float64(2), int64(3), object(5)
memory usage: 2.1+ MB


The <span style='color:blue'>TheMovieDB</span> dataset is now free of unnecessary duplicates. As per our primary factors under examination, this dataset contains information on genre that could possibly assist in our research. Let's keep going!

***

### **Rotten Tomatoes Dataset**

In [45]:
#Rotten Tomatoes Information
rt = pd.read_csv('rt.movie_info.tsv', delimiter='\t')
#looking for the first five entries
rt

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,
...,...,...,...,...,...,...,...,...,...,...,...,...
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,


In [12]:
#Looking at what is in the dataset
rt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [13]:
# To view all duplicate rows
rt_duplicates_rows = rt[rt.duplicated(keep=False)]

# To view all duplicate columns
rt_duplicates_columns = rt.T.duplicated()

# Display the duplicate rows
print("Duplicate Rows:")
print(rt_duplicates_rows)

# Display the duplicate columns
print("Duplicate Columns:")
print(rt_duplicates_columns[rt_duplicates_columns].index.tolist())
#The index.tolist() method is used to convert the index of the duplicates_columns Series into a list, 
#which gives you the names of the duplicate columns.

Duplicate Rows:
Empty DataFrame
Columns: [id, synopsis, rating, genre, director, writer, theater_date, dvd_date, currency, box_office, runtime, studio]
Index: []
Duplicate Columns:
[]


#### **Rotten Tomatoes reviews**

In [16]:
#Rotten Tomatoes reviews
rtr = pd.read_csv('rt.reviews.tsv', delimiter = '\t', encoding = 'latin1')
#Looking for the first 10 entries
rtr.head(10)

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
5,3,... Cronenberg's Cosmopolis expresses somethin...,,fresh,Michelle Orange,0,Capital New York,"September 11, 2017"
6,3,"Quickly grows repetitive and tiresome, meander...",C,rotten,Eric D. Snider,0,EricDSnider.com,"July 17, 2013"
7,3,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,"April 21, 2013"
8,3,"Cronenberg's cold, exacting precision and emot...",,fresh,Sean Axmaker,0,Parallax View,"March 24, 2013"
9,3,Over and above its topical urgency or the bit ...,,fresh,Kong Rithdee,0,Bangkok Post,"March 4, 2013"


In [17]:
rtr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In the <span style='color:blue'>Rotten Tomatoes</span> dataset, both the movie info and reviews contain data in two spectrums of my primary factors i.e genre and budget. From close observation, the budget/currency has a lot of NaN values and because of that, I will not use this dataset.

***

### The Numbers Dataset

In [46]:
#Viewing the bom dataset
tn = pd.read_csv('tn.movie_budgets.csv')
#Viewing the first 10 entries
tn.head(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,6,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"
6,7,"Apr 27, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200"
7,8,"May 24, 2007",Pirates of the Caribbean: At Worldâs End,"$300,000,000","$309,420,425","$963,420,425"
8,9,"Nov 17, 2017",Justice League,"$300,000,000","$229,024,295","$655,945,209"
9,10,"Nov 6, 2015",Spectre,"$300,000,000","$200,074,175","$879,620,923"


In [47]:
#Looking at what is in the dataset
tn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


<span style='color:blue'>The Numbers</span> dataset is rich in information I need for my budget factor. It will then be crucial to use this dataset in my analysis.

***

### IMDB

In [None]:
title.akas.csv
title.basics.csv
title.crew.csv
title.principals.csv
title.ratings.csv

In [None]:
persons = pd.read_csv('basics.csv')
basics = pd.read_csv('title.basics.csv')
crew = pd.read_csv('title.crew.csv')