## Final Project Submission

Please fill out:
* Student name: Innocent Mbuvi 
* Student pace: full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# Data-Driven Decision Making: Empowering Microsoft's Movie Studio Venture

# 1. Business Understanding

## Introduction
In a bid to diversify its portfolio and tap into the entertainment industry, Microsoft has embarked on a new venture to create a new movie studio. However, due to the lack of expertise in the realm of film production, Microsoft seeks to leverage data-driven insights from successful films at the box office. As a data analyst, I have been tasked with analyzing the movie industry data to provide actionable insights that will help Microsoft make informed decisions on the types of movies to produce.

## Business Problem
Microsoft sees the potential of the creating original video content and has decided to create a new movie studio. However, they lack the expertise in the film industry and are looking for data-driven insights to help them make informed decisions on the types of movies to produce. The goal of this analysis is to provide actionable insights that will help Microsoft maximize their return on investment and increase their chances of success in the movie industry.

## Objectives
The objectives of this analysis are to:
- Identify the most successful genres at the box office.
- Determine the most successful months for movie releases.
- Identify the most successful directors and actors.
- Identify the most successful production companies.
- Determine the relationship between movie budgets and box office revenue.
- Determine the relationship between production companies and box office revenue.
- Determine the relationship between directors and box office revenue.
- Determine the relationship between actors and box office revenue.
- Determine the relationship between genres and box office revenue.
- Determine the relationship between release months and box office revenue.

## Business Value
The insights from this analysis will help Microsoft make informed decisions on the types of movies to produce, the best time to release movies, the best directors and actors to work with, and the best production companies to partner with. This will help Microsoft maximize their return on investment and increase their chances of success in the movie industry.

## Source of Data
1. https://www.boxofficemojo.com/
2. https://www.imdb.com/
3. https://www.rottentomatoes.com/
4. https://www.themoviedb.org/
5. https://www.the-numbers.com/





# 2. Data Understanding
In this section, the following will be carried out:
- Load the data and explore it to understand its structure and contents.
- Check for missing values and duplicates.
- Identify the relevant data for our analysis. 

### Importing the necesary libraries and loading the data


In [2]:
#Importing necessary libraries
import csv
import pandas as pd
import _sqlite3

In [3]:
#LOADING THE DATA

#Loading the data from the csv file
box_office = pd.read_csv('data/bom.movie_gross.csv')
the_movie = pd.read_csv('data/tmdb.movies.csv')
the_number = pd.read_csv('data/tn.movie_budgets.csv')

#Loading the data from the tsv file
rotten_tomatoes_movie = pd.read_csv('data/rt.movie_info.tsv', delimiter='\t')
rotten_tomatoes_review = pd.read_csv('data/rt.reviews.tsv', delimiter='\t', encoding='latin1')

#Loading data from a database
#Connecting to the database
conn = _sqlite3.connect('data/im.db')

## Explore Data Characteristics

a. Box Office Mojo

In [4]:
# columns in the data
box_office.columns

Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'], dtype='object')

In [5]:
''''
The data contains 5 columns. The columns are:
- title
- studio
- domestic_gross
- foreign_gross
- year
'''

"'\nThe data contains 5 columns. The columns are:\n- title\n- studio\n- domestic_gross\n- foreign_gross\n- year\n"

In [6]:
# Data types of the columns and total number of entries
box_office.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [7]:
'''
The various data types are: 
- object - title, studio, foreign_gross
- int64 - year
- float64 - domestic_gross

The total number of entries is 3387
'''

'\nThe various data types are: \n- object - title, studio, foreign_gross\n- int64 - year\n- float64 - domestic_gross\n\nThe total number of entries is 3387\n'

In [8]:
# Checking for missing values
null_values = box_office.isnull()
print(null_values.sum())

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64


In [9]:
'''
There are a total of 1383 missing values.
studio has 5 missing values
foreign_gross has 1350 missing values
domestic_gross has 28 missing values

'''

'\nThere are a total of 1383 missing values.\nstudio has 5 missing values\nforeign_gross has 1350 missing values\ndomestic_gross has 28 missing values\n\n'

In [10]:
#Checking for duplicates
duplicates = box_office.duplicated()
print(duplicates.sum())

0


In [11]:
'''
There are no duplicates in the data.
'''

'\nThere are no duplicates in the data.\n'

b. IMDB 

In [12]:
# Fetch table names from the database
table_names = pd.read_sql('SELECT name FROM sqlite_master WHERE type="table";', conn)
table_names

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [13]:
''' 
The database contains the following tables which are:
- movie_basics
- directors
- known_for
- movie_akas
- movie_ratings
- persons
- principals
- writers
'''

' \nThe database contains the following tables which are:\n- movie_basics\n- directors\n- known_for\n- movie_akas\n- movie_ratings\n- persons\n- principals\n- writers\n'

In [14]:
#Column names and data types in the movie_basics table
movie_basics = pd.read_sql('PRAGMA table_info(movie_basics)', conn)
movie_basics

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,primary_title,TEXT,0,,0
2,2,original_title,TEXT,0,,0
3,3,start_year,INTEGER,0,,0
4,4,runtime_minutes,REAL,0,,0
5,5,genres,TEXT,0,,0


In [15]:
'''
movie_basics table has the following columns:
- movie_id - TEXT
- primary_title - TEXT
- original_title - TEXT
- start_year - INTEGER
- runtime_minutes - REAL
- genres - TEXT
'''

'\nmovie_basics table has the following columns:\n- movie_id - TEXT\n- primary_title - TEXT\n- original_title - TEXT\n- start_year - INTEGER\n- runtime_minutes - REAL\n- genres - TEXT\n'

In [16]:
#Column names and data types in the directors table
directors = pd.read_sql('PRAGMA table_info(directors)', conn)
directors

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,person_id,TEXT,0,,0


In [17]:
'''
The directors table has the following columns:
- movie_id - TEXT
- person_id - TEXT
'''

'\nThe directors table has the following columns:\n- movie_id - TEXT\n- person_id - TEXT\n'

In [18]:
#Column names and data types in the known_for table
known_for = pd.read_sql('PRAGMA table_info(known_for)', conn)
known_for



Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,person_id,TEXT,0,,0
1,1,movie_id,TEXT,0,,0


In [19]:
'''
The known_for table has the following columns:
- person_id - TEXT
- movie_id - TEXT
'''

'\nThe known_for table has the following columns:\n- person_id - TEXT\n- movie_id - TEXT\n'

In [20]:
#Column names and data types in the movie_akas table
movie_akas = pd.read_sql('PRAGMA table_info(movie_akas)', conn)
movie_akas



Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,ordering,INTEGER,0,,0
2,2,title,TEXT,0,,0
3,3,region,TEXT,0,,0
4,4,language,TEXT,0,,0
5,5,types,TEXT,0,,0
6,6,attributes,TEXT,0,,0
7,7,is_original_title,REAL,0,,0


In [21]:
'''
The movie_akas table has the following columns:
- title_id - TEXT
- ordering - INTEGER
- title - TEXT
- region - TEXT
- language - TEXT
- types - TEXT
- attributes - TEXT
- is_original_title - INTEGER
'''

'\nThe movie_akas table has the following columns:\n- title_id - TEXT\n- ordering - INTEGER\n- title - TEXT\n- region - TEXT\n- language - TEXT\n- types - TEXT\n- attributes - TEXT\n- is_original_title - INTEGER\n'

In [22]:
#Column names and data types in the movie_ratings table
movie_ratings = pd.read_sql('PRAGMA table_info(movie_ratings)', conn)
movie_ratings



Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,averagerating,REAL,0,,0
2,2,numvotes,INTEGER,0,,0


In [23]:
'''
Movie_ratings table has the following columns:
- movie_id - TEXT
- averagerating - REAL
- numvotes - INTEGER
'''


'\nMovie_ratings table has the following columns:\n- movie_id - TEXT\n- averagerating - REAL\n- numvotes - INTEGER\n'

In [24]:
#Column names and data types in the persons table
persons = pd.read_sql('PRAGMA table_info(persons)', conn)
persons

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,person_id,TEXT,0,,0
1,1,primary_name,TEXT,0,,0
2,2,birth_year,REAL,0,,0
3,3,death_year,REAL,0,,0
4,4,primary_profession,TEXT,0,,0


In [25]:
'''
persons table has the following columns:
- person_id - TEXT
- primary_name - TEXT
- birth_year - REAL
- death_year - REAL
- primary_profession - TEXT
'''

'\npersons table has the following columns:\n- person_id - TEXT\n- primary_name - TEXT\n- birth_year - REAL\n- death_year - REAL\n- primary_profession - TEXT\n'

In [26]:
#Column names and data types in the principals table
principals = pd.read_sql('PRAGMA table_info(principals)', conn)
principals


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,ordering,INTEGER,0,,0
2,2,person_id,TEXT,0,,0
3,3,category,TEXT,0,,0
4,4,job,TEXT,0,,0
5,5,characters,TEXT,0,,0


In [27]:
'''
principals table has the following columns:
- movie_id - TEXT
- ordering - INTEGER
- person_id - TEXT
- category - TEXT
- job - TEXT
- characters - TEXT
'''

'\nprincipals table has the following columns:\n- movie_id - TEXT\n- ordering - INTEGER\n- person_id - TEXT\n- category - TEXT\n- job - TEXT\n- characters - TEXT\n'

In [28]:
#Column names and data types in the writers table
writers = pd.read_sql('PRAGMA table_info(writers)', conn)
writers


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,movie_id,TEXT,0,,0
1,1,person_id,TEXT,0,,0


In [29]:
'''
writers table has the following columns:
- movie_id - TEXT
- person_id - TEXT
'''

'\nwriters table has the following columns:\n- movie_id - TEXT\n- person_id - TEXT\n'

In [30]:
# Number of records in each table
record_movies_table = pd.read_sql('SELECT COUNT(*) FROM movie_basics', conn)
record_directors_table = pd.read_sql('SELECT COUNT(*) FROM directors', conn)
record_known_for = pd.read_sql('SELECT COUNT(*) FROM known_for', conn)
record_movies_akas = pd.read_sql('SELECT COUNT(*) FROM movie_akas', conn)
record_movie_ratings = pd.read_sql('SELECT COUNT(*) FROM movie_ratings', conn)
record_persons = pd.read_sql('SELECT COUNT(*) FROM persons', conn)
record_principals = pd.read_sql('SELECT COUNT(*) FROM principals', conn)
record_writers = pd.read_sql('SELECT COUNT(*) FROM writers', conn)

print(record_movies_table)
print(record_directors_table)
print(record_known_for)
print(record_movies_akas)
print(record_movie_ratings)
print(record_persons)
print(record_principals)
print(record_writers)


   COUNT(*)
0    146144
   COUNT(*)
0    291174
   COUNT(*)
0   1638260
   COUNT(*)
0    331703
   COUNT(*)
0     73856
   COUNT(*)
0    606648
   COUNT(*)
0   1028186
   COUNT(*)
0    255873


In [31]:
'''
The number of records in each table are:
- movie_basics - 146144
- directors - 291174
- known_for - 1638260
- movie_akas - 331703
- movie_ratings - 73856
- persons - 606648
- principals - 1028186
- writers - 255873
'''

'\nThe number of records in each table are:\n- movie_basics - 146144\n- directors - 291174\n- known_for - 1638260\n- movie_akas - 331703\n- movie_ratings - 73856\n- persons - 606648\n- principals - 1028186\n- writers - 255873\n'

c. Rotten Tomatoes

In [32]:
# Columns in the dataset
rotten_tomatoes_movie.columns

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

In [33]:
'''
Columns in the rotten_tomatoes_movie dataset are:
- id
- synopsis
- rating
- genre
- director
- writer
- theater_date
- dvd_date
- currency
- box_office
- runtime

'''

'\nColumns in the rotten_tomatoes_movie dataset are:\n- id\n- synopsis\n- rating\n- genre\n- director\n- writer\n- theater_date\n- dvd_date\n- currency\n- box_office\n- runtime\n\n'

In [34]:
# Data types of the columns and total number of entries
rotten_tomatoes_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [35]:
'''
This dataset has the following data types:
- int64 - id
- object - synopsis, rating, genre, director, writer, theater_date, dvd_date, currency, box_office

The total number of entries is 1560
'''

'\nThis dataset has the following data types:\n- int64 - id\n- object - synopsis, rating, genre, director, writer, theater_date, dvd_date, currency, box_office\n\nThe total number of entries is 1560\n'

In [36]:
# Checking for missing values
null_values = rotten_tomatoes_movie.isnull()
print(null_values.sum())

id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64


In [37]:
'''
synopsis has 62 missing values
rating has 3 missing values
genre has 8 missing values
director has 199 missing values
writer has 449 missing values
theater_date has 359 missing values
dvd_data has 359 missing values
currency has 1220 missing values
box_office has 1220 missing values
runtime has 30 missing values
studio has 1066 missing values

There are total of 4975 missing values in the dataset.
'''

'\nsynopsis has 62 missing values\nrating has 3 missing values\ngenre has 8 missing values\ndirector has 199 missing values\nwriter has 449 missing values\ntheater_date has 359 missing values\ndvd_data has 359 missing values\ncurrency has 1220 missing values\nbox_office has 1220 missing values\nruntime has 30 missing values\nstudio has 1066 missing values\n\nThere are total of 4975 missing values in the dataset.\n'

In [38]:
#Checking for duplicates
duplicates = rotten_tomatoes_movie.duplicated()
print(duplicates.sum())


0


In [39]:
'There are no duplicates in the dataset'

'There are no duplicates in the dataset'

c.(i) Rotten Tomatoes Reviews

In [40]:
# Columns in the dataset
rotten_tomatoes_review.columns

Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')

In [41]:
'''
This dataset has the following columns:
- id
- review
- rating
- fresh
- critic
- top_critic
- publisher
- date
'''

'\nThis dataset has the following columns:\n- id\n- review\n- rating\n- fresh\n- critic\n- top_critic\n- publisher\n- date\n'

In [42]:
# Data types of the columns and total number of entries
rotten_tomatoes_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [43]:
'''
The dataset has the following datatypes:
- int64 - id, top_critic
- object - review, rating, fresh, critic, publisher, date

The total number of entries is 54432
'''

'\nThe dataset has the following datatypes:\n- int64 - id, top_critic\n- object - review, rating, fresh, critic, publisher, date\n\nThe total number of entries is 54432\n'

In [44]:
#Checking for missing values
null_values = rotten_tomatoes_review.isnull()
print(null_values.sum())

id                0
review         5563
rating        13517
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64


In [45]:
'''
review has 5563 missing values
rating has 13517 missing values
critic has 2722 missing values
publisher has 309 missing values

There are total of 22111 missing values in the dataset
'''

'\nreview has 5563 missing values\nrating has 13517 missing values\ncritic has 2722 missing values\npublisher has 309 missing values\n\nThere are total of 22111 missing values in the dataset\n'

In [46]:
#Check for duplicates
duplicates = rotten_tomatoes_review.duplicated()
print(duplicates.sum())

9


In [47]:
'There are 9 duplicates in the dataset'

'There are 9 duplicates in the dataset'

d. The Movie Database (TMDb)

In [48]:
# Columns in the dataset
the_movie.columns

Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [49]:
'''
The columns in the dataset are:
- genre_ids
- id
- original_language
- original_title
- popularity
- release_date
- title
- vote_average
- vote_count

There is one unnamed column in the dataset
'''

'\nThe columns in the dataset are:\n- genre_ids\n- id\n- original_language\n- original_title\n- popularity\n- release_date\n- title\n- vote_average\n- vote_count\n\nThere is one unnamed column in the dataset\n'

In [50]:
# Data types of the columns and total number of entries
the_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [51]:
'''
The dataset has the following data types:
- int64 - id, vote_count, unnamed: 0
- object - genre_ids, original_language, original_title, release_date, title
- float64 - popularity, vote_average

'''

'\nThe dataset has the following data types:\n- int64 - id, vote_count, unnamed: 0\n- object - genre_ids, original_language, original_title, release_date, title\n- float64 - popularity, vote_average\n\n'

In [52]:
# Checking for missing values
null_values = the_movie.isnull()
print(null_values.sum())

Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64


In [53]:
'There are no null values in the dataset'

'There are no null values in the dataset'

In [54]:
#Checking for duplicates
duplicates = the_movie.duplicated()
print(duplicates.sum())

0


In [55]:
'There are no duplicates in the dataset'

'There are no duplicates in the dataset'

e. The Numbers

In [56]:
# Columns in the dataset
the_number.columns

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

In [57]:
'''
The columns in this dataset are:
- id
- release_date
- movie
- production_budget
- domestic_gross
- worldwide_gross
'''

'\nThe columns in this dataset are:\n- id\n- release_date\n- movie\n- production_budget\n- domestic_gross\n- worldwide_gross\n'

In [58]:
# Data Types and total number of entries
the_number.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [59]:
'''
The dataset has the following data types:
- int64 - id
- object - release_date, movie, production_budget, domestic_gross, worldwide_gross

There are a total of 5782 entries in the dataset
'''

'\nThe dataset has the following data types:\n- int64 - id\n- object - release_date, movie, production_budget, domestic_gross, worldwide_gross\n\nThere are a total of 5782 entries in the dataset\n'

In [60]:
# Checking missing values
null_values = the_number.isnull()
print(null_values.sum())

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64


In [61]:
'There are no missing values in the dataset'

'There are no missing values in the dataset'

In [62]:
#Checking for duplicates
duplicates = the_number.duplicated()
print(duplicates.sum())

0


In [63]:
'There are no duplicates in the dataset'

'There are no duplicates in the dataset'

## Conclusion
Based on the data understanding and data quality checks done for the various datasets provided, the following datasets are suitable for carrying out analysis:
- Box Office Mojo (bom.movie_gross.csv)
- The Numbers (tn.movie_budgets.csv)
- IMDB (im.db)

- The above datasets are suitable for analysis because the data answers the questions that we are trying to answer.
- The datasets have minimal missing values and no duplicates which is suitable for analysis.


# 3. Data Preparation
In this section the following will be carried out:
- Clean the data by addressing missing values, correcting errors, and removing duplicates.
- Standardize data formats and units to facilitate uniform analysis.
- Perform feature engineering to create new variables or derive additional insights from existing data attributes.
- Transform categorical variables into numerical representations.