# Microsoft Movie Studios Recommendations

![Microsoft Movie Studios](data/logo.png)

<h2>Overview

In this project we will help guide the executives at the newly founded Microsoft Movie Studios. We will analyze the <a href="www.imdb.com">IMDb</a>, <a href="https://www.rottentomatoes.com/">Rotten Tomatoes</a>, and <a href="https://www.boxofficemojo.com/">Box Office Mojo</a> datasets. We can use this information to make suggestions regarding the next steps regarding the creation of our first films, as well as potential acquisition decisions.

The first project our studio publishes will have major ramifications for our reputation both within the industry and without.
It's important for our future success to ensure our debut makes a major splash. 

<h2> Data Preparation

IMDb is the internet's movie largest database. We're interested in their data on Film Ratings, Films, and the Actors, Directors, Writers, and Producers involved in their creation.

In [14]:
# Import necessary libraries
import pandas as pd
import sqlite3
# Create connection and cursor objects to execute our SQLite queries
conn = sqlite3.connect("data/im.db")
cursor = conn.cursor()

In [15]:
q = "SELECT name FROM sqlite_master WHERE type='table';"
cursor.execute(q)
tables = cursor.fetchall()
table_names = [table[0] for table in tables]

In [16]:
# We now code loops through the table_names variable, and creates a variable with that name whose 
# value is the DataFrame associated with that table. It then appends these DataFrames to the list db
db = []
for t in table_names:
    vars().__setitem__(t, pd.read_sql(f"""
        SELECT *
        FROM {t}
        """, conn))
    db.append(vars()[t])

Let's first get an idea of the size, shape, columns, and datatypes within this database in order to inform further analysis

In [17]:
for x in range(len(db)):
    print(f"{table_names[x]}:")
    print(db[x].info())
    print()

movie_basics:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
None

directors:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291174 entries, 0 to 291173
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   291174 non-null  object
 1   person_id  291174 non-null  object
dtypes: object(2)
memory usage: 4.4+ MB
None

known_for:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638260 entries, 0 to 1638259
Data columns (

It's clear that much of this data will be unhelpful in our analysis, and that much if it is redundant. For example, directors, writers, and known_for tables contain information that is duplicated in the principals table.



For this reason, we'll be focusing on combining `primary_name`s from the `persons` table with their `category` and `movie_id` from the `principals` table.


Let's look a little closer at the movie_akas table.

As a new studio, it's  important for us to garner a positive reputation by creating our own original films. Therefore, we can eliminate analysis on data involving non-original films.

In [18]:
movie_akas_originals = movie_akas.loc[movie_akas['is_original_title'] == 1.0]
movie_akas_originals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44700 entries, 38 to 331700
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie_id           44700 non-null  object 
 1   ordering           44700 non-null  int64  
 2   title              44700 non-null  object 
 3   region             6 non-null      object 
 4   language           4 non-null      object 
 5   types              44700 non-null  object 
 6   attributes         0 non-null      object 
 7   is_original_title  44700 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 3.1+ MB


We see that nearly all the data within the `region`, `language`, and `attributes` columns are null. The `ordering`, `types`, and `is_original_title` columns are all redundant now as well. We drop these columns and name the resulting table `originals`

In [19]:
originals = movie_akas_originals.drop(labels=['region', 'language', 'attributes', 'is_original_title', 'types', 'ordering'], axis=1)

Now that we have the original movies, let's join them with the `movie_basics` and `movie_ratings` tables, in order to have the `title`, `start_year`, `runtime_minutes`, and `genres` columns from `movie_basics`, and `averagerating` and `numvotes` columns from `movie_ratings`.

In [20]:
originals = originals.set_index('movie_id')
movie_basics = movie_basics.set_index('movie_id')
movie_ratings = movie_ratings.set_index('movie_id')

In [21]:
originals = originals.join(movie_basics, how='inner', rsuffix='_mb')
originals = originals.join(movie_ratings, how='inner', rsuffix='_mb')
originals.drop(labels=['primary_title', 'original_title'], axis=1, inplace=True)
originals = originals.reset_index()

In [23]:
originals.head()

Unnamed: 0,movie_id,title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


Let's filter this table for a minimum of 500k `numvotes`, and sort by `averagerating`, as a sanity check.

In [29]:
originals.loc[(originals['numvotes'] >= 500000)].sort_values('averagerating', ascending=False).head()

Unnamed: 0,movie_id,title,start_year,runtime_minutes,genres,averagerating,numvotes
1556,tt1375666,Inception,2010,148.0,"Action,Adventure,Sci-Fi",8.8,1841066
206,tt0816692,Interstellar,2014,169.0,"Adventure,Drama,Sci-Fi",8.6,1299334
22558,tt4154756,Avengers: Infinity War,2018,149.0,"Action,Adventure,Sci-Fi",8.5,670926
4404,tt1675434,Intouchables,2011,112.0,"Biography,Comedy,Drama",8.5,677343
14124,tt2582802,Whiplash,2014,106.0,"Drama,Music",8.5,616916


Now let's look at the principals table

In [30]:
principals.head()

Unnamed: 0,movie_id,ordering,person_id,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"


In [31]:
principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   movie_id    1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   person_id   1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


We'll now limit the rows of this table to contain only those which pertain to an original movie, as denoted by the `movie_id` within our originals DataFrame. We can then join the resulting table with the `persons` and `movie_ratings` tables to obtain the names of the individuals, as well as the the ratings of their work.

In [32]:
originals_movie_ids = list(originals['movie_id'])
principals_bool = [movie_id in originals_movie_ids for movie_id in principals['movie_id']]
principals_in_orig = principals.loc[principals_bool]

In [33]:
principals_names = principals_in_orig.set_index('person_id').join(persons.set_index('person_id'), how='inner')
principals_names = principals_names.reset_index()
prin_names_ratings = principals_names.set_index('movie_id').join(movie_ratings, how='inner')
prin_names_ratings = prin_names_ratings.reset_index()

We'll now define a function to parse the `prin_names_ratings` by `category` column, to separate the roles of the individuals into actors, actresses, directors, writers, producers, and composers. The function will then average the rating and number of votes of their work, as well as create a new column for the sum of the number of votes on all of their work. We can use this information to draw conclusions about how best to cast roles for our first films

In [34]:
def create_top_tables(srole, source):
    """
    INPUT:
    This function is taking in a list of strings, and a cleaned principals table.
    The list of strings are values from the roles within the Persons Table, Category Column, which we are interested in. 
    The Principals table has already been cleaned to have non-original works and unncessary columns removed.
    
    OUTPUT:
    Outputs a DataFrame whose variable name is the relevant role, ordered by total number of votes.
    """
    vars()[srole] = source.loc[source['category'] == srole].groupby('person_id').mean().sort_values(by='averagerating', ascending=False)
    vars()[srole]['total_numvotes'] = source.loc[source['category'] == srole].groupby('person_id').sum()['numvotes']
    vars()[srole] = vars()[srole].drop(['birth_year', 'ordering', 'death_year'], axis=1).join(persons.set_index('person_id'), how='inner').drop(labels='death_year', axis=1).rename(columns={'numvotes': 'avg_numvotes'})
    vars()[srole] = vars()[srole].loc[vars()[srole]['avg_numvotes'] >= 200000].sort_values('total_numvotes', ascending=False)
    vars()[srole]['avg_numvotes'] = vars()[srole]['avg_numvotes'].astype('int64')
    return vars()[srole]

In [35]:
roles = ["actor", "actress", "director", "writer", "producer", "composer"]
for i in range(len(roles)):
    vars()[roles[i]] = create_top_tables(roles[i], prin_names_ratings)

Let's do a `.head()` on one of the resulting tables as a sanity check.

In [43]:
actor.head()

Unnamed: 0_level_0,averagerating,avg_numvotes,total_numvotes,primary_name,birth_year,primary_profession
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
nm0000375,7.530769,488930,6356093,Robert Downey Jr.,1965.0,"actor,producer,soundtrack"
nm0000138,8.088889,697068,6273617,Leonardo DiCaprio,1974.0,"actor,producer,writer"
nm0262635,7.266667,367209,5508138,Chris Evans,1981.0,"actor,producer,director"
nm0362766,7.361538,380252,4943284,Tom Hardy,1977.0,"actor,producer,writer"
nm1165110,6.933333,323257,4848858,Chris Hemsworth,1983.0,"actor,soundtrack,producer"


 Rotten Tomatoes is a review-aggregation website. Box Office Mojo tracks box-office revenue in a systematic way. It was purchased by IMDb in 2008. Using these three resources we can 

<h2> Feature Engineering

<h2>Data Analysis

<h2> Conclusions