![image](./images/movie-business-consumer-demand.jpg)

# **Movie Industry Analysis**
### Client  :  Microsoft
> *Authored by: Patrick Anastasio*

## Introduction

In preparation of entering the movie production business, Microsoft has asked me to prepare an analysis of current movie trends and to generate suggestions as to where to invest capital and how they can be sussesful in producing profitable movies. I will be analysing several datasets and making inferences off of financial information, ratings and popularity scores, as well as looking at established industry professionals to make suggestions on what genres and types of films to invest in and who to attach to projects to create buzz and generate an audience.

![image](./images/risky2.jpg)

## Business Problem

Movies are a 'risky business.' As a fledgling production house, Microsoft is unsure as to what kinds of movies to make, and where to invest capital. They lack the experience and industry knowledge that many of the top studios possess. Several factors go into producing a succesful movie. There are a few over-arching features that we will focus on: (1) gross revenue of the top rated and top grossing movies of the modern film era, (2) popular and highly rated genres, and (3) industry professionals who were instrumental in creating these movies.

## The Data

I have pulled in multiple datasets from three industry standard data aggregation sites.
- [Internet Movie Database (IMDB)](https://www.imdb.com/)
- [The Movie Database (TMDB)](https://www.themoviedb.org/?language=en-US)
- [The Numbers](https://www.the-numbers.com/)

My subsequent filtering and analysis of these datasets focused on the following metrics:
> - Financials: 
>    - Budget and Domestic Gross Revenue
> - Ratings and Popularity Scores
> - Movie Genres
> - Names of directors

## The Method

After merging the datasets of interest I narrowed the scope of my analysis by initially filtering the data to only include movies made from 2010 forward. This constitutes the modern era of movie-making, and is characterized by new technologies, an explosion of investment and bigger budgets.

I then converted data types as needed to allow me to operate on them. Specifically, converting objects to numbers to allow me to work with them mathematically.

From these merged and cleaned datasets I pulled dataframes based on:
>1. ratings and popularity scores across all movies and averaged these into specific genres  
>2. domestic gross revenue across all movies, and then honing in on the top thirty (30) grossing movies and their budgets  
>3. directors of the top thirty (30) grossing movies, as well as writers and actors.

In [None]:
# import the packages that will be used in this project

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

## Import the data
#### Read in the raw data files, and create the dataframes I will work with

In [None]:
names_by_id = pd.read_csv('data/zippedData/imdb.name.basics.csv.gz')

names_by_id.info()

In [None]:
title_ratings = pd.read_csv('data/zippedData/imdb.title.ratings.csv.gz')

title_ratings.info()

In [None]:
title_and_genre = pd.read_csv('data/zippedData/imdb.title.basics.csv.gz')

title_and_genre.info()

In [None]:
directors_and_writers = pd.read_csv('data/zippedData/imdb.title.crew.csv.gz')

directors_and_writers.info()

In [None]:
talent_list = pd.read_csv('data/zippedData/imdb.title.principals.csv.gz')

talent_list.info()

In [None]:
popularity_and_votes = pd.read_csv('data/zippedData/tmdb.movies.csv.gz')

popularity_and_votes.info()

In [None]:
budget_and_gross = pd.read_csv('data/zippedData/tn.movie_budgets.csv.gz')

budget_and_gross.info()

## Clean & Analyze the data
#### I am looking to draw inferences from
* hype measured by ratings and popularity
* budget and gross (for simplicity's sake I will only focus on domestic gross)
* talent attached to popular and profitable movies

### Genre by average rating
>I pull out a dataframe that contains information on a movie's genre and its rating

In [None]:
# Merge the datafiles based on common key 'tconst'

genre_by_rating = title_and_genre.merge(title_ratings, on='tconst')

genre_by_rating.info()

#### As stated above, the focus is on movies produced in the modern era  
>I need to filter out titles that were made before the year 2010

In [None]:
# notice from the info that 'start_year' is of dtype: float, which will make filtering easier that converting a dtype: object to datetime

# filter on movies produced from 2010 forward

genre_by_rating = genre_by_rating[genre_by_rating['start_year'] >= 2010.00]

genre_by_rating['start_year'].min()

In [None]:
# take the slice we want

genre_by_rating = genre_by_rating.loc[:, ('original_title', 'genres', 'averagerating', 'numvotes')]

genre_by_rating.info()

In [None]:
# drop the null values in 'genres'

genre_by_rating.dropna(subset=['genres'], axis=0, inplace=True)

genre_by_rating.info()

In [None]:
# drop duplicate titles

genre_by_rating.drop_duplicates(subset='original_title', inplace=True)

genre_by_rating.head()

In [None]:
# filter out movies that do not have many votes, we will set the threshold to at least 300 votes

low_votes = genre_by_rating[genre_by_rating['numvotes'] < 300.0].index
genre_by_rating.drop(low_votes, inplace=True)

genre_by_rating['numvotes'].min()

In [None]:
# filter out low ratings to focus on high rated movies, we will set the threshold at a rating score of 8.5

# filter out the low ratings

low_ratings = genre_by_rating[genre_by_rating['averagerating'] < 8.5].index
genre_by_rating.drop(low_ratings, inplace=True)

genre_by_rating['averagerating'].min()

In [None]:
genre_by_rating.head()

In [None]:
# notice that there are 'genres' values that have multiple genres listed separated by commas
# I will focus on movies with only one genre

# drop values with multiple genres

genre_by_rating_multigenre = genre_by_rating[genre_by_rating['genres'].str.contains(',')].index
genre_by_rating.drop(genre_by_rating_multigenre, inplace=True)

genre_by_rating['genres'].unique()

In [None]:
# then we group our dataframe by 'genres' and show the mean of average ratings for each genre

genre_by_rating_means = genre_by_rating.groupby('genres').mean().sort_values(by='averagerating', ascending=False)

genre_by_rating_means

In [None]:
# create the plot

fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.stem(genre_by_rating_means.index, genre_by_rating_means['averagerating'], linefmt='w-', markerfmt='wo', basefmt='w')
ax.set(ylim=(8.4, 9.3),)
ax.set_xlabel('Genre', font='Andale Mono', fontsize=20, labelpad=10)
ax.set_ylabel('Rating', rotation=0, font='Andale Mono', fontsize=20, labelpad=20)
ax.set_title('Genres by Rating', font='Andale Mono', fontsize=50, loc="center", pad=10)
plt.xticks(font='Andale Mono', fontsize=14)
plt.yticks(font='Andale Mono', fontsize=14)
plt.axhline(y=9.0, ls='--', c='mediumaquamarine')
plt.axhline(y=8.8, ls='--', c='mediumaquamarine')
ax.tick_params(axis='x', labelrotation = 30)
# plt.savefig('./images/genres_by_rating.png')

### Gross Profit
>I pull out a dataframe that contains information on a movie's popularity and financials

In [None]:
# I will have to merge The Numbers dataset but there is no common key
# notice the 'movie' key is the same as the 'original_title' key

# rename the 'movie' key to 'original_title' to merge it

budget_and_gross.rename({'movie':'original_title'}, axis=1, inplace=True)

budget_and_gross.head(1)

In [None]:
# merge the required datasets

popularity_financials = title_and_genre.merge(
    popularity_and_votes, on='original_title', how='right').merge(
    budget_and_gross, on='original_title', how='right')

popularity_financials.info()

In [None]:
# filter on movies produced from 2010 forward

popularity_financials = popularity_financials[popularity_financials['start_year'] >= 2010.00]

popularity_financials['start_year'].min()

In [None]:
# take the slice we want

popularity_financials = popularity_financials.loc[:, ('original_title', 'popularity', 'production_budget', 'domestic_gross')]

popularity_financials.info()

In [None]:
# drop duplicate titles

popularity_financials.drop_duplicates(subset='original_title', inplace=True)

popularity_financials.info()

#### Unfortunately, the values in the financial columns are of dtype: object
>I need to convert these values to a number dtype to work with them mathematically

In [None]:
# create a function that will take an object and transform it into a number

def drop_dollar_sign_and_commas(value):
    """
    this will split the object into a list of characters using the list() function
    then iterate over the list and drop the $ sign, and remove commas from the list
    use the .remove() method to drop the $
    use a for loop to remove the commas, as .remove() will only remove the first instance, and some values contain more than one comma
    then use the .join() method to reconneect the list into a single string
    finally turn that string into a float, and return it
    """
    
    value_list = list(value)
    value_list.remove('$')
    for char in value_list:
        if ',' == char:
            value_list.remove(char)
    value_float = float(''.join(value_list))
    return value_float

In [None]:
# create new columns for the float values using .map() and our function above

popularity_financials['Budget'] = popularity_financials['production_budget'].map(drop_dollar_sign_and_commas)
popularity_financials['Domestic Gross'] = popularity_financials['domestic_gross'].map(drop_dollar_sign_and_commas)

popularity_financials.info()

In [None]:
# slice for just the floats

popularity_financials = popularity_financials.loc[:, ('original_title', 'popularity', 'Budget', 'Domestic Gross')]

popularity_financials.info()

#### I need to find the gross profit of these movies
>The gross profit will be the domestic gross minus the budget

In [None]:
# create a column for gross profit

popularity_financials['Profit'] = popularity_financials['Domestic Gross'] - popularity_financials['Budget']

popularity_financials.head()

In [None]:
# pull the 30 most profitable movies

most_profitable = popularity_financials.sort_values(by='Profit', ascending=False).head(30)

most_profitable.head()

>Currency function below borrowed from [datavizpyr](https://datavizpyr.com/add-dollar-sign-on-axis-ticks-in-matplotlib/)

In [None]:
# plot the 30 most profitable movies

# we want to avoid scientific notation and put the tick numbers into short-form USD

# use [currency] function cited above

def currency(x, pos):
    """
    This function will format a tick of float type to currency
    The two args are the value and tick position
    """
    if x >= 1e6:
        s = '${:1.0f}M'.format(x*1e-6)
    else:
        s = '${:1.0f}'.format(x*1e-3)
    return s


# create the plot

fig, ax = plt.subplots()
fig.set_size_inches(25, 20)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.barh(most_profitable['original_title'], width=most_profitable['Profit'], height=0.5, color='hotpink')
ax.set_xlabel('Profit', font='Andale Mono', fontsize=30, labelpad=15)
ax.set_ylabel('Movie', rotation=30, font='Andale Mono', fontsize=30, labelpad=0)
ax.set_title('Most Profitable Movies', font='Andale Mono', fontsize=40, weight='bold', loc="center", pad=0)
plt.xticks(font='Andale Mono', fontsize=24, weight='bold')
plt.yticks(font='Andale Mono', fontsize=20, weight='bold')
ax.invert_yaxis()
plt.ticklabel_format(axis='x', style='plain')
ax.xaxis.set_major_formatter(currency)
# plt.savefig('./images/profit.png', dpi=200, bbox_inches='tight')

### Most Popular
>I pull out a dataframe that contains information on a movie's popularity

In [None]:
# pull the 30 most popular movies

most_popular = popularity_financials.sort_values(by='popularity', ascending=False).head(30)

most_popular.head()

In [None]:
# create the plot

fig, ax = plt.subplots()
fig.set_size_inches(25, 20)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.barh(most_popular['original_title'], width=most_popular['popularity'], height=0.5, color='darkorange')
ax.set_xlabel('Popularity Score', font='Andale Mono', fontsize=30, labelpad=15)
ax.set_ylabel('Movie', rotation=30, font='Andale Mono', fontsize=30, labelpad=10)
ax.set_title('Most Popular Movies', font='Andale Mono', fontsize=40, weight='bold', loc="center", pad=0)
plt.xticks(font='Andale Mono', fontsize=24, weight='bold')
plt.yticks(font='Andale Mono', fontsize=20, weight='bold')
ax.invert_yaxis()
# plt.savefig('./images/popularity_score.png', dpi=200, bbox_inches='tight')

### Popularity vs. Budget
>I want to see if there is a correlation between popularity and budget

In [None]:
# create a scatter plot with a regression line for popularity vs. budget

# create the the plot

fig, ax = plt.subplots()
fig.set_size_inches(20, 20)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.scatter(popularity_financials['Budget'], popularity_financials['popularity'], s=65)
ax.set_xlabel('Budget', font='Andale Mono', fontsize=30, labelpad=15)
ax.set_ylabel('Popularity', rotation=90, font='Andale Mono', fontsize=30, labelpad=10)
ax.set_title('Popularity vs Budget', font='Andale Mono', fontsize=40, weight='bold', loc="center", pad=0)
plt.xticks(font='Andale Mono', fontsize=24, weight='bold')
plt.yticks(font='Andale Mono', fontsize=20, weight='bold')
ax.xaxis.set_major_formatter(currency)


#add the regression line to scatterplot

m, b = np.polyfit(popularity_financials['Budget'], popularity_financials['popularity'], 1)
plt.plot(popularity_financials['Budget'], m*popularity_financials['Budget']+b, color='red', linewidth=5)

# plt.savefig('./images/popularity_vs_budget.png')

### Most Profitable Directors
>I want to see who directed the most profitable movies

In [None]:
# merge the 'directors_and_writers' dataset with the 'title_and _genre'

directors_merged = title_and_genre.merge(directors_and_writers, on='tconst')

directors_merged.info()

In [None]:
# we want to merge this with the 'names_by_id', but there is no common key

# change the name of the 'directors' column to a common key

directors_merged.rename({'directors':'nconst'}, axis=1, inplace=True)

directors_merged.head()

In [None]:
# merge on the common key

director_names_merged = directors_merged.merge(names_by_id, on='nconst')

director_names_merged.info()

In [None]:
# merge this with the dataset containing the most profitable movies on common key 'original_title'

top_profit_directors = most_profitable.merge(director_names_merged, on='original_title')

top_profit_directors.info()

In [None]:
top_profit_directors['primary_name'].sort_values()

In [None]:
# notice only 20 of the top movies have directors listed, 2 are repeated
# so we will only have 18 names on the plot, their most profitable movie will be plotted
# the dataframe is already sorted by most profitable movies and only

# create the plot

fig, ax = plt.subplots()
fig.set_size_inches(20, 20)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.barh(top_profit_directors['primary_name'], width=top_profit_directors['Profit'], height=0.5)
ax.set_xlabel('Profit', font='Andale Mono', fontsize=30, labelpad=15)
ax.set_ylabel('Director', rotation=30, font='Andale Mono', fontsize=30, labelpad=10)
ax.set_title('Most Profitable Directors', font='Andale Mono', fontsize=40, weight='bold', loc="center", pad=0)
plt.xticks(font='Andale Mono', fontsize=24, weight='bold')
plt.yticks(font='Andale Mono', fontsize=20, weight='bold')
ax.invert_yaxis()
plt.ticklabel_format(axis='x', style='plain')
ax.xaxis.set_major_formatter(currency)
# plt.savefig('./images/top_profit_directors.png', dpi=200, bbox_inches='tight')

### Most Popular Directors
>I want to see who directed the most popular movies

In [None]:
# merge this with the dataset containing the most popular movies on common key 'original_title'

top_pop_directors = most_popular.merge(director_names_merged, on='original_title')

# top_pop_directors.drop(['Budget','Domestic Gross','tconst','primary_title','start_year','runtime_minutes','genres','nconst','writers','birth_year','death_year','primary_profession','known_for_titles'], axis=1, inplace=True)
top_pop_directors.info()

In [None]:
top_pop_directors['primary_name'].sort_values()

In [None]:
# notice only 26 of the top movies have directors listed, and there are 3 repeats
# so we will only have 23 names on the plot, their most popular movie will be plotted
# the dataframe is already sorted by most popular movies

# create the plot

fig, ax = plt.subplots()
fig.set_size_inches(20, 20)
fig.set_facecolor('mediumaquamarine')
ax.set_facecolor('black')
ax.barh(top_pop_directors['primary_name'], width=top_pop_directors['popularity'], height=0.5, color='yellow')
ax.set_xlabel('Movie Popularity Score', font='Andale Mono', fontsize=30, labelpad=15)
ax.set_ylabel('Director', rotation=30, font='Andale Mono', fontsize=30, labelpad=100)
ax.set_title('Most Popular Directors', font='Andale Mono', fontsize=40, weight='bold', loc="center", pad=0)
plt.xticks(font='Andale Mono', fontsize=24, weight='bold')
plt.yticks(font='Andale Mono', fontsize=20, weight='bold')
ax.invert_yaxis()
plt.ticklabel_format(axis='x', style='plain')
# plt.savefig('./images/most_popular_directors.png', dpi=200, bbox_inches='tight')

## Results

### Filtering the data on a minimum number of votes and a minimum rating threshold, my analysis shows the highest rated genres by average movie rating are:
    
  > - **Action** is the top genre by a large margin 

  > - Other genres with high ratings are:
        - Thriller
        - Documentary
        - Comedy
        - Drama
        - Animation
        
### Looking at the gross profit of the top 30 movies of the modern era, my analysis shows the following:

   > - 15 were in the animation or computer-generated graphic genre, with many being franchises as well

   > - 13 were in the action genre:
        - All but 1 of those was part of a franchise, or connected series of movies
        - 6 were super-hero / comic book movies, and alll part of a franchise
        


### Further, looking at the popularity scores of the top 30 most popular movies of the modern era, my analysis shows the following:

   > - 22 were in the action genre, 19 were part of a franchise
        - 16 of these were super-hero / comic book franchises
        
   > - 3 were animation

   > - 3 were drama

   > - 2 were fantasy/adventure franchises

### We also looked at who the directors were on the most profitable and the most opular movies, with some directors appearing multiple times in these lists.

## In Conclusion
Based on these observations, there are three reccomendations that I will put forth

#### 1. Microsoft should acquire the rights to a super-hero / comic book franchise, or possibly another type of action franchise
>- The most popular and profitable genre overall is action.  
>- The most successful movies by both profitability and popularity were in the superhero / comic book sub-genre.  
>- All were franchises

#### 2. Microsoft should produce animated movies
>- 15 of the top 30 most profitable were animation

#### 3. Microsoft should attach top grossing and popular directors
>- Directors are the leaders on set and they can make or break a project. You want a proven and experienced director at the helm.  
>- They bring buzz and notoriety, as well as attract top talent and collaborative investment to their projects

## Further Considerations

I would consider looking at the budgets of popular movies. We saw a slight positive correlation between budget and popularity. This could be a function of an increase in marketing budget, pay scales of top talent, or something else. This could prove to be a worthwhile anaysis of where to allocate capital in a budget, and whether certain escalations could pay dividends for the bottmline.

#### Thank You!

Email: sudomakecoffee1@gmail.com  

GitHub: [@patrick-anastasio](https://github.com/patrick-anastasio)

LinkedIn: [patrickanastasio](linkedin.com/in/patrickanastasio/)
