![example](images/director_shot.jpeg)

# Project Title

**Authors:** Mitch Allison, Brenda De Leon, Matt Rubic
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

We are looking to find at least 3 actionable steps to insure the greatest chance for a new movie's financial success. The questions to be answered are:
* What budget provides the greatest profit or ROI?
* What actors/actresses/directors lead to the greatest profit?
* What genres lead to the greatest profit at differing movie budgets?
* What is the best time of year to release a movie?

***
Notes:
* A new movie studio has no relationship with talent or catalogue of movies. We need to use our initial seed money intelligently, to foster future success.
* We picked these questions to answer after looking at the cleaned data and spotting certain trends.
* A new movie is expensive. We need to analyze data from the past decade of movies to maximize profit given a certain budget.
Starting with a flop as a new studio is likely difficult to recover from without additional investment.

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import sqlite3

%matplotlib inline
conn = sqlite3.connect('./zippedData/im.db')

In [2]:
# loading the datasets
TheNumbers = pd.read_csv('./zippedData/tn.movie_budgets.csv.gz')
RTinfo = pd.read_csv('./zippedData/rt.movie_info.tsv.gz', sep='\t')
RTreviews = pd.read_csv('./zippedData/rt.reviews.tsv.gz', sep='\t', encoding = 'unicode_escape')

We used data from the IMDB SQL database to reference talent and movies. We used data from TheNumbers csv to find financial data for movies.
Both datasets required some cleaning, espescially in order to join them and find trends between movies features and their financial data.

***
Notes:
* This data is publicly available, from IMDB and TheNumbers. They relate to the questions as they are comprehensive sources for data concerning movies.
* The data represents movies and their features. This data includes variables such as actors, directors, movies, budget, and grossing.
* To answer our business questions, we needed to know each movie's financial information, how that movie was classified, when it was released,
and who was associated with that movie.
* We intend to use the correlations of budget to worldwide profit to find the movies that are most likely to see a good return on investment(budget).

### Rotten Tomatoes Movie Info dataset

In [3]:
RTinfo.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [4]:
RTinfo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


**Discovery**

May need to discover where the nulls are, convert string numbers to values. Looks like a lot of the Studio values are missing. Currency and box office seem to appear in the same entries, but only for like 20% of the data

**How this data might be used**

Could use this dataset to connect information about movies that appear in this dataset and others, to aggregate all information about a movie

### Rotten Tomatoes Reviews dataset

In [5]:
RTreviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [6]:
RTreviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


**Discovery**

Rating is not currently a value, will need to be converted to become useful. Many ratings are NaN and will need to be dropped, and many other ratings are letters and will need to be dropped. ratings are sometimes on a different scale (read:denominator), but can be converted to real numbers.

Fresh or rotten seems to exist for each entry

Dates are also strings and need to be normalized to be used

ID be traced to which movie it represents, to aggregate

May want a new group for just reviews by top critics

**How this data might be used**

Could be used to display movie ratings alongside other information for common movies from other datasets

### The Numbers dataset

In [7]:
TheNumbers.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [8]:
TheNumbers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


**Discovery**

All entries in this dataset are non-null. However, all values are strings, so may need to be changed to be usable

**How this data might be used**

Use to evaluate and compare different financial information across movies, and aggregate with other information from different datasets

### IMDB database

In [9]:
query = '''
SELECT
    *
FROM
    movie_basics
--LIMIT 15
    '''

In [None]:
movie_basics = pd.read_sql('''SELECT * ''', conn)
movie_basics

In [None]:
movie_basics.info()

In [None]:
query = '''
SELECT
    *
FROM
    movie_ratings
--LIMIT 15
    '''

In [None]:
movie_ratings = pd.read_sql(query, conn)
movie_ratings

In [None]:
movie_ratings.info()

In [None]:
query = '''
SELECT
    *
FROM
    principals
--LIMIT 15
    '''

In [None]:
principals = pd.read_sql(query, conn)
principals

In [None]:
principals.info()

**Discovery**

This database contains a lot of data about movies that may help identify them, like id, title, release date, actors/directors, etc.

**How this data might be used**

It looks like this imdb database will be helpful for finding all movies from certain directors, or other such asks, but won't be as helpful for finding direct answers to our business questions.

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

**Cleaning data in TheNumbers**

TheNumbers represents our financial data we will use to analyze relationships between other variables and metrics derived from Gross revenue like profit, ROI, etc.

The following code converts dollar amounts to integer types within the database so they can be manipulated as numbers

In [None]:
punct = '$,'   # `|` is not present here
money_to_num = str.maketrans(dict.fromkeys(punct, ''))

#stripping dollar-sign and commas from relevant objects
TheNumbers['worldwide_gross'] = '|'.join(TheNumbers['worldwide_gross'].tolist()).translate(money_to_num).split('|')
TheNumbers['domestic_gross'] = '|'.join(TheNumbers['domestic_gross'].tolist()).translate(money_to_num).split('|')
TheNumbers['production_budget'] = '|'.join(TheNumbers['production_budget'].tolist()).translate(money_to_num).split('|')
#converting number strings to int64 types
TheNumbers['worldwide_gross'] = TheNumbers['worldwide_gross'].astype(np.int64)
TheNumbers['domestic_gross'] = TheNumbers['domestic_gross'].astype(np.int64)
TheNumbers['production_budget'] = TheNumbers['production_budget'].astype(np.int64)

The following code creates metrics that measure Profit, Profit Margin, and ROI, given the information in TheNumbers dataset about Production Budget and Gross Domestic and Worldwide. We will use these metrics to evaluate and compare a film's profitability and the efficiency of an investment, which is more useful information than the Production Budget or Gross Revenue alone when we are seeking to explore factors that contribute to profitability. 

In [None]:
#created variables for domestic profit, worldwide profit, domestic & worldwide profit margins (profit/gross)
#and domestic & worldwide ROIs (profit-production budget)
TheNumbers['domestic_profit'] = TheNumbers['domestic_gross'] - TheNumbers['production_budget']
TheNumbers['worldwide_profit'] = TheNumbers['worldwide_gross'] - TheNumbers['production_budget']
TheNumbers['domestic_margin'] = TheNumbers['domestic_profit'] / TheNumbers['domestic_gross']
TheNumbers['worldwide_margin'] = TheNumbers['worldwide_profit'] / TheNumbers['worldwide_gross']
TheNumbers['domestic_roi'] = TheNumbers['domestic_profit'] / TheNumbers['production_budget']
TheNumbers['worldwide_roi'] = TheNumbers['worldwide_profit'] / TheNumbers['production_budget']

The following code returns numbers rather than strings for release dates, so that this data can be more easily explored later 

In [None]:
#Normalize release dates as numbers (original dataset had month names)
TheNumbers['release_date'] = pd.to_datetime(TheNumbers['release_date'])
#Creates new column in datset that approximates to calendar month and day as we seek to explore relationship 
#between calendar dates and revenue 
TheNumbers['calendar_day'] = TheNumbers['release_date'].astype(str).str.strip().str[-5:]
#Creates a function that assigns calendar day strings to a number. There is probably a simpler way to do this
def getdaynum(cal_day):
    day_list = []
    for entry in cal_day:
        if entry[:2] == '01':
            day_list.append(int(entry[3:]))
        elif entry[:2] == '02':
            day_list.append(int(entry[3:]) + 31)
        elif entry[:2] == '03':
            day_list.append(int(entry[3:]) + 60)
        elif entry[:2] == '04':
            day_list.append(int(entry[3:]) + 91)
        elif entry[:2] == '05':
            day_list.append(int(entry[3:]) + 122)
        elif entry[:2] == '06':
            day_list.append(int(entry[3:]) + 152)
        elif entry[:2] == '07':
            day_list.append(int(entry[3:]) + 182)
        elif entry[:2] == '08':
            day_list.append(int(entry[3:]) + 213)
        elif entry[:2] == '09':
            day_list.append(int(entry[3:]) + 244)
        elif entry[:2] == '10':
            day_list.append(int(entry[3:]) + 274)
        elif entry[:2] == '11':
            day_list.append(int(entry[3:]) + 305)
        elif entry[:2] == '12':
            day_list.append(int(entry[3:]) + 335)
    return pd.Series(day_list)
#creates new column in dataset for integer values of days
TheNumbers['release_day_num'] = getdaynum(TheNumbers['calendar_day'])

**Creating dataframe to analyze IMDB data with corresponding columns from TheNumbers**

The following code creates a dataframe that takes all of the production_budget info in from TheNumbers, and then is able to assign genres from IMDB. Connect the movie names to their genres in order to explore relationships between genre and financial performance. Groups the genres and chart vs total budget, profit, margins

In [None]:
query = '''
SELECT
    primary_title,
    genres,
    start_year
FROM
    movie_basics
    '''
movie_genre = pd.read_sql(query, conn)

Clean genres into a list. Set movie name as index.

In [None]:
movie_genre['genres_list'] = movie_genre['genres'].str.split()

Some movies have identical names. Make a new 'name_year' column to join on for both datasets.

In [None]:
movie_genre['name_year'] = movie_genre['primary_title'] + movie_genre['start_year'].apply(str)
# TheNumbers['release_date'] is a datetime64[ns]
year = TheNumbers['release_date'].dt.year
TheNumbers['name_year'] = TheNumbers['movie']+ TheNumbers['release_date'].dt.strftime('%Y')
# Set name_year as index
movie_genre.set_index('name_year', inplace=True)
# Remove movies without genre
movie_genre.dropna(subset=['genres_list'], inplace=True)

Join movie_genre with TheNumbers

In [None]:
money_genre = TheNumbers.join(movie_genre, on='name_year', how = 'inner')

**Some additional cleaning and outlier removal for genres**

Some movies had value 0 for domestic and worldwide gross, we chose to ignore them as they would skew data for movies that were shown in theaters

In [None]:
money_genre = money_genre[(money_genre.domestic_gross != 0) & (money_genre.worldwide_gross != 0)]

Some movies are listed with multiple genres. We wanted to explore the the trends between a unique genre type and financial performance, so we got all unique genres and put them in a dictionary with the count of how often each appeared in the original dataset.

In [None]:
genres = money_genre['genres'].unique().tolist()
unique_genres = []
for x in genres:
    templist = x.split(",")
    for y in templist:
        unique_genres.append(y)
unique_genres = list(set(unique_genres))
# genres_df is dict where key = genre, value = df of movies with that genre
genres_df = {x: money_genre[money_genre.genres.str.contains(x)] for x in unique_genres}

**Creating dataframes to analyze Actor and Director data with corresponding columns from TheNumbers**

Much like we did for genre, we want to explore relationships between actors and the financial performance of the movies they appear in, so we create and merge this dataframe

In [None]:
# joined tables: movie_basics and persons to principals, selected only the matching records.
actors = pd.read_sql(
"""
SELECT
    category,
    pr.movie_id,
    pr.person_id,
    primary_name,
    primary_title 
FROM 
    principals AS pr
JOIN
    movie_basics AS mb using(movie_id)
JOIN
    persons AS pe using(person_id)
WHERE
    category = 'actor'
    or
    category = 'actress'
    or
    category = 'self'
""",conn)
actors

In [None]:
# merging "actors" DataFrame (extracted from "IMDB") with "TheNumbers" DataFrame on 'primary_title' and 'movie'
actors_df = pd.merge(actors,
                  TheNumbers,
                  left_on='primary_title',
                  right_on='movie')

actors_df

In [None]:
# joined tables: movie_basics and persons to principals, selected only the matching records.
directors = pd.read_sql(
"""
SELECT
    category,
    pr.movie_id,
    pr.person_id,
    primary_name,
    primary_title 
FROM 
    principals AS pr
JOIN
    movie_basics AS mb using(movie_id)
JOIN
    persons AS pe using(person_id)
WHERE
    category = 'director'
""",conn)
directors

In [None]:
# merging "diretors" DataFrame (extracted from "IMDB") with "TheNumbers" DataFrame on 'primary_title' and 'movie'
directors_df = pd.merge(directors,
                  TheNumbers,
                  left_on='primary_title',
                  right_on='movie')

directors_df

We first cleaned the data. Many movies share genres so we needed to create an additional feature of 'movie name' + 'year released'. We could then join the movies
on these columns.
After joining the data, we were able to find relationships between genre, actors/directors, budget and release date versus worldwide profit.

***
Notes:
* This data required some standard cleaning to use, such as cleaning up dollar amounts from strings to integers and excluding duplicates.
For our inital modeling we used standard methods of graphing data, and then breaking up data by actors/directors or genres.
* This approach is appropriate considering the size of the datasets. Our business problem did not have specific asks,
only to provide the best advice that we can find with these datasets.
***

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

### Exploratory data analyses

**Looking at relationships between variables in TheNumbers**

We performed various analyses to try to learn which variables seemed to have a strong relationship relationship, in order to determine which trends to dig into and offer as recommendations

In [None]:
# plot of budget v worldwide gross
fig, ax = plt.subplots()


plt.scatter(TheNumbers['production_budget'], TheNumbers['worldwide_gross'])


ax.set_xlabel('Production Budget')
ax.set_ylabel('Worldiwde Gross')

# Add a title for the plot
ax.set_title('Budget v Worldwide Gross')

# Add a legend to the plot with legend() in lower right corner
#ax.legend(["Sample Data"], loc=4);

In [None]:
#correlation between production budget and worldwide gross
TheNumbers['production_budget'].corr(TheNumbers['worldwide_gross'])

This was the strongest correlation that we found among TheNumbers data. This suggests that how much money is invested in a movie goes a long way toward determining how many people go to see it. This fundamental assumption becomes a guide for much of the analysis that follows

In [None]:
#plot of budget v domestic ROI
fig, ax = plt.subplots()


plt.scatter(TheNumbers['production_budget'], TheNumbers['domestic_roi'])


ax.set_xlabel('Production Budget')
ax.set_ylabel('Domestic ROI')

# Add a title for the plot
ax.set_title('Budget v Domestic ROI')

# Add a legend to the plot with legend() in lower right corner
#ax.legend(["Sample Data"], loc=4);

In [None]:
#correlation between production budget and worldwide ROI
TheNumbers['production_budget'].corr(TheNumbers['worldwide_roi'])

This relationship appears to be related negatively across an exponential function. This prompted questions about what makes a movie profit at the highest rate, and about which movies are the best investments

In [None]:
low_bucket = TheNumbers.loc[TheNumbers['production_budget'] < 5000000 ]
mid_bucket = TheNumbers.loc[(TheNumbers['production_budget'] <= 50000000) & (TheNumbers['production_budget'] > 5000000)]
high_bucket = TheNumbers.loc[TheNumbers['production_budget'] > 50000000 ]

In [None]:
fig, ax = plt.subplots()


plt.hist(TheNumbers['production_budget'])


ax.set_xlabel('Production Budget')

# Add a title for the plot
ax.set_title('Production budget')

# Add a legend to the plot with legend() in lower right corner
#ax.legend(["Sample Data"], loc=4);

Based upon the analysis above, we hypothesized that budget would have a significant effect on performance metrics like total revenue generated and ROI. For that reason, we created categories to analyze movies within and also compare trends across categories. The boundaries of the categories were informed by this blog post (https://www.studiobinder.com/blog/production-budget/) and the fit was checked with our data to make sure it was a logical distinction to create

In [None]:
#defines function that returns Average Gross for release date
def avg_gross_by_day(day):
    return TheNumbers.loc[TheNumbers['release_day_num'] == day]['worldwide_gross'].sum()/len(TheNumbers.loc[TheNumbers['release_day_num'] == day])

In [None]:
fig, ax = plt.subplots()

plt.scatter(TheNumbers['release_day_num'], TheNumbers['worldwide_gross'], label='raw')
plt.scatter(range(1,367), [avg_gross_by_day(day) for day in range(1,367)], label='avg')

ax.set_xlabel('Calendar Day')
ax.set_ylabel('Worldwide Gross')

# Add a title for the plot
ax.set_title('Calender Day v Worldwide gross')

# Add a legend to the plot with legend() in lower right corner
plt.legend(loc='upper left')
plt.show()

We measured gross here rather than profit to determine movies with which release days would people go to see the most. If we had measured profit instead, we would have measured movie-going in relation to the budget of movies released on those release days. Because moviegoing patterns were what we wanted to learn from this data, gross was the appropriate measure here.

Measuring by individual release day allows us to see trends around certain periods of the year, and isolate days that perform unusually well each year (e.g. Christmas Eve). Unfortunately, it is not as easily interpretted as other formatting of similar data.

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

In [None]:
#graph of budget vs profit for high bucket
fig, ax = plt.subplots()

ax.scatter(low_bucket['production_budget'], low_bucket['worldwide_profit'], label = 'low')
ax.scatter(mid_bucket['production_budget'], mid_bucket['worldwide_profit'], label = 'mid')
ax.scatter(high_bucket['production_budget'], high_bucket['worldwide_profit'], label = 'high')


ax.set_xlabel('Production Budget (in hundred millions)')
ax.set_ylabel('Worldwide Profit (in billions)')

# Add a title for the plot
ax.set_title('Budget v Worldwide Profit')

x1 = low_bucket['production_budget']
y1 = low_bucket['worldwide_profit']
x2 = mid_bucket['production_budget']
y2 = mid_bucket['worldwide_profit']
x3 = high_bucket['production_budget']
y3 = high_bucket['worldwide_profit']

plt.plot(np.unique(x1), np.poly1d(np.polyfit(x1, y1, 1))(np.unique(x1)), color= 'red')
plt.plot(np.unique(x2), np.poly1d(np.polyfit(x2, y2, 1))(np.unique(x2)), color= 'red')
plt.plot(np.unique(x3), np.poly1d(np.polyfit(x3, y3, 1))(np.unique(x3)), color= 'red')

plt.legend(loc='upper left')
plt.show()

Line of fit here indicates the correlation between budget and profit across budget categories. We find that budget of high budget films correlates moderately to profit, and at a much greater rate than for low- and mid-budget films. Because of this, we feel that high-budget films are the most predictable investment to make, and would favor them. Also would favor them for the sheer magnitude of profit that they generate, which is much greater than low- or mid-budget films.

In [None]:
fig, ax = plt.subplots()

plt.bar('low', (sum(low_bucket['worldwide_gross']) - sum(low_bucket['production_budget'])) / sum(low_bucket['production_budget']))
plt.bar('mid', (sum(mid_bucket['worldwide_gross']) - sum(mid_bucket['production_budget'])) / sum(mid_bucket['production_budget']))
plt.bar('high', (sum(high_bucket['worldwide_gross']) - sum(high_bucket['production_budget'])) / sum(high_bucket['production_budget']))


ax.set_xlabel('Budget level')
ax.set_ylabel('Average ROI')

# Add a title for the plot
ax.set_title('Budget v Average ROI')

# Add a legend to the plot with legend() in lower right corner
#ax.legend(["Sample Data"], loc=4);

This graph compares average ROI by film across budget levels.

Findings were surprising here, to see that ROI was greater on average for high- than mid-budget films. Though we would have predicted that low-budget films had the highest ROI, based on the earlier graph of budget and ROI. This analysis leads us to believe that low-budget films are the most efficient investments to make, as they return revenue per their investment at a higher rate than films from the other budget categories.

Although we still favor high-budget films for their predictability and higher net profits, we would diversify those investments with many small budget films as well, which seem to return on investment at a significantly higher rate than mid- and high-budget films.

In [None]:
fig, ax = plt.subplots()

plt.bar(range(1,54), [avg_gross_by_day(day) for day in range(1,367,7)])
plt.bar(51, [avg_gross_by_day(day) for day in range(346,353)], label='Christmas')
plt.bar(47, [avg_gross_by_day(day) for day in range(318,325)], label = 'Thanksgiving')
plt.bar(21, [avg_gross_by_day(day) for day in range(136,143)], label = 'Memorial Day')
plt.bar(32, [avg_gross_by_day(day) for day in range(217,224)], label = 'Mid-August')


ax.set_xlabel('Calendar Week')
ax.set_ylabel('Avg Worldwide Gross (in hundred millions)')

# Add a title for the plot
ax.set_title('Release Date v Avg Worldwide Gross')

# Add a legend to the plot with legend() in lower right corner
ax.legend(loc='upper left');

This graph plots average gross of release days within a week. This makes data more easily intreprettable, as well as adjusting for annual fluctuations like holidays that fall within a range of dates. It does obscur some of the highest performing days that were illustrated by the graph of average gross by release day, but captures very similar trends.

This graph identifies the top 10 highest performing weeks (80+ percentile) of the calendar year all fall between either between June and mid-August, or the weeks that contain Thanksgiving and Christmas Eve. Based on this analysis, we identify those as the best times to release a movie, because people see movies with those release dates more than movies released at other times in the year, as a general trend.

In [None]:
genres_corr = []
for genre,frame in genres_df.items():
    corr = frame['production_budget'].corr(frame['worldwide_profit'])
    genres_corr.append(tuple((genre, corr)))
#     print(f"{genre}: has a corr of {corr}")

        
# Sort list of tuples
genres_corr.sort(key = lambda x: x[1], reverse=True)



# Create list for genres, and list for corrs
genres_corrx = [x[0] for x in genres_corr]
genres_corry = [x[1] for x in genres_corr]

# Slice top 5 options for clarity
genres_corrx = genres_corrx[:5]
genres_corry = genres_corry[:5]

plt.bar(genres_corrx, genres_corry)
plt.xticks(rotation=40, ha='right')
plt.xlabel('Genre')
plt.ylabel('Correlation of Budget and WWP')
plt.title('Correlation of Budget and WWP per Genre(ALL BUDGETS)');

This graph shows how well a movie's genre correlates with its overall profit. Because these genres correlate strongly with how well their films profitted, we would recommend them as good film genres to produce in general.

In [None]:
# Group low/med/high budget genres
# genres_df is dict where key = genre, value = df of movies with that genre
genres_df_low = {x: money_genre[(money_genre.genres.str.contains(x)) & 
                                (money_genre.production_budget < 5000000)] for x in unique_genres}
genres_df_med = {x: money_genre[(money_genre.genres.str.contains(x)) & 
                                (money_genre.production_budget > 5000000) & 
                                (money_genre.production_budget < 50000000)] for x in unique_genres}
genres_df_high = {x: money_genre[(money_genre.genres.str.contains(x)) & 
                                (money_genre.production_budget > 50000000)] for x in unique_genres}

We created this distinction to explore whether different genres performed better at certain budget levels.

In [None]:
genres_corr_low = []
for genre,frame in genres_df_low.items():
    if len(frame.index) >= 3:
        corr = frame['production_budget'].corr(frame['worldwide_profit'])
        genres_corr_low.append(tuple((genre, corr)))
#         print(f"{genre}: {corr}")
    else:
        print(f"{genre} only had {len(frame.index)} movies.")

        
# Sort list of tuples
genres_corr_low.sort(key = lambda x: x[1], reverse=True)

# Create list for genres, and list for corrs
genres_corr_lowx = [x[0] for x in genres_corr_low]
genres_corr_lowy = [x[1] for x in genres_corr_low]

# Slice top 5 options for clarity
genres_corr_lowx = genres_corr_lowx[:5]
genres_corr_lowy = genres_corr_lowy[:5]

plt.bar(genres_corr_lowx, genres_corr_lowy)
plt.xticks(rotation=40, ha='right')
plt.xlabel('Genre')
plt.ylabel('Correlation of Budget and WWP')
plt.title('Correlation of Budget and WWP per Genre(LOW BUDGET)');

We found that 'Music' movies correlated moderately well with profit, with the highest correlation among low-budget genres, and would recommend it as a strong genre to produce low-budget films in.

In [None]:
# Create list of tuples. [0]=genre, [1]=corr
genres_corr_high = []
for genre,frame in genres_df_high.items():
    if len(frame.index) >= 3:
        corr = frame['production_budget'].corr(frame['worldwide_profit'])
        genres_corr_high.append(tuple((genre, corr)))
#         print(f"{genre}: {corr}")
    else:
        print(f"{genre} only had {len(frame.index)} movies.")
# Sort list of tuples
genres_corr_high.sort(key = lambda x: x[1], reverse=True)
# Create list for genres, and list for corrs
genres_corr_highx = [x[0] for x in genres_corr_high]
genres_corr_highy = [x[1] for x in genres_corr_high]

# Slice top 5 options for clarity
genres_corr_highx = genres_corr_highx[:5]
genres_corr_highy = genres_corr_highy[:5]
plt.bar(genres_corr_highx, genres_corr_highy)
plt.xticks(rotation=40, ha='right')
plt.xlabel('Genre')
plt.ylabel('Correlation of Budget and WWP')
plt.title('Correlation of Budget vs WWP per Genre(HIGH BUDGET)') ;

We found that 'War' movies correlated very strongly to profit, with the highest correlation among high-budget genres, and would recommend it as a strong genre to produce high-budget films in.

We did not include genre correlations in the mid-bucket as a final graph, in part because the correlations between genre and profit in that budget category were relatively low, and also because we found earlier that mid-budget movies produce the lowest ROI on average.

In [None]:
# used value_counts to select only actors who appeared in at least 15 movies 
# by selecting only actors who appear in at least 15 movies we can provide a highly reliable average
top_reliable_actors = actors_df.value_counts('primary_name').head(19)

In [None]:
# top_actors is now list of top_reliable_actors
top_actors = top_reliable_actors.index.tolist()

In [None]:
# used worldwide_profit to find mean of each actor in top_actors
avg_profit_of_movie_per_actor = []

for actor in top_actors:
    actor_filter = actors_df.loc[actors_df['primary_name'] == actor]
    avg_profit_of_movie_per_actor.append(actor_filter.worldwide_profit.mean())

In [None]:
# new DataFrame for top actors and their relative average profit
actor_profits_df = pd.DataFrame(list(zip(top_actors, avg_profit_of_movie_per_actor)),
               columns =['top_actors', 'average_profits'])

In [None]:
# sorted DataFrame to be in descending order from highest average_profits
actor_profits_df.sort_values(by='average_profits', inplace=True, ascending=False)

In [None]:
# used value_counts to select only directors who appeared in at least 5 movies 
# by selecting only directors who appear in at least 5 movies we can provide a highly reliable average
top_reliable_directors = directors_df.value_counts('primary_name').head(28)

In [None]:
# top_directors is now list of top_reliable_directors
top_directors = top_reliable_directors.index.tolist()

In [None]:
# used worldwide_profit to find mean of each director in top_directors
avg_profit_of_movie_per_director = []

for director in top_directors:
    director_filter = directors_df.loc[directors_df['primary_name'] == director]
    avg_profit_of_movie_per_director.append(director_filter.worldwide_profit.mean())

In [None]:
# new DataFrame for top directors and their relative average profit
director_profits_df = pd.DataFrame(list(zip(top_directors, avg_profit_of_movie_per_director)),
               columns =['top_directors', 'average_profits'])

In [None]:
# sorted DataFrame to be in descending order from highest average_profits
director_profits_df.sort_values(by='average_profits', inplace=True, ascending=False)

In [None]:
# narrowed list down to top 10 for bar graph visual
director_profits_df = director_profits_df.head(10)

In [None]:
Visualizing Potential Impact of Actor
# horizontal bar graph to legibly present all names
# formatted x ticks to million as mil, no decimals, included "$" 
# set appropriate labels, increased font size, reversed y axis

fig, ax = plt.subplots(figsize=(9,9))


x = 'average_profits'
y = 'top_actors'

ax.set_title("Actors with Highest Potential Profit", fontsize=16)
ax.set_xlabel("Average Profit Per Movie", fontsize=14)
ax.set_ylabel("Actors", fontsize=14)

ax.ticklabel_format(axis='x', style='plain')

ax.barh(width=x, y=y, data=actor_profits_df)
ax.invert_yaxis()

ticks_loc = ax.get_xticks().tolist()
ax.xaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
label_format = '{:,.0f}'
ax.set_xticklabels([label_format.format(x) for x in ticks_loc])

xlabels = ['${:,.0f}'.format(x) + ' mil' for x in ax.get_xticks()/1000000]
ax.set_xticklabels(xlabels);

We visualized top performing actors based on average profits for each of their films. Because these actors make movies that consistently profit well, we feel confident that they would be worth the investment and would boost the profits of a high-budget movie.

In [None]:
# horizontal bar graph to legibly present all names
# formatted x ticks to million as mil, no decimals, included "$" 
# set appropriate labels, increased font size, reversed y axis

fig, ax = plt.subplots(figsize=(9,9))


x = 'average_profits'
y = 'top_directors'

ax.set_title("Directors with Highest Potential Profit", fontsize=16)
ax.set_xlabel("Average Profit Per Movie", fontsize=14)
ax.set_ylabel("Directors", fontsize=14)

ax.ticklabel_format(axis='x', style='plain')

ax.barh(width=x, y=y, data=director_profits_df)
ax.invert_yaxis()


ticks_loc = ax.get_xticks().tolist()
ax.xaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
label_format = '{:,.0f}'
ax.set_xticklabels([label_format.format(x) for x in ticks_loc])

xlabels = ['${:,.0f}'.format(x) + ' mil' for x in ax.get_xticks()/1000000]
ax.set_xticklabels(xlabels);

We visualized top performing directors based on average profits for each of their films. Because these directors make movies that consistently profit well, we feel confident that they would be worth the investment and would boost the profits of a high-budget movie.

We found the following answers to the business questions:
* Profits generally increase as budgets increase. ROI possibility is much greater for smaller budgets(over 100x) versus larger budgets(up to 15x).
* Top actors/actresses/directors command greater profits.
* For an unknown budget, musicals have the best correlation of budget to profits. For small/medium/large budgets, we recommend music/history/war genres accordingly.
* June to August, Thanksgiving & Christmas have the greatest average movie revenue by release date.

***
Notes:
* We found these results in our initial data analysis. We interpret these results as having a strong impact on a movie's profit.
* We are fairly confident in our results, due to the size of the data and methods used. Although some outliers were still used in our conclusions,
we have to include these outliers as they give us good data points to compare to other similar movies.

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

We were able to find some strong commonalites between movies that do well, given a certain budget. We were also able to pinpoint optimal times to
release a movie, and how actors/directors contribute to a movie's success.
We were a bit limited by the datasets, as IMDB only provided data for movies from 2010 to 2018. This was 4 years ago, and so the data may be
slightly outdated. Considering the span of movies covered, we still have good faith in this dataset and the conclusions we reached with the dataset.

Notes:
* We would recommend the business follow our tips to maximize profit in a new movie studio. Although some of these tips may conflict
(Dwayne Johnson in a musical?), following our general guidelines will help to maximize profit and reduce investment risk.
* This analysis may not fully solve the business problem as it is a multifaceted problem. If the studio is looking for a higher-risk investment,
we did not account for that. Our goal was to provide the most sound suggestion for profit maximization with minimal risk.
* In the future, we could use more recent data to improve our observations. We would also like to look into promotional budgets to see how those may
have affected the profit of the movie. It would also be interesting to see how to optimize a movie for various audiences with additional data.