# Introduction and Background

## Research Question
Are there specific months for movie release dates that generate more revenue in the box office? Is there a correlation between movie budgets and their success in the box office?

## Background
This is a question of interest because the film industry may be able to find some different strategies to maximize their revenue, such as the best time to release their new movies or determine how much their budget will correlate to potential revenue. With so many movies always coming out, we wondered if there was a correlation between box office success and release dates in conjunction with how big a movie’s budgets is. In some of our background research we found that the number of user votes on IMDB for a movie corresponds with how well-known the film is, and that critics’ reviews don’t have that much of an influence on the success of a movie. 

We always hear about big budget movies because they’re the ones that usually have the ability to advertise the film to the point where everyone sees it. But sometimes movies with relatively small budgets rise to the top of the the box offices. Content has something to do with it, but we want to see how much of an impact the time at which you release a movie has on the success of the film. There have been some blog posts about which genre generally gets released for each month, and there was some indication that the winter months would be when the best movies (Oscar nominated ones) come out. 

Our question is important because it presents a visualization for the correlation between box office success and release dates in conjuction with movie budgets. We want to know if there is a strong correlation and if there is a pattern to movie release dates. It is unclear if there are popular months for movie releases or box office success. 

References (include links):
- Movie Budget and Financial Performance Records (https://www.the-numbers.com/movie/budgets/) 
-- Tables that show top 20 movies that compare the movie's budget and revenue (greatest lost and earnings).
- Analysis of factor affecting the success of the movie (http://rstudio-pubs-static.s3.amazonaws.com/233939_bbeb292c0c20440f97d31b616662c06f.html)
-- Analysis between different website user voting data and the success of the movie in the box office in conjuction with the movie's budget. 
- Movie Release Calendar Strategy (https://riotstudios.com/blog/movie-release-calendar-strategy/) 
-- Months associated with genre/movie style released.

## Hypothesis
We predict that if a movie is released during July or August, then it will generate more revenue in the box office. We also predict that if a movie has a big budget, then it will generate more revenue in the box office. 

# Data Description

This is a list of ~5000 movies from The Movie Database website. This dataset contains rows of movies with columns that dictate details of the movie. For example, the list contains keyword, revenue, budget, production company, imdb id, language, runtime, release date, and ratings concerning each movie.

- Dataset Name: TMDb 5000 Movie Dataset
- Link to the dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
- Number of observations: 4803

This is a list of ~45000 movies from The Movie Database. This list contains budget, genre, original language, popularity, revenue, imdb id, adulterated, production company, homepage, collection, poster, overview, video, tagline, title, vote count, id, runtime, release date, and vote average (0-10) for each movie. 

- Dataset Name: Movies Metadata
- Link to the dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset/data
- Number of observations: 45466

We may be merging the datasets and compare if there are any duplicates, if so, we will compare the ‘revenue’, ‘budget’, and ‘release date’ and remove the duplicates. If there are any conflicting comparisons, we will manually change the data to the correct values. 

## Data Cleaning/Pre-processing


In [156]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

!pip install easymoney
from easymoney.money import EasyPeasy


Collecting easymoney
  Downloading easymoney-1.5.0.tar.gz
Collecting pycountry (from easymoney)
  Downloading pycountry-18.2.23.tar.gz (9.7MB)
[K    100% |████████████████████████████████| 9.7MB 109kB/s ta 0:00:011
Collecting wbdata (from easymoney)
  Downloading wbdata-0.2.7.tar.gz
Building wheels for collected packages: easymoney, pycountry, wbdata
  Running setup.py bdist_wheel for easymoney ... [?25ldone
[?25h  Stored in directory: /Users/lixirui/Library/Caches/pip/wheels/c0/aa/d9/4cc4ce15d7722772b7c21de8bcc2fa1a2345fb4d5cd89a657b
  Running setup.py bdist_wheel for pycountry ... [?25ldone
[?25h  Stored in directory: /Users/lixirui/Library/Caches/pip/wheels/8e/41/09/9a91004166f9ed503f568eca67141648be1e263021a3d1a5d7
  Running setup.py bdist_wheel for wbdata ... [?25ldone
[?25h  Stored in directory: /Users/lixirui/Library/Caches/pip/wheels/6c/cc/78/79154effbbf225d0f06d9b6a270093fe100dd2dc7bbf74007a
Successfully built easymoney pycountry wbdata
Installing collected packages: py

In [194]:
# Import datafile  tmdb_5000_movies.csv to df1

df1 = pd.read_csv('tmdb_5000_movies.csv')

In [195]:
#Import datafile movies_metadata.csv to df2
df2 = pd.read_csv('movies_metadata.csv', low_memory=False)

In [196]:
# drop unrelated columns
df1.drop(['homepage', 'keywords', 'overview', 'runtime', 'tagline'], axis=1, inplace=True)
df2.drop(['adult', 'belongs_to_collection', 'homepage', 'imdb_id', 'overview', 'poster_path', 'runtime', 'video', 'tagline'], axis=1, inplace=True)

In [197]:
# change str to int and drop all unnemerical elements
pd.to_numeric(df1['revenue'], errors='coerce')
pd.to_numeric(df1['budget'], errors='coerce')
pd.to_numeric(df2['revenue'], errors='coerce')
pd.to_numeric(df2['budget'], errors='coerce')             
df1.dropna(subset = ["revenue","budget"],inplace=True)
df2.dropna(subset = ["revenue","budget"],inplace=True)
df2.budget = df2.budget.astype(np.int64)

In [198]:
# drop unrepresnrable/abnormal value for revenue and budget 
df1 = df1[df1.revenue >= 1000]
df2 = df2[df2.revenue >= 1000]
df1 = df1[df1.budget >= 1000]
df2 = df2[df2.budget >= 1000]

In [199]:
# Concat two dataset into one called df
df = pd.concat([df1, df2], ignore_index=True)

In [200]:
# drop repeated infomation.
df.drop_duplicates('id', inplace=True)

In [206]:
#split the release date to year month and date 
df['year'] = pd.DatetimeIndex(df['release_date']).year
df['month'] = pd.DatetimeIndex(df['release_date']).month
df['date'] = pd.DatetimeIndex(df['release_date']).day

In [205]:
#remove the movies that may be too old
df = df[df.year >= 1990]
#df['gender'].value_counts()

In [208]:
#adjust inflation rate 
ep = EasyPeasy()

In [212]:
# adjust the inflation of budget, make a new column called adjusted_budget
adjusted_budget = []
for index, row in df.iterrows():
    adjusted_budget.append(ep.normalize(amount=row['budget'], region="US", from_year=row['year'], to_year='latest',base_currency="USD", pretty_print=False))
df['adjusted_budget'] = adjusted_budget

Inflation (CPI) data for 2017 in 'United States' could not be obtained from the
International Monetary Fund database currently cached.
Falling back to 2016.
  warn(warn_msg % (year, natural_region_name, str(fall_back_year)))


In [220]:
# adjust the inflation of revenue, make a new column called adjusted_revenue
adjusted_revenue = []
for index, row in df.iterrows():
    adjusted_revenue.append(ep.normalize(amount=row['revenue'], region="US", from_year=row['year'], to_year='latest',base_currency="USD", pretty_print=False))
df['adjusted_revenue'] = adjusted_budget

Inflation (CPI) data for 2017 in 'United States' could not be obtained from the
International Monetary Fund database currently cached.
Falling back to 2016.
  warn(warn_msg % (year, natural_region_name, str(fall_back_year)))


In [221]:
df

Unnamed: 0,budget,genres,id,original_language,original_title,popularity,production_companies,production_countries,release_date,revenue,spoken_languages,status,title,vote_average,vote_count,year,month,date,adjusted_budget,adjusted_revenue
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,en,Avatar,150.438,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2.787965e+09,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Avatar,7.2,11800.0,2009,12,10,2.651370e+08,2.651370e+08
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,en,Pirates of the Caribbean: At World's End,139.083,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,9.610000e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Pirates of the Caribbean: At World's End,6.9,4500.0,2007,5,19,3.472620e+08,3.472620e+08
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,en,Spectre,107.377,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,8.806746e+08,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,Spectre,6.3,4466.0,2015,10,26,2.480909e+08,2.480909e+08
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,en,The Dark Knight Rises,112.313,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1.084939e+09,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Dark Knight Rises,7.6,9106.0,2012,7,16,2.613388e+08,2.613388e+08
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,en,John Carter,43.927,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,2.841391e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,John Carter,6.1,2124.0,2012,3,7,2.717923e+08,2.717923e+08
5,258000000,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...",559,en,Spider-Man 3,115.7,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-01,8.908716e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Spider-Man 3,5.9,3576.0,2007,5,1,2.986454e+08,2.986454e+08
6,260000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",38757,en,Tangled,48.682,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2010-11-24,5.917949e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Tangled,7.4,3330.0,2010,11,24,2.861742e+08,2.861742e+08
7,280000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",99861,en,Avengers: Age of Ultron,134.279,"[{""name"": ""Marvel Studios"", ""id"": 420}, {""name...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2015-04-22,1.405404e+09,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Avengers: Age of Ultron,7.3,6767.0,2015,4,22,2.835324e+08,2.835324e+08
8,250000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",767,en,Harry Potter and the Half-Blood Prince,98.8856,"[{""name"": ""Warner Bros."", ""id"": 6194}, {""name""...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2009-07-07,9.339592e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Harry Potter and the Half-Blood Prince,7.4,5293.0,2009,7,7,2.796804e+08,2.796804e+08
9,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",209112,en,Batman v Superman: Dawn of Justice,155.79,"[{""name"": ""DC Comics"", ""id"": 429}, {""name"": ""A...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2016-03-23,8.732602e+08,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Batman v Superman: Dawn of Justice,5.7,7004.0,2016,3,23,2.500000e+08,2.500000e+08
