# Introduction and Background

## Research Question
Are there specific months for movie release dates that generate more revenue in the box office? Is there a correlation between movie budgets and their success in the box office?

## Background
This is a question of interest because the film industry may be able to find some different strategies to maximize their revenue, such as the best time to release their new movies or determine how much their budget will correlate to potential revenue. With so many movies always coming out, we wondered if there was a correlation between box office success and release dates in conjunction with how big a movie’s budgets is. In some of our background research we found that the number of user votes on IMDB for a movie corresponds with how well-known the film is, and that critics’ reviews don’t have that much of an influence on the success of a movie. 

We always hear about big budget movies because they’re the ones that usually have the ability to advertise the film to the point where everyone sees it. But sometimes movies with relatively small budgets rise to the top of the the box offices. Content has something to do with it, but we want to see how much of an impact the time at which you release a movie has on the success of the film. There have been some blog posts about which genre generally gets released for each month, and there was some indication that the winter months would be when the best movies (Oscar nominated ones) come out. 

Our question is important because it presents a visualization for the correlation between box office success and release dates in conjuction with movie budgets. We want to know if there is a strong correlation and if there is a pattern to movie release dates. It is unclear if there are popular months for movie releases or box office success. 

References (include links):
- Movie Budget and Financial Performance Records (https://www.the-numbers.com/movie/budgets/) 
-- Tables that show top 20 movies that compare the movie's budget and revenue (greatest lost and earnings).
- Analysis of factor affecting the success of the movie (http://rstudio-pubs-static.s3.amazonaws.com/233939_bbeb292c0c20440f97d31b616662c06f.html)
-- Analysis between different website user voting data and the success of the movie in the box office in conjuction with the movie's budget. 
- Movie Release Calendar Strategy (https://riotstudios.com/blog/movie-release-calendar-strategy/) 
-- Months associated with genre/movie style released.

## Hypothesis
We predict that if a movie is released during July or August, then it will generate more revenue in the box office. We also predict that if a movie has a big budget, then it will generate more revenue in the box office. 

# Data Description

This is a list of ~5000 movies from The Movie Database website. This dataset contains rows of movies with columns that dictate details of the movie. For example, the list contains keyword, revenue, budget, production company, imdb id, language, runtime, release date, and ratings concerning each movie.

- Dataset Name: TMDb 5000 Movie Dataset
- Link to the dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
- Number of observations: 4803

This is a list of ~45000 movies from The Movie Database. This list contains budget, genre, original language, popularity, revenue, imdb id, adulterated, production company, homepage, collection, poster, overview, video, tagline, title, vote count, id, runtime, release date, and vote average (0-10) for each movie. 

- Dataset Name: Movies Metadata
- Link to the dataset: https://www.kaggle.com/rounakbanik/the-movies-dataset/data
- Number of observations: 45466

We may be merging the datasets and compare if there are any duplicates, if so, we will compare the ‘revenue’, ‘budget’, and ‘release date’ and remove the duplicates. If there are any conflicting comparisons, we will manually change the data to the correct values. 

## Data Cleaning/Pre-processing

In [30]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

In [35]:
# Import datafile  tmdb_5000_movies.csv to df1

df1 = pd.read_csv('tmdb_5000_movies.csv')


In [36]:
#Import datafile movies_metadata.csv to df1
df2 = pd.read_csv('movies_metadata.csv', low_memory=False)

In [37]:
df1.drop(['homepage', 'keywords', 'overview', 'runtime', 'tagline'], axis=1, inplace=True)
df2.drop(['adult', 'belongs_to_collection', 'homepage', 'imdb_id', 'overview', 'poster_path', 'runtime', 'video', 'tagline'], axis=1, inplace=True)

In [55]:
pd.to_numeric(df1['revenue'], errors='coerce')
pd.to_numeric(df1['budget'], errors='coerce')
pd.to_numeric(df2['revenue'], errors='coerce')
pd.to_numeric(df2['budget'], errors='coerce')             
df1.dropna(subset = ["revenue","budget"],inplace=True)
df2.dropna(subset = ["revenue","budget"],inplace=True)

In [56]:
df1 = df1[df1.revenue != 0]
df2 = df2[df2.revenue != 0]
df1 = df1[df1.budget != 0]
df2 = df2[df2.budget != 0]

In [59]:
df2['revenue'].value_counts()

12000000.0     20
10000000.0     19
11000000.0     19
2000000.0      18
6000000.0      17
5000000.0      14
500000.0       13
8000000.0      13
1.0            12
14000000.0     12
7000000.0      11
1000000.0      10
20000000.0     10
3000000.0      10
1500000.0       9
3.0             9
4000000.0       9
30000000.0      8
2500000.0       8
16000000.0      8
15000000.0      8
25000000.0      8
4100000.0       8
4300000.0       7
13000000.0      7
18000000.0      7
1400000.0       6
9000000.0       6
1300000.0       6
100000000.0     6
               ..
77156.0         1
95608995.0      1
17654912.0      1
195745823.0     1
41596251.0      1
1234254.0       1
441809770.0     1
46836394.0      1
20796847.0      1
2020700.0       1
408247917.0     1
7820688.0       1
15379253.0      1
36351350.0      1
99965753.0      1
3713768.0       1
8345056.0       1
3123963.0       1
40547440.0      1
288383523.0     1
406881.0        1
1614266.0       1
1440000.0       1
933959197.0     1
267045765.

In [None]:
df = pd.concat([df1, df2], ignore_index=True)
df

In [None]:
df.drop_duplicates('id', inplace=True)
df