# <h1><center>Show me the money!</center></h1>

![title](img/cuba_sho_me.png)

## Team members
* Jason O'Day
* John Clark
* Marianne Pagerit
* Nicole Fejfar
_______________________________________________________________________________________________________________________________
## Project Description

* Using the Open Movie Database API, we analyzed the top box office and academy award winning movies over the last 40 years (1980-2019) to determine the impact of release date and genre on a movie’s commercial and critical success.
_______________________________________________________________________________________________________________________________
## Hypotheses
* Movies released in the summer months (May through July) make more money than movies released at other times of the year. 
* The majority of Oscar winning movies are released during "Oscar Season" (November and December).
* Bonus Analysis: Review the impact of genre on box office sales and awards. We suspect that action movies make the most money and dramas win more awards.
_______________________________________________________________________________________________________________________________
## Data Acquisition
* The Numbers Website: Used to compile a dataset of the top 40 movies by domestic box office (adjusted for inflation) over the past 40 years.
* OMDB API: Used to determine the oscar wins, ratings and genres of our movie dataset.
_______________________________________________________________________________________________________________________________
## Limitations of our dataset
* Since we only pulled the top 40 movies per year, not all Oscar winners are included in our dataset.
* Not all data was avaialable for all movies that we ran through the OMDB API
* We made an assumption that movies that didn't have awards information didn't win awards
_______________________________________________________________________________________________________________________________
<br>

## Win a completely fictitious Netflix subscription!
* Guess the top 5 box office movies 
* Guess the 3 movies that have won the most Academy Awards ever (tying for 11). *Hint: One was before 1980.* <br>
**No Googling allowed!**
_______________________________________________________________________________________________________________________________

## Data Cleaning

In [None]:
# Nicole's code starts here 

In [None]:
# Import Dependencies
import pandas as pd
import numpy as np
# from config import OMB_api_key
import requests
import json
import re
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Compiled the top 40 movies by box office sales for past 40 years
# Importing "The Numbers" data
numbers_df = pd.read_csv('DataFiles/TheNumbers_Original.csv')
print(numbers_df.shape)
numbers_df.head(1)

In [None]:
# Created a month released column & added it to the dataframe
numbers_df['Domestic Release Date'] = numbers_df['Domestic Release Date'].astype('datetime64[ns]')
numbers_df['Worldwide Release Date'] = numbers_df['Worldwide Release Date'].astype('datetime64[ns]')
month = pd.DatetimeIndex(numbers_df['Domestic Release Date']).month
numbers_df.insert(3, 'Month Released (Domestic)', month)
numbers_df.head(1)

In [None]:
# Convert columns 11 + to integer
numbers_df[numbers_df.columns[11:]] = numbers_df[numbers_df.columns[11:]].apply\
(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
numbers_df.head(1)

In [None]:
# Adding available oscar count per year.
numbers_df['Total Oscars Awarded in Year'] = ''
for index, row in numbers_df.iterrows():
    year = row['Year Released (Domestic)']
    if year == 1980:
        numbers_df.loc[index, 'Total Oscars Awarded in Year'] = 22
    elif year in range(1981,1995) or year == 1999:
        numbers_df.loc[index, 'Total Oscars Awarded in Year'] = 23
    elif year in range(2001,2020):
        numbers_df.loc[index, 'Total Oscars Awarded in Year'] = 25
    else:
        numbers_df.loc[index, 'Total Oscars Awarded in Year'] = 24
numbers_df.head(1)

In [None]:
# Limit dataset to 40 movies per year, part 1 - Sort dataframe by release year & revenue
numbers_df = numbers_df.sort_values(['Year Released (Domestic)', 'Infl. Adj. Dom. Box Office'],
                                    ascending = [True, False])
numbers_df = numbers_df.reset_index(drop = True)

In [None]:
# Limit dataset to 40 movies per year, part 2 - Create an 'index' based on release year & sales
numbers_df['Year Index'] = ''
year_compare = 1980
count = 0
for index, row in numbers_df.iterrows():
    year = row['Year Released (Domestic)']
    if year == year_compare:
        count += 1
        numbers_df.loc[index, 'Year Index'] = count
    else:
        count = 1
        numbers_df.loc[index, 'Year Index'] = count
        year_compare += 1
print(numbers_df.shape)
numbers_df[46:51]

In [None]:
# Limit dataset to 40 movies per year, part 3 - Drop all movies w/'index' > 40
top_40_df = numbers_df.loc[(numbers_df['Year Index'] <=40), ['Title', 'Domestic Release Date',
                                                             'Year Released (Domestic)',
                                                             'Month Released (Domestic)',
                                                             'Infl. Adj. Dom. Box Office',
                                                             'Domestic Box Office',
                                                             'Genre', 'Theatrical Distributor',
                                                             'Total Oscars Awarded in Year']]
top_40_df = top_40_df.sort_values('Infl. Adj. Dom. Box Office', ascending = False)
top_40_df = top_40_df.reset_index(drop = True)
top_40_df.to_csv('DataFiles/TheNumbers_Cleaned.csv')
print(top_40_df.shape)

In [None]:
# After the first API run, we discovered that the movie title names did not always match between both data sets
# Attempted to resolve some of the issues by replacing offending characters (apostrophes, colons, "Ep. xxx:")
# Turns out fixing some, broke others :-(

# Creating new 'Query Title' column
top_40_df.insert(1, 'Query_Title', top_40_df['Title'])

# Replacing characters in new column
top_40_df[top_40_df.columns[1:2]] = top_40_df[top_40_df.columns[1:2]].apply\
(lambda x: x.str.replace(":",'')).apply(lambda x: x.str.replace("Ep.","Episode"))
top_40_df[144:146]

## API Requests

In [None]:
# Created dataframe to hold API request data
omdb_successes_df = top_40_df.copy()
omdb_successes_df['Awards'] = ''
omdb_successes_df['Metascore'] = ''
omdb_successes_df['IMDB'] = ''
omdb_successes_df['Rotten Tomatoes'] = ''
omdb_successes_df['Rated'] = ''
omdb_successes_df['Director'] = ''
omdb_successes_df['Runtime'] = ''
omdb_successes_df['Country'] = ''
omdb_successes_df.tail(1)

In [None]:
# # Initial API request, dropping unsuccessful calls from dataframe

# params = {"type": "movie", "apikey": OMB_api_key}
# url = "http://www.omdbapi.com/?t=&y="
# count = 0
# for index, row in omdb_successes_df.iterrows():
#     params["t"] = row["Query_Title"]
#     params["y"] = row["Year Released (Domestic)"]
#     response = requests.get(url, params).json()
#     if response['Response'] == 'True':
#         try:
#             omdb_successes_df.loc[index, 'Awards'] = response['Awards']
#             omdb_successes_df.loc[index, 'Metascore'] = response['Metascore']
#             omdb_successes_df.loc[index, 'IMDB'] = response['imdbRating']
#             omdb_successes_df.loc[index, 'Rated'] = response['Rated']
#             omdb_successes_df.loc[index, 'Director'] = response['Director']
#             omdb_successes_df.loc[index, 'Runtime'] = response['Runtime']
#             omdb_successes_df.loc[index, 'Country'] = response['Country']
#             omdb_successes_df.loc[index, 'Rotten Tomatoes'] = response['Ratings'][1]['Value']
#         except:
#             omdb_successes_df = omdb_successes_df.drop(count)
#             print(f'{row.Query_Title.upper()} (row {count}) has missing data')
#     else:
#         print(f'{row.Query_Title.upper()} (row {count}) was not found')
#         omdb_successes_df = omdb_successes_df.drop(count)
#     count += 1

<h2><center>Still 83 API failures! The data manipulation continues...</center></h2>

![title](img/show_me.png)

In [None]:
# # Writing successes to file
# omdb_successes_df.to_csv('DataFiles/OMDB_Successes.csv')

In [None]:
# Loading in Successes
omdb_successes_df = pd.read_csv('DataFiles/OMDB_Successes.csv')
print(omdb_successes_df.shape)
omdb_successes_df.tail(1)

In [None]:
# Create dataframe of the failed API calls
omdb_failures_df = top_40_df[top_40_df['Title'].isin(omdb_successes_df['Title'])==False]
omdb_failures_df = omdb_failures_df.reset_index()
print(omdb_failures_df.shape)

In [None]:
# Add necessary columns
omdb_failures_df = omdb_failures_df.copy()
omdb_failures_df['Awards'] = ''
omdb_failures_df['Metascore'] = ''
omdb_failures_df['IMDB'] = ''
omdb_failures_df['Rotten Tomatoes'] = ''
omdb_failures_df['Rated'] = ''
omdb_failures_df['Director'] = ''
omdb_failures_df['Runtime'] = ''
omdb_failures_df['Country'] = ''

In [None]:
# Overwrite movie titles of failed calls & re-run API until we captured as many as possible
omdb_failures_df.at[0,'Query_Title'] = "DEAD MAN'S"
omdb_failures_df.at[1,'Query_Title'] = 'THE RISE OF SKYWALKER'
omdb_failures_df.at[2,'Query_Title'] = 'THE CURSE OF'
omdb_failures_df.at[3,'Query_Title'] = 'THE CHRONICLES OF NARNIA'
omdb_failures_df.at[4,"Query_Title"] = "PIRATES OF THE CARIBBEAN AT WORLD'S END"
omdb_failures_df.at[5,'Query_Title'] = 'HARRY POTTER AND THE ORDER OF THE PHOENIX'
omdb_failures_df.at[6,'Query_Title'] = 'THREE MEN AND A BABY'
omdb_failures_df.at[7,'Query_Title'] = 'MISSION IMPOSSIBLE II'
omdb_failures_df.at[8,'Query_Title'] = '9 To 5'
omdb_failures_df.at[9,'Query_Title'] = 'X-MEN'
omdb_failures_df.at[10,'Query_Title'] = 'MEN IN BLACK'
omdb_failures_df.at[11,'Query_Title'] = 'THE HOBBIT'
omdb_failures_df.at[12,'Query_Title'] = 'DUMB AND DUMBER'
omdb_failures_df.at[13,'Query_Title'] = 'THE GRINCH'
omdb_failures_df.at[14,'Query_Title'] = 'FAST & FURIOUS 6'
omdb_failures_df.at[15,'Query_Title'] = 'MR & MRS SMITH'
omdb_failures_df.at[16,'Query_Title'] = 'THE LORAX'
omdb_failures_df.at[17,'Query_Title'] = 'CROCODILE DUNDEE II'
omdb_failures_df.at[18,'Query_Title'] = 'INTERVIEW WITH THE VAMPIRE'
omdb_failures_df.at[19,'Query_Title'] = 'NIGHT AT THE MUSEUM BATTLE'
omdb_failures_df.at[20,'Query_Title'] = 'SPIDER-MAN INTO THE SPIDER-VERSE'
omdb_failures_df.at[21,'Query_Title'] = 'tt0089050'
omdb_failures_df.at[22,'Query_Title'] = 'DEAD MEN TELL NO TALES'
omdb_failures_df.at[23,'Query_Title'] = 'RISE OF THE SILVER SURFER'
omdb_failures_df.at[24,'Query_Title'] = 'FROM THE FILES OF POLICE SQUAD'
omdb_failures_df.at[25,'Query_Title'] = "A SERIES OF UNFORTUNATE EVENTS"
omdb_failures_df.at[26,'Query_Title'] = 'DODGEBALL'
omdb_failures_df.at[27,'Query_Title'] = 'A CHRISTMAS CAROL'
omdb_failures_df.at[28,'Query_Title'] = 'X-FILES'
omdb_failures_df.at[29,'Query_Title'] = 'FANTASTIC BEASTS'
omdb_failures_df.at[30,'Query_Title'] = 'I NOW PRONOUNCE YOU CHUCK & LARRY'
omdb_failures_df.at[31,'Query_Title'] = 'THREE MEN AND A LITTLE LADY'
omdb_failures_df.at[32,'Query_Title'] = 'tt0087355'
omdb_failures_df.at[33,'Query_Title'] = "CHEECH AND CHONG'S NEXT MOVIE"
omdb_failures_df.at[34,'Query_Title'] = 'INSURGENT'
omdb_failures_df.at[35,'Query_Title'] = 'LEGALLY BLONDE 2'
omdb_failures_df.at[36,'Query_Title'] = 'tt0113676'
omdb_failures_df.at[37,'Query_Title'] = 'ISLAND OF LOST DREAMS'
omdb_failures_df.at[38,'Query_Title'] = 'BLADE II'
omdb_failures_df.at[39,'Query_Title'] = 'ARTIFICIAL INTELLIGENCE'
omdb_failures_df.at[40,'Query_Title'] = 'THE HANGOVER PART III'
omdb_failures_df.at[41,'Query_Title'] = 'SPONGEBOB SQUAREPANTS MOVIE'
omdb_failures_df.at[42,'Query_Title'] = 'GREYSTOKE'
omdb_failures_df.at[43,'Query_Title'] = 'MOUSEHUNT'
omdb_failures_df.at[44,'Query_Title'] = 'NICE DREAMS'
omdb_failures_df.at[45,'Query_Title'] = 'tt0098484'
omdb_failures_df.at[46,'Query_Title'] = 'MAMMA MIA! HERE WE GO AGAIN'
omdb_failures_df.at[47,'Query_Title'] = 'EPIC'
omdb_failures_df.at[48,'Query_Title'] = 'FANTASIA 2000'
omdb_failures_df.at[49,'Query_Title'] = 'tt0370263'
omdb_failures_df.at[50,'Query_Title'] = 'FORD V FERRARI'
omdb_failures_df.at[51,'Query_Title'] = 'COWBOYS & ALIENS'
omdb_failures_df.at[52,'Query_Title'] = 'HIGH SCHOOL MUSICAL 3'
omdb_failures_df.at[53,'Query_Title'] = 'GNOMEO & JULIET'
omdb_failures_df.at[54,'Query_Title'] = 'FRIDAY THE 13TH PART III'
omdb_failures_df.at[55,'Query_Title'] = 'GARFIELD'
omdb_failures_df.at[56,'Query_Title'] = "MARCH OF THE PENGUINS"
omdb_failures_df.at[57,'Query_Title'] = 'A NIGHTMARE ON ELM STREET 4 THE DREAM MASTER'
omdb_failures_df.at[58,'Query_Title'] = 'DIVINE SECRETS OF THE YA-YA SISTERHOOD'
omdb_failures_df.at[59,'Query_Title'] = 'THE CONJURING 2'
omdb_failures_df.at[60,'Query_Title'] = 'AUSTIN POWERS INTERNATIONAL MAN OF MYSTERY'
omdb_failures_df.at[61,'Query_Title'] = 'Halloween H20: 20 Years Later'
omdb_failures_df.at[62,'Query_Title'] = 'KILL BILL Vol. 1'
omdb_failures_df.at[63,'Query_Title'] = 'PRINCE OF PERSIA: THE SANDS OF TIME'
omdb_failures_df.at[64,'Query_Title'] = 'A NIGHTMARE ON ELM STREET 3'
omdb_failures_df.at[65,'Query_Title'] = 'PERCY JACKSON & THE OLYMPIANS'
omdb_failures_df.at[66,'Query_Title'] = 'BARNYARD'
omdb_failures_df.at[67,'Query_Title'] = 'PLANES'
omdb_failures_df.at[68,'Query_Title'] = 'CITY SLICKERS II'
omdb_failures_df.at[69,'Query_Title'] = "DON'T BREATHE"
omdb_failures_df.at[70,'Query_Title'] = 'JOHN WICK CHAPTER 2'
omdb_failures_df.at[71,'Query_Title'] = 'FRIDAY THE 13TH: THE FINAL CHAPTER'
omdb_failures_df.at[72,'Query_Title'] = 'tt0080919'
omdb_failures_df.at[73,'Query_Title'] = "CAN'T BUY ME LOVE"
omdb_failures_df.at[74,'Query_Title'] = "A MADEA FAMILY FUNERAL"
omdb_failures_df.at[75,'Query_Title'] = 'tt0086352'
omdb_failures_df.at[76,'Query_Title'] = 'QUEST FOR FIRE'
omdb_failures_df.at[77,'Query_Title'] = 'tt0083628'
omdb_failures_df.at[78,'Query_Title'] = 'tt0081760'
omdb_failures_df.at[79,'Query_Title'] = 'tt0088885'
omdb_failures_df.at[80,'Query_Title'] = 'tt0081439'
omdb_failures_df.at[81,'Query_Title'] = 'FRIDAY THE 13TH A NEW BEGINNING'
omdb_failures_df.at[82,'Query_Title'] = "A NIGHTMARE ON ELM STREET 2 FREDDY'S REVENGE"

In [None]:
# copying query titles to new df for eventual 3rd run
omdb_failures2_df = omdb_failures_df.copy()

In [None]:
# # API calls of failures with new title overrides, dropping successful calls from dataframe

# params = {"type": "movie", "apikey": OMB_api_key}
# url = "http://www.omdbapi.com/?t=&y="
# count = 0
# for index, row in omdb_failures_df.iterrows():
#     params["t"] = row["Query_Title"]
#     params["y"] = row["Year Released (Domestic)"]
#     response = requests.get(url, params).json()
#     if response['Response'] == 'True':
#         omdb_failures2_df = omdb_failures2_df.drop(count)
#         try:
#             omdb_failures_df.loc[index, 'Awards'] = response['Awards']
#             omdb_failures_df.loc[index, 'Metascore'] = response['Metascore']
#             omdb_failures_df.loc[index, 'IMDB'] = response['imdbRating']
#             omdb_failures_df.loc[index, 'Rated'] = response['Rated']
#             omdb_failures_df.loc[index, 'Director'] = response['Director']
#             omdb_failures_df.loc[index, 'Runtime'] = response['Runtime']
#             omdb_failures_df.loc[index, 'Country'] = response['Country']
#             omdb_failures_df.loc[index, 'Rotten Tomatoes'] = response['Ratings'][1]['Value']
#         except:
#             omdb_failures_df = omdb_failures_df.drop(count)
#             print(f'{row.Query_Title} has missing data')
#     else:
#         print(f'{row.Query_Title} was not found')
#         omdb_failures_df = omdb_failures_df.drop(count)
#     count += 1

In [None]:
# # Writing failures to file
# omdb_failures_df.to_csv('DataFiles/OMDB_Failures.csv')

In [None]:
# Loading in failures
omdb_failures_df = pd.read_csv('DataFiles/OMDB_Failures.csv')

In [None]:
# Resetting index for next API run
omdb_failures2_df = omdb_failures2_df.reset_index(drop=True)

In [None]:
# Running second set of failures on IMDB id since we couldn't get a match on title

# for index, row in omdb_failures2_df.iterrows():
#     imdb = row["Query_Title"]
#     url = f'http://www.omdbapi.com/?i={imdb}&apikey={OMB_api_key}'
#     response = requests.get(url).json()
#     if response['Response'] == 'True':
#         try:
#             omdb_failures2_df.loc[index, 'Awards'] = response['Awards']
#             omdb_failures2_df.loc[index, 'Metascore'] = response['Metascore']
#             omdb_failures2_df.loc[index, 'IMDB'] = response['imdbRating']
#             omdb_failures2_df.loc[index, 'Rated'] = response['Rated']
#             omdb_failures2_df.loc[index, 'Director'] = response['Director']
#             omdb_failures2_df.loc[index, 'Runtime'] = response['Runtime']
#             omdb_failures2_df.loc[index, 'Country'] = response['Country']
#             omdb_failures2_df.loc[index, 'Rotten Tomatoes'] = response['Ratings'][1]['Value']
#         except:
#             print(f'{row.Query_Title} has missing data')
#     else:
#         print(f'{row.Query_Title} was not found')

In [None]:
# # Writing failures to file
# omdb_failures2_df.to_csv('DataFiles/OMDB_Failures2.csv')

In [None]:
# Loading in next set of failures
omdb_failures2_df = pd.read_csv('DataFiles/OMDB_Failures2.csv')
print(omdb_failures2_df.shape)

In [None]:
# Concatenating successes & failures dataframes
frames = [omdb_successes_df, omdb_failures_df, omdb_failures2_df]
OMDB_Final_df = pd.concat(frames)
print(OMDB_Final_df.shape)
OMDB_Final_df.tail(2)

In [None]:
# Save to csv
OMDB_Final_df.to_csv('DataFiles/OMDB_Final.csv')

In [None]:
# Nicole's code ends here 

In [None]:
# marianne's code starts here

In [None]:
# In the awards column, the description of oscar winners begins with the word "won"
# find oscar winners by finding the word 'won'
OMDB_Final_df.loc[OMDB_Final_df['Awards'].str.contains('Won', regex=False) == True, 'Oscars Won'] = 'Yes'
OMDB_Final_df.loc[OMDB_Final_df['Awards'].str.contains('Won', regex=False) == False, 'Oscars Won'] = 'No'

print(OMDB_Final_df.shape)
OMDB_Final_df.tail(2)

In [None]:
#filter down to movies that have won an oscar
oscar_df = OMDB_Final_df.loc[OMDB_Final_df['Oscars Won'] == 'Yes']
oscar_df.reset_index(drop=True, inplace=True)
print(oscar_df.shape)
oscar_df.tail(2)

In [None]:
#pull the number of oscars won, save in new column
for index, row in oscar_df.iterrows():
#     string = 
    oscar_df.loc[index, "Number Oscars Won"] = re.findall('\d+', oscar_df.loc[index, "Awards"])[0]

oscar_df.tail(2)

In [None]:
#drop the Awards column from the oscar_df to avoid duplicating in the final dataframe
oscar_df = oscar_df[['Title', 'Year Released (Domestic)', 'Number Oscars Won']]

OMDB_Final_df = pd.merge(OMDB_Final_df, oscar_df, 
                      how="left", on=['Title', 'Year Released (Domestic)'])

In [None]:
#replace all NaNs in the number oscars won column with zero
OMDB_Final_df['Number Oscars Won'] = OMDB_Final_df['Number Oscars Won'].fillna(0)

#replace all NaNs in the oscars won column with 'no'
OMDB_Final_df['Oscars Won'] = OMDB_Final_df['Oscars Won'].fillna('No')

In [None]:
print(OMDB_Final_df.shape)
OMDB_Final_df.tail(2)

In [None]:
# marianne's code ends here

In [None]:
# Nicole doing some final tidying

### Then we discovered there were duplicates in our dataframe with incorrect data as movies were re-made during the 40 year period. So we ran through the API calls again, utilizing the year parameter.
### Once complete, removed extraneous columns and saved our final dataframe to use in our analysis notebook

In [None]:
# Re-sort & Re-Index
OMDB_Final_df = OMDB_Final_df.sort_values('Infl. Adj. Dom. Box Office', ascending = False)
OMDB_Final_df = OMDB_Final_df.reset_index()

In [None]:
# Remove unnecessary columns
FINAL_CLEANED_DF = OMDB_Final_df[['Title', 'Domestic Release Date',
       'Year Released (Domestic)', 'Month Released (Domestic)',
       'Infl. Adj. Dom. Box Office', 'Domestic Box Office', 'Genre', 'Oscars Won', 'Number Oscars Won',
       'Total Oscars Awarded in Year', 'Awards',
       'Metascore', 'IMDB', 'Rotten Tomatoes', 'Rated', 'Director', 'Runtime',
       'Theatrical Distributor', 'Country']]

In [None]:
# Save to csv
FINAL_CLEANED_DF.to_csv('DataFiles/FINAL_CLEANED_DF.csv', index=False)

In [None]:
print(FINAL_CLEANED_DF.shape)
FINAL_CLEANED_DF.tail(2)

In [None]:
# Nicole's tidying complete

<h2><center>Celebrate Goodtimes, C'mon!!</center></h2>

![title](img/jm_kid.png)