# Mod1 Project: Empoweing Us All: To Watch the Movies We want
### Author: Allan Gayahan and Phoebe Wong
### Date: Jun 19, 2019


## I. Purpose of analysis & why it matters

The goal of this project is to perform EDA (Exploratory Data Analysis) and present preliminary findings on the attributes of films that are currently doing the best at the box office, in order to inform executives of a new movie studio what movies they should consider creating. 

Based on our qualitative research on the recent trend in the global box office, we identified the foreign box office market as one of the most important features that determine the studio's movie portofolio performance. For movies to perform well in foreign box offices, the new studio has to establish global distribution channels for their movies (international marketing, theater supply chains, etc.), which come with significant upfront costs. Our analysis focuses on whether such effort will be cost beneficial and what kinds of movies the studio should create to improve its chance of success in the global market.


## II. Methodology

### a. Data Cleaning

Since the datasets with box office and budget data include only titles as identifier, and the titles do not appear consistently across the data sources (e.g. "The Lego Movie" vs "The LEGO Movie"), special treatment is needed to prevent either duplicating data with an outer join, or losing a substantial amount of data from an inner join. When merging data sets that do not include a title ID, we used the Levenshtein score that identify two different sets of strings that are most likely to be the same. The procedure of merging dataset, was as follows: 1. performed an inner join with the movie title as key, which merged the records with identical movie title spelling; 2. created a list of the movie titles from each dataset, and subtracted them from each other to obtain the respective list of movie titles that did not get merged; 3. using the StringDist package in Python, we created a list of Levenstein scores for each combination of unmatched movie title, which was combined as a dataframe with the respective movie titles from each dataset using a for loop function; 4. the movies titles that do not represent the same movie were deleted from the dataframe; 5. this dataframe with movie titles from both datasets is used as a map to merge each dataset. With this procedure, we were able to match most of our budget data with a title ID in the IMDB datasets.

After merging the budget data with the movie feature data, we also created multiple features not in the raw dataset for our analysis. By merging the dataset containing the casts' and crews' age with the principle cast and crew in each movie, we created features that could be correlated with box office demands such as the average age of principle actors and actresses in each movie when the movie was released, and the share of principle actresses in each movie.

We identified some errors in the Box Office Mojo dataset (some foreign box office observations do not have the correct scale), so Box Office Mojo data is used only when TheMovieDB.org data is not available for a movie that is included in the IMDB database. From this combined budget dataset, we created the foreign gross box Office variable by subtracting domestic box office from the worldwide box office. We also created a Profit variable, which is the difference between Worldwide Box Office and Production Budget. To better understand the demand of global movie market, we created a dummy variable that sorts movies according to their foreign box office as a share of worldwide box office, with movies that share over 50% labeled as 'Global' and those 50% or under labeled as 'Domestic'. As sensitivity test, we also created a dummy variable according to the total number of regions the movie was distributed, with over 10 regions as 'Global', and 10 or below as 'Domestic'.

### b. Analysis

In order to explore the box office data, we created mutiple charts to visualize the relationships between our variables of interest. Here's the summary statistics of our main variables:

| Target Market             | Foreign      |
|---------------------------|--------------|
| Target Genre              | Adventure    |
| Budget Recommendation     | > 12 Million |
| Actor Age for Lead Role   | ~50 yrs old  |
| Actress Age for Lead Role | ~35 yrs old  |

The focus on foreign box office is motivated by the box office trends comparison between foreign and domestic box office. The time trend between 2010 to 2018 indicates that while the domestic box office has been stagnant, foreign box office has experienced substantial growth. This trend tells us a movie portfolio with better global distribution channels is more likely to have better overall box office performance than one that focuses on the domestic market only. 

!

## III. Results

From the numerous movies and genres that has been 'selling' to date, it is a challenge to find a starting point to identify which type of movie should be created. Hence, as a starting point, we need to consider which types of movies made a 'roar' in the industry.

![Genre vs Gender](GenreGross.jpg)

As we can identify from the chart, a movie with a genre of 'Adventure' seems to be the highest selling movie. The chart also included the gender of the lead actor/actress. The purpose of showing the gender of the lead role is to show what would be the most feasible genre of movie to create given the pool of available talents. i.e. An actress with a forte for drama and an actor with forte for romance are choices as lead for the movie, creating a movie with a 'Drama' genre would be the most feasible choice.

## Actor/Actress Age

With respect to the previous finding, considering the age of the lead can be factored in. Below is a quick glance on the comparison of ages of actors and actresses as they assumed lead roles for a movie.

![Lead Age](LeadAge.jpg)

## Geographical Market

With insights on which movie to create and lead role, knowing where the movie would sell the most will be vital.

![Geographical Gross](MovieGross.jpg)

From the chart above, we can easily identify that Foreign market is where all types of of movies sells the most. Across all genres, it is consistent and very easy to tell that foreign gross is the majority of the worldwide gross of the movie. Hence, it is highly recommended to create a movie with the intent of wordwide release to generate foreign market 'sales'.

## Runtime and Budget

It is implied venturing into a new industry, knowing how much the budget should be is a vital information. Hence, insights on the budget with the corresponding profit will be provided. In addition, considering the runtime for the movie should also be considered. The runtime might affect how the movie is accpeted, the audience might become bored or expecting too much on a lengthy movie or 'crave' for a sequel if the movie is short. 

![LengthvsBudgetvsProfit](ProfitParameters.jpg)

![image.png](attachment:image.png)

From this insight, there are numerous choices and strategies on how much the budget should be and how long the moive is going to be. However, if considering a small budget, be aware that the profit is small as well. There few outliers as shown by the chart that are greater than the avergae profit for medium sized budget movie but it would be a special/rare case.

# Recommendation

From all the insights garnered from data based on almost a decade worth of movies, our recommendation to create a movie with a budget of  at least 12 Million in the genre of Adventure, Action, or Drama with contents favoring more of the foreign audience.