# Microsoft's Potential Entry Into the Video Entertainment Market


#### Authors: Jake Oddi, Mike Rozenvasser

## Business Problem

Microsoft believes there is potential upside in a move into the proprietary entertainment space, similar to those of Netflix and Apple, which have both started producing their own films. Using publicly available movie and industry data, we examine what courses of action should be taken to make this strategic move most successful.

## Data Understanding

In [21]:
import pandas as pd
import numpy as np

##### rt_movie_info is a dataset from Rotten Tomatoes, while kaggle_movies is the MovieLens dataset that has been cleaned and uploaded to kaggle

In [24]:
import code.data_preparation as dp
import code.visualizations as vz
rt_movie_info = pd.read_csv('.//data/zippedData/rt.movie_info.tsv.gz', sep = '\t')
kaggle_movies = pd.read_csv('.//data/kaggleData/movie_production.csv', encoding = 'latin1')

ModuleNotFoundError: No module named 'code.data_preparation'; 'code' is not a package

While the Rotten Tomatoes dataset has many columns, we will only focus on Director and Box Office

In [7]:
rt_movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [9]:
rt_movie_info = dp.clean_rt(rt_movie_info) # Cleaning

NameError: name 'clean_rt' is not defined

### Gathering and Plotting the Top 10% Average Grossing Directors:
<p> My goal is to select directors whose movies yield returns in the top 10% at the box office on average. I group by directors and their mean box office, and I use sum() to get the total number of directors with 90th percentile returns. This number is 27. I go on again to group the dataset by director and mean box office, though this time sorting values in ascending order, taking the top 27, and assigning the series to a variable top_10pct_box_office. Finally, I employ a barplot to display the mean box office by director. </p>

In [None]:
top_10pct_rt_directors = dp.top_10pct_rt_directors(rt_movie_info)
vz.plot_top_10pct_rt_directors(top_10pct_directors)

Microsoft increases it's chances of producing a successful film by hiring any of the directors from the top ten percent average revenue. 

The kaggle dataset proves much more useful, as most if not all irregularities have already been removed, and minimal cleaning is required.

In [10]:
kaggle_movies.head()

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986


First we clean up the 'rating' column, consolidating the variety of ratings

In [None]:
kaggle_movies = dp.clean_kaggle_ratings(kaggle_movies) 

### Feature Engineering

We create a 'profit' column, subtracting 'budget' from 'gross'

In [11]:
kaggle_movies = vz.create_kaggle_profit_column(kaggle_movies)

NameError: name 'create_kaggle_profit_column' is not defined

Next, we want to look at which movies studios are most profitable on average, which we do by using a groupby function.
We plot the 20 most profitable studios below

In [None]:
kaggle_studio_vs_profit = dp.kaggle_studio_vs_profit(kaggle_movies)
vz.kaggle_studio_vs_profit_barplot(kaggle_studio_vs_profit)

Now let's examine the relationship between MPAA rating and profit

In [None]:
vz.rating_vs_num_movies_barplot(kaggle_movies)
vz.rating_vs_average_profit_barplot(kaggle_movies)

As we see in the first plot, R rated movies are the most commonly produced, while G rated movies are much less common; however, looking at the second plot, we see G rated movies are the most profitable. This indicates an undersaturation in the market for G rated movies. Microsoft would be best positioned to do well by focusing on producing G rated movies. Taking this line of thinking further, we examined whether shorter or longer movies were more profitable for each rating, focusing on the G rating.

In [None]:
dfs_list = dp.create_kaggle_ratings_dfs_for_subplots(kaggle_movies)
vz.ratings_subplots(dfs_list)

Using built-in regression plots, for all ratings but NOT RATED, there is a very weak positive association between runtime and profit. NOT RATED displays a very weak negative association between runtime and profit

Next, we examined which genre of movie tends to perform the best, measured by how profitable it is.

In [None]:
vz.genre_vs_profit_barplot(kaggle_movies)

## Conclusion

* Microsoft would be benefit from recruiting a director in the top 10% of average profit of movies he or she was involved in 
* They would benefit from employing one of the top 20 studios measured by profitability

<h4> Most importantly, they are well positioned to take advantage of the imbalance between supply and demand in the market for G rated movies. They can maximize their returns by producing G rated movies that are around 100 minutes long and are of the animated and adventure genres </h4>

## Next Steps

* Given more time, we would have liked to examine the relationship between plot keywords from each movie and their returns.
    * Using more sophisticated models, this approach would be even more effective.
* Given more time, we also would have liked to make API calls and scrape social media sites to measure how popular a movie is on social media compared to its performance at the box office.