# Project 1 - Analysis on New York Times Movie Reviews and Box Office Prediction

In [1]:
import pandas as pd
import numpy as np
import json
import math
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')

from IPython.display import HTML

HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
.MathJax_Display { 
    text-align: left !important;
    display: inline !important;
    margin-left: 300px!important;
}
</style>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

**Summary:** The film industry has grown immensely over the past few decades generating billions of dollars of revenue for the stakeholders. It would be very helpful for movie producers and directors to have some reference of profitability, or chance of receiving positive comments or awards based on past histories. Motivating from these interests, in this project, the movie reviews by the New York Times (NYT) and box offices of the movies in the past four years have been analyzed. The analysis tried to answer 1) what characteristics of movies affect NYT's choice to write a general review or listed as 'Crisic's Pick'? 2) does NTY reviews influence the box office? and 3) what characteristics of movies affect revenue? Review information from NYT Movie Reviews and movie information from movie databases such as OMDB, IMDB and the Numbers were extracted, cleaned and analyzed. Analysis result indicates that ...
# [ TODO... ]

Student Name: Jing Li

Student Number: 1004966174

Date: 12 Nov 2018

# I. Introduction

The film industry has grown immensely over the past few decades generating billions of dollars of revenue for the stakeholders. It would be very helpful for the movie producers and directors if there is a prediction system to assess the box office of new movies. It would increase the chance of profitablity and box office success when movie producers and directors need to make decision when making movie.

Movies are one of the most common entertainment for people nowadays. With easy access to Internet, audiences like to give comments to movies. This leads to the populairty of many movie review websites, such as IMDB, Rotten Tomatoes, etc. Besides the review websites, traditional media is still contributing on the movie reviews nowadays. For example, New York Times hosted a section on their website for movie reviews. Thoses scores and reviews could influence people's decision on whether to go and watch a movie.

For NYT movie reviews, critics often give their reviews a few days before the film is released. They comment on different aspects of the movies and indicate a 'Critic's Pick' if the movie is highly recommended to watch. People would like to go thearter to watch a high recommomded movie, and avoid to spend money on badly scored movies.

Apart from the critics' reviews, there are various factors that impact the box office, i.e. genre, story plot, director, celebreties in the cast. Nevertheless, timing of the release could affect box office, too. It is common to see more crowded thearter during December and other holiday seasons.

In this project, the movie reviews by the New York Times (NYT) and box offices of the movies in the past five years have been analyzed. Three main questions were discussed and presented in this report:

**1. What characteristics of movies affect NYT's choice to write a general review or listed as 'Crisic's Pick'?** To be more specific, what types of the movies appear more frequently in NTY reviews? Do writers have preferred types when selecting 'Critic's Pick'?

**2. Does NTY reviews influence the box office?** Is there any different on the impact between general review articles and those marked as 'Critic's Pick'? Does the sentiment from NYT articles affected the box office?

**3. What characteristics of movies affect revenue?** Can we forecast the movie box office?

The analysis is based on the movie reviews and other movie data for the past 5 years (2014 to 2018). There were no restrictions on the genre or any other characteristics on the movies that were analyzed. Detailed scope of the analysis will be described in next section.

The remainder of the report is arranged as follows:

## TODO

For the work has been done in the study, I orgainzed the report in this way: First introduced data collection and data cleaning based on study objectives. Followed by second part on Insights Exploration from collected data. Subsequently, a prediction model was built and tested by using Neural Network algorithm. Last part discussed on the limitations and conclusion of this study and further improvement for further research and study.


# II. Methods

## 2.1 Data Collection

The review data was retrieved through NTY Open API. The API responds with list of review articles in the order of publication date. Making use of the API, movie reviews that were published from Sep 2013 to Oct 2018 were collected. The data was then filtered by movie open date between 01 Jan 2014 to 31 Oct 2018. The collected data included short summary on the review, whether the article is a 'Critics' Pick', and other movie infomations such as movie title, MPAA rating, movie open date, etc.

The other set of data contains more detailed information for all the movies released over the past 5 years. Based on the search from IMDB.com, there were 29,442 movies (restricted only to those screened in theaters) that were released in this 5 years. Through OMDB open API, the movie data such as movie actors, director, box office, ratings from IMDB, Metacritics, Rotten Tomatoes, awards, etc. is extracted. Addtionally, we extracted movie budget data from IMDB website.


![Image of table](https://raw.githubusercontent.com/jingliuoft/Stats/master/335351542170361_.pic_hd.jpg)

In [2]:
nytdata = pd.read_csv('https://raw.githubusercontent.com/jingliuoft/Stats/master/nyt_reviews.csv')
omdbdata = pd.read_csv('https://raw.githubusercontent.com/jingliuoft/Stats/master/omdb.csv')
imdbdata = pd.read_csv('https://raw.githubusercontent.com/jingliuoft/Stats/master/imdb_movies.csv')

## 2.2 Data Clearning and Integration
Different data clean process were carried out for each dataset.

Im OMDB dataset, there were mainly three attributes that requires cleaning:
1. 'BoxOffice' was extracted and unified to USD;
2. 'Ratings' was in JSON string format, e.g. `[{'Source': 'Internet Movie Database', 'Value': '4.4/10'}, {'Source': 'Rotten Tomatoes', 'Value': '24%'}, {'Source': 'Metacritic', 'Value': '30/100'}]`, and could be extracted to IMDB rating, Rotten Tomatoes rating and Metacritic rating;
3. 'Awards' containing wins and nominations of the movie was splited to separate columns of wins, nominations, and wins and nominations of important awards, for example, Oscar.

After the cleaning, useful columns were extracted for integration.

In [3]:
# currency conversion
omdbdata['BoxOffice'] = omdbdata['BoxOffice'].replace({',' : ''}, regex = True)
omdbdata['temp1'] = omdbdata['BoxOffice'].str.extract('\$([0-9]*)', expand=True)
omdbdata.loc[~omdbdata['temp1'].isnull(), 'BoxOffice'] = omdbdata.loc[~omdbdata['temp1'].isnull(), 'temp1'].astype('float64')
omdbdata['temp2'] = omdbdata['BoxOffice'].str.extract('\&pound;([0-9]*)', expand=True)
omdbdata.loc[~omdbdata['temp2'].isnull(), 'BoxOffice'] = omdbdata.loc[~omdbdata['temp2'].isnull(), 'temp2'].astype('float64')*1.30

# rating extraction
omdbdata['Ratings.Internet Movie Database'] = np.NaN
omdbdata['Ratings.Rotten Tomatoes'] = np.NaN
omdbdata['Ratings.Metacritic'] = np.NaN

for index, row in omdbdata.iterrows():
    rating_str = row['Ratings']
    try:
        rating_arr = json.loads(rating_str.replace("\'", "\""))
        for rating in rating_arr:
            omdbdata.at[index, 'Ratings.' + rating['Source']] = eval(rating['Value'].replace('%', '/100'))
    except Exception as e:
        print(e, index, rating_str)
        
# award extraction
omdbdata['win'] = omdbdata['Awards'].str.extract('[ ]?([0-9]*) win.', expand=True)
omdbdata['nomination'] = omdbdata['Awards'].str.extract('[ ]?([0-9]*) nomination.', expand = True)
omdbdata['won_oscar'] = omdbdata['Awards'].str.extract('Won ([0-9]*) Oscar*', expand = True)
omdbdata['nominate_oscar'] = omdbdata['Awards'].str.extract('[nN]ominated for ([0-9]*) Oscar*', expand = True)
omdbdata['won_BAFTA'] = omdbdata['Awards'].str.extract('Won ([0-9]*) BAFTA Film Award*', expand = True)
omdbdata['nominate_BAFTA'] = omdbdata['Awards'].str.extract('[nN]ominated for ([0-9]*) BAFTA Film Award*', expand = True)
omdbdata['won_GG'] = omdbdata['Awards'].str.extract('Won ([0-9]*) Golden Globe*', expand = True)
omdbdata['nominate_GG'] = omdbdata['Awards'].str.extract('[nN]ominated for ([0-9]*) Golden Globe*', expand = True)
omdbdata['won_Emmys'] = omdbdata['Awards'].str.extract('Won ([0-9]*) Primetime Emmys*', expand = True)
omdbdata['nominate_Emmys'] = omdbdata['Awards'].str.extract('[nN]ominated for ([0-9]*) Primetime Emmys*', expand = True)
        
# extract only useful columns
omdb_clean = omdbdata[['Actors', 'BoxOffice', 'Director', 'Genre', 'Production', 'Released', 'Title', 'imdbID', 'imdbVotes', 'Ratings.Internet Movie Database',
                       'Ratings.Rotten Tomatoes', 'Ratings.Metacritic', 'win', 'nomination', 'won_oscar', 'nominate_oscar']]
omdb_clean.rename(columns={'Ratings.Internet Movie Database':'ImdbRating',
                        'Ratings.Rotten Tomatoes':'RottenTomatoesRating', 
                        'Ratings.Metacritic':'MetacriticRating'},inplace=True)

In IMDB dataset, complementing OMDB dataset, the most important features were budget, gross and first week gross. These features were cleaned and unified to USD.

After the data cleaning, OMDB dataset and IMDB dataset are merged. The gross data from the two dataset were combined in the following strategy:
1. If 'Box Office.Cumulative Worldwide Gross' is present, use it as 'Gross'; or else,
2. If 'BoxOffice' is present, use it as 'Gross'; or else,
3. If 'Box Office.Gross USA' is present, use it as 'Gross'; or else,
4. Put as NaN.

In [4]:
# remove unwanted texts on entry
imdbdata['Box Office.Gross USA'] = imdbdata['Box Office.Gross USA'].replace({', [0-9a-zA-Z, ]*' : ''}, regex = True)
imdbdata['Box Office.Cumulative Worldwide Gross'] = imdbdata['Box Office.Cumulative Worldwide Gross'].replace({', [0-9a-zA-Z, ]*' : ''}, regex = True)
imdbdata['Box Office.Opening Weekend'] = imdbdata['Box Office.Opening Weekend'].replace({'[ ]*\([0-9a-zA-Z, \(\)]*' : ''}, regex = True)
imdbdata['Box Office.Opening Weekend USA'] = imdbdata['Box Office.Opening Weekend USA'].replace({'[A-Z][0-9a-zA-Z, ]*' : ''}, regex = True)
imdbdata['Box Office.Budget'] = imdbdata['Box Office.Budget'].replace({'[ ]*\(estimated\)' : ''}, regex = True).str.strip()

# convert currency
# USD remove the $ sign
# for euro and pounds times the rate
# for rest, tried to use a library to do it, install via: pip install CurrencyConverter

attrs = [
    'Box Office.Budget',
    'Box Office.Gross USA',
    'Box Office.Cumulative Worldwide Gross',
    'Box Office.Opening Weekend',
    'Box Office.Opening Weekend USA'
]

# limitation: got few currencies that could not be converted, as the quantity are low (less than 10), just ignored and marked as NaN
from currency_converter import CurrencyConverter
c = CurrencyConverter()
def convert(row, attr):
    try:
        return c.convert(float(re.sub(r'[^0-9 \xa0]', '', row[attr])), re.sub(r'[0-9 \xa0]', '', row[attr]), 'USD')
    except:
        return np.NaN

for attr in attrs:
    # remove ',' in numbers
    imdbdata[attr] = imdbdata[attr].replace({',' : ''}, regex = True)
    imdbdata['temp'] = np.NaN
    imdbdata['temp1'] = np.NaN
    imdbdata['temp'] = imdbdata[~imdbdata[attr].isnull() & imdbdata[attr].str.contains('[A-Z]{3}')].apply(lambda row: convert(row, attr), axis=1)
    imdbdata['temp1'] = imdbdata[attr].str.extract('\$([0-9]*)', expand=True)
    imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp'] = imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp1'].astype('float64')
    imdbdata['temp1'] = imdbdata[attr].str.extract('\€([0-9]*)', expand=True)
    imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp'] = imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp1'].astype('float64')*1.13
    imdbdata['temp1'] = imdbdata[attr].str.extract('\£([0-9]*)', expand=True)
    imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp'] = imdbdata.loc[~imdbdata['temp1'].isnull(), 'temp1'].astype('float64')*1.30
    imdbdata[attr] = imdbdata['temp']
    
imdb_clean = imdbdata[['title', 'imdb_id', 'year',
    'Box Office.Budget', 'Box Office.Gross USA', 'Box Office.Cumulative Worldwide Gross', 'Box Office.Opening Weekend', 'Box Office.Opening Weekend USA']]

merged = imdb_clean.merge(omdb_clean, how='inner', left_on='imdb_id', right_on='imdbID')

merged['Gross'] = merged['Box Office.Cumulative Worldwide Gross']
merged.loc[pd.isnull(merged['Gross']), 'Gross'] = merged.loc[pd.isnull(merged['Gross']), 'BoxOffice']
merged.loc[pd.isnull(merged['Gross']), 'Gross'] = merged.loc[pd.isnull(merged['Gross']), 'Box Office.Gross USA']

merged.loc[pd.isnull(merged['Box Office.Opening Weekend']), 'Box Office.Opening Weekend'] = merged.loc[pd.isnull(merged['Box Office.Opening Weekend']), 'Box Office.Opening Weekend USA']

merged.drop(['Box Office.Gross USA', 'Box Office.Cumulative Worldwide Gross', 'BoxOffice', 'Box Office.Opening Weekend USA'], axis=1, inplace =True)
merged.rename(columns={'Box Office.Budget':'Budget', 'Box Office.Opening Weekend': 'OpeningWeekend'},inplace=True)

ModuleNotFoundError: No module named 'currency_converter'

In NTY dataset, one of the main challenges was to match the review to the movie it commented. There were only movie title and movie release date can be used to identify a movie. Sometimes, the movie name varies across NYT dataset, IMDB dataset and OMDB dataset. In order to maximize the movie that can be matched programmably while limiting the error rate, title information for both IMDB dataset and OMDB dataset were used.

In [None]:
merged['year'] = merged['year'].astype('float64')
nytdata['year'] = pd.to_datetime(nytdata['opening_date'], errors='coerce').dt.year

# WARNING - long running process
# imdb_found: found an entry in imdb that has the same title name
# imdb_year_found: found an entry in imdb that has same title as well as year, as sometimes the name can be duplicated
# omdb_found: found an entry in omdb that has the same title name
# omdb_year_found: found an entry in omdb that has same title as well as year, as sometimes the name can be duplicated
for index, row in nytdata.iterrows():
    title = row['display_title']
    nytdata.loc[index, 'imdb_found'] = merged[merged['title'] == title].shape[0]
    nytdata.loc[index, 'imdb_year_found'] = merged[(merged['title'] == title) & (merged['year'] == row['year'])].shape[0]
    nytdata.loc[index, 'omdb_found'] = merged[merged['Title'] == title].shape[0]
    nytdata.loc[index, 'omdb_year_found'] = merged[(merged['Title'] == title) & (merged['year'] == row['year'])].shape[0]
    
# print(nytdata[nytdata['imdb_found'] > 0].shape)
# print(nytdata[(nytdata['imdb_found'] > 1) & (nytdata['imdb_year_found'] == 0)].shape)
# print(nytdata[nytdata['omdb_found'] > 0].shape)
# print(nytdata[(nytdata['omdb_found'] > 1) & (nytdata['omdb_year_found'] == 0)].shape)

nytdata.loc[nytdata['imdb_found']==1,'imdb_id'] = nytdata[nytdata['imdb_found']==1].apply(lambda row: merged[merged['title']==row['display_title']].iloc[0]['imdb_id'], axis=1)
nytdata.loc[nytdata['omdb_found']==1,'imdb_id'] = nytdata[nytdata['omdb_found']==1].apply(lambda row: merged[merged['Title']==row['display_title']].iloc[0]['imdb_id'], axis=1)
nytdata.loc[(nytdata['imdb_found']>1) & (nytdata['imdb_year_found']==1), 'imdb_id'] = nytdata[(nytdata['imdb_found']>1) & (nytdata['imdb_year_found']==1)].apply(lambda row: merged[(merged['title']==row['display_title']) & (merged['year']==row['year'])].iloc[0]['imdb_id'], axis=1)
nytdata.loc[(nytdata['omdb_found']>1) & (nytdata['omdb_year_found']==1), 'imdb_id'] = nytdata[(nytdata['omdb_found']>1) & (nytdata['omdb_year_found']==1)].apply(lambda row: merged[(merged['Title']==row['display_title']) & (merged['year']==row['year'])].iloc[0]['imdb_id'], axis=1)

To aid the analysis, data validation shows that:

In [None]:
print('1. There were', nytdata[~pd.isnull(nytdata['imdb_id'])].shape[0], 'out of', nytdata.shape[0], 'data in NYT dataset that can be matched with a imdb_id')
print('2. There were', merged[~pd.isnull(merged['Gross'])].shape[0], 'out of', merged.shape[0], 'data in OMDB dataset that contains box office information')
merged = merged.merge(nytdata, how='outer', left_on='imdb_id', right_on='imdb_id')
print('3. There were', merged[~pd.isnull(merged['Gross']) & ~pd.isnull(merged['byline'])].shape[0], 'data in NTY dataset that can be matched with box office information')

Taking note of the validity of the data, the data analysis were carried out with suitable subset of the merged data.

In [None]:
# print(merged.shape)
merged.drop(['Title', 'imdbID', 'date_updated', 'display_title', 'headline', 'link.suggested_link_text', 'link.type', 'link.url', 'mpaa_rating',
            'multimedia', 'multimedia.height', 'multimedia.src' ,'multimedia.type', 'multimedia.width', 'opening_date', 'publication_date',
            'year_y', 'imdb_found', 'imdb_year_found', 'omdb_found', 'omdb_year_found'], axis=1, inplace =True)
# print(merged.shape)

## 2.3 Data Exploration and Methods

### a. What characteristics of movies affect NYT's choice to write a general review or listed as 'Crisic's Pick'?

Each movie has at least one genre in our data set. To make use of this data, pandas gives us a great deal of control over how categorical variables are represented. We can dummify the "genre" column using get_dummies. Therefore, we can create corresponding 26 dummy variables (columns) for all movies. For each movie record, if the movie is under this genre, we mark it as 1, and 0 presents that it is not under that genre.

For each genre, when we sum up the total number of ones in the column, we can get the number of apperance in our datasets for each genre. Take it divided by total number of movie records we can get the frequency of appearing of each genre. Note that sum of frequency of all genres would be larger than 1, because a movie can have multiple genres.
$$ \begin{eqnarray}
Frequency_{genre} = Count_{genre} / Total& &number&of&movies 
\end{eqnarray}$$

From dummify the genre, we calculated the apprearance frequency of each genre. The highest genre is Drama that is around 36.88%, which indicates that for all NYT times article in our datasets published for the past five years, there are around 37% were writing about Drama movie. To better visualize for all genre, we plot the bar chart below.

In [None]:
subset = merged[~pd.isnull(merged['byline'])].copy()
subset['Genre'] = subset['Genre'].str.replace(' ', '')
genre = subset['Genre'].str.get_dummies(',')
genre['byline'] = subset['byline']
vis_genre = genre.describe().loc['mean']
vis_genre = vis_genre.sort_values(ascending = False)
vis_genre.plot.bar()
plt.title('Insights: What Type of Movie New York Times Reviews Prefer?')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.show()

Does it happen for critics' pick movies? We filter out the records of critics' pick, and did same approach. The result is pretty much similar. The highest genre is still Drama and it appeared in 42.60% of the articles, and second favorite is Comedy, followed by Thriller.

In summary, no matter is critics's pick or not, the top types of movies appearing are Drama, Comedy and Thriller. It happened could due to authors' preferences, or simply due to larger portion of those three types of movies in the market.

In [None]:
cp_genre = genre.copy()
cp_genre = cp_genre[subset['critics_pick'] == 1]
vis_genre_cp = cp_genre.describe().loc['mean']
vis_genre_cp = vis_genre_cp.sort_values(ascending = False)
vis_genre_cp.plot.bar()
plt.title('Insights: What Type of Movie Critics Pick prefer?')
plt.xlabel('Genre')
plt.ylabel('Frequency %')
plt.show()

Showing the types of the movies produced over the past 5 years, we can see that Drama, Comedy and Thriller were indeed the top 3 popular types of movies produced. Comparing the appearing frequency, production frequency of Drama is 45.50%, and review frequency is 36.88%; the production frequency of Comedy is 26.50%, while its review frequency drops significantly to 16.83%. Therefore, we can conclude that though the high review frequency for Drama, Comedy and Thriller is due to their high production frequency, NTY reviews had a small unfavor to Comedy.

In [None]:
genre_pop = merged['Genre'].str.replace(' ', '').str.get_dummies(',')
vis_genre_pop = genre_pop.describe().loc['mean']
vis_genre_pop = vis_genre_pop.sort_values(ascending = False)
vis_genre_pop.plot.bar()
plt.title('Types of Movies Produced')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.show()

### b . Does NYT reviews influence the box office?

First of all, is there make difference for **boxoffice** if the movie review marked as **'Critics' Pick'**?

Looking into this quesiton can help us to understand whether Critics' Pick has any impact to boxoffice. To answe this question, we can group the box office into two category, marked as or not as critics' pick. And for data groups, we can compare many properties to find out the difference in between. Visualization of two dataset distribution is most direct way of looking into their differences, and we can find out the relation between boxoffice and critics' pick from linear regression.

Let boxoffice be the independent variable and critics' pick be the dependent variable, through simple linear regression we could find out is there potential any impact from critics' pick to the box office. By looking at the regression result and p-value of critics' pick, we might find out the linear relation between them.

$$ \begin{eqnarray}
Box Office = a_0*criticspick +b_0  
\end{eqnarray}$$

We can draw the shape of distribution on boxoffice for critics pick and non-critics pick, which are the visualizations of the two datasets in boxplots and distribution diagram. From the box plot, the means and medians of boxoffice data for moviews with normal review  and critics' pick are pretty much similar, although movies with normal review articles has bigger variance. The probability distribution diagram is also telling similar story, that there is no obvious differences in the boxoffice distribution, even the movies are writen in different type of articles. The linear regression result indicates p-value for critics' pick to box office is greater than 0.05, which means that there is no statistic linear relation between critics' pick and box office revenue.

In [None]:
review = merged.loc[(merged['critics_pick'] == 0) & ~pd.isnull(merged['Gross']), 'Gross']
critics = merged.loc[(merged['critics_pick'] == 1) & ~pd.isnull(merged['Gross']), 'Gross']

plot = [review,critics]
fig, ax = plt.subplots()
ax.boxplot(plot, showfliers=False)
plt.xticks([1, 2], ['Normal Review', 'Critics Pick'])
plt.title('Insights: Boxoffice vs Critics Pick')
plt.ylabel('Boxoffice')

fig, ax = plt.subplots()
ax.hist(review, 20, density=True, color='navy',alpha= 0.5,log = True, label = 'Normal Review')
ax.hist(critics, 20, density=True, color='red',alpha = 0.5, log = True, label = 'Critics Pick')
ax.axvline(np.mean(review), color='navy', linestyle='dashed', linewidth=1, label='Normal Review Mean')
ax.axvline(np.mean(critics), color='red', linestyle='dashed', linewidth=1, label='Critics Pick Mean')
plt.title('Insights: Boxoffice vs Critics Pick')
plt.ylabel('Probability')
plt.legend();

In [None]:
OLS_bocp = smf.ols(formula = 'Gross ~ critics_pick', data = merged[~pd.isnull(merged['byline']) & ~pd.isnull(merged['Gross'])]).fit()
# print(OLS_bocp.summary())

We also curious is there a relation between **New York Times review sentiment** and **Critics' Pick**?

We can extract the sentiment of the short summary of each article. Textbolb library in python provides sentiment analysis result for paragrahs. The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0], where 0 means netrual. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. We will use the polarity data from the result as the review sentiment.

As we were curious about whether there is a relation between boxoffice and critics pick, and critics' pick is represeting in binary format, could logistic regression help to reveal the relation? Recall the equation for logistic regression, the probability of being marked as Critics' Pick is:

\begin{align}
\hat{p} & = \ e^ a / (1 + e^a) \\
\ a & = beta_0 + beta_1 * Sentiment Score\\
\end{align}

Before fit the numbers to the model, we plotted the regressio plot of sentiment and critics pick, which gave us more direct view on the possiblilty. Blow plot showed us there is a clear trend that higher the sentiment, more likely it could be marked as critics' pick.

In [None]:
subset = merged[~pd.isnull(merged['byline'])].copy()
subset['sentiment'] = subset.apply(lambda row: TextBlob(row['summary_short']).sentiment.polarity, axis=1)
import seaborn as sns
sns.regplot(x='sentiment', y='critics_pick', data=subset, logistic=True);

So to fit in the logistic model, we set criticis's pick is the dependant variable of the regression and sentiment is the independent variable. The results gave p value of sentiment is less 0.001 with coefficient of 0.60 and Odds Ratio of 1.82 indicated that, that the higher the sentiment, the more likely the article is a critics' pick.

In [None]:
# logitmod = smf.logit(formula = 'critics_pick ~ sentiment', data = subset[['sentiment','critics_pick']]).fit()
# print(logitmod.summary())

Based on the above two questions, we wanted to know does the **sentiment from NYT articles** affected the **boxoffice**.

Similarly, we could try to use linear regression model to find out the linear relation between these two factors, to let box office as dependent variable. If there is, it can be represented as:

\begin{align}
\ BoxOffice = a_1 * SentimentScore + b_1
\end{align}

We plotted the scartter plot to help us visualize the relation. The scatter diagram illustrates correlation between box office and article sentiment, with a bell curve like shape. Most of the sentiments are in the range of [-0.5, 0.5] and highest boxoffices are happened in the center. For movies with sentiment positive (=1) and sentiment negative (=-1), the boxoffice looks all relatively small. We fit the article sentiment and its square value as independent variables to linear regression model, and only resulted in small adj R-sqr value 0.005. There is no statistical significant relation between article sentiment and box office.

In [None]:
plt.figure(figsize=(7,6))
plt.title("NYT Ariticle Sentiment Vs BoxOffice")
plt.xlabel("NYT Ariticle Sentiment")
plt.ylabel("BoxOffice")
sen_bo_vis=plt.scatter(subset.sentiment, subset.Gross, c=subset.sentiment, alpha=0.5)
plt.colorbar(sen_bo_vis, fraction=.025)
plt.show()

In [None]:
bo_sentiment = subset[['sentiment', 'Gross']]
bo_sentiment['sentiment_s'] = bo_sentiment['sentiment'] * bo_sentiment['sentiment']
OLS_bo_sentiment = smf.ols(formula = 'Gross ~ sentiment ', data = bo_sentiment).fit()
# print(OLS_bo_sentiment.summary())

### c. What characteristics of movies affect revenue? Can we forecast success of the movie box office?

Forecasting the boxoffice in actual number was challenging, there are many factors that could impact the boxoffice. To simplify the problem, we pivot the question to: are we able to predict the success of the movie based on it's budget and revenue? For this case we classified the movies into two categories, success and failure. Taking into account the classification of success and failure used in previous works, we define an acceptable profit amount as shown in below equeation as a measure of the success of a movie. It is a widely used measure which has heuristically produced more accurate results than just revenue minus cost. The total box office revenue is divided by two in order to factor in marketing costs and other costs during distribution which are not publicly available.

\begin{align}
\ Profit & = \ 1/ 2 \ * Revenue - Budget\
\end{align}

There could be a lot of factors that affect the success of the box office. For this question, we'd like to explore the releation between box office success and individual variables - Genre, IMDB Rating, Production Company, Budget, number of awards win, etc.

#### C.1 Relationship between Genre and Box Office
First, explore the linear relation between genre and box office. Started with exploring what are the box offices for all genres. The bar chart illustrated the average box offices for different genres. We can tell that SciFi,Adventure,Fantasy and Family are top four genre with highest average boxoffice. The boxplot tell similar information that there are significant differences for box office under different genre. We fit in the top 7 genres as independent variables and boxoffice as dependent variable into linear regression model, to see is there any statistical relation between genre and boxoffice. From the regression result, all of independent variables had p-value less than 0.001. It means that they are significantly affecting boxoffice under 99% confidence level with a adj R-squared of 0.316.

In [None]:
genre_cate = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Musical',
       'Mystery', 'News', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Thriller',
       'War', 'Western']

vis_bo_genre = pd.Series()
subset = merged[~pd.isnull(merged['Gross'])]

for genre_item in genre_cate:
    vis_bo_genre[genre_item] = subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains(genre_item)]['Gross'].mean()
vis_bo_genre = vis_bo_genre.sort_values(ascending = False)
vis_bo_genre.plot.bar()
plt.title('Insights: BoxOffice of Each Genre')
plt.xlabel('Genre')
plt.ylabel('BoxOffice')
plt.show()

In [None]:
SciFi = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Sci-Fi')]['Gross'])
Adventure = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Adventure')]['Gross'])
Fantasy = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Fantasy')]['Gross'])
Family = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Family')]['Gross'])
Animation = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Animation')]['Gross'])
Action = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Action')]['Gross'])
Thriller = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Thriller')]['Gross'])
Comedy = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Comedy')]['Gross'])
Drama = list(subset[~pd.isnull(subset['Genre']) & subset['Genre'].str.contains('Drama')]['Gross'])
plot = [SciFi,Adventure,Fantasy,Family,Animation,Action,Thriller]
fig, ax = plt.subplots()
flierprops = dict(marker='o', markerfacecolor='r', markersize=5,
                  linestyle='none', markeredgecolor='y')
ax.boxplot(plot, showfliers = False)
plt.xticks([1, 2, 3, 4, 5, 6, 7], ['Sci-Fi','Adventure','Fantasy','Family','Animation','Action','Thriller'])
plt.show()

In [None]:
subset = merged[~pd.isnull(merged['byline'])]
df_bo_genre = genre.copy()
df_bo_genre['Gross'] = subset['Gross']
df_bo_genre.dropna(subset=['Gross'],inplace=True)

OLS_bo_gen = smf.ols(formula = 'Gross ~ Adventure + Action + Comedy + Drama + Fantasy + Family', data = df_bo_genre).fit()
# print(OLS_bo_gen.summary())

#### C.2 Relationship between IMDB ratings and Box Office
How about IMDB ratings to BoxOffice? Linear regression result gave p-value less than 0.001 indicated the statistically significant between box office and imdb ratings. Scatter plot helped to visualize the relation between these two factors. From the plot, there is a clear trend that the box office get larger as the IMDB rating become larger. Thus, imdb rating and other rating data might be predictors for boxoffice.

In [None]:
subset = merged[~pd.isnull(merged['Gross']) & ~pd.isnull(merged['ImdbRating'])]

plt.figure(figsize=(7,6))
plt.title("IMDB Rating Vs BoxOffice")
plt.xlabel("BoxOffice")
plt.ylabel("IMDB Rating")
imdbrating_bo_vis=plt.scatter(subset.ImdbRating, subset.Gross, c=subset.ImdbRating, alpha = 0.5)
plt.colorbar(imdbrating_bo_vis,fraction=.025)
plt.show()

In [None]:
OLS_bo_rating = smf.ols(formula = 'Gross ~ ImdbRating', data = subset).fit()
# print(OLS_bo_rating.summary())

#### C.3 Relationship between Production Company and Box Office
To find out whether production companies impact box office, we draw diagram below showing the bar chart on average boxoffice for top 10 production companies and the others. The second diagram showing the box plots for same information. Two diagrams tell same information that top 10 companies have higher box office compared with other smaller company.

To check the linear relation between companies and boxoffice, a linear regression modal was built with regard to boxoffice. Took the first 5 production companies at the input and boxoffice at the output. Below we can see the coefficient as well as the p-value of them. From the regression result, five of them all had p-value less than 0.001. It means that they are significantly affecting boxoffice under 99% confidence level with a adj R-squared with 0.262.

In [None]:
subset = merged[~pd.isnull(merged['Gross'])]
production_cat = subset['Production'].str.get_dummies()
production_cat1 = production_cat.copy()

for col in production_cat:
    production_cat[col] = production_cat[col]*subset['Gross']
vis_bo_pro = production_cat.mean()
vis_bo_pro = vis_bo_pro.sort_values(ascending = False)
rest = vis_bo_pro[10:].mean()
others = pd.Series({'Others':rest})
vis = vis_bo_pro.head(10).append(others)
vis.plot.bar()
plt.title('Insights: BoxOffice of Each Production Company')
plt.xlabel('Production Company')
plt.ylabel('Average BoxOffice')
plt.show()

In [None]:
WaltDisney = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Walt Disney Pictures')]['Gross'])
WarnerBros = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Warner Bros.')]['Gross'])
Universal = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Universal Pictures')]['Gross'])
Fox = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('20th Century Fox')]['Gross'])
Sony = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Sony Pictures')]['Gross'])
Paramount = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Paramount Pictures')]['Gross'])
Pixar = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Disney/Pixar')]['Gross'])
DreamWorks = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('DreamWorks Animation')]['Gross'])
Columbia = list(subset[~pd.isnull(subset['Production']) & subset['Production'].str.contains('Columbia Pictures')]['Gross'])
# Others = list(subset[~pd.isnull(subset['Production']) & ~subset['Production'].str.contains('Walt Disney Pictures','Warner Bros. Pictures', 'Warner Bros.','Universal Pictures','20th Century Fox','Sony Pictures','Paramount Pictures','Disney/Pixar','DreamWorks Animation','Columbia Pictures'])]['Gross'])

In [None]:
plot = [WaltDisney, WarnerBros, Universal, Fox, Sony, Paramount, Pixar, DreamWorks, Columbia]
fig, ax = plt.subplots()
flierprops = dict(marker='o', markerfacecolor='r', markersize=5,
                  linestyle='none', markeredgecolor='y')
ax.boxplot(plot, showfliers = False)
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 9], ['Walt Disney', 'WarnerBros', 'Universal', 
                                         '20th Century Fox', 'Sony', 'Paramount Pictures',
                                         'Disney/Pixar', 'DreamWorks', 'Columbia'], 
           rotation='vertical')
plt.show()

In [None]:
production_cat1['Gross'] = subset['Gross']
production_cat1.rename(columns={'Walt Disney Pictures':'Walt_Disney_Pictures',
                           'Warner Bros. Pictures':'Warner_Bros_Pictures',
                           'Universal Pictures': 'Universal_Pictures',
                           '20th Century Fox': 'CenturyFox',
                           'Sony Pictures': 'Sony_Pictures',
                            'Disney/Pixar': 'Pixar'},inplace=True)
OLS_bo_pro = smf.ols(formula = 'Gross ~ Walt_Disney_Pictures + Warner_Bros_Pictures + Universal_Pictures + CenturyFox + Sony_Pictures+Pixar', data = production_cat1).fit()
# print(OLS_bo_pro.summary())

#### C.4 Logistic Regression for success prediction

Throught the exploration on three factors, we can conclude that those factors may have relations to the boxoffice. Besides genre, IMDB rating, production company, we want to include other rating datasets, actor, director to try to build a logistic regression model to forecast the box office success.

After identified possible factors, we can then build a linear regression model with all factors be various independent variables. Conversion is needed:

1. There are many factors are not in numerical format. Take actors data as example - we compute each actor’s ‘Actor Power Score (APS)’ by counting the number of times this actor appears in our final data set. Then for each movie in our dataset, we add the power scores of all the actors in the movie which we call the ‘Total Actor Star-power (TAS)’ of the movie. We divide this by the number of actors (NAct) in the movie to calculate the ‘Average Actor Star-power (AvAS)’ and add it to the final data set against each movie. The computations are shown in equations below:

$$APS_{actors} = count_{actor}(all movies)$$

$$TAS_{movie} = \sum_{actor in movie}^{} APS_{actor}$$

$$AvAS_{movie} = TAS_{movie}/NAct_{movie}$$

    For data of direcor, production company are unique for each movie, hence there is no need to compute total or average power score. Therefore, the Director-Star power(DS) and Production-Star power(PS) are:

$$DS_{movie}=count_{director}(all movie)$$

$$PS_{movie}=count_{production\_company}(all movie)$$

2. For award data, to reduce number of total predictor and ease of model, we will be considering the factors of number of awards won and nominated, as well as the number of Oscar awards won and nominated.


3. For factor of release date, to convert it into numerical format, we categorize them into two groups. Month fall under summer holiday and year end holiday months, or not holiday seasons. Similar to critics' pick, we will 1 to represent holiday and 0 represent not holiday season.


4. And for Genre data, similarly we categorize them into two groups. Genre fall under top three genres as 1, and not under top three genre as 0.

I used 13 out of 24 variables as inputs for the Regression models: 

|Available Variables|Inputs for Linear Regression (Y/N)|Data Format|
 |-------------------|:-----------------------------:|-----------|
 |Budget             |N                              |Budget cost   |
 |Rotten Tomatoes Ratings|Y                          |Rating value [0,1]|
 |Metascore          |Y                              |Rating value [0,1]|
 |IMDB Ratings       |Y                              |Rating value [0,1]|
 |IMDB Votes         |Y                              |number of votes|
 |Critics Pick       |N                              |1(is critics' pick) or 0 (is not)     |
 |Article Sentiment  |N                              |from low to high [-1,1]    |
 |Movie Open Date    |Y                              |1 if month fall into summer and holiday period (7,8,11,12), else 0|
 |Actors             |Y                              |AvAS       |
 |Awards - Won       |Y                              |number of awards won|
 |Awards - Nomination|Y                              |number of award nominations|
 |Awards - won Oscar |Y                              |number of oscar won|
 |Awards - nominated Oscar|Y                        |number of oscar nominated|
 |Awards - won Golden Globe|N                        |number of Golden Globe won|
 |Awards - nomiated Golden Globe|N                   |number of Golden Globe nomiated|
 |Awards - won Emmys |N                              | -|
 |Awards - nomiated Emmys|N                          |number of Emmys nomiated|
 |Awards - won BAFTA |N                              |number of BAFTA  won|
 |Awards - nominated BAFTA|N                         |number of BAFTA  nominated|
 |Director           |Y                              |Director-Star power|
 |Genre              |Y                              |1 if it is top 3 genre, else 0 |
 |Production         |Y                              |Production-Star power|
 |Country            |N                              |-|
 |Language           |N                              |-|

We can then fit the variables to the logistic regression model to see the what could impact the success of a box office.

In [None]:
subset = merged[~pd.isnull(merged['imdbVotes']) & ~pd.isnull(merged['byline']) & ~pd.isnull(merged['Budget']) & ~pd.isnull(merged['Gross'])].copy()

In [None]:
subset['sentiment'] = subset.apply(lambda row: TextBlob(row['summary_short']).sentiment.polarity, axis=1)

In [None]:
#calculate director-star power
director = subset['Director'].str.get_dummies()
director = director.append(director.sum(0), ignore_index=True)
director = director*director.iloc[-1]
director['power'] = director.sum(1)
subset['director_power'] = director['power']

In [None]:
#calculate average actor-star power
subset['Actors'] = subset['Actors'].str.replace(' ', '')
actors = subset['Actors'].str.get_dummies(',')
actors['number'] = actors.sum(1)
num_act = pd.DataFrame(actors['number'])
actors.drop('number', axis=1, inplace=True)
actors = actors.append(actors.sum(0),ignore_index = True)
actors = actors*actors.iloc[-1]
actors['sum'] = num_act['number']
actors['total power'] = actors.sum(1)
actors['actor power'] = actors['total power']/actors['sum']
subset['actor_power'] = actors['actor power']

In [None]:
#calculate production-star power
production = subset['Production'].str.get_dummies()
production = production.append(production.sum(0),ignore_index=True)
production = production*production.iloc[-1]
production['power'] = production.sum(1)
subset['production_power'] = production['power']

In [None]:
# extract month data and categories into holiday and non-holiday period
subset['open_month'] = pd.to_datetime(subset['Released'], errors='coerce').dt.month
subset['holiday'] = 0
subset.loc[subset['open_month'].isin([7.0, 8.0, 10.0, 11.0, 12.0]), 'holiday'] = 1

In [None]:
subset['Genre'] = subset['Genre'].str.replace(' ', '')
genre = subset['Genre'].str.get_dummies(',')
subset['Adventure'] = genre['Adventure']
subset['Action'] = genre['Action']
subset['Comedy'] = genre['Comedy']

In [None]:
#convert the data type to numerical data
subset[['win', 'nomination', 'won_oscar', 'nominate_oscar']] = subset[['win', 'nomination', 'won_oscar', 'nominate_oscar']].astype('float64')

In [None]:
subset['success'] = 0
subset.loc[0.5*subset['Gross'] > subset['Budget'] ,'success'] = 1
subset.replace(np.NaN, 0, inplace=True)

In [None]:
subset['imdbVotes'] = subset['imdbVotes'].str.replace(',','').astype('float64')

In [None]:
Logit = smf.logit(formula = 'success ~  imdbVotes + holiday + ImdbRating + RottenTomatoesRating + MetacriticRating + production_power + win + nomination + won_oscar + nominate_oscar + sentiment + director_power + actor_power + critics_pick + Adventure + Action + Comedy', data = subset).fit()
# print(Logit.summary())
# print(np.exp(Logit.params))

#### C.5 Linear Regression for Box Office Prediction

Ultimately, it is always preferrable to have a forecasting model that generating the actual box office revenue based on some variables. For this study, I tried to fit in the factors above to linear regression model and took a bit deeper look into the result.

In [None]:
OLS = smf.ols(formula = 'Gross ~ imdbVotes + holiday + ImdbRating + RottenTomatoesRating + MetacriticRating + production_power + win + nomination + won_oscar + nominate_oscar + sentiment + director_power + actor_power + critics_pick + Adventure + Action + Comedy', data = subset).fit()

# III. Result

#### Forecast success - Logistic Regression

Fit in predictors identified earlier into logistic regression model, we got a results of R-squ 0.2448. IMDB, RottenTomatoes, actor power, Adventure and Action are statistically significant to the result. Among all factors, RottenTomato has highest Odds Ratio and the one factor with positive coefficient that can interpret as one unit increase in RottenTomato can increase the possbility of success of the movie.

In [None]:
# print(Logit.summary())
# print(np.exp(Logit.params))

#### Forecast the box office - Linear Regression

Further more, we'd like to find out if we can build a model that is be able to forecast the box office based on the predictors we identified. From the regressin results, we can conclude that number of votes (`imdbVote`, $p<0.001$), Rotten Tomato score (`RottenTomatoes`, $p<0.001$), number of award nomination (`nomination`, $p<0.05$), nominate oscar (`nominate_oscar`, $p<0.005$), Budget (`Budget`, $p<0.001$), and sentiment of the New York Times review (`sentiment`, $p<0.05$), whether it is a Adventure movie(`Adventure`, $p<0.05$) are statistically significant to total box office of the movie. The regression model has a Adj R-squred of 0.601.

Does the data fit into the linear regression model good? Let's review the linear regression assumptions. 

In [None]:
#print(OLS.summary())

There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction:

1. linearity and additivity of the relationship between dependent and independent variables:
   1. The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.
   2. The slope of that line does not depend on the values of the other variables.
   3.  The effects of different independent variables on the expected value of the dependent variable are additive.

2. statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)

3.  homoscedasticity (constant variance) of the errors
    (a) versus time (in the case of time series data)
    (b) versus the predictions
    (c) versus any independent variable
4.  normality of the error distribution.

Linearity and Equal Variance both can be tested by plotting residuals vs. predictions, where residuals are prediction errors. From this residual scatter plot, we can see the points are not random enough, which indicate this model is not well fitted and need imporvement. Possible reason is factor missing.

In [None]:
pred_val = OLS.fittedvalues.copy()
true_val = subset['Gross'].values.copy()
residual = true_val - pred_val
fig, ax = plt.subplots(figsize=(7,7))
_ = ax.scatter(residual, pred_val, alpha = 0.5)
plt.title('residual - predicts plot')
plt.xlabel('predicted values')
plt.ylabel('residual')
plt.show()

For normality, We can apply normal probability plot to assess how the data (error) depart from normality visually. The good fit indicates that normality is a reasonable approximation.

In [None]:
import scipy as sp
fig, ax = plt.subplots(figsize=(6,2.5))
_, (__, ___, r) = sp.stats.probplot(residual, plot=ax, fit=True)
r**2

# IV. Conclusion & Discussion

In this study, we looked into analysis the datasets from New York Times and OMDB on movie data, to find out the relation between New York Times article and boxoffice. Here are some conclusions:

1. For both review aritcle and critics' pick, Drama took the highest frequency. It could due to drama movies took a large market ratio and also could because writers prefer drama movies because people enjoys to watch this type of movie thus interested in reviews of drama movies as well.

2. The boxoffice looked similar for review articles and critics' pick.

3. There is a statisic significant relation between review sentiment and critics' pick, the logistic regression summary support it with p-value < 0.05.

4. we built the linear regression model, aimed to predict the boxoffice. However, although we achieved significant p-value for most of the factors, the residual-fits plots doesn't appear randomly and in another word the model didn't fit the data well. It could because miss out important factor.

### Limitations and Further Study

For this three datasets, there are common issues that there are plenty of null data. In this study, most of the time we dropped the null value, to ease the calculation but it could lead us to miss out the important pattern or info from the missing value. In further study, we need to think about interpretation for the missing value (null value).

There are many other good factors that could be very useful for movie boxoffice prediction. For further study, we should include more possible factors to build the forecasting model.