# From Movie Reviews to Box Office
Yen-Ting Chen

## Overview
Recent studies in predicting movie box office only aim at predicting the opening weekend box office.<sup>1-3</sup>

This project aims to find the correlation between movie box office and movie reviews from Amazon.com. Machine learning models are fitted and reviews used to project a movie's box office.

There are three main parts to this project, including 1) data scraping and cleaning, 2) feature extraction and preprocessing, and 3) prediction modeling. Parts of the codes used are shown here, while the rest can be accessed on my [Github repository](https://github.com/janie128/Project-movies).

## 1) Data Scraping and Cleaning
There are two main data sets used in this project. One is the data set of movies and their box offices<sup>4</sup>, and the second is Amazon review data with their metadata<sup>5-7</sup>. 

#### Box office data
Box office data was scraped from the website for all years and movies.

In [None]:
# The following URL and CSS templates rely on the layout of the website. They may change over time. 
pageUrlTemplate <- "http://www.boxofficemojo.com/daily/?view=bymovie&yr=%s&page=%d&sort=title&order=ASC"
years <- c("2015", "2014", "2013", "2012", "2011",
          "2010", "2009", "2008", "2007", "2006",
          "2005", "2004", "2003", "2002", "pre2002")
remainingPageLinkCss <- "font:nth-child(4) a" # how many page remaining
movieTitleCss <- "tr+ tr td:nth-child(1) font" # movie title
releaseGrossCss <- "tr+ tr td:nth-child(4) font" # release gross
releaseDateCss <- "tr+ tr td:nth-child(5) font" # release date

# Function for parsing the html page and extracting data of interest.
extractBoxOfficeFn <- function(page, boxOffice) {
  extractedTitles <- html_text(html_nodes(page, movieTitleCss))
  extractedReleaseDates <- html_text(html_nodes(page, releaseDateCss))
  # Extract release gross, strip "$" and ",", and cast to numeric.
  extractedReleaseGrosses <- html_text(html_nodes(page, releaseGrossCss))
  extractedReleaseGrosses <- gsub("\\$", "", extractedReleaseGrosses)
  extractedReleaseGrosses <- gsub(",", "", extractedReleaseGrosses)
  extractedReleaseGrosses <- as.numeric(extractedReleaseGrosses)

  extractedFromPage <- data.frame(
      title = extractedTitles,
      gross = extractedReleaseGrosses,
      releaseDate = extractedReleaseDates)
  return(extractedFromPage)
}

for (year in years) {
  pageUrl <- sprintf(pageUrlTemplate, year, 1);
  page <- read_html(pageUrl)
  numRemainingPages <- length(html_text(html_nodes(page, remainingPageLinkCss)))

  boxOffice <- rbind(boxOffice, extractBoxOfficeFn(page, boxOffice))
  for (pageIndex in 1:numRemainingPages) {
    pageUrl <- sprintf(pageUrlTemplate, year, pageIndex + 1);
    page <- read_html(pageUrl)
    boxOffice <- rbind(boxOffice, extractBoxOfficeFn(page, boxOffice))
  }
}

The data was then filtered to include data only after May 1996, to match the Amazon review data period. Initially, the box office amounts were to be adjusted to account for inflation or cultural factors (such as perhaps higher movie-going culture), by fitting a general (upwards) trend regression analysis and factoring it out. However, over the time period, an increase in the percentage of small films (lower-grossing) disrupts any such trend. This is evident in Figure 1, where the percentage of movies grouped by their box office amounts ($log_{10}(gross)$) over the years is shown.  

![](./figures/box_office_years.png)
**Fig.1** Box office ($log_{10}(gross)$) sorted into 4 buckets and plotted over years. Percentage of highest grossing movies decreases while medium and lower grossing movies are on the rise.

Therefore, US inflation rates from 1995-2014 were obtained<sup>8</sup>, and was the only factor used to adjust the gross amount.

#### Amazon movie review data
Amazon "Movies & TV" review data and their metadata files were download in JSON format, which were further parsed and output with a Python script into strict JSON format for parsing in R. Review/metadata files were labeled with product titles as opposed to specifically movie titles, and matching was required between the movie titles from the box office data set and the product titles available in the metadata file. The data matching process involved stripping punctuation, converting to lower case, removing leading "the" for a more comprehensive matching, and pattern matching with regular expressions.  
  
The title-matched data was further cleaned and unnecessary variables removed. As some mismatching was unavoidable with extremely short titles, a subset of this data where the number of characters was greater than 4 was used for the rest of the analysis.

## 2) Feature Extraction and Preprocessing
Raw variables of interest available in the review data included the movie reviews, and overall rating (score) for each review. Further meaningful features must be extracted from this data. 

**(1) Review count (by movie)**: This was extracted using the group_by(), summarize() and count() functions. Figure 2 shows a histogram of the distribution of review counts for all movies. As can be seen, there is a large proportion of movies with review counts below 50 (red line). However, these are considered to be too few to be representative of review quality and are thus discarded for the analysis.  
![](./figures/review_count.png)
**Fig.2** Distribution of review counts per movie. Data below counts of 50 (red line) are discarded.
  
**(2) Average score**: This was extracted using the group_by(), summarize() and mean() functions. Figure 3 shows three movie samples and their ratings distribution. It is evident that ratings distributions are significantly different for each movie, and only the average score is not sufficient.  
![](./figures/review_ratings.png)
**Fig.3** Three samples of movie ratings distribution.  

Therefore, rating scores of 1 through 5 and their frequencies are extracted as features.

**(3) Rating scores 1 through 5  **

**(4) Average word count, word count percentiles (25, 50, 75%)**: Word count for each review was calculated using regex search, then aggregated for each movie. Word count distributions for each movie is likely to be different, so low, mid and high percentiles of the word count distribution for each movie was also calculated.

In [None]:
# ---------Reviews word count
words <- as.data.frame(sapply(gregexpr("\\S+", reviewsTitles$reviewText), length))
colnames(words) <- "wordCount"
reviewsTitles <- cbind(reviewsTitles, words)

# ---------Generate table with review word count info including quantiles and average
wordCountInfo <- reviewsTitles %>% group_by(movieTitle) %>%
  summarize(wordCountLow = quantile(wordCount, probs=0.25), wordCountMid = quantile(wordCount, probs=0.5), 
            wordCountHigh = quantile(wordCount, probs=0.75), wordCountAvg = round(mean(wordCount),1))

**(5) Good/bad word TFIDF (review content word analysis)**: Finally, the review contents were analyzed with natural language processing tokenization, and sentiment analysis performed to produce a set of "good" (positive sentiment) and "bad" (negative sentiment) TFIDF.  

The "tm" package was used to perform text mining on each review to obtain a list of occuring words and their frequencies. The usual text cleaning techniques including transformation to lower case, punctuation removal, and extra white space stripping were conducted. However, stopwords were not removed, a decision based on consideration of the amount of information it may be removing on reviews that are relatively short. Dictionaries of good (positive sentiment) and bad (negative sentiment) words were obtained<sup>9-10</sup>, compared to the generated occuring words list, and relevant words found.  
Inverse document frequencies for the two dictionaries were calculated:  

$$idf(t,D) = log_{10}\frac{N}{\{d\;\epsilon\;D:  t\;\epsilon\;d\}+1}$$  

where $N$ is the total number of reviews and $\{d\;\epsilon\;D:  t\;\epsilon\;d\}+1$ is the number of reviews where the term $t$ appears. The plus one in the denominator is to avoid cases where the term does not appear in any reviews. This is essentially a weighting for how common a given word is in the set of reviews; the less common it is, the higher its importance and stronger the weighting. 

Parallel processing with multicores was utilized to speed up the process.

In [None]:
# -----------------------
# Adding IDF to goodDict
# -----------------------
totalDoc <- length(textAllReviews) # total number of documents

goodDictRegex <- goodDict$regex
goodDictContain <- numeric(length(goodDictRegex))

# Parallel processing & exporting data needed in this segment
numCores <- min(3, detectCores() - 1)
cluster <- makeCluster(numCores)
clusterExport(cluster, "textAllReviews")
# parLapply used instead of lapply for parallel processing

# --Done in chunks to ensure data is saved if anything happens
# Count number of documents(reviews) that contain the good word (loops through goodDict)
limit <- length(goodDictRegex)
begin <- 1
end <- 50

while (begin <= limit){
  realEnd = min(limit, end)
  
  results <- parLapply(cluster, goodDictRegex[begin:realEnd], function(x) {
    sum(grepl(paste("\\<", x, "\\>", sep = ""), textAllReviews))
  })
  goodDictContain[begin:realEnd] <- results
  
  # For monitoring progress
  print(paste(end, " out of ", limit))
 
  begin <- begin + 50
  end <- end + 50
}
goodDict$contain <- goodDictContain
# Turn off parallel
stopCluster(cluster)
rm(cluster)
 
# Calculate inverse document frequency with denominator +1
goodDict$IDF <- lapply(goodDict$contain, function(x) {round(log10(totalDoc/(1 + x)),4)})

Subsequently, the term frequencies for each relevant word in a given review were calculated:  

$$tf(t,d) = \frac{f_{t,d}}{max\{f_{t^{'},d}:\;t^{'}\;\epsilon\;d\}}$$

where $f_{t,d}$ is the raw frequency of the term $t$ occuring and $max\{f_{t^{'},d}:\;t^{'}\;\epsilon\;d\}$ is the maximum raw frequency of any term in the given review, which accounts for the differing lengths of the reviews. The term frequency-inverse document frequency (TFIDF) $tfidf(t,d,D) = tf(t,d)\times idf(t,D)$ is then calculated for each word, and the good words and bad words each summed for the review. This process is repeated for all reviews, then averaged for each movie.

In [None]:
# A) Function for generating vector of words in each review (return is class char vector)
reviewToWordsFn <- function(text){
  text_source <- VectorSource(text)
  textCorpus <- Corpus(text_source)
  textCorpus <- tm_map(textCorpus, content_transformer(tolower))
  textCorpus <- tm_map(textCorpus, removePunctuation)
  textCorpus <- tm_map(textCorpus, stripWhitespace)
  dtm <- as.matrix(DocumentTermMatrix(textCorpus, control=list(wordLengths=c(1,Inf))))
  return(dtm)
}

# B) Function for generating good OR bad TF-IDF from list of words, returns list of TF-IDF (good or bad)
wordsToSentiTfidfFn <- function(listOfWordsFreq, maxTF, sentiDict){
  listOfSentiWords <- intersect(listOfWordsFreq$words, sentiDict$words) # char vector of words that are good/bad
  listOfSentiWordsTFIDF <- subset(listOfWordsFreq, words %in% listOfSentiWords) # df of good/bad words with their term frequencies
  listOfSentiWordsTFIDF$TF <- round(listOfSentiWordsTFIDF$freq/maxTF, 5) # raw TF of word divided by total word count in document
  listOfSentiWordsIDF <- subset(sentiDict, words %in% listOfSentiWords)
  listOfSentiWordsTFIDF$IDF <- listOfSentiWordsIDF$IDF
  rm(listOfSentiWordsIDF)
  listOfSentiWordsTFIDF$TFIDF <- mapply(function(x,y) {x*y},
                                        listOfSentiWordsTFIDF$TF, listOfSentiWordsTFIDF$IDF) # TF*IDF
  return(listOfSentiWordsTFIDF)
}

# C) Function for generating total TF-IDF from reviews. Calls functions A & B.
#    Returns a vector with sum of good words and bad words TFIDF (for each input review).
reviewToTfidfFn <- function(text){
  listOfWordsFreq <- reviewToWordsFn(text) # matrix of all words (as colname) with their frequencies
  maxTF <- max(listOfWordsFreq) # numeric, maximum frequency of above matrix
  listOfWords <- colnames(listOfWordsFreq) # char vector of all words
  listOfWordsFreq <- cbind(words=listOfWords, freq=listOfWordsFreq[1,]) # matrix of words & freq
  row.names(listOfWordsFreq) <- NULL
  rm(listOfWords)
  listOfWordsFreq <- as.data.frame(listOfWordsFreq, stringsAsFactors = FALSE) # df of words & freq
  listOfWordsFreq$freq <- as.numeric(listOfWordsFreq$freq)
 
  # call for good words TF-IDF list
  goodTFIDFList <- wordsToSentiTfidfFn(listOfWordsFreq, maxTF, goodDict)
  # call for bad words TF-IDF list
  badTFIDFList <- wordsToSentiTfidfFn(listOfWordsFreq, maxTF, badDict)
  
  goodTFIDF <- ifelse(dim(goodTFIDFList)[1]!=0, sum(goodTFIDFList$TFIDF), 0)
  badTFIDF <- ifelse(dim(badTFIDFList)[1]!=0, sum(badTFIDFList$TFIDF), 0)

  return(c(goodTFIDF, badTFIDF))
}

In Figure 4, it can be seen from the relation between good and bad words TFIDF's to review average scores that the calculated TFIDF's are likely a good indication of the review content.  
![](./figures/TFIDF.png)

## Prediction Modeling

## References
1. [Predicting Box Office Success a Year in Advance from Ranker Data](http://blog.ranker.com/predicting-box-office-success-ranker-data/#.VotBL_krKUk)
2. "Predicting the Future with Social Media" S. Asur, B. A. Huberman; *2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology*, 2010, pp. 492-499
3. "Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data" M. Mestyan, T. Yasseri, J. Kertesz; *PLoS ONE 8(8): e71226*, 2013
4. [Box Office Mojo](http://www.boxofficemojo.com/daily/?view=bymovie&yr=all&sort=title&order=ASC&p=.htm)
5. [Dr. Julian McAuley's Amazon product data](http://jmcauley.ucsd.edu/data/amazon/)
6. "Image-based recommendations on styles and substitutes" J. McAuley, C. Targett, J. Shi, A. van den Hengel; *SIGIR*, 2015
7. "Inferring networks of substitutable and complementary products" J. McAuley, R. Pandey, J. Leskovec; *Knowledge Discovery and Data Mining*, 2015
8. [US Inflation Calculator](http://www.usinflationcalculator.com/inflation/historical-inflation-rates/)
9. [Hu and Liu's Opinion Lexicon](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon)
10. "Mining and Summarizing Customer Reviews" M. Hu, B. Liu; *Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004)*, Aug 22-25, 2004, Seattle, Washington, USA