In [1]:
using DataFrames, DataFramesMeta
using Plots
using Distributions
using StatsModels
gr()



Plots.GRBackend()

In [2]:
using ClobberingReload

In [3]:
using StatPlots

# Exploratory Analysis
Some things we need to do:
- set a 'good/not good' column
- empirical bayes for estimation of movie rating
- perform joins and see most successful actors, producers, writers, directors
- compare tastes of public vs tastes of critics

more?

## Loading the data, adding some calculated columns

In [4]:
basic_cols = [:id, :title, :metascore, :user_score, :release_date, :running_time, :rating, :company, :positive, :mixed, :negative]
review_cols = [:id, :score, :publication, :critic]
basics = readtable("../data/basics.csv", names = basic_cols,header=false);
reviews = readtable("../data/reviews.csv", names=review_cols, header=false);

In [5]:
string_to_float(str) = try parse(Float64, str) catch return(NA) end
basics[:user_score] = map(string_to_float, basics[:user_score]);

We now add a thumbs up/thumbs down column. The idea is that we are mostly interested in the 'good' movies, and to preserve generalization, we keep to an idea of 'good' that is very general. We don't want to make a distinction in our algorithm between excellent, very good, and good. We just want to recommend movies that you'll be happy watching 

In [6]:
function score_to_star(sc::Int64)
    return max(1, convert(Int64, round(sc/10)) )
end

score_to_star (generic function with 1 method)

In [7]:
cutoff = 70 # this could be changed
reviews_normalized = reviews[:score] .>= cutoff;
reviews[:stars] = score_to_star.(convert(Array{Int64, 1}, reviews[:score]));

In [8]:
reviews[:thumbsup] = reviews_normalized;

We also want to change the 'Staff (uncredited)' values to the name of the publication.

In [9]:
reviews[:critic_clean] = reviews[:critic]
for (index, critic) in enumerate(reviews[:critic_clean])
    if contains(critic, "Staff")
        reviews[:critic_clean][index] = reviews[:publication][index]
    end
end

some quick and dirty empirical bayes to produce better rating averages per movie. We suppose the ratings follow a Dirichlet-Categorical distribution, and estimate the prior using the entire dataset.

In [10]:
include("../src/EmpiricalBayes.jl")
using EmpiricalBayes

EmpiricalBayes

In [12]:
# if we made some iterations on the model
# creload("../src/EmpiricalBayes.jl")

[1m[34mINFO: Reloading ../src/EmpiricalBayes.jl
[0m[1m[31m[1m[31m[1m[31m

"../src/EmpiricalBayes.jl"

In [56]:
categories = reviews[reviews[:id] .== "avatar", :][:stars]

35-element DataArrays.DataArray{Int64,1}:
 10
 10
 10
 10
 10
 10
 10
 10
 10
 10
  9
  9
  9
  ⋮
  8
  8
  8
  8
  8
  8
  7
  6
  6
  6
  5
  5

In [57]:
fit(Categorical, 10, categories )

Distributions.Categorical{Float64}(
K: 10
p: [0.0,0.0,0.0,0.0,0.0571429,0.0857143,0.0285714,0.257143,0.285714,0.285714]
)


In [32]:
model = fit_bayes(reviews, :id, :stars)sappi;
aggr = EmpiricalBayes.return_posterior(reviews, :id, :stars, model);

Let's add the average rating to the basics dataframe.

In [33]:
basics = join(basics, aggr[[:id, :adjusted_average]], on = :id, kind = :left);

Without further ado:
### The best movies according to metacritic critics:


In [16]:
basics = basics[isna(basics[:adjusted_average]) .== false,:]
sort!(basics, cols=[:adjusted_average], rev=true)
basics[1:20, [:id, :adjusted_average]]

Unnamed: 0,id,adjusted_average
1,boyhood,9.4616654052544
2,moonlight-2016,9.38102739004958
3,pans-labyrinth,9.195551965715314
4,manchester-by-the-sea,9.164145106291093
5,the-social-network,9.13127852510068
6,gravity,9.114046997566124
7,army-of-shadows,9.1087936475983
8,carol,9.06377206601342
9,4-months-3-weeks-and-2-days,9.050377400786404
10,ratatouille,9.02618163996492


Obviously this shows some biases. It looks like older movies are under-represented; this might be because some older movies have less critics? Let's check this later. 

First let's look at our score VS the user score:

In [17]:
basics_2 = basics[isna(basics[:user_score]) .== false, :];

In [34]:
date_to_oldnew(date)= parse(date[length(date)-3:length(date)]) > 2005
basics_2[:isnew] = map(date_to_oldnew, basics_2[:release_date]);



In [46]:
plot(basics_2, :adjusted_average, :adjusted_average, linewidth=2,
    linecolor = "gray", lab="x = y")
scatter!(basics_2, :adjusted_average, :user_score, group=:isnew,
    markersize = 2,
    markerstrokewidth = 0,
markeralpha = .5, lab=["old", "new"])

n = size(basics_2[:adjusted_average])[1]
bhat = [Array(basics_2[:adjusted_average]) ones(n)]\Array(basics_2[:user_score])

Plots.abline!(bhat..., linewidth = 1, linecolor = "navy", lab="regression")

We can clearly see some cool, expected features: users tend to like movies panned by critics *way more*, and users tend to be a little more disatisfied with moveies with high critical acclaim.

In [20]:
a = marginalhist(basics_2, :adjusted_average, :user_score, c=:matter)

## Who are the best directors?
Obviously we are going to need some shrinkage to to prevent Pixar directors from taking everything.

In [21]:
director_cols = [:id, :director]
directors = readtable("../data/director.csv", names = director_cols, header=false);

In [22]:
director_basics = join(directors, basics_2, on=:id, kind=:left);

In [23]:
director_scores = by(director_basics, :director, df->DataFrame(score = mean(df[:adjusted_average]), 
        user_score = mean(df[:user_score])))
director_scores = director_scores[isna(director_scores[:score]) .== false, :]
director_scores = director_scores[isna(director_scores[:user_score]) .== false, :]
sort!(director_scores, cols=[:score], rev=true)[1:10, :]

Unnamed: 0,director,score,user_score
1,Cristian Mungiu,9.050377400786404,7.9
2,Jan Pinkava,9.02618163996492,8.6
3,Ronaldo Del Carmen,8.944239453886418,8.7
4,Maren Ade,8.828465136872255,7.0
5,Andrey Zvyagintsev,8.767335551723594,7.3
6,Damien Chazelle,8.734170181101062,8.65
7,Jules Dassin,8.71621170738931,8.3
8,Florian Henckel von Donnersmarck,8.70186804730541,8.9
9,Andrew Jarecki,8.678789111314432,8.0
10,Fritz Lang,8.622580857412743,8.3


Obviously again a problem of not enough movies. Since we're doing this for the second time, let's factor out the code we used the first time.

In [24]:
director_reviews = join(director_basics, reviews, on=:id, kind=:inner);

In [25]:
model_2 = fit_bayes(director_reviews, :director, :stars);

In [26]:
director_reviews_adj = EmpiricalBayes.return_posterior(director_reviews, :director, :stars, model_2);

In [27]:
sort!(director_reviews_adj, cols=[:adjusted_average], rev=true)
director_reviews_adj[1:20, [:director]]

Unnamed: 0,director
1,Cristian Mungiu
2,Jan Pinkava
3,Ronaldo Del Carmen
4,Damien Chazelle
5,Barry Jenkins
6,Jean-Pierre Melville
7,Maren Ade
8,Andrey Zvyagintsev
9,Florian Henckel von Donnersmarck
10,Lee Unkrich


Still not good. This comes from the fact that we are using as a data point, a review. Let's use as a data point, a movie. We need to do empirical Bayes fun stuff on movies as well.

Let's reprocess the data so as to have one line per movie.

In [50]:
director_basics

Unnamed: 0,id,director,title,metascore,user_score,release_date,running_time,rating,company,positive,mixed,negative,adjusted_average,isnew
1,10-cent-pistol,Michael C. Martin,10 Cent Pistol,37,6.5,"July 24, 2015",91 min,"['Rated R for violence', 'language throughout', 'some sexual references and drug use (rating surrendered)']",Route 17 Entertainment,2,1,1,4.935719426752363,true
2,10-cloverfield-lane,Dan Trachtenberg,10 Cloverfield Lane,76,7.7,"March 11, 2016",104 min,"['Rated PG-13 for thematic material including frightening sequences of threat with some violence', 'and brief language']",Paramount Pictures,632,89,50,7.543871241462177,true
3,10-items-or-less,Brad Silberling,10 Items or Less,54,5.8,"December 1, 2006",82 min,Rated R for language.,Revelations Entertainment,13,5,6,5.879600631192561,true
4,10-things-i-hate-about-you,Gil Junger,10 Things I Hate About You,70,6.9,"March 31, 1999",97 min,"['Rated PG-13 for crude sex-related humor and dialogue', 'alcohol and drug-related scenes', 'all involving teens.']",Touchstone Pictures,142,35,20,6.925524233025344,false
5,10-years,Jamie Linden,10 Years,61,6.5,"September 14, 2012",100 min,"['Rated R for sexual content', 'and language throughout.']",Temple Hill Entertainment,10,2,3,6.227086358485928,true
6,1000-times-good-night,Erik Poppe,"1,000 Times Good Night",57,6.0,"October 24, 2014",117 min,Not Rated,Film i Väst,3,0,1,6.0406667381434005,true
7,10000-bc,Roland Emmerich,"10,000 BC",34,4.6,"March 7, 2008",109 min,Rated PG-13 for sequences of intense action and violence.,Warner Bros. Pictures,89,85,120,4.141911434152098,true
8,10000-km,Carlos Marques-Marcet,"10,000 km",75,7.3,"July 10, 2015",99 min,"['Rated R for some strong sexual content including dialogue', 'language and brief graphic nudity']",Televisión Española (TVE),6,2,0,7.106627727225048,true
9,101-dalmatians,Stephen Herek,101 Dalmatians,49,5.8,"November 27, 1996",103 min,G,Walt Disney Pictures,28,64,11,5.715191533377097,false
10,101-reykjavik,Baltasar Kormákur,Reykjavík,68,7.7,"July 25, 2001",88 min,Not Rated,Zentropa Entertainments,13,0,3,6.676204798536832,false


## Non-negative Matrix Factorization for similarity between user and critic.
The goal here is to:
- give a certain number of movies to the user to rate
- identify the critic the user is the closest to
- reccomend movies to the user based on that critic's preferences.

This is better (I think) than traditional Collaborative Filtering, because the standard user won't see as many movies as the critics - so this could allow recommending more obscure movies, and also suggest to the user to read that critic's paper in the future...

We could imagine having a hierarchical improvement, where for each genre, we would select a critic. But maybe later.

In [28]:
film_ids = convert(Array, unique(reviews[:id]))
critic_ids = convert(Array, unique(reviews[:critic]));
film_dict = Dict(collect(zip(film_ids, 1:length(film_ids))))
critic_dict = Dict(collect(zip(critic_ids, 1:length(critic_ids))));
film_is = [film_dict[film] for film in reviews[:id]]
critic_is = [critic_dict[critic] for critic in reviews[:critic]];

In [29]:
critic_x_film = sparse(critic_is, film_is, reviews[:score])

2707×8297 sparse matrix with 175327 Int64 nonzero entries:
	[1   ,    1]  =  70
	[2   ,    1]  =  65
	[3   ,    1]  =  60
	[4   ,    1]  =  50
	[5   ,    1]  =  40
	[6   ,    1]  =  30
	[7   ,    1]  =  12
	[8   ,    2]  =  91
	[9   ,    2]  =  91
	[10  ,    2]  =  88
	⋮
	[2074, 8296]  =  88
	[2707, 8296]  =  75
	[21  , 8297]  =  80
	[62  , 8297]  =  25
	[66  , 8297]  =  50
	[67  , 8297]  =  75
	[71  , 8297]  =  42
	[91  , 8297]  =  50
	[108 , 8297]  =  30
	[162 , 8297]  =  70
	[225 , 8297]  =  60

In [30]:
means = Array{Float64}(size(critic_x_film, 2)) 
for i in 1:size(critic_x_film, 2)
   means[i] = mean(nonzeros(critic_x_film[:, i]))
end

In [31]:
fit_mle(Dirichlet, [.1 .1; .9 .9], init= [.1 .1])

LoadError: TypeError: #fit_mle: in typeassert, expected Array{Float64,1}, got Array{Float64,2}