In [1]:
using DataFrames, DataFramesMeta
using Plots
using Distributions
using StatsModels
using StatPlots
gr()



Plots.GRBackend()

In [2]:
using ClobberingReload
using EmpiricalBayes

# Exploratory Analysis
Some things we need to do:
- set a 'good/not good' column
- empirical bayes for estimation of movie rating
- perform joins and see most successful actors, producers, writers, directors
- compare tastes of public vs tastes of critics

more?

## Loading the data, adding some calculated columns

In [4]:
basic_cols = [:id, :title, :metascore, :user_score, :release_date, :running_time, :rating, :company, :positive, :mixed, :negative]
review_cols = [:id, :score, :publication, :critic]
basics = readtable("../data/basics.csv", names = basic_cols,header=false);
reviews = readtable("../data/reviews.csv", names=review_cols, header=false);

In [5]:
string_to_float(str) = try parse(Float64, str) catch return(NA) end
basics[:user_score] = map(string_to_float, basics[:user_score]);

We now add a thumbs up/thumbs down column. The idea is that we are mostly interested in the 'good' movies, and to preserve generalization, we keep to an idea of 'good' that is very general. We don't want to make a distinction in our algorithm between excellent, very good, and good. We just want to recommend movies that you'll be happy watching 

In [6]:
function score_to_star(sc::Int64)
    return max(1, convert(Int64, round(sc/10)) )
end

score_to_star (generic function with 1 method)

In [7]:
cutoff = 70 # this could be changed
reviews_normalized = reviews[:score] .>= cutoff;
reviews[:stars] = score_to_star.(convert(Array{Int64, 1}, reviews[:score]));

In [8]:
reviews[:thumbsup] = reviews_normalized;

We also want to change the 'Staff (uncredited)' values to the name of the publication.

In [9]:
reviews[:critic_clean] = reviews[:critic]
for (index, critic) in enumerate(reviews[:critic_clean])
    if contains(critic, "Staff")
        reviews[:critic_clean][index] = reviews[:publication][index]
    end
end

some quick and dirty empirical bayes to produce better rating averages per movie. We suppose the ratings follow a Dirichlet-Categorical distribution, and estimate the prior using the entire dataset.

In [12]:
dc = dirichlet_from_df(reviews, :id, :stars)

EmpiricalBayes.DirichletCategorical{String}(Dict("robocop-2013"=>41,"now-forever"=>6,"in-the-line-of-fire"=>16,"bless-the-child"=>27,"war-witch"=>16,"a-life-less-ordinary"=>22,"american-pie-2"=>28,"bright-future"=>11,"duma"=>21,"the-devil-wears-prada"=>40…),Distributions.Dirichlet{Float64}(
alpha: [0.230368,0.301161,0.287171,0.444957,0.654671,0.648783,0.515382,0.674257,0.321798,0.251007]
)
,Dict("robocop-2013"=>Distributions.Categorical{Float64}(
K: 10
p: [0.0,0.097561,0.0,0.170732,0.219512,0.195122,0.121951,0.170732,0.0,0.0243902]
)
,"now-forever"=>Distributions.Categorical{Float64}(
K: 10
p: [0.0,0.333333,0.166667,0.333333,0.166667,0.0,0.0,0.0,0.0,0.0]
)
,"in-the-line-of-fire"=>Distributions.Categorical{Float64}(
K: 10
p: [0.0,0.0,0.0,0.0625,0.0,0.25,0.0,0.5,0.125,0.0625]
)
,"bless-the-child"=>Distributions.Categorical{Float64}(
K: 10
p: [0.407407,0.185185,0.037037,0.148148,0.148148,0.037037,0.0,0.037037,0.0,0.0]
)
,"war-witch"=>Distributions.Categorical{Float64}(
K: 10
p: [0.0,0.0,0

In [16]:
adjusted_rating(name) = try (mean(dc.posteriors[name])' * (1:10))[1] catch return(NA) end



adjusted_rating (generic function with 1 method)

Let's add the average rating to the basics dataframe.

In [19]:
basics[:adjusted_average] = map(adjusted_rating, basics[:id])

8431-element DataArrays.DataArray{Any,1}:
 [5.03533]
 [6.93488]
 [4.78915]
 [8.06292]
 [5.74368]
 [7.71266]
 [5.81323]
 [6.08175]
 [6.61672]
 [6.04067]
 [4.93572]
 [7.54387]
 [5.8796] 
 ⋮        
 [7.85138]
 [5.23237]
 [6.3593] 
 [5.07244]
 [7.36064]
 [6.61944]
 [6.36997]
 [4.18843]
 [3.60336]
 [5.68131]
 [7.77628]
 [5.48016]

In [22]:
basics[:adjusted_average] = map(x -> try x[1] catch return NA end, basics[:adjusted_average])

8431-element DataArrays.DataArray{Any,1}:
 5.03533
 6.93488
 4.78915
 8.06292
 5.74368
 7.71266
 5.81323
 6.08175
 6.61672
 6.04067
 4.93572
 7.54387
 5.8796 
 ⋮      
 7.85138
 5.23237
 6.3593 
 5.07244
 7.36064
 6.61944
 6.36997
 4.18843
 3.60336
 5.68131
 7.77628
 5.48016

Without further ado:
### The best movies according to metacritic critics:


In [23]:
basics = basics[isna(basics[:adjusted_average]) .== false,:]
sort!(basics, cols=[:adjusted_average], rev=true)
basics[1:20, [:id, :adjusted_average]]

Unnamed: 0,id,adjusted_average
1,boyhood,9.461665405254395
2,moonlight-2016,9.38102739004958
3,pans-labyrinth,9.195551965715314
4,manchester-by-the-sea,9.16414510629109
5,the-social-network,9.131278525100678
6,gravity,9.114046997566124
7,army-of-shadows,9.1087936475983
8,carol,9.06377206601342
9,4-months-3-weeks-and-2-days,9.050377400786404
10,ratatouille,9.02618163996492


Obviously this shows some biases. It looks like older movies are under-represented; this might be because some older movies have less critics? Let's check this later. 

First let's look at our score VS the user score:

In [24]:
basics_2 = basics[isna(basics[:user_score]) .== false, :];

In [25]:
date_to_oldnew(date)= parse(date[length(date)-3:length(date)]) > 2005
basics_2[:isnew] = map(date_to_oldnew, basics_2[:release_date]);

In [26]:
plot(basics_2, :adjusted_average, :adjusted_average, linewidth=2,
    linecolor = "gray", lab="x = y")
scatter!(basics_2, :adjusted_average, :user_score, group=:isnew,
    markersize = 2,
    markerstrokewidth = 0,
markeralpha = .5, lab=["old", "new"])

n = size(basics_2[:adjusted_average])[1]
bhat = [Array(basics_2[:adjusted_average]) ones(n)]\Array(basics_2[:user_score])

Plots.abline!(bhat..., linewidth = 1, linecolor = "navy", lab="regression")

LoadError: MethodError: no method matching zero(::Type{Any})[0m
Closest candidates are:
  zero([1m[31m::Type{Base.LibGit2.Oid}[0m) at libgit2/oid.jl:88
  zero([1m[31m::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}[0m) at pkg/resolve/versionweight.jl:80
  zero([1m[31m::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}[0m) at pkg/resolve/versionweight.jl:120
  ...[0m

We can clearly see some cool, expected features: users tend to like movies panned by critics *way more*, and users tend to be a little more disatisfied with moveies with high critical acclaim.

In [27]:
a = marginalhist(basics_2, :adjusted_average, :user_score, c=:matter)

[1m[34mINFO: binning = auto
[0m


[1m[34mINFO: binning = auto
[0m

## Who are the best directors?
We're also going to shrink the ratings here, to prevent directors with a single movie from taking it all. We're going to assume that the prior distribution of ratings is normal with a known variance. NB: we estimate that variance using the entire dataset's variance; and I'm not sure how much this makes sense.
Anyway, the conjugate will be normal as well and we can estimate our posterior mean using an empirical prior.

In [28]:
director_cols = [:id, :director]
directors = readtable("../data/director.csv", names = director_cols, header=false);

In [51]:
nn = EmpiricalBayes.normal_from_df(director_basics, :director, :adjusted_average)

EmpiricalBayes.NormalNormal{String}(Dict("Rama Burshtein"=>1,"Zhang Ke Jia"=>6,"Shunji Iwai"=>2,"Matthew Kaufman"=>1,"David L. Cunningham"=>1,"Wally Pfister"=>1,"Kevin Tancharoen"=>1,"Lucrecia Martel"=>3,"James Moll"=>1,"Grant Heslov"=>1…),Distributions.Normal{Float64}(μ=5.919676106597595, σ=1.205234392730886),Dict("Rama Burshtein"=>Distributions.Normal{Float64}(μ=7.46621, σ=1.30531),"Zhang Ke Jia"=>Distributions.Normal{Float64}(μ=7.20598, σ=1.30531),"Shunji Iwai"=>Distributions.Normal{Float64}(μ=6.37063, σ=1.30531),"Matthew Kaufman"=>Distributions.Normal{Float64}(μ=6.15535, σ=1.30531),"David L. Cunningham"=>Distributions.Normal{Float64}(μ=4.73945, σ=1.30531),"Wally Pfister"=>Distributions.Normal{Float64}(μ=4.72431, σ=1.30531),"Kevin Tancharoen"=>Distributions.Normal{Float64}(μ=4.65052, σ=1.30531),"Lucrecia Martel"=>Distributions.Normal{Float64}(μ=7.34511, σ=1.30531),"James Moll"=>Distributions.Normal{Float64}(μ=4.41748, σ=1.30531),"Grant Heslov"=>Distributions.Normal{Float64}(μ=6.0554

In [52]:
adjusted_director_rating(name) = try nn.posteriors[name].μ catch return(NA) end

adjusted_director_rating (generic function with 1 method)

In [80]:
nn.posteriors["Steve McQueen"]

Distributions.Normal{Float64}(μ=6.920311468839487, σ=0.7327932609549568)

In [56]:
director_scores[:adjusted_director_rating] = map(adjusted_director_rating, director_scores[:director])

3713-element DataArrays.DataArray{Any,1}:
 7.36043
 7.34929
 7.31158
 7.2583 
 7.23017
 7.69372
 7.20664
 7.20004
 7.18942
 7.16355
 7.15756
 7.14749
 7.13773
 ⋮      
 4.44949
 4.40437
 4.39183
 4.36517
 4.36517
 4.36318
 4.36318
 4.33417
 4.33334
 4.32646
 4.21053
 4.21053

In [62]:
sort!(director_scores, cols=[:adjusted_director_rating], rev=true)[1:20, :]

Unnamed: 0,director,score,user_score,adjusted_director_rating
1,Lee Unkrich,8.43151357052462,8.775,7.861956555327411
2,Asghar Farhadi,8.395832474261493,8.3,7.834366117388497
3,Hayao Miyazaki,8.171001244644643,8.5,7.743212232538106
4,Damien Chazelle,8.734170181101062,8.65,7.69372251675656
5,Alexander Payne,7.917080895091959,7.171428571428572,7.63041787207395
6,John Lasseter,8.113178433566734,8.225000000000001,7.615803640763272
7,Spike Jonze,8.101017037330358,8.025,7.606399830812034
8,Paul Thomas Anderson,7.884742391083265,6.457142857142856,7.602720517106457
9,Joshua Oppenheimer,8.501108662429035,8.3,7.546817979173104
10,Nick Park,8.48497718799668,8.2,7.53664990507178


In [79]:
director_scores[director_scores[:director] .== "Steve McQueen", :]

Unnamed: 0,director,score,user_score,adjusted_director_rating


In [81]:
director_basics[director_basics[:id] .== "", :][:director]

0-element DataArrays.DataArray{String,1}

In [93]:
df = DataFrame(director = collect(keys(nn.posteriors)), score = collect(Base.values(nn.posteriors)));



In [95]:
df[:score] = map(x -> x.μ, df[:score])

4113-element DataArrays.DataArray{Any,1}:
 6.63139
 6.99564
 6.20392
 6.02813
 5.37653
 5.36957
 5.33561
 6.94444
 5.22836
 5.98217
 4.33417
 6.10811
 6.51228
 ⋮      
 6.44324
 5.80481
 5.69018
 4.81241
 5.22026
 6.68662
 6.58931
 4.95166
 5.64242
 5.46331
 6.61922
 5.71349

In [98]:
sort!(df, cols=[:score], rev=true)[1:20, :]

Unnamed: 0,director,score
1,Mike Leigh,7.891083239713566
2,Lee Unkrich,7.861956555327411
3,Asghar Farhadi,7.834366117388497
4,Hayao Miyazaki,7.743212232538106
5,Damien Chazelle,7.69372251675656
6,Jean-Pierre Dardenne,7.662738625354124
7,Luc Dardenne,7.662738625354124
8,Bennett Miller,7.659622758607599
9,Jafar Panahi,7.655152660012336
10,Alexander Payne,7.63041787207395


In [99]:
director_basics[director_basics[:director] .== "Mike Leigh", :]

Unnamed: 0,id,director,title,metascore,user_score,release_date,running_time,rating,company,positive,mixed,negative,adjusted_average,isnew,adjusted_director_rating
1,all-or-nothing,Mike Leigh,All or Nothing,72,7.4,"October 25, 2002",128 min,Rated R for pervasive language and some sexuality.,Les Films Alain Sarde,7,3,0,7.24741844320058,False,7.891083239713566
2,happy-go-lucky,Mike Leigh,Happy-Go-Lucky,84,8.0,"October 10, 2008",118 min,Rated R for language.,Ingenious Film Partners,199,14,42,8.036828744784522,True,7.891083239713566
3,life-is-sweet,Mike Leigh,Life Is Sweet,88,6.8,"October 25, 1991",103 min,Rated R for language and a scene of sensuality,British Screen Productions,3,1,1,8.152785379343861,False,7.891083239713566
4,mr-turner,Mike Leigh,Mr. Turner,94,6.9,"December 19, 2014",150 min,Rated R for some sexual content,Film4,83,22,18,8.898241880827932,True,7.891083239713566
5,naked,Mike Leigh,Naked,84,8.8,"December 15, 1993",131 min,Unrated,British Screen Productions,14,1,0,7.893612079432005,False,7.891083239713566
6,secrets-lies,Mike Leigh,Secrets & Lies,91,8.3,"September 27, 1996",136 min,Rated R for language.,CiBy 2000,46,3,4,8.523838591231263,False,7.891083239713566
7,topsy-turvy,Mike Leigh,Topsy-Turvy,90,7.7,"December 17, 1999",160 min,Rated R for a scene of risque nudity.,Thin Man Films,22,3,3,8.521139594088273,False,7.891083239713566
8,vera-drake,Mike Leigh,Vera Drake,83,7.6,"October 10, 2004",125 min,Rated R for depiction of strong thematic material.,Ingenious Film Partners,28,4,4,8.167193863049452,False,7.891083239713566


## Non-negative Matrix Factorization for similarity between user and critic.
The goal here is to:
- give a certain number of movies to the user to rate
- identify the critic the user is the closest to
- reccomend movies to the user based on that critic's preferences.

This is better (I think) than traditional Collaborative Filtering, because the standard user won't see as many movies as the critics - so this could allow recommending more obscure movies, and also suggest to the user to read that critic's paper in the future...

We could imagine having a hierarchical improvement, where for each genre, we would select a critic. But maybe later.

In [None]:
film_ids = convert(Array, unique(reviews[:id]))
critic_ids = convert(Array, unique(reviews[:critic]));
film_dict = Dict(collect(zip(film_ids, 1:length(film_ids))))
critic_dict = Dict(collect(zip(critic_ids, 1:length(critic_ids))));
film_is = [film_dict[film] for film in reviews[:id]]
critic_is = [critic_dict[critic] for critic in reviews[:critic]];

In [None]:
critic_x_film = sparse(critic_is, film_is, reviews[:score])

In [None]:
means = Array{Float64}(size(critic_x_film, 2)) 
for i in 1:size(critic_x_film, 2)
   means[i] = mean(nonzeros(critic_x_film[:, i]))
end

In [None]:
fit_mle(Dirichlet, [.1 .1; .9 .9], init= [.1 .1])