# Solutions 4 - Age of Recommendation

After __Age of Search__ (see the Page Rank notebook), we are now in the 
__Age of Recommendation__. This notebook is about Netflix Recommendations using 
Simon Funk's algorithm as implemented in [IncrementalSVD.jl](https://github.com/aaw/IncrementalSVD.jl) by Aaron Windsor.

## Messages

Easy -> hard?

Present: __age of search__ -> _mathematics_

Future: __age of recommendation__  -> _mathematics_

(BigData, new technologies)


## Easy -> hard

weight:  1 _kg_  -> 1 _kg_ $\pm$ 0.000000001 _kg_

running: 100 _m_ -> 42,195 _m_ ili 100 _m_ < 10 _sek_

mathematics: exam -> state competition  -> [Olympiad](http://www.imo-official.org/)

search, recommending: good -> excellent

## Age of Search

google (and others)


* [50 billion pages](http://www.worldwidewebsize.com/), [3.5 billion querries daily](http://www.internetlivestats.com/google-search-statistics/)
* __PageRank__
* history, context - cookies, storing data (about you), [200+ parameters](http://backlinko.com/google-ranking-factors)

## Age of Recommendation

NetFlix, Amazon Prime, PickBox, ... - on-line streaming of movies and shows

[NetFlix](https://www.netflix.com/hr/)

 * [80 million users](https://www.statista.com/statistics/250934/quarterly-number-of-netflix-streaming-subscribers-worldwide/), 5,000 movies
 * [NetFlix Prize](http://www.kdd.org/kdd2014/tutorials/KDD%20-%20The%20Recommender%20Problem%20Revisited.pdf)
 

## Mathematics

Netflix Recommendation Engine is based on approximation of a (large and sparse) matrix
```
M = Users x Movies 
```
using (approximation of) singular value decomposition (SVD): 

* [IncrementalSVD.jl](https://github.com/aaw/IncrementalSVD.jl)
* [A parallel recommendation engine in Julia](http://juliacomputing.com/blog/2016/04/22/a-parallel-recommendation-engine-in-julia.html)

## Similarities

Similarity of users $i$ and $k$:
$$
\cos \angle (i,k)=\frac{(M[i,:],M[k,:])}{\|M[i,:]\| \cdot \|M[k,:]\|}
$$
Similarity of movies $i$ and $k$:
$$
\cos \angle (i,k)=\frac{(M[:,i],M[:,k)}{\|M[:,i]\| \cdot \|M[:,k]\|}
$$

## Search

Row $M[u,:]$ - what user $u$ thinks about movies

Column $M[:,m]$ - what users think about movie $m$

Element $M[u,m]$ - what user $u$ thinks about movie $m$.

## Problem

Matrix $M$ is sparse so we do not have enough information. For example, 

```
900188 marks / ( 6040 users x 3706 movies ) = 4%
```

## Approximation

SVD decomposition $M=U\Sigma V^T$ is [approximated by a low-rank matrix](https://en.wikipedia.org/wiki/Low-rank_approximation) (for example. $rank=25$)

![SVD decomposition](svd.png)

The approximation matrix is __full__ and __gives enough good information__.

Prize for efficient approximation algorithm was $\$$ 1.000.000.  

In [1]:
# pkg> add https://github.com/aaw/IncrementalSVD.jl   or
# pkg> add https://github.com/ivanslapnicar/IncrementalSVD.jl

In [2]:
include("../../../IncrementalSVD.jl/src/IncrementalSVD.jl")
using .IncrementalSVD

┌ Info: Precompiling InfoZIP [f4508453-b816-52ab-a864-26fc7f6211fc]
└ @ Base loading.jl:1260
└ @ InfoZIP C:\Users\Ivan_Slapnicar\.julia\packages\InfoZIP\3AT7m\src\InfoZIP.jl:20
│ - If you have InfoZIP checked out for development and have
│   added ZipFile as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with InfoZIP


In [3]:
varinfo(IncrementalSVD)

| name                         |        size | summary                              |
|:---------------------------- | -----------:|:------------------------------------ |
| IncrementalSVD               | 174.465 KiB | Module                               |
| Rating                       |   208 bytes | DataType                             |
| RatingSet                    |   216 bytes | DataType                             |
| RatingsModel                 |   224 bytes | DataType                             |
| cosine_similarity            |     0 bytes | typeof(cosine_similarity)            |
| get_predicted_rating         |     0 bytes | typeof(get_predicted_rating)         |
| item_features                |     0 bytes | typeof(item_features)                |
| item_search                  |     0 bytes | typeof(item_search)                  |
| items                        |     0 bytes | typeof(items)                        |
| load_book_crossing_dataset   |     0 bytes | typeof(load_book_crossing_dataset)   |
| load_large_movielens_dataset |     0 bytes | typeof(load_large_movielens_dataset) |
| load_small_movielens_dataset |     0 bytes | typeof(load_small_movielens_dataset) |
| rmse                         |     0 bytes | typeof(rmse)                         |
| similar_items                |     0 bytes | typeof(similar_items)                |
| similar_users                |     0 bytes | typeof(similar_users)                |
| split_ratings                |     0 bytes | typeof(split_ratings)                |
| train                        |     0 bytes | typeof(train)                        |
| truncate_model!              |     0 bytes | typeof(truncate_model!)              |
| user_features                |     0 bytes | typeof(user_features)                |
| user_ratings                 |     0 bytes | typeof(user_ratings)                 |
| users                        |     0 bytes | typeof(users)                        |


In [4]:
rating_set = load_small_movielens_dataset()

Reusing existing downloaded files...
.\ml-1m\ratings.dat


[32mLoading ratings 100%|███████████████████████████████████| Time: 0:00:01[39m


RatingSet(Rating[Rating(5394, 2741, 2.0f0), Rating(614, 17, 4.0f0), Rating(2506, 401, 4.0f0), Rating(3841, 1231, 3.0f0), Rating(5011, 993, 4.0f0), Rating(1788, 252, 3.0f0), Rating(2124, 1388, 1.0f0), Rating(2544, 273, 5.0f0), Rating(245, 197, 2.0f0), Rating(4609, 2648, 3.0f0)  …  Rating(1851, 1149, 2.0f0), Rating(3887, 1209, 1.0f0), Rating(4754, 353, 4.0f0), Rating(1137, 3129, 3.0f0), Rating(721, 820, 2.0f0), Rating(2895, 1377, 2.0f0), Rating(1977, 1863, 5.0f0), Rating(5021, 1562, 4.0f0), Rating(3718, 2525, 4.0f0), Rating(1418, 2200, 2.0f0)], Rating[Rating(235, 1940, 2.0f0), Rating(2544, 44, 4.0f0), Rating(3119, 87, 5.0f0), Rating(204, 97, 5.0f0), Rating(4060, 6, 3.0f0), Rating(1325, 566, 4.0f0), Rating(3356, 869, 5.0f0), Rating(1425, 513, 4.0f0), Rating(2063, 222, 1.0f0), Rating(3118, 720, 4.0f0)  …  Rating(4836, 1888, 1.0f0), Rating(3537, 3271, 1.0f0), Rating(4235, 125, 4.0f0), Rating(54, 1765, 4.0f0), Rating(169, 2061, 4.0f0), Rating(2756, 1082, 2.0f0), Rating(2389, 1311, 3.0f0), Ra

In [5]:
propertynames(rating_set)

(:training_set, :test_set, :user_to_index, :item_to_index)

In [6]:
# The format is (user, movie, mark)
rating_set.training_set

900188-element Array{Rating,1}:
 Rating(5394, 2741, 2.0f0)
 Rating(614, 17, 4.0f0)
 Rating(2506, 401, 4.0f0)
 Rating(3841, 1231, 3.0f0)
 Rating(5011, 993, 4.0f0)
 Rating(1788, 252, 3.0f0)
 Rating(2124, 1388, 1.0f0)
 Rating(2544, 273, 5.0f0)
 Rating(245, 197, 2.0f0)
 Rating(4609, 2648, 3.0f0)
 Rating(5880, 1320, 3.0f0)
 Rating(5636, 1189, 2.0f0)
 Rating(2018, 720, 4.0f0)
 ⋮
 Rating(232, 1076, 1.0f0)
 Rating(2777, 1666, 3.0f0)
 Rating(1851, 1149, 2.0f0)
 Rating(3887, 1209, 1.0f0)
 Rating(4754, 353, 4.0f0)
 Rating(1137, 3129, 3.0f0)
 Rating(721, 820, 2.0f0)
 Rating(2895, 1377, 2.0f0)
 Rating(1977, 1863, 5.0f0)
 Rating(5021, 1562, 4.0f0)
 Rating(3718, 2525, 4.0f0)
 Rating(1418, 2200, 2.0f0)

In [7]:
rating_set.test_set

100021-element Array{Rating,1}:
 Rating(235, 1940, 2.0f0)
 Rating(2544, 44, 4.0f0)
 Rating(3119, 87, 5.0f0)
 Rating(204, 97, 5.0f0)
 Rating(4060, 6, 3.0f0)
 Rating(1325, 566, 4.0f0)
 Rating(3356, 869, 5.0f0)
 Rating(1425, 513, 4.0f0)
 Rating(2063, 222, 1.0f0)
 Rating(3118, 720, 4.0f0)
 Rating(2116, 84, 4.0f0)
 Rating(3484, 252, 3.0f0)
 Rating(939, 1110, 3.0f0)
 ⋮
 Rating(4497, 1448, 3.0f0)
 Rating(996, 386, 4.0f0)
 Rating(4836, 1888, 1.0f0)
 Rating(3537, 3271, 1.0f0)
 Rating(4235, 125, 4.0f0)
 Rating(54, 1765, 4.0f0)
 Rating(169, 2061, 4.0f0)
 Rating(2756, 1082, 2.0f0)
 Rating(2389, 1311, 3.0f0)
 Rating(787, 244, 5.0f0)
 Rating(5747, 65, 5.0f0)
 Rating(2592, 2073, 4.0f0)

In [8]:
# Users and their IDs
rating_set.user_to_index

Dict{AbstractString,Int32} with 6040 entries:
  "4304" => 4304
  "3935" => 3935
  "5422" => 5422
  "5734" => 5734
  "2243" => 2243
  "1881" => 1881
  "5425" => 5425
  "4209" => 4209
  "1907" => 1907
  "2923" => 2923
  "599"  => 599
  "2491" => 2491
  "5944" => 5944
  "228"  => 228
  "2590" => 2590
  "3697" => 3697
  "5031" => 5031
  "2579" => 2579
  "5551" => 5551
  "1880" => 1880
  "2562" => 2562
  "3215" => 3215
  "3991" => 3991
  "4652" => 4652
  "4088" => 4088
  ⋮      => ⋮

In [9]:
# Movies and their IDs
rating_set.item_to_index

Dict{AbstractString,Int32} with 3706 entries:
  "Fried Green Tomatoes (1991)"                                       => 594
  "Milk Money (1994)"                                                 => 1361
  "From Russia with Love (1963)"                                      => 729
  "House II: The Second Story (1987)"                                 => 1247
  "Held Up (2000)"                                                    => 3549
  "Missing in Action 2: The Beginning (1985)"                         => 2177
  "Murder, My Sweet (1944)"                                           => 996
  "Hidden, The (1987)"                                                => 981
  "Cable Guy, The (1996)"                                             => 669
  "Big Kahuna, The (2000)"                                            => 893
  "Addams Family Values (1993)"                                       => 1857
  "Farinelli: il castrato (1994)"                                     => 1945
  "Education of Little T

In [10]:
# We can extract the titles ...
keys(rating_set.item_to_index)

Base.KeySet for a Dict{AbstractString,Int32} with 3706 entries. Keys:
  "Fried Green Tomatoes (1991)"
  "Milk Money (1994)"
  "From Russia with Love (1963)"
  "House II: The Second Story (1987)"
  "Held Up (2000)"
  "Missing in Action 2: The Beginning (1985)"
  "Murder, My Sweet (1944)"
  "Hidden, The (1987)"
  "Cable Guy, The (1996)"
  "Big Kahuna, The (2000)"
  "Addams Family Values (1993)"
  "Farinelli: il castrato (1994)"
  "Education of Little Tree, The (1997)"
  "In God's Hands (1998)"
  "Last Man Standing (1996)"
  "Sixth Sense, The (1999)"
  "Star Maps (1997)"
  "Girl, Interrupted (1999)"
  "Stand by Me (1986)"
  "Rob Roy (1995)"
  "Caligula (1980)"
  "Flirting With Disaster (1996)"
  "Hook (1991)"
  "Institute Benjamenta, or This Dream People Call Human Life (1995)"
  "Way We Were, The (1973)"
  ⋮

In [11]:
# or codes
values(rating_set.item_to_index)

Base.ValueIterator for a Dict{AbstractString,Int32} with 3706 entries. Values:
  594
  1361
  729
  1247
  3549
  2177
  996
  981
  669
  893
  1857
  1945
  2759
  2814
  1840
  39
  2670
  32
  85
  449
  2532
  1290
  657
  3397
  762
  ⋮

In [12]:
rating_set.item_to_index

Dict{AbstractString,Int32} with 3706 entries:
  "Fried Green Tomatoes (1991)"                                       => 594
  "Milk Money (1994)"                                                 => 1361
  "From Russia with Love (1963)"                                      => 729
  "House II: The Second Story (1987)"                                 => 1247
  "Held Up (2000)"                                                    => 3549
  "Missing in Action 2: The Beginning (1985)"                         => 2177
  "Murder, My Sweet (1944)"                                           => 996
  "Hidden, The (1987)"                                                => 981
  "Cable Guy, The (1996)"                                             => 669
  "Big Kahuna, The (2000)"                                            => 893
  "Addams Family Values (1993)"                                       => 1857
  "Farinelli: il castrato (1994)"                                     => 1945
  "Education of Little T

In [13]:
# Which movies did the user "3000" grade?
user_ratings(rating_set, "3000")

91-element Array{Tuple{Pair{Int32,SubString{String}},Float32},1}:
 (530 => "Sling Blade (1996)", 5.0)
 (2058 => "Eye of the Beholder (1999)", 5.0)
 (2177 => "Missing in Action 2: The Beginning (1985)", 5.0)
 (3176 => "Goodbye, Lover (1999)", 5.0)
 (854 => "Titan A.E. (2000)", 5.0)
 (1527 => "Sliding Doors (1998)", 5.0)
 (2861 => "Tales of Terror (1962)", 5.0)
 (1071 => "Last Emperor, The (1987)", 5.0)
 (802 => "Holy Man (1998)", 5.0)
 (594 => "Fried Green Tomatoes (1991)", 5.0)
 (2051 => "Bread and Chocolate (Pane e cioccolata) (1973)", 5.0)
 (772 => "Jerk, The (1979)", 5.0)
 (2196 => "Amazing Panda Adventure, The (1995)", 4.0)
 ⋮
 (2496 => "Poison Ivy: New Seduction (1997)", 2.0)
 (550 => "Hype! (1996)", 2.0)
 (373 => "Boys Don't Cry (1999)", 2.0)
 (1449 => "It Happened One Night (1934)", 1.0)
 (406 => "Keeping the Faith (2000)", 1.0)
 (2984 => "Steam: The Turkish Bath (Hamam) (1997)", 1.0)
 (3580 => "Sprung (1997)", 1.0)
 (2461 => "Meet Wally Sparks (1997)", 1.0)
 (1314 => "Stripes (

In [14]:
# Let us find the exact title and code for "Blade Runner"
for k in keys(rating_set.item_to_index)
    if occursin("Blade",k)
        println(k)
    end
end

Sling Blade (1996)
Blade (1998)
Blade Runner (1982)
Some Folks Call It a Sling Blade (1993)


In [15]:
# Did the user "3000" grade "Blade Runner" ?
for k in user_ratings(rating_set,"3001")
    if occursin("Blade",k[1][2])
        println(k)
    end
end

In [16]:
# How did the user "3000" grade "Sling Blade" ?
for k in user_ratings(rating_set,"3000")
    if occursin("Blade",k[1][2])
        println(k)
    end
end

(530 => "Sling Blade (1996)", 5.0f0)


In [17]:
get(rating_set.item_to_index,"Blade Runner (1982)",0)

744

In [18]:
# This takes about 2.5 - 3.5 minutes
model = train(rating_set, 25);

[32mComputing truncated rank 25 SVD 100%|███████████████████| Time: 0:00:15[39m


In [20]:
propertynames(model)

(:user_to_index, :item_to_index, :U, :S, :V)

In [21]:
model.U

6040×25 Array{Float32,2}:
 0.0107142   0.0118767    0.0169031   0.00807186  …   0.0177213   0.0109083
 0.014054    0.0112279    0.0101315   0.0102579       0.00793691  0.0104726
 0.0100172   0.00904886   0.0162754   0.00882626      0.00866408  0.018414
 0.00711395  0.006777     0.0157497   0.00436876      0.00874469  0.0151361
 0.0136305   0.0166518    0.00149312  0.0152443       0.00415827  0.00783235
 0.0108813   0.0165051    0.0151611   0.0154106   …   0.00459014  0.0195152
 0.00867741  0.00744987   0.018701    0.0044736       0.00586769  0.0163895
 0.0155244   0.0151719    0.0107715   0.0134915       0.0155342   0.00718333
 0.0134894   0.00893162   0.0110126   0.0073914       0.0133561   0.0075436
 0.0187854   0.0126696   -0.00998682  0.0248243       0.0123881   0.0104953
 0.0131322   0.0115955    0.00979953  0.0113361   …   0.00579716  0.00946938
 0.00671707  0.00838414   0.0147071   0.00635105      0.012219    0.0180386
 0.0123021   0.00970481   0.0130822   0.00742211      0.0100

In [22]:
model.S

25-element Array{Float32,1}:
 8190.8213
 2149.017
  830.6027
 1150.4939
  839.8949
 1158.9558
  652.8342
  392.10507
  339.68713
  271.391
  256.0176
  333.4429
  289.4227
  221.43741
  308.60135
  339.03073
  341.78983
  193.14778
  204.86087
  192.27917
  167.09984
  126.188484
  112.97897
  119.95354
  102.64159

In [23]:
model.V

3706×25 Array{Float32,2}:
 0.0359215   0.00816437  0.0691514   …  -0.000583241   0.0121019
 0.0252739   0.0149589   0.0201202      -0.00171532    0.0145751
 0.0305693   0.0139319   0.0311927       0.00789148    0.006726
 0.0320046   0.0114556   0.0539689       0.0150046    -0.0125299
 0.0315613   0.00747788  0.0558023       0.00758263   -0.00760921
 0.034729    0.00479607  0.0705086   …  -0.0214193     0.0130412
 0.0303202   0.00763446  0.0296873       0.00120213    0.00272048
 0.0332735   0.00382193  0.051395        0.0276164    -0.0201521
 0.0289202   0.00732115  0.0225404       0.00439462    0.0148727
 0.0336104   0.00486958  0.0537455       0.000312531   0.0251844
 0.0297867   0.00519216  0.0382883   …   0.00329647   -0.0133145
 0.0172581   0.0267268   0.00967494      0.0125154     0.00874973
 0.0257697   0.0196641   0.0126901       0.00252681    0.00634576
 ⋮                                   ⋱                
 0.00117918  0.00256827  0.00405728      0.0141199     0.0140804
 0.001

In [24]:
similar_items(model, "Friday the 13th (1980)")

10-element Array{SubString{String},1}:
 "Friday the 13th (1980)"
 "Amityville Horror, The (1979)"
 "Jaws 2 (1978)"
 "Omen, The (1976)"
 "Friday the 13th Part 2 (1981)"
 "Halloween II (1981)"
 "Cujo (1983)"
 "Candyman (1992)"
 "Hellraiser (1987)"
 "Pet Sematary (1989)"

In [25]:
# Take a look at the function
@which similar_items(model, "Friday the 13th (1980)")

In [26]:
similar_items(model, "Citizen Kane (1941)")

10-element Array{SubString{String},1}:
 "Citizen Kane (1941)"
 "Chinatown (1974)"
 "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
 "M*A*S*H (1970)"
 "Lawrence of Arabia (1962)"
 "Rear Window (1954)"
 "African Queen, The (1951)"
 "Boat, The (Das Boot) (1981)"
 "Casablanca (1942)"
 "Maltese Falcon, The (1941)"

In [27]:
similar_users(model,"3000")

10-element Array{SubString{String},1}:
 "3000"
 "5613"
 "882"
 "1248"
 "1360"
 "5931"
 "4608"
 "5622"
 "2202"
 "1222"

In [28]:
# What is the opinion of user "3000" about "Blade Runner (1982)" 
# in the approximate model (no true mark) ?
get_predicted_rating(model, "3000", "Blade Runner (1982)")

4.2478147f0

In [29]:
# What is the opinion of user "3000" about "Citizen Kane (1941)"
# (no true mark!) ?
IncrementalSVD.get_predicted_rating(model, "3000", "Citizen Kane (1941)")

4.223943f0

In [30]:
# What is the opinion of user "3000" about "Sling Blade (1996)"
# in the approximate model (true mark 5.0) ?
IncrementalSVD.get_predicted_rating(model, "3000", "Sling Blade (1996)")

4.043533f0

In [32]:
# What is the opinion of user "3000" about "Time to Kill, A (1996)")
# in the approximate model (true mark 1.0) ?
IncrementalSVD.get_predicted_rating(model, "3000", "Time to Kill, A (1996)")

3.080271f0

## Thank you for you attention

### Questions?