# New Age

Fakulteta za matematiko in fiziko - Univerza v Ljubljani

Ivan Slapničar, Sveučilište u Splitu, FESB

April, 2017

## Messages

Easy -> hard?

Present: __age of search__ -> _mathematics_

Future: __age of recommendation__  -> _mathematics_

(BigData, new technologies)


## Easy -> hard

weight:  1 _kg_  -> 1 _kg_ $\pm$ 0.000000001 _kg_

running: 100 _m_ -> 42,195 _m_ ili 100 _m_ < 10 _sek_

mathematics: exam -> state competition  -> [Olympiad](http://www.imo-official.org/)

recommending: good -> excellent

## Age of Search

google (and others)


* [50 billion pages](http://www.worldwidewebsize.com/), [3.5 billion querries daily](http://www.internetlivestats.com/google-search-statistics/)
* __PageRank__
* history, context - cookies, storing data (about you), [200+ parameters](http://backlinko.com/google-ranking-factors)

## PageRank

* Graph theory and linear algebra
* [C, Moler, Google PageRank][Mol11]


[Mol11]: https://www.mathworks.com/moler/exm/chapters/pagerank.pdf "C, Moler, 'Google PageRank', mathWorks, 2011."


In [1]:
i = vec([ 2 6 3 4 4 5 6 1 1])
j = vec([ 1 1 2 2 3 3 3 4 6])
G=sparse(i,j,1.0)

6×6 sparse matrix with 9 Float64 nonzero entries:
	[2, 1]  =  1.0
	[6, 1]  =  1.0
	[3, 2]  =  1.0
	[4, 2]  =  1.0
	[4, 3]  =  1.0
	[5, 3]  =  1.0
	[6, 3]  =  1.0
	[1, 4]  =  1.0
	[1, 6]  =  1.0

In [2]:
full(G)

6×6 Array{Float64,2}:
 0.0  0.0  0.0  1.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0
 0.0  1.0  1.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  0.0

On the graph we do a _random walk_ - probabilty to follow any link is equal:

In [3]:
c=sum(G,1)
n=size(G,1)
for j=1:n
    if c[j]>0
        G[:,j]=G[:,j]/c[j]
    end
end
full(G)

6×6 Array{Float64,2}:
 0.0  0.0  0.0       1.0  0.0  1.0
 0.5  0.0  0.0       0.0  0.0  0.0
 0.0  0.5  0.0       0.0  0.0  0.0
 0.0  0.5  0.333333  0.0  0.0  0.0
 0.0  0.0  0.333333  0.0  0.0  0.0
 0.5  0.0  0.333333  0.0  0.0  0.0

* $p$: probability to follow link
* $1-p$: probability to visit some other page (at random)
* google uses $p=0.85$ ?

In [5]:
p=0.85
z = ((1-p)*(c.!=0) + (c.==0))/n
A=p*G+ones(n)*z

6×6 Array{Float64,2}:
 0.025  0.025  0.025     0.875  0.166667  0.875
 0.45   0.025  0.025     0.025  0.166667  0.025
 0.025  0.45   0.025     0.025  0.166667  0.025
 0.025  0.45   0.308333  0.025  0.166667  0.025
 0.025  0.025  0.308333  0.025  0.166667  0.025
 0.45   0.025  0.308333  0.025  0.166667  0.025

## Idea

Let us start a random walk from the vector $x_0=\begin{bmatrix} 1/n \\ 1/n \\ \vdots \\ 1/n \end{bmatrix}$.

The subsequent vectors are  

$$
x_1=A\cdot x_0 \\
x_2=A\cdot x_1 \\
x_3=A\cdot x_2\\
\vdots
$$

When the vector __stabilizes__:

$$
A\cdot x\approx x,
$$

then $x[i]$ is the __rank of the page__ $i$.

In [6]:
function myPageRank(G::SparseMatrixCSC{Float64,Int64},steps::Int)
    p=0.85
    c=sum(G,1)/p
    n=size(G,1)
    for i=1:n
        G.nzval[G.colptr[i]:G.colptr[i+1]-1]./=c[i]
    end
    e=ones(n)
    x=e/n
    z = vec(((1-p)*(c.!=0) + (c.==0))/n)
    for j=1:steps
        x=G*x+(z⋅x)
    end
    x/norm(x,1)
end

myPageRank (generic function with 1 method)

In [7]:
myPageRank(G,15)

6-element Array{Float64,1}:
 0.321024 
 0.170538 
 0.106596 
 0.136795 
 0.0643103
 0.200737 

### [Stanford web graph](http://snap.stanford.edu/data/web-Stanford.html)

This is somewhat larger test problem.

In [8]:
W=readdlm("web-Stanford.txt",Int)

2312497×2 Array{Int64,2}:
      1    6548
      1   15409
   6548   57031
  15409   13102
      2   17794
      2   25202
      2   53625
      2   54582
      2   64930
      2   73764
      2   84477
      2   98628
      2  100193
      ⋮        
 281849  165189
 281849  177014
 281849  226290
 281849  243180
 281849  244195
 281849  247252
 281849  281568
 281865  186750
 281865  225872
 281888  114388
 281888  192969
 281888  233184

In [9]:
S=sparse(W[:,2],W[:,1],1.0)

281903×281903 sparse matrix with 2312497 Float64 nonzero entries:
	[6548  ,      1]  =  1.0
	[15409 ,      1]  =  1.0
	[17794 ,      2]  =  1.0
	[25202 ,      2]  =  1.0
	[53625 ,      2]  =  1.0
	[54582 ,      2]  =  1.0
	[64930 ,      2]  =  1.0
	[73764 ,      2]  =  1.0
	[84477 ,      2]  =  1.0
	[98628 ,      2]  =  1.0
	⋮
	[168703, 281902]  =  1.0
	[180771, 281902]  =  1.0
	[266504, 281902]  =  1.0
	[275189, 281902]  =  1.0
	[44103 , 281903]  =  1.0
	[56088 , 281903]  =  1.0
	[90591 , 281903]  =  1.0
	[94440 , 281903]  =  1.0
	[216688, 281903]  =  1.0
	[256539, 281903]  =  1.0
	[260899, 281903]  =  1.0

In [10]:
@time x100=myPageRank(S,100);

  1.472047 seconds (846.68 k allocations: 542.596 MB, 11.52% gc time)


In [11]:
x101=myPageRank(S,101)
maxabs((x100-x101)./x100)

2.349138102570129e-7

In [12]:
sortperm(x100,rev=true), sort(x101,rev=true)

([89073,226411,241454,262860,134832,234704,136821,68889,105607,69358  …  281647,281700,281715,281728,281778,281785,281813,281849,281865,281888],[0.0113029,0.00926783,0.00829727,0.00302312,0.00300128,0.00257173,0.00245371,0.00243079,0.00239105,0.00236401  …  5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7,5.33369e-7])

## Age of Recommendation

NetFlix, Amazon Prime, PickBox, ... - on-line streaming of movies and shows

[NetFlix](https://www.netflix.com/hr/)

 * [80 million users](https://www.statista.com/statistics/250934/quarterly-number-of-netflix-streaming-subscribers-worldwide/), 5,000 movies
 * [NetFlix Prize](http://www.kdd.org/kdd2014/tutorials/KDD%20-%20The%20Recommender%20Problem%20Revisited.pdf)
 

## Mathematics

Netflix Recommendation Engine is based on approximation of a (large and sparse) matrix
```
M = Users x Movies 
```
using (approximation of) singular value decomposition (SVD): 

* [IncrementalSVD.jl](https://github.com/aaw/IncrementalSVD.jl)
* [A parallel recommendation engine in Julia](http://juliacomputing.com/blog/2016/04/22/a-parallel-recommendation-engine-in-julia.html)

## Similarities

Similarity of users $i$ and $k$:
$$
\cos \angle (i,k)=\frac{(M[i,:],M[k,:])}{\|M[i,:]\| \cdot \|M[k,:]\|}
$$
Similarity of movies $i$ and $k$:
$$
\cos \angle (i,k)=\frac{(M[:,i],M[:,k)}{\|M[:,i]\| \cdot \|M[:,k]\|}
$$

## Search

Row $M[u,:]$ - what user $u$ thinks about movies

Column $M[:,m]$ - what users think about movie $m$

Element $M[u,m]$ - what user $u$ thinks about movie $m$.

## Problem

Matrix $M$ is sparse so we do not have enough information. For example, 

```
900188 marks / ( 6040 users x 3706 movies ) = 4%
```

## Approximation

SVD decomposition $M=U\Sigma V^T$ is approximated by a low-rank matrix (for example. $rank=25$)

![SVD decomposition](svd.png)

The approximation matrix is __full__ and __gives enough good information__.

Prize for efficient approximation algorithm was $\$$ 1.000.000.  

In [13]:
# Pkg.clone("git://github.com/aaw/IncrementalSVD.jl.git")

In [14]:
using IncrementalSVD

[1m[34mINFO: Recompiling stale cache file /home/slap/.julia/lib/v0.5/ProgressMeter.ji for module ProgressMeter.
[0m
Use "Dict(a=>b for (a,b) in c)" instead.


In [15]:
whos(IncrementalSVD)

                IncrementalSVD     57 KB     Module
                        Rating    136 bytes  DataType
                     RatingSet    148 bytes  DataType
                  RatingsModel    160 bytes  DataType
             cosine_similarity      0 bytes  IncrementalSVD.#cosine_similarity
          get_predicted_rating      0 bytes  IncrementalSVD.#get_predicted_rati…
                 item_features      0 bytes  IncrementalSVD.#item_features
                   item_search      0 bytes  IncrementalSVD.#item_search
                         items      0 bytes  IncrementalSVD.#items
    load_book_crossing_dataset      0 bytes  IncrementalSVD.#load_book_crossing…
  load_large_movielens_dataset      0 bytes  IncrementalSVD.#load_large_moviele…
  load_small_movielens_dataset      0 bytes  IncrementalSVD.#load_small_moviele…
                          rmse      0 bytes  IncrementalSVD.#rmse
                 similar_items      0 bytes  IncrementalSVD.#similar_items
                 similar_us

In [16]:
rating_set = load_small_movielens_dataset();

Downloading movie ratings data...


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 25 5778k   25 1460k    0     0   619k      0  0:00:09  0:00:02  0:00:07 1163k

Extracting movie data...
Archive:  /tmp/ml-1m.zip
  inflating: /tmp/ml-1m/movies.dat   
Extracting ratings data...
Archive:  /tmp/ml-1m.zip
  inflating: /tmp/ml-1m/ratings.dat  

100 5778k  100 5778k    0     0  2068k      0  0:00:02  0:00:02 --:--:-- 3419k





Loading ratings 100% Time: 0:00:03


In [17]:
fieldnames(rating_set)

4-element Array{Symbol,1}:
 :training_set 
 :test_set     
 :user_to_index
 :item_to_index

In [19]:
# The format is (user, movie, mark)
rating_set.training_set

900188-element Array{IncrementalSVD.Rating,1}:
 IncrementalSVD.Rating(4897,1459,3.0)
 IncrementalSVD.Rating(1340,2162,1.0)
 IncrementalSVD.Rating(4261,976,4.0) 
 IncrementalSVD.Rating(648,51,3.0)   
 IncrementalSVD.Rating(695,387,4.0)  
 IncrementalSVD.Rating(4447,678,4.0) 
 IncrementalSVD.Rating(3609,516,4.0) 
 IncrementalSVD.Rating(2496,601,5.0) 
 IncrementalSVD.Rating(1701,1005,4.0)
 IncrementalSVD.Rating(4212,1337,3.0)
 IncrementalSVD.Rating(5881,971,3.0) 
 IncrementalSVD.Rating(4169,1252,4.0)
 IncrementalSVD.Rating(1680,1297,4.0)
 ⋮                                   
 IncrementalSVD.Rating(866,86,5.0)   
 IncrementalSVD.Rating(3773,1441,2.0)
 IncrementalSVD.Rating(1902,1830,2.0)
 IncrementalSVD.Rating(635,2270,4.0) 
 IncrementalSVD.Rating(1422,1587,3.0)
 IncrementalSVD.Rating(1601,1031,5.0)
 IncrementalSVD.Rating(5557,377,4.0) 
 IncrementalSVD.Rating(3591,266,5.0) 
 IncrementalSVD.Rating(1605,406,4.0) 
 IncrementalSVD.Rating(627,583,3.0)  
 IncrementalSVD.Rating(4114,1467,5.0)
 In

In [20]:
rating_set.test_set

100021-element Array{IncrementalSVD.Rating,1}:
 IncrementalSVD.Rating(4169,197,3.0) 
 IncrementalSVD.Rating(3859,1656,4.0)
 IncrementalSVD.Rating(1425,1614,4.0)
 IncrementalSVD.Rating(3768,663,4.0) 
 IncrementalSVD.Rating(4517,816,3.0) 
 IncrementalSVD.Rating(2109,51,3.0)  
 IncrementalSVD.Rating(855,1296,4.0) 
 IncrementalSVD.Rating(877,39,4.0)   
 IncrementalSVD.Rating(3126,98,5.0)  
 IncrementalSVD.Rating(3192,168,5.0) 
 IncrementalSVD.Rating(245,896,2.0)  
 IncrementalSVD.Rating(5450,510,4.0) 
 IncrementalSVD.Rating(5990,206,3.0) 
 ⋮                                   
 IncrementalSVD.Rating(2180,1041,3.0)
 IncrementalSVD.Rating(3562,755,3.0) 
 IncrementalSVD.Rating(3390,1132,4.0)
 IncrementalSVD.Rating(4680,789,2.0) 
 IncrementalSVD.Rating(578,51,3.0)   
 IncrementalSVD.Rating(3483,2065,4.0)
 IncrementalSVD.Rating(5643,921,3.0) 
 IncrementalSVD.Rating(3410,542,3.0) 
 IncrementalSVD.Rating(3216,1353,4.0)
 IncrementalSVD.Rating(338,1979,1.0) 
 IncrementalSVD.Rating(1871,177,4.0) 
 In

In [21]:
# Users and their IDs
rating_set.user_to_index

Dict{AbstractString,Int32} with 6040 entries:
  "4304" => 4304
  "3935" => 3935
  "5422" => 5422
  "5734" => 5734
  "2243" => 2243
  "1881" => 1881
  "5425" => 5425
  "4209" => 4209
  "1907" => 1907
  "2923" => 2923
  "599"  => 599
  "2491" => 2491
  "5944" => 5944
  "228"  => 228
  "2590" => 2590
  "3697" => 3697
  "5031" => 5031
  "2579" => 2579
  "5551" => 5551
  "1880" => 1880
  "2562" => 2562
  "3215" => 3215
  "3991" => 3991
  "4652" => 4652
  "4088" => 4088
  ⋮      => ⋮

In [22]:
# Movies and their IDs
rating_set.item_to_index

Dict{AbstractString,Int32} with 3706 entries:
  "Fried Green Tomatoes (1991)"                                       => 594
  "Milk Money (1994)"                                                 => 1361
  "From Russia with Love (1963)"                                      => 729
  "House II: The Second Story (1987)"                                 => 1247
  "Held Up (2000)"                                                    => 3549
  "Missing in Action 2: The Beginning (1985)"                         => 2177
  "Murder, My Sweet (1944)"                                           => 996
  "Hidden, The (1987)"                                                => 981
  "Cable Guy, The (1996)"                                             => 669
  "Big Kahuna, The (2000)"                                            => 893
  "Addams Family Values (1993)"                                       => 1857
  "Farinelli: il castrato (1994)"                                     => 1945
  "Education of Little T

In [23]:
# We can extract the titles ...
keys(rating_set.item_to_index)

Base.KeyIterator for a Dict{AbstractString,Int32} with 3706 entries. Keys:
  "Fried Green Tomatoes (1991)"
  "Milk Money (1994)"
  "From Russia with Love (1963)"
  "House II: The Second Story (1987)"
  "Held Up (2000)"
  "Missing in Action 2: The Beginning (1985)"
  "Murder, My Sweet (1944)"
  "Hidden, The (1987)"
  "Cable Guy, The (1996)"
  "Big Kahuna, The (2000)"
  "Addams Family Values (1993)"
  "Farinelli: il castrato (1994)"
  "Education of Little Tree, The (1997)"
  "In God's Hands (1998)"
  "Last Man Standing (1996)"
  "Sixth Sense, The (1999)"
  "Star Maps (1997)"
  "Girl, Interrupted (1999)"
  "Stand by Me (1986)"
  "Rob Roy (1995)"
  "Caligula (1980)"
  "Flirting With Disaster (1996)"
  "Hook (1991)"
  "Institute Benjamenta, or This Dream People Call Human Life (1995)"
  ⋮

In [24]:
# or codes
values(rating_set.item_to_index)

Base.ValueIterator for a Dict{AbstractString,Int32} with 3706 entries. Values:
  594
  1361
  729
  1247
  3549
  2177
  996
  981
  669
  893
  1857
  1945
  2759
  2814
  1840
  39
  2670
  32
  85
  449
  2532
  1290
  657
  3397
  ⋮

In [25]:
# Which movies did the user "3000" grade?
user_ratings(rating_set, "3000")

93-element Array{Tuple{SubString{String},Float32},1}:
 ("Rock, The (1996)",5.0)                             
 ("Time Bandits (1981)",5.0)                          
 ("Babe (1995)",5.0)                                  
 ("Defending Your Life (1991)",5.0)                   
 ("One Flew Over the Cuckoo's Nest (1975)",5.0)       
 ("Brazil (1985)",5.0)                                
 ("When Harry Met Sally... (1989)",5.0)               
 ("American Beauty (1999)",5.0)                       
 ("Brothers McMullen, The (1995)",5.0)                
 ("Gattaca (1997)",5.0)                               
 ("Princess Bride, The (1987)",5.0)                   
 ("Caddyshack (1980)",5.0)                            
 ("Dances with Wolves (1990)",5.0)                    
 ⋮                                                    
 ("Romancing the Stone (1984)",2.0)                   
 ("Back to the Future (1985)",2.0)                    
 ("Mad Max 2 (a.k.a. The Road Warrior) (1981)",2.0)   
 ("Lost Wor

In [26]:
# Let us find the exact title and code for "Blade runner"
for k in keys(rating_set.item_to_index)
    if contains(k,"Blade")
        println(k)
    end
end

Sling Blade (1996)
Blade (1998)
Blade Runner (1982)
Some Folks Call It a Sling Blade (1993)


LoadError: UnicodeError: invalid character index

In [27]:
get(rating_set.item_to_index,"Blade Runner (1982)",0)

744

In [28]:
# Did the user "3000" grade "Blade Runner" ?
for k in user_ratings(rating_set,"3000")
    if contains(k[1],"Blade")
        println(k)
    end
end

("Blade Runner (1982)",4.0f0)


In [29]:
# Did the user "3000" grade "Citizen Kane" ?
for k in user_ratings(rating_set,"3000")
    if contains(k[1],"Citizen")
        println(k)
    end
end

In [30]:
# This takes about 2.5 minutes
model = train(rating_set, 25);

Computing truncated rank 25 SVD 100% Time: 0:02:16


In [31]:
fieldnames(model)

5-element Array{Symbol,1}:
 :user_to_index
 :item_to_index
 :U            
 :S            
 :V            

In [32]:
model.U

6040×25 Array{Float32,2}:
 0.0110512   0.012132     0.0166615    0.00896368  …   0.0148011   0.015121  
 0.0140814   0.0110606    0.0104063    0.00985581      0.0108044   0.0111072 
 0.0104258   0.00908557   0.0167579    0.00823287      0.00923159  0.0129    
 0.00670599  0.00673408   0.0151975    0.00359263      0.0111083   0.0129639 
 0.0135804   0.0172373    0.00191356   0.0160518       0.00523406  0.00303589
 0.0103158   0.0165367    0.0152035    0.0161497   …   0.00952869  0.0130475 
 0.00877037  0.00742544   0.0183137    0.00462677      0.00735738  0.0165566 
 0.0155079   0.0156311    0.010671     0.0136543       0.00977127  0.0142739 
 0.0134644   0.00867299   0.0101724    0.00762071      0.00924866  0.0107447 
 0.0186961   0.0129793   -0.0103287    0.0245968       0.00453495  0.018872  
 0.0130506   0.0114443    0.00951001   0.0112533   …   0.0115466   0.00500921
 0.00626829  0.0077138    0.0143347    0.00556362      0.0136987   0.0149916 
 0.0124754   0.00987823   0.0127647   

In [33]:
model.S

25-element Array{Float32,1}:
 8193.12  
 2149.36  
  826.802 
 1149.36  
  844.998 
 1160.66  
  649.183 
  391.566 
  352.988 
  268.952 
  252.449 
  332.67  
  294.422 
  225.733 
  286.962 
  340.078 
  330.697 
  187.297 
  199.941 
  195.533 
  164.842 
  135.839 
  139.152 
  111.532 
   91.0994

In [34]:
model.V

3706×25 Array{Float32,2}:
 0.0360571   0.00828776  0.0676394   …   0.00146285  -0.000734074
 0.024943    0.0144925   0.019328       -0.00669053   0.00404501 
 0.0301463   0.0132188   0.0318842       0.00416647   0.00542532 
 0.0323433   0.0121883   0.0536472       0.0147098   -0.0119986  
 0.0314123   0.00737984  0.0552655       0.00644352  -0.00656439 
 0.0346807   0.00470635  0.0697331   …  -0.00802938  -0.0122308  
 0.0307082   0.0085307   0.0303255      -0.00056698   0.00655972 
 0.0331568   0.00379407  0.0513606       0.0283895   -0.0104648  
 0.0287726   0.00754044  0.0240452       0.00787751   0.0140193  
 0.0333241   0.0050905   0.0531577      -0.00319292   0.016238   
 0.0294117   0.00531615  0.0372617   …   0.00582767  -0.0121301  
 0.0175696   0.026541    0.00923563      0.0071839    0.0117079  
 0.025875    0.0199528   0.0128668       0.00632448   0.0017673  
 ⋮                                   ⋱                           
 0.0012003   0.00261312  0.00408757      0.0144919

In [35]:
similar_items(model, "Friday the 13th (1980)")

10-element Array{SubString{String},1}:
 "Friday the 13th (1980)"           
 "Cujo (1983)"                      
 "Friday the 13th Part 2 (1981)"    
 "Amityville Horror, The (1979)"    
 "Pet Sematary (1989)"              
 "Omen, The (1976)"                 
 "Jaws 2 (1978)"                    
 "Halloween II (1981)"              
 "Creepshow (1982)"                 
 "Friday the 13th Part 3: 3D (1982)"

In [36]:
@which similar_items(model, "Friday the 13th (1980)")

In [37]:
similar_items(model, "Citizen Kane (1941)")

10-element Array{SubString{String},1}:
 "Citizen Kane (1941)"                                                        
 "M*A*S*H (1970)"                                                             
 "Chinatown (1974)"                                                           
 "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
 "Rear Window (1954)"                                                         
 "Vertigo (1958)"                                                             
 "Lawrence of Arabia (1962)"                                                  
 "Boat, The (Das Boot) (1981)"                                                
 "African Queen, The (1951)"                                                  
 "Casablanca (1942)"                                                          

In [38]:
similar_users(model,"3000")

10-element Array{SubString{String},1}:
 "3000"
 "5613"
 "3479"
 "2202"
 "4501"
 "5619"
 "1360"
 "5572"
 "500" 
 "5419"

In [39]:
# What is the opinion of user "3000" about "Blader Runner" 
# in the approximate model (true mark is 4.0) ?
get_predicted_rating(model, "3000", "Blade Runner (1982)")

4.0690255f0

In [40]:
# What is the opinion of user "3000" about Citizen Kane 
# (no true mark!) ?
IncrementalSVD.get_predicted_rating(model, "3000", "Citizen Kane (1941)")

4.018481f0

## Thank you for you attention

### Questions?