###### Introduction to Network Analysis 2023/24 (vi)

## Link betweenness, node similarity, errors & attacks

### II. Movie recommendations with PageRank

You are given a small knowledge graph of $1337$ movies in Pajek format ([movies_graph.net](http://lovro.fri.uni-lj.si/ina/nets/movies_graph.net)). Nodes represent either individual movies or their different *modes* such as language, country, genres, actors, director etc.



1. **(code)** Compute standard statistics of the network. Are the results expected?



In [None]:
import utils
import networkx as nx

movies = utils.read_pajek("movies_graph")
utils.info(movies, clustering_sample=len(movies))

if nx.is_bipartite(movies):
    print(f"{movies.name} is bipartite!")

  MultiGraph | 'movies_graph'
       Nodes | 6,577 (iso=0)
       Edges | 16,842 (loop=0)
      Degree | 5.12 (max=1,213)
         LCC | 100.0% (n=1)
  Clustering | 0.0000

movies_graph is bipartite!


Clustering is 0, as expected for a bipartite graph.

2. **(code)** Find the most important movies according to the PageRank algorithm $p_i=\alpha\sum_jA_{ij}\frac{p_j}{k_j}+\frac{1-\alpha}{n}$, where $A$ is the network adjacency matrix, $n$ is the number of network nodes, $k_i$ is the degree of node $i$ and $\alpha$ is the damping factor set to $0.85$. Which movies have the highest PageRank score?



In [None]:
G = movies
_ = utils.top_nodes(G, utils.pagerank(G), 'pagerank')

  Centrality | 'pagerank'
    0.000764 | 'Movie 43' (21)
    0.000604 | 'Parade' (18)
    0.000588 | 'Joyeux Noel' (23)
    0.000584 | 'Children of Men' (22)
    0.000583 | '7 Boxes' (17)
    0.000576 | 'Hitman's Bodyguard' (24)
    0.000568 | 'Nomad - The Warrior' (15)
    0.000563 | 'Hunting Party' (20)
    0.000560 | 'Moana' (18)
    0.000551 | 'Wonder Woman' (23)
    0.000548 | 'Saving Santa' (18)
    0.000546 | 'Turbo Kid' (17)
    0.000545 | 'Wild Life' (18)
    0.000543 | 'Valerian and the City of a Thousand Planets' (23)
    0.000542 | 'Blade Runner 2049' (23)



The first movie (*Movie 43*) as an anthology film featuring 14 different storylines and many different famous actors, which boosted its PageRank score.

3. **(code)** Consider random walks with restarts $p^t_i=\alpha\sum_jA_{ij}\frac{p^t_j}{k_j}+(1-\alpha)\delta_{it}$, where $t$ is a selected teleport node and $\delta$ is the Kronecker delta.
How could you use this algorithm to find movies similar to, e.g., *Moana*?



In [None]:
def similar_movies(title: str):
    tp = {utils.find_node(G, title)}
    pr = utils.pagerank(G, teleport=tp) # every restart will begin at the target movie (eg. Moana)
    utils.top_nodes(G, pr, 'pagerank')
    return pr

moana_pr = similar_movies("Moana")

  Centrality | 'pagerank'
    0.223491 | 'Moana' (18)
    0.001624 | 'Saving Santa' (18)
    0.001515 | 'Frozen' (16)
    0.001493 | 'Smallfoot' (16)
    0.001449 | 'Lion King' (19)
    0.001444 | 'Hoodwinked' (17)
    0.001430 | 'Rio' (16)
    0.001341 | 'Book of Life' (16)
    0.001287 | 'Muppets Most Wanted' (15)
    0.001237 | 'Aladdin' (16)
    0.001196 | 'Into the Woods' (13)
    0.001133 | 'Greatest Showman' (11)
    0.001130 | 'Planet 51' (18)
    0.001033 | 'Sweeney Todd - The Demon Barber of Fleet Street' (14)
    0.000956 | 'Jumanji 2 - Welcome to the Jungle' (17)



Note that this heuristic was smart enough to find the live-action *Aladdin* among animated films.

In [None]:
# here you can play around with your favorite movies
for movie in ["Pulp Fiction", "The Room"]:
    print(f"similar to {movie}:")
    try:
        similar_movies(movie)
    except ValueError as err:
        print(err)

similar to Pulp Fiction:
  Centrality | 'pagerank'
    0.190888 | 'Pulp Fiction' (12)
    0.003497 | 'Funny Games' (11)
    0.003449 | 'Liability' (11)
    0.003126 | 'Hardcore Henry' (15)
    0.003075 | 'Incredible Hulk' (13)
    0.002930 | 'Death Proof' (10)
    0.002892 | 'Sin City' (12)
    0.002783 | 'Hateful Eight' (11)
    0.002775 | 'Django Unchained' (13)
    0.002756 | 'Inglourious Basterds' (11)
    0.002363 | 'I Am Wrath' (12)
    0.002288 | 'Bolt' (14)
    0.002224 | 'From Paris with Love' (16)
    0.002168 | 'Be Cool' (12)
    0.002102 | 'Killing Season' (11)

similar to The Room:
node 'The Room' not found in movies_graph


4. **(homework)** Consider the personalized PageRank algorithm $p^{[t]}_i=\alpha\sum_jA_{ij}\frac{p^{[t]}_j}{k_j}+(1-\alpha)[t]_i$, where $[t]$ is a selected personalization vector, $\sum_i[t]_i=1$. How could you use this algorithm to find movies similar to, e.g., dramas starred by Tom Hanks, action and adventure movies with Johnny Depp or movies co-starred by Brad Pitt and George Clooney?



In [None]:
conjunctive_queries = [
    {"m-Drama", "m-Tom Hanks"},
    {"m-Action", "m-Adventure", "m-Johnny Depp"},
    {"m-Brad Pitt", "m-George Clooney"}
]

for M in conjunctive_queries:
    # each mode-node is a candidate teleport target (with equal probability)
    utils.top_nodes(G, utils.pagerank(G, teleport = {utils.find_node(G, m) for m in M}), str(M))

  Centrality | '{'m-Tom Hanks', 'm-Drama'}'
    0.008388 | 'Forrest Gump' (10)
    0.008339 | 'Toy Story of Terror' (11)
    0.008328 | 'Polar Express' (11)
    0.008097 | 'Toy Story That Time Forgot' (14)
    0.008093 | 'Sully' (10)
    0.007797 | 'Captain Phillips' (14)
    0.007738 | 'Toy Story 3' (14)
    0.007590 | 'Angels & Demons' (10)
    0.007585 | 'Saving Private Ryan' (13)
    0.007544 | 'Saving Mr. Banks' (11)
    0.007077 | 'Inferno' (18)
    0.006968 | 'Da Vinci Code' (16)
    0.000786 | 'Kraftidioten' (19)
    0.000773 | 'Blade Runner 2049' (23)
    0.000714 | 'Blind Side' (11)

  Centrality | '{'m-Action', 'm-Johnny Depp', 'm-Adventure'}'
    0.003613 | 'Pirates of the Caribbean 2 - Dead Man's Chest' (15)
    0.003519 | 'Fantastic Beasts - The Crimes of Grindelwald' (13)
    0.003507 | 'Alice Through the Looking Glass' (11)
    0.003502 | 'Alice in Wonderland' (12)
    0.003489 | 'Pirates of the Caribbean 5 - Dead Men Tell No Tales' (16)
    0.003474 | 'Pirates of the C

5. **(discuss)** Examples above include only *positive* queries by measuring similarity between the movies and selected modes. But how could handle also *negative* queries such as, e.g., you do not like romantic movies or a particular actor?

For *hard* negatives, we could simply skip unwanted query results. For *soft* negatives, we could do a separate run of personalized PageRank for undesired modes, and subtract these scores from the ones obtained for desired modes.

Similarly, for *disjunctive* quieries (eg. *either* Tom Hanks *or* Bradd Pitt), we could do a separate run for each query term and take the *maximum* score.