If pyvis or networkx is not installed on working kernel, then in cell below lines that start with # !conda should be unhashed and run afterwards to install these packages. 

In [3]:
# installing pyvis + networkx if does not exist on current kernel (may take some time) - also assumed that working environment
# is build in anaconda
import sys

# pyvis installation 
# !conda install --yes --prefix {sys.prefix} -c conda-forge pyvis
# networkx installation
# !conda install --yes --prefix {sys.prefix} -c anaconda networkx

In [76]:
import pandas as pd
import numpy as np
import networkx as nx

from pyvis.network import Network
from collections import defaultdict

import os
import rich
import json
import itertools

# path to current directory
path = os.getcwd()

# Network analysis


The purpose of the code is to create network composed of users and movies they rated, based on data that are used in implemented recommenders. This part will include finding similar users, based on the movies they rated as well as basic social network analysis and distinguishing communities inside network.

## Import data

Data import from pickle file containing titles, users, sentiment value, rating and preparation of the dataframe, that will contain following attributes ["username", "title", "sentiment", "rating"]. This will represent each user's rating for certain movie alongside sentiment score computed from movie.

In [7]:
# import data
raw_data = pd.read_pickle(os.path.join(path, "network_data", "rating_and_sentiment.pkl"))

# initiate data values
net_data = []

# fill list with values
for row in raw_data.index:
    # dictionaries of sentiment and rating values for each user
    dict_sen = raw_data.loc[row, "user_sentiment"] 
    dict_rat = raw_data.loc[row, "rating_final"]
    for key in dict_sen:
        # check for null values
        if dict_rat[key] == "Null":
            rat_val = 5
        else:
            rat_val = float(dict_rat[key])
        new_row = [key, raw_data.loc[row, "original_title"], dict_sen[key], rat_val]
        net_data.append(new_row)

# creating a dataframe 
net_data = pd.DataFrame(net_data, columns = ["username", "title", "sentiment", "rating"])

# show snapshot of dataframe
net_data.sample(5)

Unnamed: 0,username,title,sentiment,rating
28080,jfgibson73,Ted,5.480657,6.0
30149,m_madhu,The Deer Hunter,5.343544,6.0
17916,ayoreinf,Lucy,5.441581,7.0
11147,Navaf,Fury,5.412515,8.0
20469,bestfootie,Mystic River,5.556242,6.0


## Network preparation

### Vertex set

In this part vertex set of network will be prepared. Vertices in analysed graph will be users and movies, both containing several attributes.

#### Username

The nodes for this group will contain:

    - user name ["username"]
    - number of movies this user has reviewed ["revs_num"]
    
To each username subscript "_u" will be added, to avoid node cancellation when username and movie title is the same.

In [8]:
# count values of each user's reviews and save outcome to dataframe
user_nodes = net_data.username.value_counts().to_frame(name = "revs_num")
user_nodes.reset_index(inplace = True)

# change column name that was affected by index reset
user_nodes = user_nodes.rename(columns = {"index" : "username"})

# add subscript
user_nodes["username"] = user_nodes["username"].apply(lambda x: x + "_u")

Relation containing nodes and their attributes, alongside some basic measures is shown below.

In [9]:
rich.print(f"\n\n[bold]Total number of username nodes (number of users): {len(user_nodes.index)}")
user_nodes.head(5)

Unnamed: 0,username,revs_num
0,SnoopyStyle_u,198
1,anaconda-40658_u,134
2,jboothmillard_u,129
3,g-bodyl_u,106
4,kosmasp_u,104


#### Movie

Prepare movie nodes for network. The nodes for this group will contain:

    - movie title ["title"]
    - average sentiment score ["avg_sen"]
    - average rating ["avg_rat"]
    - number of reviews considered for this movie ["revs_num"]
    
To each username subscript "_m" will be added, to avoid node cancellation when username and movie title is the same.

In [10]:
# Create dataframe containing title and year
movie_nodes = net_data["title"].drop_duplicates()
movie_nodes.reset_index(inplace = True, drop = True)

# calculate number of reviews for each movie and append it as a column to dataframe
movie_temp = net_data.title.value_counts().to_frame("revs_num")
movie_temp.reset_index(inplace = True)
movie_temp.rename(columns = {"index" : "title"}, inplace = True)
movie_nodes = pd.merge(movie_nodes, movie_temp, on = "title")

# calculate average sentiment and append it to dataframe
movie_temp = net_data[["title", "sentiment", "rating"]].groupby(["title"]).mean()
movie_temp.reset_index(inplace = True)
movie_temp.rename(columns = {"index" : "title"}, inplace = True)
movie_nodes = pd.merge(movie_nodes, movie_temp, on = "title")

# delete unused dataframe from memory
del movie_temp

# add subscript
movie_nodes["title"] = movie_nodes["title"].apply(lambda x: x + "_m")

# rename columns
movie_nodes = movie_nodes.rename(columns = {"sentiment" : "avg_sen", "rating" : "avg_rat"})

Relation containing nodes and their attributes, alongside some basic measures is shown below.

In [11]:
rich.print(f"\n\n[bold]Total number of movie nodes (number of movies): {len(movie_nodes.index)}")
movie_nodes.head(5)

Unnamed: 0,title,revs_num,avg_sen,avg_rat
0,10 Cloverfield Lane_m,50,5.4083,7.5
1,10 Things I Hate About You_m,50,5.534746,7.62
2,12 Angry Men_m,50,5.397162,9.32
3,12 Years a Slave_m,50,5.370843,7.88
4,127 Hours_m,50,5.419699,7.52


### Edge set

Dataframe containing edge directions (from node to node) were defined based on reviews dataframe imported at the beginning. In addition for each edge sentiment value and rating value attributes were added based on each review.

In [12]:
edges = net_data
edges = edges.rename(columns = {"username" : "from", "title" : "to"})
rich.print(f"\n\n[bold]Total number of edges: {len(edges.index)}")
# adding subscripts
edges["from"] = edges["from"].apply(lambda x: x + "_u")
edges["to"] = edges["to"].apply(lambda x: x + "_m")
# example of how relations look like
edges.sample(5)

Unnamed: 0,from,to,sentiment,rating
4887,maverick-vishal_u,Braveheart_m,5.372468,7.0
14704,gsic_batou_u,In Bruges_m,5.611893,8.0
37684,Leofwine_draca_u,Tinker Tailor Soldier Spy_m,5.441586,2.0
11349,tccandler_u,Garden State_m,5.586896,10.0
6028,namashi_1_u,Cast Away_m,5.469384,10.0


## Create network graph with use of NetworkX

Graph structure was chosen to be directed, since reviews are going from user to specific movie. Two typed of nodes will be defined - user nodes and movie nodes. These vertices will have different attributes. Size of the nodes will be defined as number of edges going from/to specific node. Two attributes that are coherent for both node types are:

    - size of node
    - node_type -> specifying whether is this user node or movie node
    
### Defining nodes (vertices) and edges

In [13]:
# initiate empty graph structure
G = nx.DiGraph()

# add user nodes to the network
for _, users in user_nodes.iterrows():
    G.add_node(users["username"], 
               size = users["revs_num"], 
               node_type = "user")
    
# add movie nodes to the network
for _, movie in movie_nodes.iterrows():
    G.add_node(movie["title"], 
               size = movie["revs_num"],
               sentiment = movie["avg_sen"], 
               rating = movie["avg_rat"],  
               node_type = "movie")
    
# add edges to the network
for _, edge in edges.iterrows():
    # sanity check whether such nodes exist in graph
    if (G.has_node(edge["from"]) and G.has_node(edge["to"])):
        # add the edge to the network
        G.add_edge(edge["from"], edge["to"])

### The network

In [14]:
rich.print(f"[bold]The network is directional graph containing following number of vertices and edges:")
rich.print(G)

# iterate through nodes in network and count nodes number of each type to ensure all nodes were defined correctly
u_nodes = 0
m_nodes = 0
for node in G.nodes():
    if G.nodes[node]["node_type"] == "user":
        u_nodes += 1
    else:
        m_nodes += 1

rich.print(f"Number of user nodes: {u_nodes}")
rich.print(f"Number of movie nodes: {m_nodes}")
rich.print(f"[bold]User node example: ", G.nodes["SnoopyStyle_u"])
rich.print(f"[bold]Movie node example: ", G.nodes["Gladiator_m"])

Node "size" attribute value should be equal to "in degree" (number of edges with vertex $x_1$ as terminated vertex) for all movie nodes and equal to "out degree" (number of edges that are initiated from vertex $x_2$) for user nodes. $x_1, x_2$ represent sam random node from specific group.

In [15]:
# check whether all nodes have assigned correct size numbers
i = 0
for node in G.nodes():
    if G.nodes[node]["node_type"] == "user":
        # out degree value for user node
        degree = G.out_degree(node)
    else:
        # in degree value for movie node
        degree = G.in_degree(node)
    # check for correct size number
    if degree != G.nodes[node]["size"]:
        print("Error: incorrect size value in node ", node)

### Network findings

Some metrics were investigated based on created network. Those were top 10 most frequently rated movies, top 10 users with most ratings written, 10 least rated movies and 10 users with lowers number of rated movies. This could be done as well with use of pandas, but for network analysis purposes will be derived directly from network as well.

In [16]:
# create dictionary with nodes and their in/out degrees
out_stat = {key: value for key, value in sorted(G.out_degree(), key = lambda item: item[1], reverse=True)}
in_stat = {key: value for key, value in sorted(G.in_degree(), key = lambda item: item[1], reverse=True)}

top_users = []
top_movies = []

# make sorted list of users
for node in out_stat:
    if G.nodes[node]["node_type"] == "user":
        top_users.append(node)

# make sorted list of movies
for node in in_stat:
    if G.nodes[node]["node_type"] == "movie":
        top_movies.append(node)
        
# print top tens

# users with highest number of ratings
rich.print(f"[bold]Users with highest amount of reviewed movies:")
for i, user in enumerate(top_users[:10]):
    # variable number of text space to make neat layout
    space = " "*(25-len(user))
    print(f"{i+1}. {user[:-2]}", space, f"-> {out_stat[user]} review")

# users with lowest number of ratings
rich.print(f"[bold]Users with lowest amount of reviewed movies:")
for user in top_users[-10:]:
    # variable number of text space to make neat layout
    space = " "*(25-len(user))
    print(f"{user[:-2]}", space, f"-> {out_stat[user]} reviews")

# movies with highest number of ratings
rich.print(f"[bold]Movies with highest number of reviews:")
for i, movie in enumerate(top_movies[:10]):
    # variable number of text space to make neat layout
    space = " "*(50-len(movie))
    print(f"{i+1}. {movie[:-2]}", space, f"-> {in_stat[movie]} reviews")

# movies with lowest number of ratings
rich.print(f"[bold]Movies with lowest number of reviews:")
for i, movie in enumerate(reversed(top_movies[-10:])):
    # variable number of text space to make neat layout
    space = " "*(50-len(movie))
    print(f"{i+1}. {movie[:-2]}", space, f"-> {in_stat[movie]} reviews")

1. SnoopyStyle              -> 198 review
2. anaconda-40658           -> 134 review
3. jboothmillard            -> 129 review
4. g-bodyl                  -> 106 review
5. kosmasp                  -> 104 review
6. Smells_Like_Cheese       -> 103 review
7. Quinoa1984               -> 103 review
8. michaelRokeefe           -> 103 review
9. lee_eisenberg            -> 101 review
10. claudio_carvalho         -> 98 review


Luuk-2                   -> 1 reviews
boyznick                 -> 1 reviews
Asterios                 -> 1 reviews
noahhaye                 -> 1 reviews
eyal philippsborn        -> 1 reviews
E-Z-Rider                -> 1 reviews
Junker-2                 -> 1 reviews
adamsmith-51004          -> 1 reviews
rsabnis1                 -> 1 reviews
armycrazy                -> 1 reviews


1. 10 Cloverfield Lane                               -> 50 reviews
2. 10 Things I Hate About You                        -> 50 reviews
3. 12 Angry Men                                      -> 50 reviews
4. 12 Years a Slave                                  -> 50 reviews
5. 127 Hours                                         -> 50 reviews
6. 1408                                              -> 50 reviews
7. 1941                                              -> 50 reviews
8. 2001: A Space Odyssey                             -> 50 reviews
9. 2012                                              -> 50 reviews
10. 21                                                -> 50 reviews


1. Little Boy                                        -> 30 reviews
2. Saints and Soldiers                               -> 31 reviews
3. In the Land of Blood and Honey                    -> 31 reviews
4. Step Up 2: The Streets                            -> 32 reviews
5. Peaceful Warrior                                  -> 33 reviews
6. Airlift                                           -> 33 reviews
7. Hardball                                          -> 37 reviews
8. Rang De Basanti                                   -> 38 reviews
9. Bloody Sunday                                     -> 39 reviews
10. He Got Game                                       -> 41 reviews


### Network visualisation

Since designed network may be is tremendous, visualising it with use of standard python libraries and available computational resources may be nearly impossible. Therefore for imaging purposes truncated sample network will be created, with much smaller number of nodes and edges. Network will be visualised with pyvis tool, therefore some naming rules will be changed, to make visualisation more clear. This part won't be considered as our developed network, the subnetwork is made solely and exclusively for graph drawing.

In [164]:
# graph properties

# number of user nodes
un = 200                   
# number of movie nodes
mn = 200
# size scaling factor
ssf = 20

In [165]:
# create truncated network with 100 nodes as indirected graph structure
GT = nx.DiGraph()

# add user nodes to the network
for _, users in user_nodes.head(un).iterrows():
    GT.add_node(users["username"],
               group = "user")
    
# add movie nodes to the network
for _, movie in movie_nodes.sample(mn).iterrows():
    GT.add_node(movie["title"],
                sentiment = movie["avg_sen"], 
                rating = movie["avg_rat"],  
                group = "movie")
    
# add edges to the network
for _, edge in edges.iterrows():
    # sanity check whether such nodes exist in graph
    if (GT.has_node(edge["from"]) and GT.has_node(edge["to"])):
        # add the edge to the network
        GT.add_edge(edge["from"], edge["to"])
        
# assign size to each node based on in/out degree size
for node in GT.nodes():
    size = GT.in_degree(node) + GT.out_degree(node)
    GT.nodes[node]["size"] = size * ssf                   # size scaled by factor 5 to make nodes more visible

# print statistics
rich.print(GT)

# count number of nodes in each group
u_nodes = 0
m_nodes = 0
for node in GT.nodes():
    if GT.nodes[node]["group"] == "user":
        u_nodes += 1
    else:
        m_nodes += 1

rich.print(f"Number of user nodes: {u_nodes}")
rich.print(f"Number of movie nodes: {m_nodes}")

Network graph generation may take some time. Furthermore following message may appear: 

<font color='red'>Local cdn resources have problems on chrome/safari when used in jupyter notebook.</font>
    
Above message should be ignored. Even though it appears, graph visualisation will appear after a while. By embedding pyvis in jupyter notebook some of the interesting features of package can not be used. While rendering figure from generated html file, adding kwargs in fucntion definition [*select\_menu = True, filter\_menu = True*] allows user to investigate connections between chosen nodes and selected nodes themselves. In second part of a code .html file will be created in current workng directory. After opening the file locally from file explorer it will be embedded on default browser. There all features of pyvis may be used. 

<b>CONCLUSION NOTE:</b>

3 cells below aren't required to be run, they can be omited directly through "Communities" part. Instead .html files contained in Network file can be opened in browser without executing code. Code execution of first cell after a little amount of time will print the same network as in generated .html file, but just in Jupyter console. 

In [166]:
rich.print(f"[bold]Nodes and their colours legend:")
rich.print(f"[blue]BLUE [black]nodes are users\n[yellow]YELLOW [black]nodes are movies")

In [167]:
net = Network(height = 900, width = 900, notebook = True)
net.toggle_hide_edges_on_drag(True)
net.barnes_hut()
net.from_nx(GT)
net.show("jupyter_net.html")

Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 


In [168]:
# saving network with filters to html file
net2 = Network(height = 900, width = 900, select_menu = True, filter_menu = True)
net2.barnes_hut()
net2.from_nx(GT)
net2.show_buttons(filter_ = ["physics"])
net2.show("localhost_net.html")

### Communities

Communities will be calculated based on Girvan-Newman algorithm. (may take some time)

Since the network is containing both user and movies nodes, initial comunnities division will be among all nodes, but then for user community investigation filter will be applied, that will leave only users inside the communities. Furthermore, the communities will be calculated on truncated network, to make them easier to read. Original network contains too many nodes and community estimation took tremendous amount of time.

In [180]:
def get_usr(community: nx.algorithms.community) -> nx.algorithms.community:
    """
    This function is deleting all non user members (therefore movies) of communitiy object in NetworkX and returning cleaned
    community afterwards.
    """
    for comm in community:
        to_remove = []
        for member in comm:
            if member[-2:] == "_m":
                to_remove.append(member)
        for member in to_remove:
            comm.remove(member)
    community = tuple(user for user in community if user != [])
    return community

# calculate community
c_gn = nx.algorithms.community.centrality.girvan_newman(GT)

best_community_gn = tuple(sorted(c) for c in next(c_gn))

In [181]:
rich.print(f"[bold][red]Number of communities within the best distribution: {len(best_community_gn)}")
rich.print(f"[bold][red]Best distributed communities:")
for i in best_community_gn:
    print(f"{i}, \n\n")

['10 Cloverfield Lane_m', '10 Things I Hate About You_m', '12 Angry Men_m', '1408_m', '1941_m', '2001: A Space Odyssey_m', '30 Days of Night_m', '3xHCCH_u', '50 First Dates_m', '851222_u', 'A_Different_Drummer_u', 'Aaron1375_u', 'About Time_m', 'Alien: Resurrection_m', 'Aliens_m', 'Amadeus_m', 'American Gangster_m', 'American Hustle_m', 'American Pie_m', 'American Psycho_m', 'Anonymous_Maxine_u', 'Antz_m', 'Apocalypse Now_m', 'Atonement_m', 'BA_Harrison_u', 'Bad Teacher_m', 'Beetlejuice_m', 'Before Sunset_m', 'Begin Again_m', 'Black Hawk Down_m', 'Boba_Fett1138_u', 'Bolt_m', 'Bored_Dragon_u', 'Bridge of Spies_m', 'Buddy-51_u', 'Butch Cassidy and the Sundance Kid_m', 'Captain America: The Winter Soldier_m', 'Cars_m', 'Casino Royale_m', 'Casino_m', 'Catch Me If You Can_m', 'Charlie and the Chocolate Factory_m', 'Chris_Docker_u', 'CinematicInceptions_u', 'Clash of the Titans_m', 'Cold Mountain_m', 'Collateral_m', 'Concussion_m', 'Control_m', 'Crash_m', 'DICK STEEL_u', 'DKosty123_u', 'Dall

In [182]:
# extracting users communities
com_gn = get_usr(best_community_gn)
rich.print(f"[bold][red]Number of communities within the best distribution: {len(com_gn)}")
rich.print(f"[bold][red]Best distributed communities:")
for i in com_gn:
    print(f"{i}\n\n")

['3xHCCH_u', '851222_u', 'A_Different_Drummer_u', 'Aaron1375_u', 'Anonymous_Maxine_u', 'BA_Harrison_u', 'Boba_Fett1138_u', 'Bored_Dragon_u', 'Buddy-51_u', 'Chris_Docker_u', 'CinematicInceptions_u', 'DICK STEEL_u', 'DKosty123_u', 'DarkVulcan29_u', 'Desertman84_u', 'Dr_Coulardeau_u', 'EUyeshima_u', 'EijnarAmadeus_u', 'ElMaruecan82_u', 'Electrified_Voltage_u', 'Elswet_u', 'Enchorde_u', 'FeastMode_u', 'FlashCallahan_u', 'Floated2_u', 'Fluke_Skywalker_u', 'GiraffeDoor_u', 'Hellmant_u', 'Hitchcoc_u', 'Horst_In_Translation_u', 'JamesHitchcock_u', 'KineticSeoul_u', 'Lejink_u', 'Leofwine_draca_u', 'LeonLouisRicci_u', 'Luigi Di Pilla_u', 'MLDinTN_u', 'MR_Heraclius_u', 'MartinHafer_u', 'Matt_Layden_u', 'MaxBorg89_u', 'MovieAddict2016_u', 'Movie_Muse_Reviews_u', 'Mr-Fusion_u', 'Muhammad_Rafeeq_u', 'OllieSuave-007_u', 'OriginalMovieBuff21_u', 'PWNYCNY_u', 'Pjtaylor-96-138044_u', 'Prismark10_u', 'Quinoa1984_u', 'Robert_duder_u', 'Ryan_MYeah_u', 'Sandcooler_u', 'Screen_Blitz_u', 'Seraphion_u', 'Sirus

Because of the network being truncated and also containing 2 types of nodes, where we consider all of them while computing communities, the results often are one huge community and lots of singleton communities. This distribution would look much better if whole dataset of reviews would be initially included, but computations time would increase tremendously for all parts included in this project. 

## Similar users and frequent movie pairs

In this part similar users based on Jaccard similarity and frequent movies reviewed by specific users will be investigated. Initially all singletons will be hashed. Both for users set and movies set, to make all computations significantly faster. Generating movie_baskets dictionary may take some time. 

In [20]:
# hash users - hash by assigning each user a numerical value
users = dict()
i = 0
for user in net_data.username.unique():
    users[user] = i
    i += 1

# hash movies
movies = dict()
i = 0
for title in net_data.title.unique():
    movies[title] = i
    i += 1

Movie basket dataframe, containing all hashed users along with lists of all rated by them movies (also hashed) was prepared and exported earlier, since it takes significant amount of time to do this in code. Following code was used to export this data:
```python
export = defaultdict(list)
for user in net_data.username.unique():
    user_hash = users[user]
    for i in net_data.index:
        if net_data.loc[i, 'username'] == user:
            movie_hash = movies[net_data.loc[i, "title"]]
            export[user_hash].append(movie_hash)

# save as json file
json.dump(export, open("network_data/movie_basket.json", 'w'))
```

In [21]:
# import prepared data
movie_baskets = json.load(open("network_data/movie_basket.json"))

### Similar users

Jaccard similarity funciton will be implemented and than all similarities between users will be calculated. Since some users have much more reviews than others, some modification in Jaccard similarity could have been implemented, in order to benefit those having lots of movies in common, when they have more ratings than for example users that have 2 movies out of 2 rated in common (in this case Jaccard similarity would be 1). Since our data was selected in a specific manner, users will be compared only if at least one of them have 5 reviews. Calculation of similarity matrix may take some time, since there are more than $0.5*10000^2$ entries to calculate.

In [22]:
def Jaccard_similarity(user_1: int, user_2: int) -> int:
    """
    Jaccard_similarity between 2 users calculator. It does not consider 5 movies threshold described above,
    since this will be implemented in code while calculating all similarities.
    """
    # common movies for both users (intersection of set)
    common = len(set(movie_baskets[user_1]).intersection(movie_baskets[user_2]))
    # all movies both users reviewed (union of set)
    union = len(set(movie_baskets[user_1]).union(movie_baskets[user_2]))
    return common/union    

In [23]:
# create matrix storing all similarities between users
similarities = np.zeros([len(users), len(users)])
# iterate through users
iter_list = list(movie_baskets.keys())
# reviews threshold 
rev_tr = 5

# filling the matrix with Jaccard similarities but only if described condition is met (5 movies)
for user_1 in iter_list:
    # since calculating Jaccard similarity between 2 items is commutative operation it is prevented to do it twice
    buffer = int(user_1) + 1
    for user_2 in iter_list[buffer:]:
        if len(movie_baskets[user_1]) >= rev_tr or len(movie_baskets[user_2]) >= rev_tr:
            similarities[int(user_1), int(user_2)] = Jaccard_similarity(user_1, user_2) 
        else:
            similarities[int(user_1), int(user_2)] = 0 

Then top 10 pairs of similar users based on Jaccard similarity can be extracted, along with movies they reviewed, by searching for highest similarities and unhashing encoding functions. 

In [49]:
# define function for returning row and col index from row-major coordinate
def retr_coor(val: int, dim: int) -> [int, int]:
    """
    This function takes absolute coordinate value in row-major form from numpy array and converts it to row index and column
    index representation of this value.
    """
    row_coor = int(np.floor(val/dim))
    col_coor = val % dim
    return row_coor, col_coor

In [50]:
# array of top 10 similar pairs indexes
indices = np.flip(np.argsort(similarities.ravel())[-10:])

# arrays used to retrieve key from dictionary and decode hashes
keys = list(users.keys())
vals = list(users.values())

rich.print(f"[bold]TOP 10 SIMILAR USERS PAIRS")
# print top pairs along with jaccard similarities
for i, idx in enumerate(indices):
    user_1, user_2 = retr_coor(idx, len(users))
    # decode user hashing
    user1 = vals.index(user_1)
    user2 = vals.index(user_2)
    # print results
    rich.print(f"{i+1}. [red][bold]user", keys[user1], f"[red][bold] with user", keys[user2], 
              f"[bold][red]     sentiment score  ->", similarities[user_1][user_2])

### Frequent movies pairs

Frequent pairs of movies that were watched by the same user are going to be investigated. The analysis will be similar to 
market basket analysis, except that in this case basket is containing movies reviewed by particular user. The support threshold will be chosen to 0 for frequent movies, since the dataset was chosen in a way, that almost all movies have the same number of reviews considered (number of baskets they appear in). 

In [51]:
# specify pair movies count matrix
freq_mov = np.zeros([len(movies), len(movies)])

# count how many times certain movies were reviewed by the same user
for basket in iter_list:
    curr_basket = movie_baskets[str(basket)]
    # there must be at least 2 movies reviewed by single user
    if len(curr_basket) >= 2:
        # iterate through all items except last
        for buffer, movie_1 in enumerate(curr_basket[:-1]):
            for movie_2 in curr_basket[(buffer+1):]:
                freq_mov[movie_1][movie_2] += 1            

Support threshold for movie pairs will be set to very small value of s = 0.0005, since our dataset is containing more than 11000 baskets and even singletones can appear in at most 50 of them. Frequent movie pairs that are meeting the threshold constraint will be printed out to the console. 

In [68]:
# setting support
s = 0.0008

# only frequent pairs with specific support
freq_mov_supp = np.where(freq_mov > s * len(movie_baskets), freq_mov, 0)

# arrays used to retrieve key from dictionary and decode hashes
keys = list(movies.keys())
vals = list(movies.values())

# sorted starting from highest support
indices = np.flip(np.argsort(freq_mov_supp.ravel()))

#print out results
rich.print(f"[bold]FREQUENT MOVIE PAIRS")
# print top pairs along with jaccard similarities
i = 0
mov_1, mov_2 = retr_coor(indices[i], len(movies))
while freq_mov[mov_1][mov_2] != 0:
    # unhash
    movie1 = vals.index(mov_1)
    movie2 = vals.index(mov_2)
    print(f"[{keys[movie1]}; {keys[movie2]}]")
    # new pair of movies
    i += 1
    mov_1, mov_2 = retr_coor(indices[i], len(movies))

[Cinderella; The Jungle Book]
[Draft Day; Million Dollar Arm]
[Million Dollar Arm; Trouble with the Curve]
[McFarland, USA; Million Dollar Arm]
[Glory Road; The Greatest Game Ever Played]
[Glory Road; Hustle & Flow]
[Ice Age: Continental Drift; Madagascar 3: Europe's Most Wanted]
[Kung Fu Panda 2; Rio]
[Grudge Match; Million Dollar Arm]
[Pitch Perfect 2; Pitch Perfect]
[If I Stay; Woman in Gold]
[Here Comes the Boom; Rise of the Guardians]
[Major League; We Are Marshall]
[Milk; The Duchess]
[Just Go with It; Prince of Persia: The Sands of Time]
[Caddyshack; Major League]
[1941; Good Morning, Vietnam]
[The Hunger Games: Mockingjay - Part 1; Wreck-It Ralph]
[Here Comes the Boom; Million Dollar Arm]
[Friends with Benefits; No Strings Attached]
[Mulan; The Jungle Book]
[Mr. Holland's Opus; Shine]
[If I Stay; Million Dollar Arm]
[Romeo + Juliet; Shine]
[Just Go with It; No Strings Attached]
[Hotel Transylvania; The Croods]
[Catch-22; The Hustler]
[Ice Age; Ice Age: Dawn of the Dinosaurs]
[H