In [1]:
import pandas as pd
import numpy as np
import networkx as nx

from pyvis.network import Network

import os
import rich

# path to current directory
path = os.getcwd()

## Import data

Data import from earlier prepared dataframe containing usernames and movie titles.

In [2]:
net_data = pd.read_csv(os.path.join(path, "net_data.csv"))

# removing duplicates if any
net_data.drop_duplicates(inplace = True)
net_data.reset_index(drop = True, inplace = True)

net_data.sample(5)

Unnamed: 0,username,title,year
547599,toqtaqiya2,The Monuments Men,2014
506754,AdrianValOlonan,The Hunger Games: Catching Fire,2013
172255,mooniebutt,"Girl, Interrupted",1999
372929,Sylviastel,Sherlock Holmes,2009
476385,Fitzbob,The Equalizer,2014


## Network preparation

### Vertex set

In this part vertex set of network will be prepared. Vertices in analysed graph will be users and movies.

#### Username

Prepare username nodes for network. The nodes for this group will contain:

    - user name ["username"]
    - number of movies this user has reviewed ["revs_num"]
    
To each username subscript "u will be added, to avoid node cancellation when username and movie title is the same.

In [3]:
# count values of each user's reviews and save outcome to dataframe
user_nodes = net_data.username.value_counts().to_frame(name = "revs_num")
user_nodes.reset_index(inplace = True)

# change column name that was affected by index reset
user_nodes = user_nodes.rename(columns = {"index" : "username"})

# add subscript
user_nodes["username"] = user_nodes["username"].apply(lambda x: x + "_u")

Relation containing nodes and their attributes, alongside some basic measures is shown below.

In [4]:
rich.print(f"\n\n[bold]Total number of username nodes: {len(user_nodes.index)}")
user_nodes.head(5)

Unnamed: 0,username,revs_num
0,SnoopyStyle_u,815
1,jboothmillard_u,717
2,TxMike_u,642
3,anaconda-40658_u,589
4,bob the moo_u,578


#### Movie

Prepare movie nodes for network. The nodes for this group will contain:

    - movie title ["title"]
    - year of production ["year"]
    - movie average rating ["rating"]
    - number of reviews ["num_revs"]
    
To each username subscript "u will be added, to avoid node cancellation when username and movie title is the same.

In [5]:
# Create dataframe containing title and year
movie_nodes = net_data[["title", "year"]].drop_duplicates()
movie_nodes.reset_index(inplace = True, drop = True)
movie_nodes["rating"] = 0

# calculate number of reviews for each movie and append it as a column to dataframe
movie_revs = net_data.title.value_counts().to_frame("num_revs")
movie_revs.reset_index(inplace = True)
movie_revs.rename(columns = {"index" : "title"}, inplace = True)
movie_nodes = pd.merge(movie_nodes, movie_revs, on = "title")

# delete unused dataframe from memory
del movie_revs

# add subscript
movie_nodes["title"] = movie_nodes["title"].apply(lambda x: x + "_m")

Relation containing nodes and their attributes, alongside some basic measures is shown below.

In [6]:
rich.print(f"\n\n[bold]Total number of movie nodes: {len(movie_nodes.index)}")
movie_nodes.head(5)

Unnamed: 0,title,year,rating,num_revs
0,10 Cloverfield Lane_m,2016,0,751
1,10 Things I Hate About You_m,1999,0,606
2,12 Angry Men_m,1957,0,1501
3,12 Years a Slave_m,2013,0,848
4,127 Hours_m,2010,0,481


### Edge set

Dataframe containing edge directions (from node to node) were defined based on reviews dataframe imported at the beginning.

In [7]:
edges = net_data[["username", "title"]]
edges = edges.rename(columns = {"username" : "from", "title" : "to"})
rich.print(f"\n\n[bold]Total number of edges: {len(edges.index)}")
# adding subscripts
edges["from"] = edges["from"].apply(lambda x: x + "_u")
edges["to"] = edges["to"].apply(lambda x: x + "_m")
# example of how relations look like
edges.sample(5)

Unnamed: 0,from,to
36580,krisrox_u,American Psycho_m
479738,marvelmanny_u,The Expendables_m
221343,aaron-oneill3_u,Inside Out_m
204887,shannon5760_u,How to Lose a Guy in 10 Days_m
226881,joemhoward_u,Interstellar_m


## Create network graph with use of NetworkX

Graph structure was chosen to be directed, since reviews are going from user to specific movie. Two typed of nodes will be defined - user nodes and movie nodes. These vertices will have different attributes. Size of the nodes will be defined as number of edges going from/to specific node. Two attributes that are coherent for both node types are:

    - size of node
    - node_type -> specifying whether is this user node or movie node
    
### Defining nodes (vertices) and edges

In [8]:
# initiate empty graph structure
G = nx.DiGraph()

# add user nodes to the network
for _, users in user_nodes.iterrows():
    G.add_node(users["username"], 
               size = users["revs_num"], 
               node_type = "user")
    
# add movie nodes to the network
for _, movie in movie_nodes.iterrows():
    G.add_node(movie["title"], 
               size = movie["num_revs"],
               year = movie["year"], 
               rating = movie["rating"],  
               node_type = "movie")
    
# add edges to the network
for _, edge in edges.iterrows():
    # sanity check whether such nodes exist in graph
    if (G.has_node(edge["from"]) and G.has_node(edge["to"])):
        # add the edge to the network
        G.add_edge(edge["from"], edge["to"])

### The network

In [9]:
rich.print(f"[bold]The network is directional graph containing following number of vertices and edges:")
rich.print(G)

# iterate through nodes in network and count nodes number of each type to ensure all nodes were defined correctly
u_nodes = 0
m_nodes = 0
for node in G.nodes():
    if G.nodes[node]["node_type"] == "user":
        u_nodes += 1
    else:
        m_nodes += 1

rich.print(f"Number of user nodes: {u_nodes}")
rich.print(f"Number of movie nodes: {m_nodes}")
rich.print(f"[bold]User node example: ", G.nodes["SnoopyStyle_u"])
rich.print(f"[bold]Movie node example: ", G.nodes["Gladiator_m"])

Node "size" attribute value should be equal to "in degree" (number of edges with vertex $x_1$ as terminated vertex) for all movie nodes and equal to "out degree" (number of edges that are initiated from vertex $x_2$) for user nodes. $x_1, x_2$ represent sam random node from specific group.

In [10]:
# check whether all nodes have assigned correct size numbers
i = 0
for node in G.nodes():
    if G.nodes[node]["node_type"] == "user":
        # out degree value for user node
        degree = G.out_degree(node)
    else:
        # in degree value for movie node
        degree = G.in_degree(node)
    # check for correct size number
    if degree != G.nodes[node]["size"]:
        print("Error: incorrect size value in node ", node)

### Network visualisation

Since designed network is tremendous, visualising it with use of standard python libraries and available computational resources is nearly impossible. Therefore for imaging purposes truncated sample network will be created, with much smaller number of nodes and edges. Network will be visualised with pyvis tool, therefore some naming rules will be changed, to make visualisation more clear. This part won't be considered as our developed network, the subnetwork is made solely and exclusively for graph drawing.

In [16]:
# graph properties

# number of user nodes
un = 50                   
# number of movie nodes
mn = 100
# size scaling factor
ssf = 3

In [17]:
# create truncated network with 100 nodes as indirected graph structure
GT = nx.Graph()

# add user nodes to the network
for _, users in user_nodes.head(un).iterrows():
    GT.add_node(users["username"], 
                title = users["username"][:-2],        # delete subscript from title
                size = users["revs_num"]/ssf,          # make node smaller
                group = "user")                        # assign a color group
    
# add movie nodes to the network
for _, movie in movie_nodes.sample(mn).iterrows():
    GT.add_node(movie["title"], 
                title = movie["title"][:-2],           # delete subscript from title 
                size = movie["num_revs"]/ssf,          # make node smaller
                year = movie["year"], 
                rating = movie["rating"],  
                group = "movie")                       # assign a color group
    
# add edges to the network
for _, edge in edges.iterrows():
    # sanity check whether such nodes exist in graph
    if (GT.has_node(edge["from"]) and GT.has_node(edge["to"])):
        # add the edge to the network
        GT.add_edge(edge["from"], edge["to"])
        
# print statistics
rich.print(GT)

# count number of nodes in each group
u_nodes = 0
m_nodes = 0
for node in GT.nodes():
    if GT.nodes[node]["group"] == "user":
        u_nodes += 1
    else:
        m_nodes += 1

rich.print(f"Number of user nodes: {u_nodes}")
rich.print(f"Number of movie nodes: {m_nodes}")

Network graph generation may take some time. Furthermore following message may appear: 

<font color='red'>Local cdn resources have problems on chrome/safari when used in jupyter-notebook.</font>

This should be ignored. Even though it appears, graph visualisation will appear after a while.

In [18]:
rich.print(f"[bold]Nodes and their colours legend:")
rich.print(f"[blue]BLUE [black]nodes are users\n[yellow]YELLOW [black]nodes are movies")
net = Network(height = 900, width = 900, notebook = True)
net.toggle_hide_edges_on_drag(True)
net.barnes_hut()
net.from_nx(GT)
net.show_buttons(filter_=['physics'])
net.show("network.html")

Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 
