# H-Feature Analysis
This is an optional notebook where we will go through the steps of creating random walks using the H-path method. These can be used as a feature when learning HAS-embeddings in the HAS_entity_embeddings notebook. We will do some analysis of what happens with different parameter choices as well as results of using the feature for learning embeddings.

*What is this feature?* --> These random walks are intended to detect similarity due to homophily. I.e. entities with similar neighbors are similar.

## Pre-requisite steps to run this notebook
1. You need to run the 1_candidate_label_creation notebook before this notebook.
2. gensim is a dependency. You can install it with `pip install --upgrade gensim`, or if you want to use Anaconda, `conda install -c conda-forge gensim`
3. graph-tool is a dependency. Instructions to install in various ways are here: https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions (e.g. to install in an existing conda environment, use `conda install -c conda-forge graph-tool`)

In [20]:
%load_ext autoreload
%autoreload 2
import os
import random
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
import graph_tool.all as gt
from kgtk.gt.gt_load import load_graph_from_kgtk
from kgtk.io.kgtkreader import KgtkReader
import pathlib
import matplotlib.pyplot as plt
from collections import Counter
from h_path_walks import gt_random_walks_from_nodes
import seaborn as sns
from tqdm import tqdm

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## parameters

**Embedding model parameters**   
*num_walks*: Number of random walks to start at each node with the H-feature walk method  
*walk_length*: Length of random walk started at each node  
*representation_size*: Number of latent dimensions to learn from each node  
*window_size*: Window size of skipgram model  
*workers*: Number of parallel processes  

**File/Directory parameters**  
*item_file*: File path for the file that contains entity to entity relationships (e.g. wikibase-item).  
*label_file*: File path for the file that contains wikidata labels.  
*work_dir*: same work_dir that you specified in the label creation notebook. Files created by this notebook will also be saved here.  
*store_dir*: Path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

In [7]:
# Embedding model params
num_walks = 10
walk_length = 10
representation_size = 64
window_size = 5
workers = 32

# File/Directory params
item_file = "./data/wikidata_humans/claims.wikibase-item.tsv.gz"
label_file = "./data/wikidata_humans/labels.en.tsv.gz"
work_dir = "./output/wikidata_humans_v3"
store_dir = "./output/wikidata_humans_v3/temp"

### Process parameters and set up variables / file names

In [16]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/H_walks_analysis".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
# Number of nodes to compute walks from at once. Larger = faster, but if too big, we may run out of stack space.
# Not making this a user-specifiable param for now.
batch_size = 50000 #TODO move this inside h_path_walks.py

# Names of files we will create
directed_walks_file = "{}/h_walks_directed.txt".format(output_dir)
undirected_walks_file = "{}/h_walks_undirected.txt".format(output_dir)

# Setting up environment variables 
os.environ['ITEM_FILE'] = item_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['LABEL_CREATION'] = "{}/label_creation".format(work_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

### Helpers

In [4]:
def plot_walk_length_dist(walks):
    print("Number of walks: {}".format(len(walks)))
    walk_lengths = [len(arr) for arr in walks]
    print("Number of walks of each length:")
    counts_str = ", ".join(["{} : {}".format(key,value) for key,value in sorted(Counter(walk_lengths).items())])
    print(counts_str)
    fig, ax = plt.subplots()
    ax.hist(walk_lengths,bins=np.arange(12)-.5)
    ax.set_ylabel('Number of walks')
    ax.set_xlabel('Walk length')
    plt.show()
    
def plot_distinct_nodes_in_walks(walks):
    print("Number of walks: {}".format(len(walks)))
    count_distinct_nodes = [len(set(arr)) for arr in walks]
    print("Number of walks by number of unique nodes visited:")
    counts_str = ", ".join(["{} : {}".format(key,value) for key,value in sorted(Counter(count_distinct_nodes).items())])
    print(counts_str)
    fig, ax = plt.subplots()
    ax.hist(count_distinct_nodes,bins=np.arange(12)-.5)
    ax.set_ylabel('Number of walks')
    ax.set_xlabel('Number of unique nodes visited')
    plt.show()

### 1. Directed graph representation

Load graph from item to item file

In [5]:
kr = KgtkReader.open(pathlib.Path(item_file))
g = load_graph_from_kgtk(kr, directed=True, hashed=True)

In [6]:
print("This graph has {} vertices and {} edges".format(len(g.get_vertices()), len(g.get_edges())))

This graph has 944403 vertices and 2402344 edges


perform walks

In [7]:
open(directed_walks_file, 'w').close()
count_walk_lens = np.zeros(walk_length+1, dtype=int)
vertices = g.get_vertices()
num_batches = int(np.ceil(len(vertices) / batch_size))
print("num_batches: {}".format(num_batches))
for batch_num in tqdm(range(num_batches)):
    start_nodes = vertices[batch_num*batch_size : (batch_num+1)*batch_size]
    h_walks_directed = gt_random_walks_from_nodes(g, start_nodes, walk_length, num_walks)
    
    # keep track of stats
    for walk in h_walks_directed:
        count_walk_lens[len(walk)] += 1
        
    # Explicitly cast list of lists to ndarray with dtype=object to avoid ragged nested sequences message
    h_walks_directed = np.array(h_walks_directed, dtype=object)
    with open(directed_walks_file, "a") as f:
        np.savetxt(f, h_walks_directed, fmt="%s")

num_batches: 19


  return array(a, dtype, copy=False, order=order)


done with batch: 0
done with batch: 1
done with batch: 2
done with batch: 3
done with batch: 4
done with batch: 5
done with batch: 6
done with batch: 7
done with batch: 8
done with batch: 9
done with batch: 10
done with batch: 11
done with batch: 12
done with batch: 13
done with batch: 14
done with batch: 15
done with batch: 16
done with batch: 17
done with batch: 18
CPU times: user 13min 20s, sys: 14 s, total: 13min 34s
Wall time: 13min 35s


Let's look at how many walks we have and how long they are (max walk length is 10).

In [15]:
print("number of walks: {}".format(sum(count_walk_lens)))
counts_str = ", ".join(["{}: {}".format(i, count_walk_lens[i]) for i in range(len(count_walk_lens))])
print("Number of walks of each length:\n{}".format(counts_str))

number of walks: 9444030.0
Number of walks of each length:
0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 9444030.0


### 2. Undirected representation

Load graph from item to item file

In [5]:
kr = KgtkReader.open(pathlib.Path(item_file))
g = load_graph_from_kgtk(kr, directed=False, hashed=True)

In [6]:
print("This graph has {} vertices and {} edges".format(len(g.get_vertices()), len(g.get_edges())))

This graph has 13185117 vertices and 68061483 edges


perform walks

In [21]:
%%time
open(undirected_walks_file, 'w').close()
count_walk_lens = np.zeros(walk_length+1, dtype=int)
count_unique_nodes_in_walk = np.zeros(walk_length+1, dtype=int)
vertices = g.get_vertices()
num_batches = int(np.ceil(len(vertices) / batch_size))
print("num_batches: {}".format(num_batches))
with open(undirected_walks_file, "a") as f:
    for batch_num in tqdm(range(num_batches)):
        start_nodes = vertices[batch_num*batch_size : (batch_num+1)*batch_size]
        h_walks_undirected = gt_random_walks_from_nodes(g, start_nodes, walk_length, num_walks)
        # keep track of stats
        for walk in h_walks_undirected:
            count_walk_lens[len(walk)] += 1
            count_unique_nodes_in_walk[len(set(walk))] += 1
        np.savetxt(f, h_walks_undirected, fmt="%s")

  0%|          | 0/264 [00:00<?, ?it/s]

num_batches: 264


  1%|          | 2/264 [01:07<2:27:32, 33.79s/it]


KeyboardInterrupt: 

In [67]:
print("number of walks: {}".format(sum(count_walk_lens)))
counts_str = ", ".join(["{}: {}".format(i, count_walk_lens[i]) for i in range(len(count_walk_lens))])
print("Number of walks of each length:\n{}".format(counts_str))

number of walks: 0
Number of walks of each length:
0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0


Since these walks are on an undirected graph, it might be informative to look at how many unique vertexes we visit in each walk.

In [17]:
counts_str = ", ".join(["{}: {}".format(i, count_unique_nodes_in_walk[i]) for i in range(len(count_unique_nodes_in_walk))])
print("Number of walks by number of unique nodes visited:\n{}".format(counts_str))

Number of walks by number of unique nodes visited:
0: 0.0, 1: 0.0, 2: 2299.0, 3: 8598.0, 4: 30437.0, 5: 141243.0, 6: 2893997.0, 7: 2080116.0, 8: 1888192.0, 9: 1616571.0, 10: 782577.0


### 3. Let's see what embeddings we learn if we only use this feature
Use Skip-Gram model to learn representations for the entities  
**We'll use the undirected representation's h-path walks.**  

In [18]:
%%time
model = Word2Vec(corpus_file=undirected_walks_file, size=representation_size,
                 window=window_size, min_count=0, sg=1, hs=1, workers=workers)
model.wv.save("{}/entity_embeddings.kv".format(output_dir))

CPU times: user 8h 22min 18s, sys: 2min 20s, total: 8h 24min 39s
Wall time: 21min 44s


### Evaluate the embeddings
We want similar entities to have more similar embeddings. For the purpose of profiling entities of a desired type, we are specifically interested in similar entities *within a type* having more similar embeddings. We'll investigate using cosine similarity.

Let's compare entities of different types.

In [20]:
print("Vladimir Putin and Russia: {:.2f}".format(model.wv.similarity('Q7747','Q159')))
print("Vladimir Putin and Ireland: {:.2f}".format(model.wv.similarity('Q7747','Q27')))
print("Russia and Ireland: {:.2f}".format(model.wv.similarity('Q159','Q27')))

Vladimir Putin and Russia: 0.70
Vladimir Putin and Ireland: 0.17
Russia and Ireland: 0.47


Now let's compare various entities within the type we are interested in profiling

In [24]:
model.wv.most_similar('Q7747')

[('Q65145163', 0.9593958258628845),
 ('Q1027574', 0.9569349884986877),
 ('Q27496593', 0.953919529914856),
 ('Q20013143', 0.9514901638031006),
 ('Q27512950', 0.9514316320419312),
 ('Q4459933', 0.9511592388153076),
 ('Q27494237', 0.9500447511672974),
 ('Q24714557', 0.9491683840751648),
 ('Q65090755', 0.9490238428115845),
 ('Q4985986', 0.9481089115142822)]

In [28]:
!wb u Q65145163

OSError: [Errno 12] Cannot allocate memory

In [23]:
print("Vladimir Putin and : {:.2f}".format(model.wv.similarity('Q7747','Q28858481')))

KeyError: "word 'Q28858481' not in vocabulary"

Finally, we can look at how similar each beer is to eachother (since we have very few of them)

In [None]:
beer_vecs = model.wv['Q12877510', 'Q61976614', 'Q93552342', 'Q93557205', 'Q93559285', 'Q93558270', 'Q93560567', 'Q97412285']
similarity_mat = [model.wv.cosine_similarities(beer, beer_vecs) for beer in beer_vecs]
mask = np.zeros_like(similarity_mat)
mask[np.triu_indices_from(mask)] = True
labels = ['Macedonian Thrace Brewery', 'Rastrum', 'Vergina Lager', 'Vergina Red', 'Vergina Weiss', 'Vergina Porfyra', 'Vergina Black', 'Vergina Alcohol Free']
# Could mask to only show the lower triangle, but I think this is actually easier to read without the mask
fig, ax = plt.subplots(figsize=(9,7))
sns.set(font_scale=1.5)
sns.heatmap(similarity_mat, ax=ax, xticklabels=labels, yticklabels=labels, annot=True)
plt.xticks(rotation=30, horizontalalignment='right')
plt.title("Cosine similarity of beer embeddings")
plt.show()

The beer called 'Rastrum' seems to be an outlier. Searching for it online gives few results related to beer. Looking at similarities without this beer so we can more easily see how the others compare

In [None]:
beer_vecs = model.wv['Q12877510', 'Q93552342', 'Q93557205', 'Q93559285', 'Q93558270', 'Q93560567', 'Q97412285']
similarity_mat = [model.wv.cosine_similarities(beer, beer_vecs) for beer in beer_vecs]
mask = np.zeros_like(similarity_mat)
mask[np.triu_indices_from(mask)] = True
labels = ['Macedonian Thrace Brewery', 'Vergina Lager', 'Vergina Red', 'Vergina Weiss', 'Vergina Porfyra', 'Vergina Black', 'Vergina Alcohol Free']
# Could mask to only show the lower triangle, but I think this is actually easier to read without the mask
fig, ax = plt.subplots(figsize=(9,7))
sns.set(font_scale=1.5)
sns.heatmap(similarity_mat, ax=ax, xticklabels=labels, yticklabels=labels, annot=True)
plt.xticks(rotation=30, horizontalalignment='right')
plt.title("Cosine similarity of beer embeddings")
plt.show()