# H-Feature Analysis
This is an optional notebook where we will go through the steps of creating random walks using the H-path method. These can be used as a feature when learning HAS-embeddings in the HAS_entity_embeddings notebook. We will do some analysis of what happens with different parameter choices as well as results of using the feature for learning embeddings.

*What is this feature?* --> These random walks are intended to detect similarity due to homophily. I.e. entities with similar neighbors are similar.

## Pre-requisite steps to run this notebook
1. You need to run the 1_candidate_label_creation notebook before this notebook.
2. gensim is a dependency. You can install it with `pip install --upgrade gensim`, or if you want to use Anaconda, `conda install -c conda-forge gensim`
3. graph-tool is a dependency. Instructions to install in various ways are here: https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions (e.g. to install in an existing conda environment, use `conda install -c conda-forge graph-tool`)

In [1]:
%load_ext autoreload
%autoreload 2
import os
import random
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
import graph_tool.all as gt
from kgtk.gt.gt_load import load_graph_from_kgtk
from kgtk.io.kgtkreader import KgtkReader
import pathlib
import matplotlib.pyplot as plt
from collections import Counter
from h_path_walks import gt_random_walks_from_nodes
import seaborn as sns
from tqdm import tqdm



## parameters

**Embedding model parameters**   
*num_walks*: Number of random walks to start at each node with the H-feature walk method  
*walk_length*: Length of random walk started at each node  
*representation_size*: Number of latent dimensions to learn from each node  
*window_size*: Window size of skipgram model  
*workers*: Number of parallel processes  

**File/Directory parameters**  
*item_file*: File path for the file that contains entity to entity relationships (e.g. wikibase-item).  
*label_file*: File path for the file that contains wikidata labels.  
*work_dir*: same work_dir that you specified in the label creation notebook. Files created by this notebook will also be saved here.  
*store_dir*: Path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

In [2]:
# Embedding model params
num_walks = 3
walk_length = 6
representation_size = 64
window_size = 4
workers = 12

# File/Directory params
item_file = "./data/wikidata-20210215-dwd/claims.wikibase-item.tsv.gz"
label_file = "./data/wikidata-20210215-dwd/labels.en.tsv.gz"
work_dir = "./output/wikidata-20210215-dwd"
store_dir = "./output/wikidata-20210215-dwd/temp-h"

### Process parameters and set up variables / file names

In [3]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/H_walks_analysis".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if not os.path.exists(store_dir):
    os.makedirs(store_dir)
    
# Number of nodes to compute walks from at once. Larger = faster, but if too big, we may run out of stack space.
# Not making this a user-specifiable param for now.
batch_size = 50000 #TODO move this inside h_path_walks.py

# Names of files we will create
directed_walks_file = "{}/h_walks_directed_3x6.txt".format(output_dir)
undirected_walks_file = "{}/h_walks_undirected_3x6.txt".format(output_dir)
# undirected_walks_file = "{}/1M_walks.txt".format(output_dir)

# Setting up environment variables 
os.environ['ITEM_FILE'] = item_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['LABEL_CREATION'] = "{}/label_creation".format(work_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

### Helpers

In [4]:
def plot_walk_length_dist(walks):
    print("Number of walks: {}".format(len(walks)))
    walk_lengths = [len(arr) for arr in walks]
    print("Number of walks of each length:")
    counts_str = ", ".join(["{} : {}".format(key,value) for key,value in sorted(Counter(walk_lengths).items())])
    print(counts_str)
    fig, ax = plt.subplots()
    ax.hist(walk_lengths,bins=np.arange(12)-.5)
    ax.set_ylabel('Number of walks')
    ax.set_xlabel('Walk length')
    plt.show()
    
def plot_distinct_nodes_in_walks(walks):
    print("Number of walks: {}".format(len(walks)))
    count_distinct_nodes = [len(set(arr)) for arr in walks]
    print("Number of walks by number of unique nodes visited:")
    counts_str = ", ".join(["{} : {}".format(key,value) for key,value in sorted(Counter(count_distinct_nodes).items())])
    print(counts_str)
    fig, ax = plt.subplots()
    ax.hist(count_distinct_nodes,bins=np.arange(12)-.5)
    ax.set_ylabel('Number of walks')
    ax.set_xlabel('Number of unique nodes visited')
    plt.show()

### 1. Directed graph representation

Load graph from item to item file

In [5]:
kr = KgtkReader.open(pathlib.Path(item_file))
g = load_graph_from_kgtk(kr, directed=True, hashed=True)

In [6]:
print("This graph has {} vertices and {} edges".format(len(g.get_vertices()), len(g.get_edges())))

This graph has 944403 vertices and 2402344 edges


perform walks

In [7]:
open(directed_walks_file, 'w').close()
count_walk_lens = np.zeros(walk_length+1, dtype=int)
vertices = g.get_vertices()
num_batches = int(np.ceil(len(vertices) / batch_size))
print("num_batches: {}".format(num_batches))
for batch_num in tqdm(range(num_batches)):
    start_nodes = vertices[batch_num*batch_size : (batch_num+1)*batch_size]
    h_walks_directed = gt_random_walks_from_nodes(g, start_nodes, walk_length, num_walks)
    
    # keep track of stats
    for walk in h_walks_directed:
        count_walk_lens[len(walk)] += 1
        
    # Explicitly cast list of lists to ndarray with dtype=object to avoid ragged nested sequences message
    h_walks_directed = np.array(h_walks_directed, dtype=object)
    with open(directed_walks_file, "a") as f:
        np.savetxt(f, h_walks_directed, fmt="%s")

num_batches: 19


  return array(a, dtype, copy=False, order=order)


done with batch: 0
done with batch: 1
done with batch: 2
done with batch: 3
done with batch: 4
done with batch: 5
done with batch: 6
done with batch: 7
done with batch: 8
done with batch: 9
done with batch: 10
done with batch: 11
done with batch: 12
done with batch: 13
done with batch: 14
done with batch: 15
done with batch: 16
done with batch: 17
done with batch: 18
CPU times: user 13min 20s, sys: 14 s, total: 13min 34s
Wall time: 13min 35s


Let's look at how many walks we have and how long they are (max walk length is 10).

In [15]:
print("number of walks: {}".format(sum(count_walk_lens)))
counts_str = ", ".join(["{}: {}".format(i, count_walk_lens[i]) for i in range(len(count_walk_lens))])
print("Number of walks of each length:\n{}".format(counts_str))

number of walks: 9444030.0
Number of walks of each length:
0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 9444030.0


### 2. Undirected representation

Load graph from item to item file

In [5]:
kr = KgtkReader.open(pathlib.Path(item_file))
g = load_graph_from_kgtk(kr, directed=False, hashed=True)

In [6]:
print("This graph has {} vertices and {} edges".format(len(g.get_vertices()), len(g.get_edges())))

This graph has 42575933 vertices and 182246240 edges


perform walks

In [7]:
%%time
open(undirected_walks_file, 'w').close()
count_walk_lens = np.zeros(walk_length+1, dtype=int)
count_unique_nodes_in_walk = np.zeros(walk_length+1, dtype=int)
vertices = g.get_vertices()
num_batches = int(np.ceil(len(vertices) / batch_size))
print("num_batches: {}".format(num_batches))
with open(undirected_walks_file, "a") as f:
    for batch_num in tqdm(range(num_batches)):
        start_nodes = vertices[batch_num*batch_size : (batch_num+1)*batch_size]
        h_walks_undirected = gt_random_walks_from_nodes(g, start_nodes, walk_length, num_walks)
        # keep track of stats
        for walk in h_walks_undirected:
            count_walk_lens[len(walk)] += 1
            count_unique_nodes_in_walk[len(set(walk))] += 1
        np.savetxt(f, h_walks_undirected, fmt="%s")

  0%|          | 0/852 [00:00<?, ?it/s]

num_batches: 852


100%|██████████| 852/852 [1:53:52<00:00,  8.02s/it]  


CPU times: user 1h 44min 25s, sys: 9min 31s, total: 1h 53min 57s
Wall time: 1h 53min 54s


In [8]:
print("number of walks: {}".format(sum(count_walk_lens)))
counts_str = ", ".join(["{}: {}".format(i, count_walk_lens[i]) for i in range(len(count_walk_lens))])
print("Number of walks of each length:\n{}".format(counts_str))

number of walks: 127727799
Number of walks of each length:
0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 127727799


Since these walks are on an undirected graph, it might be informative to look at how many unique vertexes we visit in each walk.

In [9]:
counts_str = ", ".join(["{}: {}".format(i, count_unique_nodes_in_walk[i]) for i in range(len(count_unique_nodes_in_walk))])
print("Number of walks by number of unique nodes visited:\n{}".format(counts_str))

Number of walks by number of unique nodes visited:
0: 0, 1: 422, 2: 125154, 3: 213616, 4: 540475, 5: 1491401, 6: 37087106, 7: 45449188, 8: 104035468, 9: 142833307, 10: 93983193


### 3. Let's see what embeddings we learn if we only use this feature
Use Skip-Gram model to learn representations for the entities  
**We'll use the undirected representation's h-path walks.**  

In [150]:
undirected_walks_file_1 = '/data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1.txt'
undirected_walks_file_1_small = '/data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small.txt'
undirected_walks_file_1_small2 = '/data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small2.txt'

In [76]:
!wc -l /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1.txt

42575933 /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1.txt


In [114]:
!wc -l /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small.txt

100000 /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small.txt


In [166]:
!head -50000 /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small.txt \
>> /data/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/H_walks_analysis/h_walks_undirected_1_small2.txt

First try to empirically find a good value for min_count

In [9]:
word_counts = {}
with open(undirected_walks_file) as f:
    for line in tqdm(f):
        words = line.split()
        for word in words:
            if word not in word_counts:
                word_counts[word] = 0
            word_counts[word] += 1

127727799it [07:16, 292452.43it/s]


In [10]:
rare_words = {}
for word, count in tqdm(word_counts.items()):
    if count not in rare_words:
        rare_words[count] = []
    rare_words[count].append(word)

100%|██████████| 42575933/42575933 [00:15<00:00, 2766071.97it/s]


In [11]:
len(word_counts)

42575933

In [33]:
sum([len(words) for count, words in rare_words.items() if count <= 6])

8951099

In [37]:
len(rare_words[50])

1050204

In [43]:
random.choice(rare_words[7])

'Q18196617'

In [45]:
word_counts["Q10598940"]

4

In original 10-walks per entity file:  
42.6M entities  

| num ocurrences | num entities with <= this num ocurrences | examples of entities with this num ocurrences |  
| :- | :- | :- |  
| <=50 | 22,912,461 | Neotriozella pyrifolii (species of insect), Q13785750 (no english label), 𒅅 (unicode character), Sungai Lingit (disambig page), Thereva brunnea (species of insect) |
| <=40 | 12,238,525 | |
| <=30 | 4,192,879 | |
| <=20 | 683,155 | |
| <=10 | 164 | |

num ocurrences of various entities:

| entity | Qnode | frequency |
| :- | :- | :- |
| Political career of Vladimir Putin | Q17052997 | 35 |
| Putin's presidential campaign | Q45023984 | 22 |
| Putin disamb. page | Q4384355 | 73 |
| Putin's father | Q19300851 | 223 |
| Medvedev (russian pol) | Q23530 | 884 |
| Medvedev (rus. biologist) | Q2096791 | 211 |
| Medvedev (rus. librettist (opera)) | Q27230041 | 120 |
| Yevgeniy Ditrikh (russian politician) | Q53803011 | 75 |
| Pedro Szekely | Q100104271 | 100 |
| Memento (movie) | Q190525 | 348 |
| Saint Abanoub | Q2562070 | 69 |
| Saint Barbara | Q192816 | 6109 |
| Legend of Zelda Twilight Princess | Q735613 | 284 |
| Gone Girl (book) | Q5581570 | 117 |


50 looks like a reasonable cutoff.

testing run-time...

In 1-walk per entity file:  
<=5: 25M  
<=4: 19M  
<=3: 12M  
<=2: 6M  
<=1: 1.8M  

In trimmed 1-walk per entity file that has 100,000 walks + 50,000 duplicates:  
550k entities  
1:266k  
*Runtime with min_count=0*: 1 min for setup and 19s per epoch  
*Runtime with min_count=2*: <1 min for setup and 13s per epoch  

Now finding appropriate min_count for 3 walks from each entity with walk length 6...  

| num ocurrences | num entities with <= this num ocurrences | examples of entities with this num ocurrences |  
| :- | :- | :- | 
| <=10 | 27.6M | |
| <=9 | 23.6M | |
| <=8 | 18.8M | Category:Fish described in 1884, Category:Amur Shipbuilding Plant, Q21709829 (no label), Alexander Wedl (German ice hockey player) |
| <=7 | 13.8M | Atanas Babata (bulgarian revolutionary) |
| <=6 | 9M | |

num ocurrences of various entities:

| entity | Qnode | frequency |
| :- | :- | :- |
| Political career of Vladimir Putin | Q17052997 | 7 |
| Putin's presidential campaign | Q45023984 | 5 |
| Putin disamb. page | Q4384355 | 11 |
| Putin's father | Q19300851 | 36 |
| Medvedev (russian pol) | Q23530 | 109 |
| Medvedev (rus. biologist) | Q2096791 | 27 |
| Medvedev (rus. librettist (opera)) | Q27230041 | 16 |
| Yevgeniy Ditrikh (russian politician) | Q53803011 | 11 |
| Pedro Szekely | Q100104271 | 11 |
| Memento (movie) | Q190525 | 43 |
| Saint Abanoub | Q2562070 | 10 |
| Saint Barbara | Q192816 | 1000 |
| Legend of Zelda Twilight Princess | Q735613 | 34 |
| Gone Girl (book) | Q5581570 | 21 |

Since the margins are smaller here I'll play it a bit safer and go with min_count=8 (ignores 13.8M entities)

In [4]:
%%time
from datetime import datetime
from gensim.models.callbacks import CallbackAny2Vec

class EpochLogger(CallbackAny2Vec):
    '''Callback to log information about training'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_begin(self, model):
        self.epoch_start = datetime.now()
        print("Epoch #{} start -- {}".format(self.epoch, str(self.epoch_start)))

    def on_epoch_end(self, model):
        epoch_end = datetime.now()
        time_elapsed = str(epoch_end - self.epoch_start)
        print("Epoch #{} end -- {} -- elapsed time: {}".format(self.epoch, str(epoch_end), time_elapsed))
        self.epoch += 1

print("Starting Word2Vec process @ {}".format(datetime.now()))
        
epoch_logger = EpochLogger()
model = Word2Vec(corpus_file=undirected_walks_file, vector_size=representation_size,
                 window=window_size, min_count=8, sg=1, hs=1, workers=workers, callbacks=[epoch_logger])

print("Now saving model...")
model.save("{}/h_embeddings_3x6,min_count=8.model".format(output_dir))

print("Now saving keyed vectors...")
model.wv.save("{}/h_embeddings_3x6,min_count=8.kv".format(output_dir))

Starting Word2Vec process @ 2021-05-07 01:15:55.082173
Epoch #0 start -- 2021-05-07 01:53:59.064302
Epoch #0 end -- 2021-05-07 03:35:43.243992 -- elapsed time: 1:41:44.179690
Epoch #1 start -- 2021-05-07 03:35:43.244273
Epoch #1 end -- 2021-05-07 05:18:51.101256 -- elapsed time: 1:43:07.856983
Epoch #2 start -- 2021-05-07 05:18:51.101523
Epoch #2 end -- 2021-05-07 07:01:54.190864 -- elapsed time: 1:43:03.089341
Epoch #3 start -- 2021-05-07 07:01:54.191152
Epoch #3 end -- 2021-05-07 08:44:34.370598 -- elapsed time: 1:42:40.179446
Epoch #4 start -- 2021-05-07 08:44:34.370849
Epoch #4 end -- 2021-05-07 10:25:27.396399 -- elapsed time: 1:40:53.025550
Now saving model...
Now saving keyed vectors...
CPU times: user 3d 20h 43min 40s, sys: 2h 35min 52s, total: 3d 23h 19min 32s
Wall time: 9h 21min 13s


### Evaluate the embeddings
We want similar entities to have more similar embeddings. For the purpose of profiling entities of a desired type, we are specifically interested in similar entities *within a type* having more similar embeddings. We'll investigate using cosine similarity.

Let's compare entities of different types.

In [4]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load("{}/h_embeddings_3x6,min_count=8.kv".format(output_dir))

In [5]:
print("Vladimir Putin and Russia: {:.2f}".format(wv.similarity('Q7747','Q159')))
print("Vladimir Putin and Ireland: {:.2f}".format(wv.similarity('Q7747','Q27')))
print("Russia and Ireland: {:.2f}".format(wv.similarity('Q159','Q27')))

Vladimir Putin and Russia: 0.64
Vladimir Putin and Ireland: 0.11
Russia and Ireland: 0.19


Now let's compare various entities within the type we are interested in profiling

In [None]:
wv.most_similar('Q7747')

In [23]:
print("Vladimir Putin and : {:.2f}".format(wv.similarity('Q7747','Q28858481')))

KeyError: "word 'Q28858481' not in vocabulary"

Finally, we can look at how similar each beer is to eachother (since we have very few of them)

In [None]:
beer_vecs = model.wv['Q12877510', 'Q61976614', 'Q93552342', 'Q93557205', 'Q93559285', 'Q93558270', 'Q93560567', 'Q97412285']
similarity_mat = [model.wv.cosine_similarities(beer, beer_vecs) for beer in beer_vecs]
mask = np.zeros_like(similarity_mat)
mask[np.triu_indices_from(mask)] = True
labels = ['Macedonian Thrace Brewery', 'Rastrum', 'Vergina Lager', 'Vergina Red', 'Vergina Weiss', 'Vergina Porfyra', 'Vergina Black', 'Vergina Alcohol Free']
# Could mask to only show the lower triangle, but I think this is actually easier to read without the mask
fig, ax = plt.subplots(figsize=(9,7))
sns.set(font_scale=1.5)
sns.heatmap(similarity_mat, ax=ax, xticklabels=labels, yticklabels=labels, annot=True)
plt.xticks(rotation=30, horizontalalignment='right')
plt.title("Cosine similarity of beer embeddings")
plt.show()

The beer called 'Rastrum' seems to be an outlier. Searching for it online gives few results related to beer. Looking at similarities without this beer so we can more easily see how the others compare

In [None]:
beer_vecs = model.wv['Q12877510', 'Q93552342', 'Q93557205', 'Q93559285', 'Q93558270', 'Q93560567', 'Q97412285']
similarity_mat = [model.wv.cosine_similarities(beer, beer_vecs) for beer in beer_vecs]
mask = np.zeros_like(similarity_mat)
mask[np.triu_indices_from(mask)] = True
labels = ['Macedonian Thrace Brewery', 'Vergina Lager', 'Vergina Red', 'Vergina Weiss', 'Vergina Porfyra', 'Vergina Black', 'Vergina Alcohol Free']
# Could mask to only show the lower triangle, but I think this is actually easier to read without the mask
fig, ax = plt.subplots(figsize=(9,7))
sns.set(font_scale=1.5)
sns.heatmap(similarity_mat, ax=ax, xticklabels=labels, yticklabels=labels, annot=True)
plt.xticks(rotation=30, horizontalalignment='right')
plt.title("Cosine similarity of beer embeddings")
plt.show()