This notebook accompanies the blog post https://engineering.taboola.com/think-your-data-different.

Above page does not exist. Reference: https://www.freecodecamp.org/news/how-to-think-about-your-data-in-a-different-way-b84306fc2e1d/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/Engineering/Curriculum/8th Semester/Internship/descriptive_evaluation_project/node2vec/references

/content/drive/.shortcut-targets-by-id/17Gn89Edqfyxljr8tO09VdcQWGcUCa_Ua/descriptive_evaluation_project/node2vec/references


In [None]:
import pandas as pd
import numpy as np
import itertools
from sklearn.cluster import KMeans
import pprint

## 1. Prepare input for node2vec
We'll use a CSV file where each row represents a single recommendable item: it contains a comma separated list of the named entities that appear in the item's title.

In [None]:
named_entities_df = pd.read_csv('named_entities.csv')
named_entities_df.head()

Unnamed: 0,named_entities
0,"basketball,Kobe Bryant"
1,"basketball,Lebron James"


First, we'll have to tokenize the named entities, since `node2vec` expects integers.

In [None]:
tokenizer = dict()
named_entities_df['named_entities'] = named_entities_df['named_entities'].apply(
    lambda named_entities: [tokenizer.setdefault(named_entitie, len(tokenizer))
                            for named_entitie in named_entities.split(',')])
named_entities_df.head()

Unnamed: 0,named_entities
0,"[0, 1]"
1,"[0, 2]"


In [None]:
pprint.pprint(dict(tokenizer.items()[:5]))

TypeError: ignored

In order to construct the graph on which we'll run node2vec, we first need to understand which named entities appear together.

In [None]:
pairs_df = named_entities_df['named_entities'].apply(lambda named_entities: list(itertools.combinations(named_entities, 2)))
pairs_df = pairs_df[pairs_df.apply(len) > 0]
pairs_df = pd.DataFrame(np.concatenate(pairs_df.values), columns=['named_entity_1', 'named_entity_2'])
pairs_df.head()

Unnamed: 0,named_entity_1,named_entity_2
0,0,1
1,0,2


Now we can construct the graph. The weight of an edge connecting two named entities will be the number of times these named entities appear together in our dataset.

In [None]:
NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD = 25

edges_df = pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')
edges_df = edges_df[edges_df['weight'] > NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD]
edges_df[['named_entity_1', 'named_entity_2', 'weight']].to_csv('edges.csv', header=False, index=False, sep=' ')
edges_df.head()

Unnamed: 0,named_entity_1,named_entity_2,weight


Next, we'll run `node2vec`, which will output the result embeddings in a file called `emb`.  
We'll use the open source implementation developed by [Stanford](https://github.com/snap-stanford/snap/tree/master/examples/node2vec).

In [None]:
!python node2vec/src/main.py --input edges.csv --output emb --weighted

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
[]
[]
Traceback (most recent call last):
  File "node2vec/src/main.py", line 106, in <module>
    main(args)
  File "node2vec/src/main.py", line 102, in main
    learn_embeddings(walks)
  File "node2vec/src/main.py", line 89, in learn_embeddings
    model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=1, sg=1, workers=args.workers, iter=args.iter)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/word2vec.py", line 767, in __init__
    fast_version=FAST_VERSION)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/base_any2vec.py", line 763, in __init__
    end_alpha=self.min_alpha, compute_loss=compute_loss)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/word2vec.py", line 892, in train
    queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
  File "/usr/local/lib/python3.7/dist-packages/ge

## 2. Read embedding and run KMeans clusterring:

In [None]:
emb_df = pd.read_csv('emb', sep=' ', skiprows=[0], header=None)
emb_df.set_index(0, inplace=True)
emb_df.index.name = 'named_entity'
emb_df.head()

FileNotFoundError: ignored

Each column is a dimension in the embedding space. Each row contains the dimensions of the embedding of one named entity.  
We'll now cluster the embeddings using a simple clustering algorithm such as k-means.

In [None]:
NUM_CLUSTERS = 10

kmeans = KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(emb_df)
labels = kmeans.predict(emb_df)
emb_df['cluster'] = labels
clusters_df = emb_df.reset_index()[['named_entity','cluster']]
clusters_df.head()

## 3. Prepare input for Gephi:

[Gephi](https://gephi.org) is a nice visualization tool for graphical data.  
We'll output our data into a format recognizable by Gephi.

In [None]:
id_to_named_entity = {named_entity_id: named_entity
                      for named_entity, named_entity_id in tokenizer.items()}

with open('clusters.gdf', 'w') as f:
    f.write('nodedef>name VARCHAR,cluster_id VARCHAR,label VARCHAR\n')
    for index, row in clusters_df.iterrows():
        f.write('{},{},{}\n'.format(row['named_entity'], row['cluster'], id_to_named_entity[row['named_entity']]))
    f.write('edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE\n')
    for index, row in edges_df.iterrows(): 
        f.write('{},{},{}\n'.format(row['named_entity_1'], row['named_entity_2'], row['weight']))

Finally, we can open `clusters.gdf` using Gephi in order to inspect the clusters.