<a href="https://colab.research.google.com/github/kamilwyszynski/graph_dataset_creation/blob/main/creating_graph_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a custom graph dataset
<br>

*This notebook demostrates the creation of a custom PyTorch Geometric graph dataset*
<br>



<br>

 **Contents**



1.   Getting data in the correct shape
2.   Creating a Dataset class




# Installing Pytorch Geometric

In [18]:
!pip install --upgrade pip
!pip install torch-geometric
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip install torch_sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/de/47/58b9f3e6f611dfd17fb8bd9ed3e6f93b7ee662fb85bdfee3565e8979ddf7/pip-21.0-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 7.6MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-21.0
Collecting torch-geometric
  Downloading torch_geometric-1.6.3.tar.gz (186 kB)
[K     |████████████████████████████████| 186 kB 8.7 MB/s 
Collecting rdflib
  Downloading rdflib-5.0.0-py3-none-any.whl (231 kB)
[K     |████████████████████████████████| 231 kB 14.7 MB/s 
Collecting ase
  Downloading ase-3.21.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 16.0 MB/s 
Collecting isodate
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 2.8 MB/s 
Building wheels for collected packages: torch-g

# Modules

In [19]:
# word embedding
import gensim.downloader as gensim_api
# grapth dataset
import torch
from torch_geometric.data import InMemoryDataset, Data
# combinations
import itertools

# Data Preparation

In [3]:
nlp = gensim_api.load("glove-wiki-gigaword-300")



In [4]:
nlp

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f0c398c3668>

In [5]:
nlp.most_similar(['whale'])

[('whales', 0.78084397315979),
 ('humpback', 0.6860450506210327),
 ('shark', 0.6499471664428711),
 ('dolphin', 0.6322990655899048),
 ('minke', 0.5734179019927979),
 ('whaling', 0.5654325485229492),
 ('orca', 0.5450523495674133),
 ('fish', 0.5387887954711914),
 ('tuna', 0.494992733001709),
 ('hunts', 0.49030470848083496)]

In [6]:
w1, w2 = 'whale', 'fish'

print(nlp.similar_by_word(w1))
print(nlp.distance(w1, w2))
print(nlp.n_similarity(w1, w2))
print(nlp.similarity(w1,w2)) # Use this

[('whales', 0.78084397315979), ('humpback', 0.6860450506210327), ('shark', 0.6499471664428711), ('dolphin', 0.6322990655899048), ('minke', 0.5734179019927979), ('whaling', 0.5654325485229492), ('orca', 0.5450523495674133), ('fish', 0.5387887954711914), ('tuna', 0.494992733001709), ('hunts', 0.49030470848083496)]
0.46121108531951904
0.7872955
0.5387889


### Selecting Datapoints

In [7]:
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

In [8]:
# read cifar100 metadata to retrieve the words
cifar_path = r'/content/drive/MyDrive/Colab Notebooks/Word2Vec/cifar-100-python/meta'
cifar_dict = unpickle(cifar_path)
cifar_dict.keys()

dict_keys([b'fine_label_names', b'coarse_label_names'])

In [9]:
# picking only the words that exist in the nlp object

def is_in_nlp(word):
    if word in nlp.vocab.keys():
        return True
    else:
        return False


In [10]:
words = [i.decode("utf-8") for i in cifar_dict[b'fine_label_names'] if is_in_nlp(i.decode("utf-8"))]

### Node feature matrix with shape

<br>

In this case, we're creating a graph of words and the distances between them.
Therefore, for the feature matrix, we will use sparse matrix.

In [11]:
x = torch.eye(len(words))

### Graph connectivity 

The word graph should be fully connected.
I didn't find documentation about automating the full connectivity of a graph.
Therefore, I need to generate one myself.

<br>

This connectivity array should be of a shape [ 2, number_of_nodes ]

In [45]:
torch.tensor([[1,2,3],[3,4,5]]).shape

torch.Size([2, 3])

In [101]:
edge_index = torch.tensor(list(itertools.combinations(range(len(words)), 2)))

edge_index = torch.reshape(edge_index, (edge_index.shape[1], edge_index.shape[0]))

###  Edge feature matrix

This is a tensor that contains the features of an edge.
In this case, each edge will hold one value - the similarity between the two words it's connecting.

<br>

It should be of size [ num_edges, num_edge_features ]

In [109]:
word_similarities = [nlp.similarity(w1,w2) for w1,w2 in list(itertools.combinations(words, 2))]

edge_attr = torch.tensor(word_similarities)
edge_attr

tensor([0.1858, 0.2127, 0.1040,  ..., 0.1771, 0.1391, 0.0273])

# PyTorch Geometric Dataset Implementation

Here, I'm going to repeat the previously taken steps in the dataset class itself in order to increase reusability of it.

<br>

The Data object will be created from the data components constructed above in this notebook.
Then the `torch_geometric.data.InMemoryDataset.collate()` function is called to optimise the memory used by the dataset and to convert the tensors into a usable `InMemoryDataset`.

In [28]:
class CIFARGraph(InMemoryDataset):
    r"""Network of almost 100 words from the CIFAR100 dataset and their 
        similarity to each other.

    Args:
        word2vec (callable): A trained word2vec model. 
            Instance of gensim.models.keyedvectors.Word2VecKeyedVectors class.
        word_list (callable): A list of words needed in the dataset.
        transform (callable, optional): A function/transform that takes in an
            :obj:`torch_geometric.data.Data` object and returns a transformed
            version. The data object will be transformed before every access.
            (default: :obj:`None`)
    """
    def __init__(self, word2vec, word_list, transform=None):
        super(CIFARGraph, self).__init__('.', transform, None, None)

        word_list = [i for i in word_list if self.is_in_nlp(i)]

        x = torch.eye(len(words), dtype=torch.float)

        node_combinations = torch.tensor(list(itertools.combinations(range(len(word_list)), 2)))
        edge_index = torch.reshape(node_combinations, (2, node_combinations.shape[0]))

        word_similarities = [word2vec.similarity(w1,w2) for w1,w2 in list(itertools.combinations(word_list, 2))]
        edge_attr = torch.tensor(word_similarities)

        # y = torch.tensor(word_list)

        data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

        self.data, self.slices = self.collate([data])

    def __repr__(self):
        return '{}()'.format(self.__class__.__name__)
    
    def is_in_nlp(self, word):
        if word in word2vec.vocab.keys():
            return True
        else:
            return False


# Creating an instance of the dataset

In [None]:
word_list = [i.decode("utf-8") for i in cifar_dict[b'fine_label_names']]
word2vec = gensim_api.load("glove-wiki-gigaword-300")

In [29]:
cifar = CIFARGraph(word2vec, word_list)