
# Torch2vec - Text similarity on Steroids
---
This notebook is about a PyTorch implementation of Doc2Vec (distributed memory) with similarity measure.

I will be using ArXiV's papers abstracts to compute similarity between abstracts.

Now let's dive in...



### Activate GPU support
---



In [1]:
import torch
torch.cuda.is_available()

False

If successful, the output of the cell above should print `True`. Note that Google Colaboratory also offers [TPU](https://cloud.google.com/tpu/) support. These *Tensor Processing Units* are specifically designed for machine learning tasks and may outperform conventional GPUs. While support for TPUs in PyTorch is still pending, [tensorflow](https://www.tensorflow.org/) models may benefit from using TPUs (see [this tutorial](https://colab.research.google.com/notebooks/tpu.ipynb)).

### Useful commands

Within the notebook environment, you can not only execute Python code, but also bash commands by prepending a `!`. For example, you can install new Python packages via the package manager `pip`. Here, we just check the installed version of PyTorch:

In [None]:
!pip show torch

Another useful command is `!kill -9 -1`. It will reset all running kernels and free up memory (including GPU memory). Furthermore, there are a few commands to have a closer look on the hardware spcifications, i.e. to get information about the installed CPU and GPU:

In [None]:
!lscpu |grep 'Model name'

In [None]:
!nvidia-smi -L

In addition, you can check the available RAM and HDD memory:

In [None]:
!cat /proc/meminfo | grep 'MemAvailable'

In [None]:
!df -h / | awk '{print $4}'

Finally, one can execute the following command to get a live update on the GPU 

---

usage. This is useful to check how much of the GPU memory is in use to optimize 

---

the batchsize for training. Note that whenever the training routine in a notebook is still running, you need to execute this command in another Colaboratory notebook to get an instant response:

In [None]:
!nvidia-smi

### Mount Google Drive
Another important prerequisite for training our neural network is a place to save checkpoints of the trained model and to store obtained training data. Colaboratory provides convenient access to Google Drive via the `google.colab` Python module. The following command will mount your Google Drive contents to the folder path `/content/gdrive` on the Colaboratory instance. For authentication, you have to click the generated link and paste the authorization code into the input field:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

### Download the arxiv meta data with gsutil
We will need gsutil utility from google cloud sdk. Firstly, you need to authenticate yourself in Colab. Once you run the code below, it will ask you to follow a link to login and enter an access token that you receive upon successful login.


In [None]:
from google.colab import auth
auth.authenticate_user()

We would be using the gsutil command to upload and download files. So we first need to install the GCloud SDK.

In [None]:
!curl https://sdk.cloud.google.com | bash
!gcloud init

### Download the dataset

In [None]:
!gsutil cp -n gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json /content/gdrive/My\ Drive/arxiv-metadata-oai.json
!ls -l /content/gdrive/My\ Drive


### Reading the entire json metadata
This cell may take a minute to run considering the volume of data

In [None]:
import os
import tqdm
import json

input_file = "/content/gdrive/My Drive/arxiv-metadata-oai.json"

data  = []
with tqdm.tqdm(total=os.path.getsize(input_file)) as pbar:
     with open(input_file, 'r') as f:
          for line in f:
              pbar.update(len(line))
              data.append(json.loads(line))

I'm limiting my analysis to just 50,000 documents because of the compute limit.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

train = pd.DataFrame(train[:50000])

In [None]:
import shortuuid
import tqdm

In [None]:
for i in tqdm.tqdm(range(len(train))):
    train.loc[i,'id']=shortuuid.uuid()

100%|██████████| 197465/197465 [00:36<00:00, 5425.06it/s]


In [None]:
corpus = train['authors'].fillna('')+' '+train['title'].fillna('')+' '+train['summary'].fillna('')+' '+train['subjects'].fillna('')

In [None]:
corpus.index = train['id']
corpus.name = 'text'

In [None]:
corpus.to_csv('corpus.csv')

In [None]:
del corpus

In [None]:
import re
import tqdm
from time import time
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from numpy.random import choice
from torchtext.data import RawField,Field, TabularDataset
from spacy.lang.en import STOP_WORDS
import string
from collections import Counter


class DataPreparation():
    def __init__(self,corpus_path,vocab_size=None):
        data = pd.read_csv(corpus_path)
        self.corpus = data.iloc[:,1]
        self.document_ids = data.iloc[:,0].values
#         self.window_size = window_size
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.vocab_size = vocab_size if vocab_size else None
        
        
    def vocab_builder(self):
        tqdm.tqdm.pandas(desc='--- Tokenizing ---')
        self.corpus = self.corpus.progress_apply(self._tokenize_str)
        vocab = [word for sentence in self.corpus.values for word in sentence]
        word_counts = Counter(vocab)
        if not self.vocab_size:
            self.vocab_size = len(vocab)
        self.word_counts = word_counts.most_common()[:self.vocab_size]
        self.vocab = [word[0] for word in self.word_counts]+['[UNK]']
        self.vocab_size = len(self.vocab)
        self.word_id_mapper = {word:ids for ids,word in enumerate(self.vocab)}
        self.id_word_mapper = dict(zip(self.word_id_mapper.values(),self.word_id_mapper.keys()))
            
    
    def _tokenize_str(self,str_):
        stopwords = list(STOP_WORDS)+list((''.join(string.punctuation)).strip(''))+['-pron-','-PRON-']
        # keep only alphanumeric and punctations
        str_ = re.sub(r'[^A-Za-z0-9(),.!?\'`]', ' ', str_)
        # remove multiple whitespace characters
        str_ = re.sub(r'\s{2,}', ' ', str_)
        # punctations to tokens
        str_ = re.sub(r'\(', ' ( ', str_)
        str_ = re.sub(r'\)', ' ) ', str_)
        str_ = re.sub(r',', ' , ', str_)
        str_ = re.sub(r'\.', ' . ', str_)
        str_ = re.sub(r'!', ' ! ', str_)
        str_ = re.sub(r'\?', ' ? ', str_)
        # split contractions into multiple tokens
        str_ = re.sub(r'\'s', ' \'s', str_)
        str_ = re.sub(r'\'ve', ' \'ve', str_)
        str_ = re.sub(r'n\'t', ' n\'t', str_)
        str_ = re.sub(r'\'re', ' \'re', str_)
        str_ = re.sub(r'\'d', ' \'d', str_)
        str_ = re.sub(r'\'ll', ' \'ll', str_)
        # lower case

        return [word for word in str_.strip().lower().split() if word not in stopwords and len(word)>2]
    
    def get_data(self,window_size,num_noise_words):
        '''
        num_noise_words: number of words to be negative sampled
        '''
        self._padder(window_size)
        data = self._corpus_to_num()
        instances = self._instance_count(window_size)
        context = np.zeros((instances,window_size*2+1),dtype=np.int32)
        doc = np.zeros((instances,1),dtype=np.int32)
        k = 0 
        for doc_id, sentence  in (enumerate(tqdm.tqdm(data,desc='---- Creating Data ----'))):
            for i in range(window_size, len(sentence)-window_size):
                context[k] = sentence[i-window_size:i+window_size+1] # Get surrounding words
                doc[k] = doc_id
                k += 1
                
        target = context[:,window_size]
        context = np.delete(context,window_size,1)
        doc = doc.reshape(-1,)
        target_noise_ids = self._sample_noise_distribution(num_noise_words,window_size)
        target_noise_ids = np.insert(target_noise_ids,0,target,axis=1)
        
        
        context = torch.from_numpy(context).type(torch.LongTensor)
        doc = torch.from_numpy(doc).type(torch.LongTensor)
        target_noise_ids = torch.from_numpy(target_noise_ids).type(torch.LongTensor)
        
#         context = torch.from_numpy(context).type(torch.LongTensor).to(self.device)
#         doc = torch.from_numpy(doc).type(torch.LongTensor).to(self.device)
#         target_noise_ids = torch.from_numpy(target_noise_ids).type(torch.LongTensor).to(self.device)
        
        return doc,context,target_noise_ids
            
    def _padder(self,window_size):
        for i in range(len(self.corpus.values)):
            self.corpus.values[i] = ('[UNK] '*window_size).strip().split()+self.corpus.values[i]+('[UNK] '*window_size).strip().split()
            
    def _corpus_to_num(self):
        num_corpus = []
        unk_count = 0
        for sentence in self.corpus.values:
            sen = []
            for word in sentence:
                if word in self.word_id_mapper:
                    sen.append(self.word_id_mapper[word])
                else:
                    sen.append(self.word_id_mapper['[UNK]'])
                    unk_count+=1
            num_corpus.append(sen)
            
        self.word_counts+=[('[UNK]',unk_count)]
        return np.array(num_corpus)
    
    def _instance_count(self,window_size):
        instances = 0
        for i in self.corpus.values:
            instances+=len(i)-2*window_size   
        return instances
        
    def _sample_noise_distribution(self,num_noise_words,window_size):
        
        probs = np.zeros(self.vocab_size)

        for word, freq in self.word_counts:
            probs[self.word_id_mapper[word]] = freq

        probs = np.power(probs, 0.75)
        probs /= np.sum(probs)

        return choice(probs.shape[0],(self._instance_count(window_size),num_noise_words),p=probs).astype(np.int32)
    
    def __len__(self):
        return len(self.corpus)


In [None]:
data = DataPreparation('corpus.csv') #if going out of memory when using pytorch model then you can restrict model size by using vocab_size argument

In [None]:
data.vocab_builder()

  from pandas import Panel
--- Tokenizing ---: 100%|██████████| 197465/197465 [03:22<00:00, 973.69it/s] 


In [None]:
doc, context, target_noise_ids = data.get_data(window_size=3,num_noise_words=6)

---- Creating Data ----: 100%|██████████| 197465/197465 [01:14<00:00, 2667.17it/s]


In [None]:
len(doc)/1000

20673.196

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self,doc_ids,context, target_noise_ids):
        self.doc_ids = doc_ids
        self.context = context
        self.target_noise_ids = target_noise_ids
        
    def __len__(self):
        return len(self.doc_ids)
    
    def __getitem__(self,index):
        return self.doc_ids[index], self.context[index], self.target_noise_ids[index]

In [None]:
class NegativeSampling(nn.Module):
    
    
    def __init__(self):
        super(NegativeSampling, self).__init__()
        self._log_sigmoid = nn.LogSigmoid()

    def forward(self, scores):
        
        k = scores.size()[1] - 1
        return -torch.sum(
            self._log_sigmoid(scores[:, 0])
            + torch.sum(self._log_sigmoid(-scores[:, 1:]), dim=1) / k
        ) / scores.size()[0]

In [None]:
import torch
import torch.nn as nn


class DM(nn.Module):
    """Distributed Memory version of Paragraph Vectors.
    Parameters
    ----------
    vec_dim: int
        Dimensionality of vectors to be learned (for paragraphs and words).
    num_docs: int
        Number of documents in a dataset.
    num_words: int
        Number of distinct words in a daset (i.e. vocabulary size).
    """
    def __init__(self, vec_dim, num_docs, num_words):
        super(DM, self).__init__()
        # paragraph matrix
        self._D = nn.Parameter(
            torch.randn(num_docs, vec_dim), requires_grad=True)
        # word matrix
        self._W = nn.Parameter(
            torch.randn(num_words, vec_dim), requires_grad=True)
        # output layer parameters
        self._O = nn.Parameter(
            torch.FloatTensor(vec_dim, num_words).zero_(), requires_grad=True)

    def forward(self, context_ids, doc_ids, target_noise_ids):
        
        
        # combine a paragraph vector with word vectors of
        # input (context) words
        x = torch.add(
            self._D[doc_ids, :], torch.sum(self._W[context_ids, :], dim=1))

        # sparse computation of scores (unnormalized log probabilities)
        # for negative sampling
        return torch.bmm(
            x.unsqueeze(1),
            self._O[:, target_noise_ids].permute(1, 0, 2)).squeeze()

    def get_paragraph_vector(self):
        return self._D.data.tolist()
    
    def fit(self,doc_ids,context,target_noise_ids,epochs,batch_size,num_workers=1):
        
        opt=torch.optim.Adam(self.parameters(),lr=0.0001)
        cost_func = NegativeSampling()
        if torch.cuda.is_available():            
            cost_func.cuda()
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        dataset = Dataset(doc_ids, context, target_noise_ids)
        dataloader = torch.utils.data.DataLoader(dataset,batch_size=batch_size,num_workers=num_workers)
        loss = []
        for epoch in range(epochs):
            step = 0
            pbar = tqdm.tqdm(dataloader,desc='Epoch= {} ---- prev loss={}'.format(epoch+1,loss))
            loss=[]
            
            for doc_ids,context_ids,target_noise_ids in pbar:
                doc_ids = doc_ids.to(device)
                context_ids = context_ids.to(device)
                target_noise_ids = target_noise_ids.to(device)
                x = self.forward(
                        context_ids,
                        doc_ids,
                        target_noise_ids) 
                x = cost_func.forward(x)
                loss.append(x.item())
                self.zero_grad()
                x.backward()
                opt.step()
#                 if step%100==0:
#                     print('-',end='')
            loss = torch.mean(torch.FloatTensor(loss))
#             print('epoch - {} loss - {:.4f}'.format(epoch+1,loss))
        tqdm.tqdm.write('Final loss: {:.4f}'.format(loss))
        
    def save_model(self,ids,file_name):
        docvecs = self._D.data.cpu().numpy()
        if len(docvecs)!=len(ids):
            raise("Length of ids does'nt match")
            
            
        self.embeddings = np.concatenate([ids.reshape(-1,1),docvecs],axis=1)
        np.save(file_name,self.embeddings,fix_imports=False)
        
    def load_model(self,file_path):
        self.embeddings = np.load(file_path,allow_pickle=True,fix_imports=False)
        
    
    def similar_docs(self,docs,topk=10):
        topk=topk+1
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        if not isinstance(docs,np.ndarray):
            docs = np.array(docs)
        
        docids = self.embeddings[:,0]
        vecs = self.embeddings[:,1:]
        mask = np.isin(docids,docs)
        if not mask.any():
            raise('Not in vocab')
            
        given_docvecs = torch.FloatTensor(vecs[mask].tolist()).to(device)
        vecs = torch.FloatTensor(vecs.tolist()).to(device)
        similars = self._similarity(given_docvecs,vecs,topk)
        similar_docs = docids[similars.indices.tolist()[0]].tolist()
        probs = similars.values.tolist()[0]
        
        return similar_docs[1:], probs[1:]
        
    def _similarity(self,doc,embeddings,topk):
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        similarity = []
        
        cos=nn.CosineSimilarity(dim=0).to(device)
        for i in doc:
            inner = []
            for j in embeddings:
                inner.append(cos(i.view(-1,1),j.view(-1,1)).tolist())
            similarity.append(inner)
        similarity = torch.FloatTensor(similarity).view(1,-1).to(device)
        return torch.topk(similarity,topk)

In [None]:
data.vocab_size

771909

In [None]:
len(data)

197465

In [None]:
model = DM(vec_dim=100,num_docs=len(data),num_words=data.vocab_size).cuda()

In [None]:
num_workers=os.cpu_count()

In [None]:
num_workers

2

In [None]:
model.fit(doc,context,target_noise_ids,epochs=1,batch_size=3000,num_workers=num_workers) #epochs can be increased set to be 1 for testing purpose

Epoch= 1 ---- prev loss=[]: 100%|██████████| 6892/6892 [09:41<00:00, 11.85it/s]

Final loss: 1.1714





In [None]:
model.save_model(data.document_ids,'weights')

In [None]:
model.load_model('weights.npy')

In [None]:
np.load('weights.npy',allow_pickle=True).nbytes

159551720

In [None]:
model.similar_docs('E2HayXNpNnFfDd5U7LUX2o',topk=10)