# Final project - Hyper parameter optimization

## Installing packages (if running on colab)

**IMPORTANT**: Uncomment and run the cell below and then restart the Runtime (Menu Runtime > Restart Runtime, or with Ctrl + M .), then run it again. If you do not do that, then you will get errors. You only need to run it again if your Google Colab / Kaggle instance is restarted or lost.

In [1]:
'''
! pip install --upgrade scipy
! pip install --upgrade pandas
! pip install ipywidgets
! pip uninstall -y pykeen
! pip install git+https://github.com/pykeen/pykeen.git@v1.5.0
! python -c "import pykeen" || pip install git+https://github.com/pykeen/pykeen.git@v1.5.0
from pkg_resources import require
require('pykeen')
'''

'\n! pip install --upgrade scipy\n! pip install --upgrade pandas\n! pip install ipywidgets\n! pip uninstall -y pykeen\n! pip install git+https://github.com/pykeen/pykeen.git@v1.5.0\n! python -c "import pykeen" || pip install git+https://github.com/pykeen/pykeen.git@v1.5.0\nfrom pkg_resources import require\nrequire(\'pykeen\')\n'

After you install the packages above, you can just run from this cell onwards (Ctrl + F10 when this is selected)

If SAVE_TO_DRIVE is set to True, the following cell enables storing of hpo results to your google drive account. Authentication required, and only works on colab instances (as far as we know)

In [2]:
SAVE_TO_DRIVE = False
if (SAVE_TO_DRIVE):
    from google.colab import drive
    drive.mount('/content/drive')


## Imports and parameters

In [3]:
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
import pandas as pd
import torch
import gc
from pykeen.hpo import hpo_pipeline

Using the Local Closed World Assumption is computationally heavy compared to using the Stochastic Local Closed World Assumption. The selection of training loops impacts the choice of the ideal loss function as described in the article. The default setting in this notebook is to set SLCWA_ONLY = True, though it may be set to False to use the LCWA training loops for ideal optimization of the hyperparameters.

In [4]:
SLCWA_ONLY = True

How many epochs to use on training

In [5]:
N_EPOCHS = 100

TIME_PER_TRIAL- How many seconds to use for each hyperparameter optimization trial. The optimizer will run as many trials as possible and stop after finishing the one that was runningwhen time ran out. *Not used, we run with n_trials instead*

N_TRIALS - How many trials to run for each combination of dataset and model.

In [6]:
TIME_PER_TRIAL = 900
N_TRIALS = 10

Always run on GPU

In [7]:
CPU_DEV = 'gpu'

In [8]:

# Use this value for BASE_DATA_URL if working with local data
BASE_DATA_URL = './data'

# Use this value for BASE_DATA_URL if working with data from the github repo
#BASE_DATA_URL = 'https://raw.githubusercontent.com/hvags/effective-octo-eureka/main/data'

## Retrieving data

The data sets used were already pre stratified into training, testing and validation sets. Using the methods described in the data manipulation notebook they were split into symmetrical and asymmetrical subsets. The results were stored in a github repository for easy access in the further work.

To retrieve the data we first populate a dictionary with the URL of the sets.

In [9]:
dataset_urls = dict()

## wn18rr datasets

In [10]:
dataset_urls['wn18rr-full'] = {
    'train': '/wn18rr/train_wn18rr.txt',
    'validate': '/wn18rr/valid_wn18rr.txt',
    'test': '/wn18rr/test_wn18rr.txt'
    }

In [11]:
dataset_urls['wn18rr-sym'] = {
    'train': '/wn18rr/sym_train_wn18rr.txt',
    'validate': '/wn18rr/sym_valid_wn18rr.txt',
    'test': '/wn18rr/sym_test_wn18rr.txt'
    }


In [12]:
dataset_urls['wn18rr-asym'] = {
    'train': '/wn18rr/asym_train_wn18rr.txt',
    'validate': '/wn18rr/asym_valid_wn18rr.txt',
    'test': '/wn18rr/asym_test_wn18rr.txt'
    }

## fb15k-237 datasets

In [13]:
dataset_urls['fb15k-237-full'] = {
    'train':  '/fb15k-237/train_fb15k-237.txt',
    'validate': '/fb15k-237/valid_fb15k-237.txt',
    'test': '/fb15k-237/test_fb15k-237.txt'
    }

In [14]:
dataset_urls['fb15k-237-sym'] = {
    'train': '/fb15k-237/sym_train_fb15k-237.txt',
    'validate': '/fb15k-237/sym_valid_fb15k-237.txt',
    'test': '/fb15k-237/sym_test_fb15k-237.txt'
    }

In [15]:
dataset_urls['fb15k-237-asym'] = {
    'train': '/fb15k-237/asym_train_fb15k-237.txt',
    'validate': '/fb15k-237/asym_valid_fb15k-237.txt',
    'test': '/fb15k-237/asym_test_fb15k-237.txt'
    }

## Read data from files and create TriplesFactories

In [16]:
datasets = dict()

for key in dataset_urls.keys():
    print(f'Processing: {key}')
    
    datasets[key] = dict()
    
    df_train = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['train'], header=None, sep='\t', names=['head', 'relation','tail'])
    df_validate = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['validate'], header=None, sep='\t', names=['head', 'relation','tail'])
    df_test = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['test'], header=None, sep='\t', names=['head', 'relation','tail'])
    
    datasets[key]['train'] = TriplesFactory.from_labeled_triples(df_train.astype('str').to_numpy())
    entity_mapping = datasets[key]['train'].entity_to_id
    relation_mapping = datasets[key]['train'].relation_to_id
    
    datasets[key]['validate'] = TriplesFactory.from_labeled_triples(df_validate.astype('str').to_numpy(),
                                                                    entity_to_id=entity_mapping,
                                                                    relation_to_id=relation_mapping
                                                                    )
    
    datasets[key]['test'] = TriplesFactory.from_labeled_triples(df_test.astype('str').to_numpy(),
                                                                    entity_to_id=entity_mapping,
                                                                    relation_to_id=relation_mapping
                                                                    )
    print('\n')

Processing: wn18rr-full


You're trying to map triples with 211 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 210 from 3034 triples were filtered out
You're trying to map triples with 212 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 210 from 3134 triples were filtered out




Processing: wn18rr-sym


You're trying to map triples with 60 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 48 from 1165 triples were filtered out
You're trying to map triples with 35 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 31 from 1172 triples were filtered out




Processing: wn18rr-asym


You're trying to map triples with 527 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 518 from 1869 triples were filtered out
You're trying to map triples with 555 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 546 from 1962 triples were filtered out




Processing: fb15k-237-full


You're trying to map triples with 9 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 9 from 17535 triples were filtered out
You're trying to map triples with 30 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 28 from 20466 triples were filtered out




Processing: fb15k-237-sym


You're trying to map triples with 192 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 110 from 524 triples were filtered out
You're trying to map triples with 174 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 98 from 610 triples were filtered out




Processing: fb15k-237-asym


You're trying to map triples with 11 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 11 from 17011 triples were filtered out
You're trying to map triples with 30 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 28 from 19856 triples were filtered out






## Models

#### Defining models and their parameters
If doing a partial run, comment out models that should not be included

In [17]:
models = [
          #('TransE', dict(scoring_fct_norm=2)),
          #('TransH', dict()),
          #('TransD', dict()),
          #('TransR', dict()),
          ('RESCAL', dict()),
          #('ComplEx', dict()),
          #('RotatE', dict())
         ]



Define parameters that are specific to a combination of dataset and model.

In [20]:
class md_param:
    def __init__(self, loss, training_loop):
        self.loss = loss
        self.training_loop = training_loop        

        
if (SLCWA_ONLY == False):
    model_dataset_param = {('TransE',  'wn18rr'):     md_param('BCEWithLogitsLoss', 'lcwa'),
                           ('TransH',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                           ('TransD',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                           ('TransR',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),                       
                           ('RESCAL',  'wn18rr'):     md_param('CrossEntropyLoss', 'lcwa'),                                              
                           ('ComplEx', 'wn18rr'):     md_param('CrossEntropyLoss', 'lcwa'),
                           ('RotatE',  'wn18rr'):     md_param('BCEWithLogitsLoss', 'lcwa'),

                           ('TransE',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransH',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransD',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransR',  'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),                                              
                           ('RESCAL',  'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),                        
                           ('ComplEx', 'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),
                           ('RotatE',  'fb15k-237'):  md_param('NSSALoss', 'lcwa'),
                          }

else:
    model_dataset_param = {('TransE',  'wn18rr'):     md_param('NSSALoss', 'slcwa'),
                           ('TransH',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                           ('TransD',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                           ('TransR',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),                       
                           ('RESCAL',  'wn18rr'):     md_param('NSSALoss', 'slcwa'),                                              
                           ('ComplEx', 'wn18rr'):     md_param('BCEWithLogitsLoss', 'slcwa'),
                           ('RotatE',  'wn18rr'):     md_param('NSSALoss', 'slcwa'),

                           ('TransE',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransH',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransD',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                           ('TransR',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),                                              
                           ('RESCAL',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),                        
                           ('ComplEx', 'fb15k-237'):  md_param('BCEWithLogitsLoss', 'slcwa'),
                           ('RotatE',  'fb15k-237'):  md_param('NSSALoss', 'slcwa'),
                      }

In [21]:
def run_hpo_pipeline(dataset_name, dataset, model):
  model_name = model[0]
  model_params = model[1]
  dataset_base = str.join('-',str.split(dataset_name, '-')[:-1])
  print(f"Dataset: {dataset_name}, Model: {model_name}")
  result = hpo_pipeline(  
    n_trials=N_TRIALS,  
    training=dataset['train'],
    testing=dataset['test'],
    validation=dataset['validate'],
    model=model_name,
    model_kwargs=model_params,
    # set the parameters specific to this combination of dataset and model
    loss=model_dataset_param[model_name, dataset_base].loss,
    training_loop=model_dataset_param[model_name, dataset_base].training_loop,
    training_kwargs=dict(num_epochs=N_EPOCHS,
                         use_tqdm_batch=False),
    training_kwargs_ranges=dict(
                            batch_size=dict(type=int, low=256, high=1024, q=256),                    
                            ),
    device=CPU_DEV,
    metric='MEAN_RECIPROCAL_RANK',
    direction='maximize',
    stopper='early',
    stopper_kwargs=dict(frequency=10, patience=2, relative_delta=0.01),
    model_kwargs_ranges=dict(
                        embedding_dim=dict(type=int, low=32, high=400, q=64)
                        )
    ) 

  return(result)


In [None]:
for dataset_name, dataset in datasets.items():
  print(dataset_name)

  for model in models:
    gc.collect()
    torch.cuda.empty_cache()
    result = run_hpo_pipeline(dataset_name, dataset, model)
    result.save_to_directory(f'hpo_results/{dataset_name}/{model[0]}')
    if (SAVE_TO_DRIVE):
        result.save_to_directory(f'/content/drive/MyDrive/hpo_results/{dataset_name}/{model[0]}')


[32m[I 2021-11-11 11:22:42,765][0m A new study created in memory with name: no-name-82b372bc-2dff-469d-8081-d9d8ec1105ef[0m
No random seed is specified. Setting to 2839519528.


wn18rr-full
Dataset: wn18rr-full, Model: RESCAL


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=64.
INFO:pykeen.evaluation.evaluator:Evaluation took 16.25s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.evaluation.evaluator:Evaluation took 16.07s seconds
INFO:pykeen.evaluation.evaluator:Evaluation took 16.04s seconds
INFO:pykeen.stoppers.early_stopping:Stopping early after 3 evaluations at epoch 30. The best result hits_at_k=0.00017705382436260624 occurred at epoch 10.
INFO:pykeen.training.training_loop:=> loading checkpoint 'C:\Users\hvags\AppData\Local\Temp\tmpqkd61l04'
INFO:pykeen.training.training_loop:=> loaded checkpoint 'C:\Users\hvags\AppData\Local\Temp\tmpqkd61l04' stopped after having finished epoch 10


Evaluating on cuda:   0%|          | 0.00/2.82k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 16.11s seconds
[32m[I 2021-11-11 11:27:54,287][0m Trial 0 finished with value: 0.00019207599000791857 and parameters: {'model.embedding_dim': 160, 'loss.margin': 18, 'loss.adversarial_temperature': 0.8599550830787281, 'regularizer.weight': 0.4999898154211292, 'optimizer.lr': 0.03126477281996058, 'negative_sampler.num_negs_per_pos': 1, 'training.batch_size': 768}. Best is trial 0 with value: 0.00019207599000791857.[0m


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=64.
INFO:pykeen.evaluation.evaluator:Evaluation took 16.04s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.evaluation.evaluator:Evaluation took 16.36s seconds
INFO:pykeen.evaluation.evaluator:Evaluation took 16.13s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 30.
INFO:pykeen.evaluation.evaluator:Evaluation took 16.24s seconds
INFO:pykeen.evaluation.evaluator:Evaluation took 16.25s seconds
INFO:pykeen.stoppers.early_stopping:Stopping early after 5 evaluations at epoch 50. The best result hits_at_k=0.00017705382436260624 occurred at epoch 30.
INFO:pykeen.training.training_loop:=> loading checkpoint 'C:\Users\hvags\AppData\Local\Temp\tmp2_dj4u10'
INFO:pykeen.training.training_loop:=> loaded checkpoint 'C:\Users\hvags\AppData\Loc

Evaluating on cuda:   0%|          | 0.00/2.82k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 16.29s seconds
[32m[I 2021-11-11 11:38:51,518][0m Trial 1 finished with value: 0.0001804700743885965 and parameters: {'model.embedding_dim': 160, 'loss.margin': 9, 'loss.adversarial_temperature': 0.9079065523901304, 'regularizer.weight': 0.996935347165849, 'optimizer.lr': 0.02051035090356538, 'negative_sampler.num_negs_per_pos': 2, 'training.batch_size': 768}. Best is trial 0 with value: 0.00019207599000791857.[0m


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=64.
INFO:pykeen.evaluation.evaluator:Evaluation took 25.00s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.evaluation.evaluator:Evaluation took 24.86s seconds
INFO:pykeen.evaluation.evaluator:Evaluation took 25.27s seconds
INFO:pykeen.stoppers.early_stopping:Stopping early after 3 evaluations at epoch 30. The best result hits_at_k=0.07790368271954674 occurred at epoch 10.
INFO:pykeen.training.training_loop:=> loading checkpoint 'C:\Users\hvags\AppData\Local\Temp\tmpn0ft9vm5'
INFO:pykeen.training.training_loop:=> loaded checkpoint 'C:\Users\hvags\AppData\Local\Temp\tmpn0ft9vm5' stopped after having finished epoch 10


Evaluating on cuda:   0%|          | 0.00/2.82k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 24.95s seconds
[32m[I 2021-11-11 11:52:21,740][0m Trial 2 finished with value: 0.04634231809420707 and parameters: {'model.embedding_dim': 224, 'loss.margin': 24, 'loss.adversarial_temperature': 0.6911642038233735, 'regularizer.weight': 0.13538893138472385, 'optimizer.lr': 0.06647293940143989, 'negative_sampler.num_negs_per_pos': 2, 'training.batch_size': 256}. Best is trial 2 with value: 0.04634231809420707.[0m
INFO:pykeen.training.training_loop:Starting sub_batch_size search for training now...
INFO:pykeen.training.training_loop:Concluded search with sub_batch_size 512.


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=128.
INFO:pykeen.evaluation.evaluator:Evaluation took 6.98s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.evaluation.evaluator:Evaluation took 7.06s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
[33m[W 2021-11-11 12:08:12,544][0m Trial 3 failed, because the value None could not be cast to float.[0m
INFO:pykeen.training.training_loop:Starting sub_batch_size search for training now...
INFO:pykeen.training.training_loop:Concluded search with sub_batch_size 384.


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=32.
INFO:pykeen.evaluation.evaluator:Evaluation took 37.79s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.evaluation.evaluator:Evaluation took 37.75s seconds
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
[33m[W 2021-11-11 12:35:58,557][0m Trial 4 failed, because the value None could not be cast to float.[0m
INFO:pykeen.training.training_loop:Starting sub_batch_size search for training now...
INFO:pykeen.training.training_loop:Concluded search with sub_batch_size 32.


Training epochs on cuda:   0%|          | 0/100 [00:00<?, ?epoch/s]