# Final project - Hyper parameter optimization

## Installing packages (if running on colab)

**IMPORTANT**: Run the cell below and then restart the Runtime (Menu Runtime > Restart Runtime, or with Ctrl + M .), then run it again. If you do not do that, then you will get errors. You only need to run it again if your Google Colab / Kaggle instance is restarted or lost.

In [None]:
! pip install --upgrade scipy
! pip install --upgrade pandas
! pip install ipywidgets
! pip uninstall -y pykeen
! pip install git+https://github.com/pykeen/pykeen.git@v1.5.0
! python -c "import pykeen" || pip install git+https://github.com/pykeen/pykeen.git@v1.5.0
from pkg_resources import require
require('pykeen')

After you install the packages above, you can just run from this cell onwards (Ctrl + F10 when this is selected)

## Imports and parameters

In [1]:
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
import pandas as pd

How many epochs to use on training

In [None]:
N_EPOCHS = 100

How many seconds to use for each hyperparameter optimization trial. The optimizer will run as many trials as possible and stop after finishing the one that was runningwhen time ran out.

In [None]:
TIME_PER_TRIAL = 900

Always run on GPU

In [None]:
CPU_DEV = 'gpu'

Regularizer weight bounds for hyper parametr optimalization

In [None]:
REGULARIZER_LOW = 0.0001
REGULARIZER_HIGH = 10000

## Retrieving data

The data sets used were already pre stratified into training, testing and validation sets. Using the methods described in the data manipulation notebook they were split into symmetrical and asymmetrical subsets. The results were stored in a github repository for easy access in the further work.

To retrieve the data we first populate a dictionary with the URL of the sets.

In [2]:
dataset_urls = dict()

# Use this value for BASE_DATA_URL if working with local data
#BASE_DATA_URL = './data'

# Use this value for BASE_DATA_URL if working with data from the github repo
BASE_DATA_URL = 'https://raw.githubusercontent.com/hvags/effective-octo-eureka/main/data'

In [3]:
# wn18rr datasets
dataset_urls['wn18rr-full'] = {
    'train': '/wn18rr/train_wn18rr.txt',
    'validate': '/wn18rr/valid_wn18rr.txt',
    'test': '/wn18rr/test_wn18rr.txt'
    }


In [4]:
dataset_urls['wn18rr-sym'] = {
    'train': '/wn18rr/sym_train_wn18rr.txt',
    'validate': '/wn18rr/sym_valid_wn18rr.txt',
    'test': '/wn18rr/sym_test_wn18rr.txt'
    }


In [5]:
dataset_urls['wn18rr-asym'] = {
    'train': '/wn18rr/asym_train_wn18rr.txt',
    'validate': '/wn18rr/asym_valid_wn18rr.txt',
    'test': '/wn18rr/asym_test_wn18rr.txt'
    }

In [6]:
# fb15k-237 datasets
dataset_urls['fb15k-237-full'] = {
    'train':  '/fb15k-237/train_fb15k-237.txt',
    'validate': '/fb15k-237/valid_fb15k-237.txt',
    'test': '/fb15k-237/test_fb15k-237.txt'
    }


In [7]:
dataset_urls['fb15k-237-sym'] = {
    'train': '/fb15k-237/sym_train_fb15k-237.txt',
    'validate': '/fb15k-237/sym_valid_fb15k-237.txt',
    'test': '/fb15k-237/sym_test_fb15k-237.txt'
    }

In [8]:
dataset_urls['fb15k-237-asym'] = {
    'train': '/fb15k-237/asym_train_fb15k-237.txt',
    'validate': '/fb15k-237/asym_valid_fb15k-237.txt',
    'test': '/fb15k-237/asym_test_fb15k-237.txt'
    }

We then read the data from the above URLs into a dataset dictionary.

In [9]:
datasets = dict()

for key in dataset_urls.keys():
    print(f'Processing: {key}')
    
    datasets[key] = dict()
    
    df_train = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['train'], header=None, sep='\t', names=['head', 'relation','tail'])
    df_validate = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['validate'], header=None, sep='\t', names=['head', 'relation','tail'])
    df_test = pd.read_csv(BASE_DATA_URL + dataset_urls[key]['test'], header=None, sep='\t', names=['head', 'relation','tail'])
    
    datasets[key]['train'] = TriplesFactory.from_labeled_triples(df_train.astype('str').to_numpy())
    entity_mapping = datasets[key]['train'].entity_to_id
    relation_mapping = datasets[key]['train'].relation_to_id
    
    datasets[key]['validate'] = TriplesFactory.from_labeled_triples(df_validate.astype('str').to_numpy(),
                                                                    entity_to_id=entity_mapping,
                                                                    relation_to_id=relation_mapping
                                                                    )
    
    datasets[key]['test'] = TriplesFactory.from_labeled_triples(df_test.astype('str').to_numpy(),
                                                                    entity_to_id=entity_mapping,
                                                                    relation_to_id=relation_mapping
                                                                    )
    print('\n')

Processing: wn18rr-full


You're trying to map triples with 211 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 210 from 3034 triples were filtered out
You're trying to map triples with 212 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 210 from 3134 triples were filtered out




Processing: wn18rr-sym


You're trying to map triples with 60 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 48 from 1165 triples were filtered out
You're trying to map triples with 35 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 31 from 1172 triples were filtered out




Processing: wn18rr-asym


You're trying to map triples with 527 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 518 from 1869 triples were filtered out
You're trying to map triples with 555 entities and 0 relations that are not in the training set. These triples will be excluded from the mapping.
In total 546 from 1962 triples were filtered out




Processing: fb15k-237-full


HTTPError: HTTP Error 404: Not Found

## Models

#### Defining models and their parameters

In [None]:
models = [
          ('TransE', dict(scoring_fct_norm=2)),
#          ('TransH', dict()),
#          ('TransD', dict()),
#          ('TransR', dict()),
#          ('RESCAL', dict()),
#          ('ComplEx', dict()),
#          ('RotatE', dict())
         ]



Define parameters that are specific to a combination of dataset and model.

In [None]:
class md_param:
    def __init__(self, loss, training_loop):
        self.loss = loss
        self.training_loop = training_loop        


model_dataset_param = {('TransE',  'wn18rr'):     md_param('BCEWithLogitsLoss', 'lcwa'),
                       ('TransH',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                       ('TransD',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                       ('TransR',  'wn18rr'):     md_param('MarginRankingLoss', 'slcwa'),
                       ('RESCAL',  'wn18rr'):     md_param('CrossEntropyLoss', 'lcwa'),
                       ('ComplEx', 'wn18rr'):     md_param('CrossEntropyLoss', 'lcwa'),
                       ('RotatE',  'wn18rr'):     md_param('BCEWithLogitsLoss', 'lcwa'),

                       ('TransE',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                       ('TransH',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                       ('TransD',  'fb15k-237'):  md_param('MarginRankingLoss', 'slcwa'),
                       ('TransR',  'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),
                       ('RESCAL',  'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),                       
                       ('ComplEx', 'fb15k-237'):  md_param('CrossEntropyLoss', 'lcwa'),
                       ('RotatE',  'fb15k-237'):  md_param('NSSALoss', 'lcwa'),
                      }

In [None]:
from pykeen.hpo import hpo_pipeline
    
def run_hpo_pipeline(dataset_name, dataset, model):
  model_name = model[0]
  model_params = model[1]
  dataset_base = str.join('-',str.split(dataset_name, '-')[:-1])
  print(f"Dataset: {dataset_name}, Model: {model_name}")
  result = hpo_pipeline(  
    timeout=TIME_PER_TRIAL,
    training=dataset['train'],
    testing=dataset['test'],
    validation=dataset['validate'],
    model=model_name,
    model_kwargs=model_params,
    # set the parameters specific to this combination of dataset and model
    loss=model_dataset_param[model_name, dataset_base].loss,
    training_loop=model_dataset_param[model_name, dataset_base].training_loop,
    training_kwargs=dict(num_epochs=N_EPOCHS,
                         use_tqdm_batch=False),
    training_kwargs_ranges=dict(
                            batch_size=dict(type=int, low=256, high=1024, q=256),                    
                            ),
    device=CPU_DEV,
    metric='MEAN_RECIPROCAL_RANK',
    direction='maximize',
    stopper='early',
    stopper_kwargs=dict(frequency=10, patience=2, relative_delta=0.01),
    model_kwargs_ranges=dict(
                        embedding_dim=dict(type=int, low=32, high=400, q=64)
                        )
    ) 

  return(result)


In [None]:
import torch
import gc
gc.collect()


In [None]:
torch.cuda.empty_cache()

In [None]:
for dataset_name, dataset in datasets.items():
  print(dataset_name)

  for model in models:
    gc.collect()
    torch.cuda.empty_cache()
    result = run_hpo_pipeline(dataset_name, dataset, model)
    result.save_to_directory(f'{dataset_name}/{model[0]}')


In [None]:
result.save_to_directory(f'{dataset_name}/{model[0]}')

In [None]:
def result_to_table(studies):

  metrics = [('both.realistic.arithmetic_mean_rank', 'MR'),  
            ('both.realistic.adjusted_arithmetic_mean_rank_index', 'AMRI'),
            ('both.realistic.inverse_geometric_mean_rank', 'IGMR'),
            ('both.realistic.hits_at_1', 'Hits@1'),
            ('both.realistic.hits_at_3', 'Hits@3'),
            ('both.realistic.hits_at_5', 'Hits@5')]

  model_labels = [x[0] for x in models]           

  for dataset in datasets.keys():
    table = pd.DataFrame(columns = ['metric'] + model_labels)
        
    for metric in metrics:
      row = [metric[1]]
      for model_label in model_labels:        
        result = studies[(dataset, model_label)].best_trial.user_attrs[metric[0]]
        row += [round(result,5)]
      table = table.append(pd.DataFrame([row], columns =['metric'] + model_labels ))

    print(f'\n\nScores for the {dataset} dataset.\n')
    print(table.to_string(index=False))

def print_best_params(model_name, study):        
    print(f'\nbest hyperparameters found for {model_name}')                        
    print(pd.DataFrame.from_dict(study.best_params, orient='index').to_string())