# ICON demonstration

This notebook is a guided example of using ICON to enrich the Google Product Type Taxonomy.
Before running this notebook, make sure that you have read README.md of the ICON repository.

## Importing relevant packages

A complete list of dependencies is available in the [README](/README.md#dependencies).

In [None]:
import os
from typing import List, Union, Hashable
import torch
import pandas as pd
import numpy as np
from ellement.transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, AutoTokenizer
from utils import taxo_utils
from utils.taxo_utils import Taxonomy
from main.icon import ICON

## Reading data

The taxonomy dataset will be loaded as a `utils.taxo_utils.Taxonomy` object. For I/O format details, please refer to the corresponding section in [README](README.md#file-io-format).

In [None]:
taxo = taxo_utils.from_json('./data/raw/google.json')

## Loading the models

ICON requires three sub-models: `emb_model`, `gen_model` and `sub_model`.

**If you don't have these models**: The scripts in `/experiments/data_wrangling/` and notebooks in `/experiments/model_training/` will offer a pipeline for preparing the training data and fine-tuning pre-trained language models.

**Models for eBay**: Models fine-tuned on eBay data with the pipeline described below are available at RNO HDFS: `/user/jingcshi/ICON_models/`.

Our choices of emb_model, gen_model and sub_model each requires a tokenizer. The tokenizer for ret_model is automatically loaded during the SimCSE init command.

Notice that ICON uses its sub-models as callable functions and doesn't care how the models themselves are implemented. Therefore, we need to wrap these models in callable interfaces. This will be demonstrated in a [cell below](#wrapping-the-models-as-callables).

In [None]:
ret_model_path = 'YOUR_MODEL_PATH'
gen_model_path = 'YOUR_MODEL_PATH'
sub_model_path = 'YOUR_MODEL_PATH'

## Wrapping the models as callables classes

Here we create a class for each sub-model with a `__call__` method so that ICON can directly call them.

Each model has its expected inputs and outputs:

- `EMB_model`: Takes in one or a list of sentences (strings). Returns a numpy array representing the embeddings of each sentence. 

- `GEN_model`: Takes in a list of strings (concept labels which the model should summarise). Returns a single string (label for the union concept).

- `SUB_model`: Takes in two lists of strings (the labels for `sub` and `sup` respectively). Returns an 1D array of prediction scores of how likely each concept in `sup` subsumes the corresponding concept in `sub`.

In [None]:
class EMB_model:

    def __init__(self, model_path, **kwargs) -> None:

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = AutoModel.from_pretrained(model_path, **kwargs).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)

    def __call__(self, sentence: Union[str, List[str]], batch_size: int=64, max_length: int=64, normalize: bool = True) -> np.ndarray:

        single_sentence = False
        if isinstance(sentence, str):
            sentence = [sentence]
            single_sentence = True
        
        embedding_list = []
        with torch.no_grad():
            total_batch = len(sentence) // batch_size + (1 if len(sentence) % batch_size > 0 else 0)
            for batch_id in range(total_batch):
                inputs = self.tokenizer(
                    sentence[batch_id*batch_size:(batch_id+1)*batch_size], 
                    padding=True, 
                    truncation=True, 
                    max_length=max_length, 
                    return_tensors="pt"
                )
                inputs = {k: v.to(self.device) for k, v in inputs.items()}
                outputs = self.model(**inputs, return_dict=True).last_hidden_state[:, -1]
                embedding_list.append(outputs.cpu())
        embeddings = torch.cat(embedding_list, 0)
        if normalize:
            embeddings = embeddings / torch.norm(embeddings, p=2, dim=1, keepdim=True)
        if single_sentence:
            embeddings = embeddings[0]
        return embeddings.numpy()

class GEN_model:

    def __init__(self, model_path, **kwargs) -> None:
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path, **kwargs).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.max_length = self.model.config.max_length

    def __call__(self, labels: List[str], prefix='summarize: ') -> str:
        corpus = prefix
        for l in labels:
            corpus += l + '[SEP]'
        corpus = corpus[:-5]
        inputs = self.tokenizer(corpus,return_tensors='pt').to(device)['input_ids']
        outputs = self.model.generate(inputs,max_length=self.max_length)[0]
        decoded = self.tokenizer.decode(outputs.cpu().numpy(),skip_special_tokens=True)
        return decoded

class SUB_model:

    def __init__(self, model_path, **kwargs) -> None:
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path, **kwargs).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path,model_max_length=128)

    def __call__(self, sub: Union[str, List[str]], sup: Union[str, List[str]], batch_size :int=256) -> np.ndarray:
        if isinstance(sub, str):
            sub, sup = [sub], [sup]
        if len(sub) <= batch_size:
            inputs = self.tokenizer(sub,sup,padding=True,return_tensors='pt').to(device)
            predictions = torch.softmax(self.model(**inputs).logits.detach().cpu(),1)[:,1].numpy()
        else:
            head = (sub[:batch_size], sup[:batch_size])
            tail = (sub[batch_size:],sup[batch_size:])
            predictions = np.concatenate((SUB_model(head[0], head[1], batch_size=batch_size), SUB_model(tail[0], tail[1], batch_size=batch_size)))
        return predictions

device = 'cuda' if torch.cuda.is_available() else 'cpu'
ret_model = EMB_model(ret_model_path)
gen_model = GEN_model(gen_model_path, max_length=64)
sub_model = SUB_model(sub_model_path)

## Configuration

Almost there! Configure your run by specifying the data, models and settings. Check [here](/README.md#configurations) to see how to choose the right settings for your purpose. 

In the following example, we will run auto mode with 10 outer loops. We will also set `logging` to `True` to see a detailed logging of ICON's actions and results.

In [None]:
kwargs = {'data': taxo,
        'emb_model': EMB_model,
        'gen_model': GEN_model,
        'sub_model': SUB_model,
        'restrict_combinations': False,
        'retrieve_size': 5,
        'logging': 1}

iconobj = ICON(**kwargs)

## Running

We have prepared everything to run ICON. Simply initialise an ICON object with our configuration and call `run()`. 

If you change your mind on the settings before running, you don't have to initialise again: calling `update_config` would suffice.

The output of a run will be either a new taxonomy (as is the case here) or a list of ICON predictions. To save a taxonomy to a file, use the `to_json` method.

In [None]:
iconobj.update_config(threshold=0.8, logging=True, subgraph_strict=False) # Example of updating configurations
outputs = iconobj.run()