# ICON demonstration

This notebook is a guided example of using ICON to enrich the Google Product Type Taxonomy.
Before running this notebook, make sure that you have read README.md of the ICON repository.

## Preparation

**Replace SimCSE script**: For the purpose of this demonstration, please temporarily replace the `tool.py` in your SimCSE directory with `/utils/replace_simcse/tool.py`. The reasons are explained [here](/README.md#replace-simcse-script).

In [None]:
! pip show simcse | grep -P "Location: .*$" # Locate your SimCSE package. 
# Copy the directory given by the above command's outputs, which will look like:
    # Location: SIMCSE_DIR
# Now uncomment the following line and replace SIMCSE_DIR with what you have copied
# ! cp utils/replace_simcse/tool.py SIMCSE_DIR/simcse/tool.py

## Importing relevant packages

A complete list of dependencies is available in the [README](/README.md#dependencies).

In [None]:
from typing import List, Union, Hashable
import torch
import pandas as pd
import numpy as np
from simcse import SimCSE
from transformers import BertForSequenceClassification, AutoModelForSeq2SeqLM, BertTokenizer, AutoTokenizer
from utils import taxo_utils
from utils.taxo_utils import Taxonomy
from main.icon import ICON

## Loading the models

ICON requires three sub-models: `ret_model`, `gen_model` and `sub_model`.

**If you don't have these models**: The notebooks in `/data_wrangling/` and `/model_training/` will offer a pipeline for preparing the training data and fine-tuning pre-trained language models.

**Models for eBay**: Models fine-tuned on eBay data with the pipeline described below are available at RNO: `/user/jingcshi/ICON_models/`.

Our choices of ret_model, gen_model and sub_model each requires a tokenizer. The tokenizer for ret_model is automatically loaded during the SimCSE() command.

Notice that ICON uses its sub-models as callable functions and doesn't care how the models themselves are implemented. Therefore, we need to wrap these models in callable interfaces. This will be demonstrated in a [cell below](#wrapping-the-models-as-callables).

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ret_model = SimCSE('/your/path/to/ret_model',device=device)
gen_model = AutoModelForSeq2SeqLM.from_pretrained('/your/path/to/gen_model').to(device)
gen_tokenizer = AutoTokenizer.from_pretrained('/your/path/to/gen_model')
sub_model = BertForSequenceClassification.from_pretrained('/your/path/to/sub_model').to(device)
sub_tokenizer = BertTokenizer.from_pretrained('/your/path/to/sub_model',model_max_length=128)

## Reading and preprocessing data

The taxonomy dataset will be loaded as a `utils.taxo_utils.Taxonomy` object.

We can get a tabular view of the dataset by converting its concepts into a pandas DataFrame. This DataFrame will also be used to track the index of each concept when the concepts are converted to a flat list (which will happen when SimCSE returns its results later). 

In [None]:
taxo = taxo_utils.from_json('./data/raw/google.json')
df = pd.DataFrame(taxo.nodes(data='label'),columns=['ID','Label']).drop(0).reset_index(drop=True)
idx_dict = {}
for i,row in df.iterrows():
    idx_dict[i] = row['ID'] # Convert a concept's index in the flat list to its ID in the taxonomy.
df

## Wrapping the models as callables

Here we create a function for each sub-model so that ICON can directly call them.

Each function has its expected inputs and outputs:

- `RET_model`: Takes in a taxonomy, a query string (the concepts most similar to which we would like to find out), and an integer `k`, the amount of concepts to be retrieved. Returns a list of concept IDs in the taxonomy.

- `GEN_model`: Takes in a list of strings (concept labels which the model should summarise). Returns a single string (label for the union concept).

- `SUB_model`: Takes in two lists of strings (the labels for `sub` and `sup` respectively). Returns an 1D array of prediction scores of how likely each concept in `sup` subsumes the corresponding concept in `sub`.

In [None]:
ret_model.build_index(list(df['Label']))
def RET_model(taxo: Taxonomy, query: str, k: int=10) -> List[Hashable]:
    topk = ret_model.search(query, top_k=k)
    return [idx_dict[i] for i,_,_ in topk]

def GEN_model(labels: List[str], prefix='summarize: ') -> str:
    corpus = prefix
    for l in labels:
        corpus += l + '; '
    corpus = corpus[:-2]
    inputs = gen_tokenizer(corpus,return_tensors='pt').to(device)['input_ids']
    outputs = gen_model.generate(inputs,max_length=64)[0]
    decoded = gen_tokenizer.decode(outputs.cpu().numpy(),skip_special_tokens=True)
    return decoded

def SUB_model(sub: Union[str, List[str]], sup: Union[str, List[str]], batch_size :int=256) -> np.ndarray:
    if isinstance(sub, str):
        sub, sup = [sub], [sup]
    if len(sub) <= batch_size:
        inputs = sub_tokenizer(sub,sup,padding=True,return_tensors='pt').to(device)
        predictions = torch.softmax(sub_model(**inputs).logits.detach().cpu(),1)[:,1].numpy()
    else:
        head = (sub[:batch_size], sup[:batch_size])
        tail = (sub[batch_size:],sup[batch_size:])
        predictions = np.concatenate((SUB_model(head[0], head[1], batch_size=batch_size), SUB_model(tail[0], tail[1], batch_size=batch_size)))
    return predictions

## Configuration

Almost there! Configure your run by specifying the data, models and settings. Check [here](/README.md#configurations) to see how to choose the right settings for your purpose. 

In the following example, we will run auto mode with 10 outer loops. We will also set `logging` to `True` to see a detailed logging of ICON's actions and results.

In [None]:
kwargs = {'data': taxo,
        'ret_model': RET_model,
        'gen_model': GEN_model,
        'sub_model': SUB_model,
        'max_outer_loop': 10,
        'restrict_combinations': False,
        'retrieve_size': 5,
        'logging': True}

## Running

We have prepared everything to run ICON. Simply initialise an ICON object with our configuration and call `run()`. 

If you change your mind on the settings before running, you don't have to initialise again: calling `update_config` would suffice.

The output of a run will be either a new taxonomy (as is the case here) or a list of ICON predictions. To save a taxonomy to a file, use the `to_json` method.

In [None]:
iconobj = ICON(**kwargs)
iconobj.update_config(threshold=0.9) # Example of updating configurations
outputs = iconobj.run()