# Location of Model Artifacts

### Google Cloud Storage

- **model for inference** (965 MB): `https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl`


- **encoder (for fine-tuning w/a classifier)** (965 MB): 
`https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_encoder_22zkdqlr.pth`


- **fastai.databunch** (27.1 GB):
`https://storage.googleapis.com/issue_label_bot/model/lang_model/data_save.pkl`


- **checkpointed model** (2.29 GB): 
`https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/best_22zkdqlr.pth`

# Load Minimal Model For Inference

In [38]:
from inference import InferenceWrapper, pass_through
from IPython.display import display, Markdown
import pandas as pd
from torch.nn.utils.rnn import pad_sequence
from torch import Tensor, cat
from torch.cuda import empty_cache
from typing import List
from tqdm import tqdm
from numpy import concatenate as cat

#### Create an `InferenceWrapper` object

In [2]:
wrapper = InferenceWrapper(model_path='/ds/Issue-Embeddings/notebooks',
                           model_file_name='trained_model_22zkdqlr.pkl')

#### Download a test dataset

The test dataset has 2,000 GitHub Issues in the below format:

In [11]:
testdf = pd.read_csv(f'https://storage.googleapis.com/issue_label_bot/language_model_data/000000000000.csv.gz').head(2000)

testdf.head(3)



Unnamed: 0,url,repo,title,title_length,body,body_length
0,https://github.com/egingric/2016-Racing-Game/i...,egingric/2016-Racing-Game,Got stuck near shortcut,25,"After being blown up by the barrel, I got stuc...",314
1,https://github.com/Microsoft/nodejstools/issue...,Microsoft/nodejstools,Guidance for unit test execution - How to prop...,95,What is the appropriate way to set NODE_ENV fo...,507
2,https://github.com/raphapari/dummy/issues/3,raphapari/dummy,Génération du catalogue,25,## User story xxxlnbrk - En tant que : **gest...,480


# Perform Batch Inference

Why Batch-Inference?  When there are a large number of issues for which you want to retrieve document embedddings, batch inference on a gpu (should be) significantly faster than on a cpu.

#### Generate Embeddings From Pre-Trained Language Model

See help for `wrapper.df_to_emb`:

In [29]:
help(wrapper.df_to_emb)

Help on method df_to_emb in module inference:

df_to_emb(dataframe:pandas.core.frame.DataFrame, bs=150) -> numpy.ndarray method of inference.InferenceWrapper instance
    Retrieve document embeddings for a dataframe with the columns `title` and `body`.
    Uses batching for effiecient computation, which is useful when you have many documents
    to retrieve embeddings for. 
    
    Paramaters
    ----------
    dataframe: pandas.DataFrame
        Dataframe with columns `title` and `body`, which reprsent the Title and Body of a
        GitHub Issue. 
    bs: int
        batch size for doing inference.  Set this variable according to your available GPU memory.
        The default is set to 200, which was stable on a Nvida-Tesla V-100.
    
    Returns
    -------
    numpy.ndarray
        An array with of shape (number of dataframe rows, 2400)
        This numpy array represents the latent features of the GitHub issues.
    
    Example
    -------
    >>> import pandas as pd
    >>> wr

In [30]:
embeddings = wrapper.df_to_emb(testdf)

HBox(children=(IntProgress(value=0, description='Tokenizing and parsing text:', max=2000, style=ProgressStyle(…




HBox(children=(IntProgress(value=0, description='Numericalizing text:', max=2000, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=0, description='Model inference:', max=14, style=ProgressStyle(description_wi…




#### Benchmarking batch vs. one at a time

Benchmark time to perform inference.  There is over a 2x speedup.  

In [39]:
%%time
# prepare data
test_data = [wrapper.process_dict(x)['text'] for x in testdf.to_dict(orient='rows')]

emb_single = []
for d in tqdm(test_data):
    emb_single.append(wrapper.get_pooled_features(d).detach().cpu().numpy())
    
emb_single_combined = cat(emb_single)

100%|██████████| 2000/2000 [03:45<00:00,  5.78it/s]

CPU times: user 3min 13s, sys: 44 s, total: 3min 57s
Wall time: 3min 56s





# Conclusion:

Naively batching examples doesn't really speed things up much, because any speedup you get from batching is mitigated by the extra slowdown you get from padding.  In order to get a speed improvement we must utilize [pack_padded_sequence](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence).  Which requires sorting the data by sequence length in descending order.  This also requires sorting the corresponding labels.

We leave this a future exercise to optimize batching more.  In the meantime, feel free to reach eitehr batching method.



### Test

This section tests that the embeddings retrieved from the one-at-a time approach are sufficently close to the embeddings from the batching approach

In [44]:
np.allclose(emb_single_combined, embeddings, atol=1e-6)

True