In [1]:
from datasets import load_dataset

stoic = load_dataset('jamescalam/stoic-corpus', split='train')
stoic

Using custom data configuration jamescalam--stoic-corpus-0a7f055fafc72e16
Reusing dataset json (C:\Users\James\.cache\huggingface\datasets\json\jamescalam--stoic-corpus-0a7f055fafc72e16\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['id', 'text', 'source'],
    num_rows: 1647
})

In [2]:
stoic[0]

{'id': '0',
 'text': 'From my grandfather Verus I learned good morals and the government of my temper.',
 'source': 'meditations'}

## Query Generation


We use a query generation model to generate *(query, doc)* pairs, where a doc is a single paragraph from the samples we produced above. For this, we use a T5 query generation model.

In [16]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/msmarco-t5-base-v1'
# also tried 'doc2query/all-t5-base-v1' but queries were super weird

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).cuda()

Downloading: 100%|██████████| 773k/773k [00:00<00:00, 2.01MB/s]
Downloading: 100%|██████████| 1.74k/1.74k [00:00<00:00, 3.60MB/s]
Downloading: 100%|██████████| 2.07k/2.07k [00:00<00:00, 4.28MB/s]
Downloading: 100%|██████████| 667/667 [00:00<00:00, 672kB/s]
Downloading: 100%|██████████| 945M/945M [01:10<00:00, 14.0MB/s]


In [17]:
for doc in stoic:
    if len(doc) > 100: break

# tokenize the passage
inputs = tokenizer(doc['text'], return_tensors='pt')
# generate three queries
outputs = model.generate(
    input_ids=inputs['input_ids'].cuda(),
    attention_mask=inputs['attention_mask'].cuda(),
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=3
)

In [18]:
print("Paragraph:")
print(doc['text'])

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Paragraph:
And what is this Good? It is a clear and flawless mind, which rivals that of God, raised far above mortal concerns, and counting nothing of its own to be outside itself. You are a reasoning animal. What Good, then, lies within you? Perfect reason. Are you willing to develop this to its farthest limits – to its greatest degree of increase? Only consider yourself happy when all your joys are born of reason, and when – having marked all the objects which men clutch at, or pray for, or watch over – you find nothing which you will desire; mind, I do not say prefer. Here is a short rule by which to measure yourself, and by the test of which you may feel that you have reached perfection: "You will come to your own when you shall understand that those whom the world calls fortunate are really the most unfortunate of all." Farewell.

Generated Queries:
1: what good is within yourself?
2: what are things that lie within you
3: what is the good that lies within you


We're getting some interesting queries, my favorite being *"An eloquent poem to tell me what good is in us"*. Granted, this is not a poem, but it is eloquent literature, and speaks of the *good in us*.

In [19]:
from tqdm.auto import tqdm

batch_size = 64
num_queries = 3  # number of queries to generate for each passage
pairs = []
doc_batch = []

for doc in tqdm(stoic['text']):
    # remove tab + newline characters if present
    doc_batch.append(doc.replace('\t', ' ').replace('\n', ' '))
    
    # we encode in batches
    if len(doc_batch) == batch_size:
        # tokenize the passage
        inputs = tokenizer(
            doc_batch,
            truncation=True,
            padding=True,
            max_length=256,
            return_tensors='pt'
        )

        # generate three queries per doc/passage
        outputs = model.generate(
            input_ids=inputs['input_ids'].cuda(),
            attention_mask=inputs['attention_mask'].cuda(),
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries
        )

        # decode query to human readable text
        decoded_output = tokenizer.batch_decode(
            outputs,
            skip_special_tokens=True
        )

        # loop through to pair query and passages
        for i, query in enumerate(decoded_output):
            query = query.replace('\t', ' ').replace('\n', ' ')  # remove newline + tabs
            passage_idx = int(i/num_queries)  # get index of passage to match query
            pairs.append([query, doc_batch[passage_idx]])
        
        doc_batch = []

100%|██████████| 1647/1647 [01:31<00:00, 18.06it/s]


## Negative Mining

The next step is *negative mining*. For this, we will use a sentence transformer retrieval model, it is the same model that we will be using as our base model for domain adaption later, the *[msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5)*. Important factors that contributed to choosing this model include that it uses *dot product similarity* metric, and that it has been trained on MSMARCO, which consits of *(query, doc)* pairs similar to what we are using.

In [20]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-bert-base-dot-v5')
model.max_seq_length = 256
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

This step requires the retrieval of similar docs using a semantic search. For that I will use Pinecone because it's super fast, easy, and for this many vectors completely free. All we need is an [API key](https://app.pinecone.io).

In [21]:
import pinecone  # pip install pinecone-client

pinecone.init(
    api_key='<<YOUR_API_KEY>>',  # get api key from app.pinecone.io
    environment='us-west1-gcp'
)
# create a new negative mining index if does not already exist
if 'negative-mine' not in pinecone.list_indexes():
    pinecone.create_index(
        'negative-mine',
        dimension=model.get_sentence_embedding_dimension(),
        metric='dotproduct',  # important to use dot product similarity here
        pods=1
    )
# connect
index = pinecone.Index('negative-mine')

Now we encode the documents and store in the `negative-mine` index.

In [22]:
to_upsert = []
docs_seen = []
doc_batch = []
id_batch = []
batch_size = 64

for i, (query, doc) in enumerate(tqdm(pairs)):
    # do this to avoid doc duplication in Pinecone index
    if doc not in docs_seen:
        docs_seen.append(doc)
        doc_batch.append(doc)
        id_batch.append(str(i))
    # on reaching batch_size we encode and upsert
    if len(doc_batch) == batch_size:
        embeds = model.encode(doc_batch).tolist()
        # insert to index
        index.upsert(vectors=list(zip(id_batch, embeds)))
        # refresh batch
        doc_batch = []
        id_batch = []
    
# (optional) take a look at the index stats
index.describe_index_stats()

100%|██████████| 4800/4800 [00:17<00:00, 266.95it/s]


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1600}}}

Now we're ready to perform the negative mining step. We will return *10* of the most similar docs to each query, and select one at random to be our *negative doc*.

In [23]:
import random

batch_size = 64

for i in tqdm(range(0, len(pairs), batch_size)):
    # embed queries and query pinecone in batches to minimize network latency
    i_end = min(i+batch_size, len(pairs))
    queries = [pair[0] for pair in pairs[i:i_end]]
    pos_doc = [pair[1] for pair in pairs[i:i_end]]
    # create query embeddings
    query_embs = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
    # search for top_k most similar passages
    res = index.query(query_embs.tolist(), top_k=10)
    # iterate through queries and find negatives
    for query, pos_doc, query_res in zip(queries, pos_doc, res['results']):
        top_results = query_res['matches']
        # shuffle results so they are in random order
        random.shuffle(top_results)
        for hit in top_results:
            neg_doc = pairs[int(hit['id'])][1]
            # check that we're not just returning the positive doc
            if neg_doc != pos_doc:
                # if not we can add this to our (Q, P+) pair to make a (Q, P+, P- triplet)
                pairs[i].append(neg_doc)
                i += 1
                break

100%|██████████| 75/75 [01:17<00:00,  1.03s/it]


In [24]:
# delete the index when done
pinecone.delete_index('negative-mine')

The final data preparation step is the *pseudo-labeling* step. For this we need a cross encoder model.

In [25]:
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

We use the cross encoder to calculate the similarity of the positive and negative pairs, then take the score as the margin between both.

In [26]:
for i, (query, pos, neg) in enumerate(tqdm(pairs)):
    pos_score = model.predict((query, pos))
    neg_score = model.predict((query, neg))
    margin = pos_score - neg_score
    pairs[i].append(margin)

100%|██████████| 4800/4800 [01:50<00:00, 43.48it/s]


Let's view some of these pairs...

In [27]:
pairs[0], pairs[50], pairs[100]

(['what lessons did verus teach me',
  'From my grandfather Verus I learned good morals and the government of my temper.',
  'Now I will transfer my attention to the musician. You, sir, are teaching me how the treble and the bass are in accord with one another, and how, though the strings produce different notes, the result is a harmony; rather bring my soul into harmony with itself, and let not my purposes be out of tune. You are showing me what the doleful keys are; show me rather how, in the midst of adversity, I may keep from uttering a doleful note. The mathematician teaches me how to lay out the dimensions of my estates; but I should rather be taught how to lay out what is enough for a man to own. He teaches me to count, and adapts my fingers to avarice; but I should prefer him to teach me that there is no point in such calculations, and that one is none the happier for tiring out the book-keepers with his possessions – or rather, how useless property is to any man who would find

A greater difference in the margin score means there is a greater difference between the positive/negative docs.

In [28]:
len(pairs)

4800

## Fine-tuning the Model

We have 4.8K labeled triplets, meaning we're now ready to begin fine-tuning the model. We will use margin MSE loss to optimize the same model we used in the negative mining step earlier.

In [29]:
model = SentenceTransformer('msmarco-bert-base-dot-v5')
model.max_seq_length = 256
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

We must transform the data into a list of `InputExample` objects.

In [30]:
from sentence_transformers import InputExample

train = [
    InputExample(texts=[q, p, n], label=margin) for (q, p, n, margin) in pairs
]
len(train)

4800

Initialize our dataloader, and empty the GPU cache if needed.

In [33]:
import torch

torch.cuda.empty_cache()

batch_size = 16

loader = torch.utils.data.DataLoader(
    train, batch_size=batch_size, shuffle=True
)

Initialize the margin MSE loss function.

In [31]:
from sentence_transformers import losses

loss = losses.MarginMSELoss(model)

Then train, GPL performance tends to improve up to ~100K steps. We only have a small dataset of 4.8K samples, so if we process them through `10` epochs we have worked through 48K steps, almost half of that *very approximate* limit.

In [34]:
epochs = 10
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='stoic-search-bert',
    show_progress_bar=True
)


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A