# UKPLab / sentence-transformers
https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/query_generation

- 1_programming_query_generation.py - We generate queries for all paragraphs from these articles 
- 2_programming_train_bi-encoder.py - We train a SentenceTransformer bi-encoder with these generated queries. This results in a model we can then use for sematic search (for the given Wikipedia articles).
- 3_programming_semantic_search.py - Shows how the trained model can be used for semantic search

## 1_programming_query_generation.py

In [5]:
"""
In this example we train a semantic search model to search through Wikipedia
articles about programming articles & technologies.

We use the text paragraphs from the following Wikipedia articles:
Assembly language, C , C Sharp , C++, Go , Java , JavaScript, Keras, Laravel, MATLAB, Matplotlib, MongoDB, MySQL, Natural Language Toolkit, NumPy, pandas (software), Perl, PHP, PostgreSQL, Python , PyTorch, R , React, Rust , Scala , scikit-learn, SciPy, Swift , TensorFlow, Vue.js

In:
1_programming_query_generation.py - We generate queries for all paragraphs from these articles
2_programming_train_bi-encoder.py - We train a SentenceTransformer bi-encoder with these generated queries. This results in a model we can then use for sematic search (for the given Wikipedia articles).
3_programming_semantic_search.py - Shows how the trained model can be used for semantic search
"""
import json
import gzip
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import tqdm
import os
from sentence_transformers import util

paragraphs = set()

In [6]:
# We use the Wikipedia articles of certain programming languages
corpus_filepath = 'wiki-programmming-20210101.jsonl.gz'
if not os.path.exists(corpus_filepath):
    util.http_get('https://sbert.net/datasets/wiki-programmming-20210101.jsonl.gz', corpus_filepath)

with gzip.open(corpus_filepath, 'rt') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        for p in data['paragraphs']:
            if len(p) > 100:    #Only take paragraphs with at least 100 chars
                paragraphs.add(p)

paragraphs = list(paragraphs)
print("Paragraphs:", len(paragraphs))

Paragraphs: 1230


In [14]:
# paragraphs[:5]
# get lengths of first 10 paragraphs
# len(paragraphs)
# [len(p) for p in paragraphs]
sum([len(p) for p in paragraphs]) / len(paragraphs)


376.03333333333336

In [7]:
# Now we load the model that is able to generate queries given a paragraph.
# This model was trained on the MS MARCO dataset, a dataset with 500k
# queries from Bing and the respective relevant passage
tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()

#Select the device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

# Parameters for generation
batch_size = 8 #Batch size
num_queries = 5 #Number of queries to generate for every paragraph
max_length_paragraph = 300 #Max length for paragraph
max_length_query = 64   #Max length for output query

In [8]:
# Now for every paragraph in our corpus, we generate the queries
with open('generated_queries.tsv', 'w') as fOut:
    for start_idx in tqdm.trange(0, 10*batch_size, batch_size):
    # for start_idx in tqdm.trange(0, len(paragraphs), batch_size):
        sub_paragraphs = paragraphs[start_idx:start_idx+batch_size]
        inputs = tokenizer.prepare_seq2seq_batch(sub_paragraphs, max_length=max_length_paragraph, truncation=True, return_tensors='pt').to(device)
        outputs = model.generate(
            **inputs,
            max_length=max_length_query,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries)

        for idx, out in enumerate(outputs):
            query = tokenizer.decode(out, skip_special_tokens=True)
            para = sub_paragraphs[int(idx/num_queries)]
            fOut.write("{}\t{}\n".format(query.replace("\t", " ").strip(), para.replace("\t", " ").strip()))

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  0%|          | 0/10 [00:28<?, ?it/s]


KeyboardInterrupt: 

## 2_programming_train_bi-encoder.py

In [4]:
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
import os

train_examples = []
with open('generated_queries.tsv') as fIn:
    for line in fIn:
        query, paragraph = line.strip().split('\t', maxsplit=1)
        print(f'{query = }\t{paragraph = }')
        train_examples.append(InputExample(texts=[query, paragraph]))
        break

In [9]:
# For the MultipleNegativesRankingLoss, it is important
# that the batch does not contain duplicate entries, i.e.
# no two equal queries and no two equal paragraphs.
# To ensure this, we use a special data loader
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=64)

# Now we create a SentenceTransformer model from scratch
word_emb = models.Transformer('distilbert-base-uncased')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

# MultipleNegativesRankingLoss requires input pairs (query, relevant_passage)
# and trains the model so that is is suitable for semantic search
train_loss = losses.MultipleNegativesRankingLoss(model)

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
# Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)

os.makedirs('output', exist_ok=True)
model.save('output/programming-model')

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

## 3_programming_semantic_search.py


In [2]:
from sentence_transformers import SentenceTransformer, util
import gzip
import json
import os

# Load the model we trained in 2_programming_train_bi-encoder.py
model = SentenceTransformer('output/programming-model')

# Load the corpus
docs = []
corpus_filepath = 'wiki-programmming-20210101.jsonl.gz'
if not os.path.exists(corpus_filepath):
    util.http_get('https://sbert.net/datasets/wiki-programmming-20210101.jsonl.gz', corpus_filepath)

with gzip.open(corpus_filepath, 'rt') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        title = data['title']
        for p in data['paragraphs']:
            if len(p) > 100:    #Only take paragraphs with at least 100 chars
                docs.append((title, p))

paragraph_emb = model.encode([d[1] for d in docs], convert_to_tensor=True)  # TODO: Should be done once

In [16]:
# Save paragraph embeddings to disk
import pickle
with open('output/programming-paragraph-embeddings.pkl', 'wb') as fOut:
    pickle.dump(paragraph_emb, fOut)

In [17]:
# Load paragraph embeddings from disk
with open('output/programming-paragraph-embeddings.pkl', 'rb') as fIn:
    _paragraph_emb = pickle.load(fIn)
_paragraph_emb.shape

torch.Size([1230, 768])

In [4]:

print("Available Wikipedia Articles:")
print(", ".join(sorted(list(set([d[0] for d in docs])))))

query = "What is Python?"
query_emb = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(query_emb, paragraph_emb, top_k=3)[0]

for hit in hits:
    doc = docs[hit['corpus_id']]
    print("{:.2f}\t{}\t\t{}".format(hit['score'], doc[0], doc[1]))


Available Wikipedia Articles:
Assembly language, C (programming language), C Sharp (programming language), C++, Go (programming language), Java (programming language), JavaScript, Keras, Laravel, MATLAB, Matplotlib, MongoDB, MySQL, Natural Language Toolkit, NumPy, PHP, Pandas (software), Perl, PostgreSQL, PyTorch, Python (programming language), R (programming language), Rust (programming language), Scala (programming language), SciPy, Scikit-learn, Swift (programming language), TensorFlow, Vue.js
0.69	Python (programming language)		The major academic conference on Python is PyCon. There are also special Python mentoring programmes, such as Pyladies.
0.65	Python (programming language)		Users and admirers of Python, especially those considered knowledgeable or experienced, are often referred to as "Pythonistas".
0.64	Python (programming language)		Python's name is derived from the British comedy group Monty Python, whom Python creator Guido van Rossum enjoyed while developing the languag

## Jason

In [97]:
# load sections pickle
import pickle
# add df_pickle to path
with open('df_context.pkl', 'rb') as f:
    df = pickle.load(f)

# Extract context
contexts = df['context'].tolist()
contexts = [c.strip() for c in contexts]

In [98]:
# get lengtsh of contexts
for c in contexts:
    print(len(c))

1183
691
838
415
1333
3658
296
4139
7414
3579
2250
1411
7674
2930
3979
1654
3990
2898
1766
821
2872
2991
9969
1550
1107
2769
5361
3734
4172
1300
10140
1594
1726
3380
3675
2775
8407
7129
3008
3124
1600
1132
1362
4432
1160
1003
4024
3430
1890
2800
2837
6607
686
3319
910
1295
3525
4962
4951
1273
7660
5647
2782
1761
3273
7905
2611
7413
2934
6770
4003
2669
2168
7732
3972
3804
2478
2599
1009
5593
9314
1958
8179
5783
4476
4303
2172
8689
1527
9780
6872
5965
4348
728
1694
4134
2208
6387
440
3094
3988
2116
2249
923
2219
6504
1428
693
7155
3482
4645
1648
2424
3638
5383
1549
9507
2985
6425
5787
3372
866
8620
4291
4163
4392
7067
1380
3539
1929
1211
1104
1325
3185
3198
1301
842
3220
7826
7166
2006
2693
6771
6110
5854
7924
9636
3678
4940
9157
5021
5425
1424
3183
3936
3996
4548
1590
6876
3058


In [85]:
# sections vary great in size, so we need to split them into smaller chunks of at most size 100 chars

# split sections into chunks of at most 100 chars
def split_section(section, size=100):
    section = section.split('.')
    chunks = []
    chunk = []
    chunk_size = 0
    for sentence in section:
        chunk_size += len(sentence)
        if chunk_size > size:
            chunks.append(' '.join(chunk))
            chunk = []
            chunk_size = 0
        chunk.append(sentence)
    chunks.append(' '.join(chunk))
    return chunks

split_section(contexts[100], 200)[:5]

['',
 'Since logistic regression corresponds to a linear neural network with no activation function, it should be apparent the multi-class neural network allows us to extend logistic regression to the multi-class setting  One way to accomplish this is to simply replace f(x, w) in eq  (15 18) with a linear function; while this is certainly a valid way to proceed, there is one slightly annoying side-effect',
 ' Recall from section 15 3 1 that in the case where we applied neural networks to a two-class classification task, the neural network had a single output neuron  However, in the multi-class setting considered in section 15 3',
 '2, the neural network had as many outputs C as there was classes  This means that this approach to multi-class regression would not directly generalize the binary classification case',
 ' To get around this, it is customary to implement linear multi-class classification using the (modified) softmax with C −1 inputs which we encountered in eq  (5 31)  Specifi

In [86]:
paragraphs = [split_section(s, 200) for s in contexts]

In [87]:
# Flatten list of lists
paragraphs = [item.strip() for sublist in paragraphs for item in sublist]
paragraphs[:5]

['How can we build intelligent machines? More than 65 years ago Alan Turing made this question the subject of his famous essay “Computing machinery and intelligence” [Turing, 1950]',
 'Alan Turing suggested that when we phrase the question in this manner, we unavoidably get bogged down in the definition of the word “intelligence”',
 'Instead, he proposed we should rather consider a different question: Can we construct a machine that can do the same things a human can do? This may ultimately be as hard to answer as the first question, but at least we don’t have to begin our efforts by defining intelligence  A second part of Turing’s essay discuss how we might build such a human\ufffeimitating machine',
 'Turing proposed that instead of writing a computer program that behaves like a human from scratch, we should build a machine which initially cannot do a great many things but which can learn from past experience',
 'For instance, if we wished to construct a machine which translate from 

In [88]:
len(paragraphs)

2260

In [95]:
# Parameters for generation
batch_size = 8 #Batch size
num_queries = 5 #Number of queries to generate for every paragraph
max_length_paragraph = 1000 #Max length for paragraph
max_length_query = 200   #Max length for output query

In [96]:
# Now for every paragraph in our corpus, we generate the queries
start = 500
with open('generated_queries_jason.tsv', 'w') as fOut:
    for start_idx in tqdm.trange(start, start+batch_size, batch_size):
        sub_paragraphs = paragraphs[start_idx:start_idx+batch_size]
        print(sub_paragraphs)
        inputs = tokenizer.prepare_seq2seq_batch(sub_paragraphs, max_length=max_length_paragraph, truncation=True, return_tensors='pt').to(device)
        outputs = model.generate(
            **inputs,
            max_length=max_length_query,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=num_queries)

        for idx, out in enumerate(outputs):
            query = tokenizer.decode(out, skip_special_tokens=True)
            para = sub_paragraphs[int(idx/num_queries)]
            fOut.write("{}\t{}\n".format(query.replace("\t", " ").strip(), para.replace("\t", " ").strip()))

  0%|          | 0/1 [00:00<?, ?it/s]

['Obviously, how to build these models will be a subject we return to several times in later chapters, however in this chapter we will be concerned with introducing a few building blocks which we will use many times over when constructing more elaborate models', 'Consider a setting where we consider a single, binary variable b which can be either false, b = 0, or true, b = 1', 'The prototypical example of a binary event is a coin flip where b = 0 denote the event the coin landed tails and b = 1 the event the coin landed heads, but the setup applies to all simple5', '4 The Bernoulli, categorical and binomial distributions 81 classification problems with two outcomes, for instance we could denote the event a treatment cures a patient such that b = 0 if the patient is not cured and b = 1 if the patient is cured  The Bernoulli distribution is the assumption the probability that b = 0 or b = 1 depends on a number 0 ≤ θ ≤ 1 as: Bernoulli distribution: p(b|θ) = θ b (1 − θ) 1−b', 'It is worth 

100%|██████████| 1/1 [01:09<00:00, 69.11s/it]
