# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [38]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import pandas as pd
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
bi_encoder.max_seq_length = 512     #Truncate long passages to 256 tokens
top_k = 100                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('amberoad/bert-multilingual-passage-reranking-msmarco', max_length=512)

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

# wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

# if not os.path.exists(wikipedia_filepath):
#     util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

# passages = []
# with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
#     for line in fIn:
#         data = json.loads(line.strip())

#         #Add all paragraphs
#         #passages.extend(data['paragraphs'])

#         #Only add the first paragraph
#         passages.append(data['paragraphs'][0])

df = pd.read_pickle(os.getcwd()+'/../dataframes/final_dataframe.pkl')
passages = []

for text in df.text.values[:200]:
    passages.append(text)

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


Passages: 200


Batches: 100%|████████████████████████████████████████████████████████████████████████| 7/7 [01:41<00:00, 14.51s/it]


In [16]:
corpus_embeddings[0].shape

torch.Size([512])

In [3]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


100%|███████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1886.87it/s]


In [44]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
#     print("Top-3 lexical search (BM25) hits")
#     for hit in bm25_hits[0:3]:
#         print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
#     question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    print(cross_inp[0])
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    print(hits)
    hits = sorted(hits, key=lambda x: x['cross-score'][1], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'][1], passages[hit['corpus_id']].replace("\n", " ")))


In [55]:
query = 'When will be 6G technology released?'
passage_list = ['6G is yet to come, maybe next year', 'IoT uses 5G technology', '2G is outdated', '6G will release tomorrow']

cross_inp = [[query, passage_list[idx]] for idx in range(len(passage_list))]
cross_scores = cross_encoder.predict(cross_inp)

hits = []
# Sort results by the cross-encoder scores
for idx in range(len(cross_scores)):
    hits.append(dict())
    hits[idx]['id'] = idx
    hits[idx]['cross-score'] = cross_scores[idx]

hits = sorted(hits, key=lambda x: x['cross-score'][1], reverse=True)
print(hits)
for hit in hits[0:3]:
    
    print("\t{:.3f}\t{}".format(hit['cross-score'][1], passage_list[hit].replace("\n", " ")))

[{'id': 3, 'cross-score': array([-3.7687316,  3.5060594], dtype=float32)}, {'id': 0, 'cross-score': array([-3.220613,  3.073342], dtype=float32)}, {'id': 1, 'cross-score': array([ 5.148552 , -4.2308745], dtype=float32)}, {'id': 2, 'cross-score': array([ 5.586109, -4.738967], dtype=float32)}]


TypeError: list indices must be integers or slices, not dict

In [45]:
search(query = "What is the capital of the United States?")

Input question: What is the capital of the United States?
['What is the capital of the United States?', '“I t makes no sense to have a theoretically perfect listing regime if in practice users increasingly choose other venues,” says Lord Hill in the introduction to his report on how to shake up London’s stock market listings regime. It’s a fair, pragmatic point. London accounted for only 5% of IPOs, or flotations, globally between 2015 and 2020, which is a feeble performance if the post-Brexit ambition for the stock market is to rival New York, and not just deflect Amsterdam’s challenge. A few of London’s supposedly sacred governance principles were always likely to be sacrificed. At least Hill has tried to soften the process. The least objectionable proposal is the green light for dual classes of shares. Such “golden share” structures are a governance no-no but the US, by accepting them, has dealt itself an ace card to lay in front of footloose founders of technology firms. Hill propo