Instructions for creating a ColBERT index and for training an FiD model for KILT-ELI5 can be found [here](https://github.com/primeqa/primeqa/blob/main/examples/lfqa/README.md)<br>  
The ColBERT index is based on the KILT-Wikipedia corpus and an FiD reader is trained on KILT-ELI5.<br>
This code requires 300GB memory.

In [1]:
from primeqa.components.retriever.dense import ColBERTRetriever
from primeqa.components.reader.generative import GenerativeFiDReader
from primeqa.pipelines.qa_pipeline import QAPipeline
import json

index_root = "/dccstor/mabornea1/kilt-wikipedia-test/colbert_ir/kilt_wikipedia_eli5_dev_exp/indexes/"
index_name = "kilt_wikipedia_eli5_dev_indname"
collection = "/dccstor/mabornea1/kilt-wikipedia-test/passages/kilt_knowledgesource_eli5_dev.tsv"


colbert_retriever = ColBERTRetriever(index_root = index_root, 
                                     index_name = index_name, 
                                     collection = collection, 
                                     max_num_documents = 3)

colbert_retriever.load()

fid_reader = GenerativeFiDReader()
fid_reader.load()

lfqa_pipeline = QAPipeline(colbert_retriever, fid_reader)

questions = ["What causes the trail behind jets at high altitude?"]
answers = lfqa_pipeline.run(questions)
print(json.dumps(answers, indent=4)) 

{"time":"2023-02-16 20:33:01,912", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss with AVX2 support."}
{"time":"2023-02-16 20:33:02,177", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss with AVX2 support."}
[Feb 16, 20:33:07] #> base_config.py from_path /dccstor/mabornea1/kilt-wikipedia-test/colbert_ir/kilt_wikipedia_eli5_dev_exp/indexes//kilt_wikipedia_eli5_dev_indname/metadata.json
[Feb 16, 20:33:07] #> base_config.py from_path args loaded! 
[Feb 16, 20:33:07] #> base_config.py from_path args replaced ! 
[Feb 16, 20:33:15] #>>>>> at ColBERT name (model type) : /dccstor/colbert-ir/franzm/experiments/oct2_7_12_1.5e-06/none/2022-10/09/15.21.39/checkpoints/colbert.dnn.batch_91287.model
[Feb 16, 20:33:15] #>>>>> at BaseColBERT name (model type) : /dccstor/colbert-ir/franzm/experiments/oct2_7_12_1.5e-06/none/2022-10/09/15.21.39/checkpoints/colbert.dnn.batch_91287.model
[Feb 16, 20:33:24] factory model type: xlm-roberta-large
[Feb 16, 20:33

14635it [00:00, 151857.01it/s]

[Feb 16, 20:34:37] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Feb 16, 20:34:37] #> Input: $ What causes the trail behind jets at high altitude?, 		 True, 		 None
[Feb 16, 20:34:37] #> Output IDs: torch.Size([32]), tensor([     0,   9748,   4865, 113660,     70, 141037,  50155,     55,    933,
            99,  11192,    144,  35810,     32,      2,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[Feb 16, 20:34:37] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Feb 16, 20:34:37] #>>>> colbert query ==
[Feb 16, 20:34:37] #>>>>> input_ids: torch.Size([32]), tensor([     0,   9748,   4865, 113660,     70, 141037,  50155,     55,    933,
            99,  11192,    144,  35810,     32,      2,      1,      1,      1,
             1,      1,      1,      




[Feb 16, 20:34:38] #>>>> before linear query ==
[Feb 16, 20:34:38] #>>>>> Q: torch.Size([32, 1024]), tensor([[-0.9052, -0.0412,  1.6951,  ..., -2.0517, -0.3507,  0.3122],
        [-0.6221, -0.3006,  0.1237,  ...,  0.0440, -1.5914, -0.2277],
        [-0.4957, -0.5071, -0.0034,  ...,  0.1966, -1.3050, -0.4422],
        ...,
        [-1.5236, -0.1908,  0.4288,  ...,  0.2573, -1.6916, -0.0264],
        [-1.5236, -0.1908,  0.4288,  ...,  0.2573, -1.6916, -0.0264],
        [-1.5236, -0.1908,  0.4288,  ...,  0.2573, -1.6916, -0.0264]],
       device='cuda:0')
[Feb 16, 20:34:38] #>>>>> self.linear query : Parameter containing:
tensor([[-0.0301, -0.0307, -0.0115,  ..., -0.0231, -0.0023, -0.0216],
        [ 0.0053,  0.0023, -0.0308,  ...,  0.0108,  0.0011,  0.0201],
        [-0.0220,  0.0370,  0.0339,  ..., -0.0023, -0.0172,  0.0244],
        ...,
        [ 0.0222,  0.0115, -0.0246,  ...,  0.0389, -0.0034, -0.0165],
        [-0.0146,  0.0392,  0.0131,  ..., -0.0055,  0.0219, -0.0368],
        [ 

100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.05it/s]


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BartFiDModelForDownstreamTasks.forward` and have been ignored: example_id. If example_id are not expected by `BartFiDModelForDownstreamTasks.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 1


{
    "0": {
        "answers": {
            "text": "The water vapor in the exhaust from the engine is mixed with the cold air, and condenses into ice crystals. \n\nThe water is then carried away by the wind, and the air is heated up.  The water vapor condenses and forms clouds."
        },
        "passages": [
            "Mitigation of aviation's environmental impact\n \"Aircraft flying at high altitude form condensation trails or contrails in the exhaust plume of their engines. While in the Troposphere these have very little climatic impact. However, jet aircraft cruising in the Stratosphere do create an impact from their contrails, although the extent of the damage to the environment is as yet unknown. Contrails can also trigger the formation of high-altitude Cirrus cloud thus creating a greater climatic effect. A 2015 study found that artificial cloudiness caused by contrail \"\"outbreaks\"\" reduce the difference between daytime and nighttime temperatures. The former are decre

In [2]:
questions = ["Why do we have different tax brackets ? "]
answers = lfqa_pipeline.run(questions)
print(json.dumps(answers, indent=4))

100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 68.71it/s]


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BartFiDModelForDownstreamTasks.forward` and have been ignored: example_id. If example_id are not expected by `BartFiDModelForDownstreamTasks.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 1


{
    "0": {
        "answers": {
            "text": "The idea is that the higher the income, the more you earn, the higher your tax burden. \n\nIf you make $100,000, you pay $100.00 in taxes. If you make a million, you only pay $50,000 in taxes, and you pay the same amount of taxes.  \nIf your income is $100k, you're paying $50k in taxes and you're only paying $30k in tax.  If you earn $100m, you are paying $60k in taxation.  You're paying the same tax rate as someone making $100mil.  So you're still paying the exact same amount, but you're also paying the full amount.  The higher the amount, the lower the tax burden, and the higher you are.  This is why you pay more tax on the top 1% of your income than the bottom 1%. \n \nThe reason why you have different tax brackets is because the higher income people are, the less they pay in taxes on their income.  They're paying more taxes on the lower income people, and they're paying less on the higher incomes.  That's why you see the higher