Instructions for creating a ColBERT index and for training an FiD model for KILT-ELI5 can be found [here](https://github.com/primeqa/primeqa/blob/main/examples/lfqa/README.md)<br>  
The ColBERT index is based on the KILT-Wikipedia corpus and an FiD reader is trained on KILT-ELI5.<br>
This code requires 300GB memory.

In [1]:
from primeqa.components.retriever.dense import ColBERTRetriever
from primeqa.components.reader.generative import GenerativeFiDReader
from primeqa.pipelines.qa_pipeline import QAPipeline
import json

index_root = "/dccstor/mabornea1/kilt-wikipedia-test/colbert_ir/kilt_wikipedia_eli5_dev_exp/indexes/"
index_name = "kilt_wikipedia_eli5_dev_indname"
collection = "/dccstor/mabornea1/kilt-wikipedia-test/passages/kilt_knowledgesource_eli5_dev.tsv"


colbert_retriever = ColBERTRetriever(index_root = index_root, 
                                     index_name = index_name, 
                                     collection = collection, 
                                     max_num_documents = 3)
colbert_retriever.load()

fid_reader = GenerativeFiDReader()
fid_reader.load()

lfqa_pipeline = QAPipeline(colbert_retriever, fid_reader)

questions = ["What causes the trail behind jets at high altitude?"]
answers = lfqa_pipeline.run(questions)
print(json.dumps(answers, indent=4)) 

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
{"time":"2023-02-03 12:11:58,301", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-02-03 12:11:58,777", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
[Feb 03, 12:12:10] #> base_config.py from_path /dccstor/mabornea1/kilt-wikipedia-test/colbert_ir/kilt_wikipedia_eli5_dev_exp/indexes//kilt_wikipedia_eli5_dev_indname/metadata.json
[Feb 03, 12:12:10] #> base_config.py from_path args loaded! 
[Feb 03, 12:12:10] #> base_config.py from_path args replaced ! 
[Feb 03, 12:12:29] #>>>>> at ColBERT name (model type) : /dccstor/colbert-ir/franzm/experiments/oct2_7_12_1.5e-06/none/2022-10/09/15.21.39/checkpoints/colbert.dnn.batch_91287.model
[Feb 03, 12:12:29] #>>>>> at BaseColBERT name (model type) : /dccstor/colbert-ir/franzm/experiments/oct2_7_12_1.5e-06/none/2022-10/09/15.21.39/checkpoints/colbert.dnn.batch_91287.model
[Feb 03, 12:12:41] factory model type: xlm-robe



[Feb 03, 12:13:40] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Feb 03, 12:14:06] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Feb 03, 12:14:33] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


14635it [00:00, 95800.42it/s]

[Feb 03, 12:15:08] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Feb 03, 12:15:08] #> Input: $ What causes the trail behind jets at high altitude?, 		 True, 		 None
[Feb 03, 12:15:08] #> Output IDs: torch.Size([32]), tensor([     0,   9748,   4865, 113660,     70, 141037,  50155,     55,    933,
            99,  11192,    144,  35810,     32,      2,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[Feb 03, 12:15:08] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Feb 03, 12:15:08] #>>>> colbert query ==
[Feb 03, 12:15:08] #>>>>> input_ids: torch.Size([32]), tensor([     0,   9748,   4865, 113660,     70, 141037,  50155,     55,    933,
            99,  11192,    144,  35810,     32,      2,      1,      1,      1,
             1,      1,      1,      




[Feb 03, 12:15:09] #>>>> before linear query ==
[Feb 03, 12:15:09] #>>>>> Q: torch.Size([32, 1024]), tensor([[-0.9069, -0.0403,  1.6935,  ..., -2.0556, -0.3505,  0.3143],
        [-0.6235, -0.2982,  0.1217,  ...,  0.0441, -1.5926, -0.2264],
        [-0.4985, -0.5053, -0.0043,  ...,  0.1960, -1.3059, -0.4401],
        ...,
        [-1.5241, -0.1893,  0.4258,  ...,  0.2577, -1.6957, -0.0245],
        [-1.5241, -0.1893,  0.4258,  ...,  0.2577, -1.6957, -0.0245],
        [-1.5241, -0.1893,  0.4258,  ...,  0.2577, -1.6957, -0.0245]])
[Feb 03, 12:15:09] #>>>>> self.linear query : Parameter containing:
tensor([[-0.0301, -0.0307, -0.0115,  ..., -0.0231, -0.0023, -0.0216],
        [ 0.0053,  0.0023, -0.0308,  ...,  0.0108,  0.0011,  0.0201],
        [-0.0220,  0.0370,  0.0339,  ..., -0.0023, -0.0172,  0.0244],
        ...,
        [ 0.0222,  0.0115, -0.0246,  ...,  0.0389, -0.0034, -0.0165],
        [-0.0146,  0.0392,  0.0131,  ..., -0.0055,  0.0219, -0.0368],
        [ 0.0071,  0.0256, -0.0346

100%|██████████| 1/1 [00:00<00:00, 45.09it/s]


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BartFiDModelForDownstreamTasks.forward` and have been ignored: example_id. If example_id are not expected by `BartFiDModelForDownstreamTasks.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 1


[
    {
        "example_id": "0",
        "text": "The water vapor in the exhaust from the engine is mixed with the cold air, and condenses into droplets or ice crystals. \n\nThe water is then carried away by the wind, and the air is heated up, and then cooled down.  The water vapor condenses back into droplet form, and is then blown away by wind.  \nThe air is then heated up again, and it condenses again, creating a cloud."
    }
]


In [2]:
questions = ["Why do we have different tax brackets ? "]
answers = lfqa_pipeline.run(questions)
print(json.dumps(answers, indent=4))

100%|██████████| 1/1 [00:00<00:00, 27.45it/s]


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BartFiDModelForDownstreamTasks.forward` and have been ignored: example_id. If example_id are not expected by `BartFiDModelForDownstreamTasks.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 1


[
    {
        "example_id": "0",
        "text": "The idea is that the higher the income, the more you earn, the higher your tax burden. \n\nIf you make $100,000, you pay $100.00 in taxes. If you make a million, you only pay $50,000 in taxes, and you pay the same amount of taxes on all of your income.  \nIf your income is $100k, you're paying $50k in taxes on $100K.  If you earn $100m, you are paying $100M in taxes and you're only paying $60K in taxes - you're still paying $40K in tax.  You're still only paying the same percentage of your money.  So you're not paying the full amount of your $100mil.  But you're also paying the exact same amount as someone who makes $100 million.  That's why you're in the same bracket. \n\n\nIf someone makes $200m, they pay $200M in tax, and they're paying the $50K tax."
    }
]
