### Watson Discovery with PrimeQA Reader example
This notebook shows an example of a retriever-reader process that uses Watson Discovery to search a document collection and a PrimeQA Extractive Reader to find answers in the retrieved documents.

In [3]:
# Additional Dependencies - ibm_watson
! pip install ibm_watson

Looking in indexes: https://bsiyer%40us.ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/wcp-ai-foundation-team-pypi-virtual/simple


In [4]:
# Set these parameters to configure the Watson Discovery Search
endpoint="https://api.us-south.discovery.watson.cloud.ibm.com/instances/5c9b84e4-f2e1-4aff-854f-651771aaa464"
api_key="your-api-key"
project_id="8c50236a-d8db-4af4-b640-c40c2d3fe671"
collection_id="32fe88df-c5ce-4cd8-8b9b-5069c772b2ea"
index_name="32fe88df-c5ce-4cd8-8b9b-5069c772b2ea:passages"
max_num_documents=5

In [5]:
from ibm_watson import DiscoveryV2
from ibm_cloud_sdk_core.authenticators import (
    IAMAuthenticator,
    BearerTokenAuthenticator,
)
import pandas as pd
from IPython.display import display, HTML

In [6]:
# This example uses a collection of documents from the InsuranceLib corpus 
# Initialize Watson Discovery connection and obtain the collection id 
WDS = DiscoveryV2(version="2020-08-30", authenticator=IAMAuthenticator(apikey=api_key))
WDS.set_service_url(endpoint)
collections = WDS.list_collections(project_id=project_id).get_result()["collections"]
for collection in collections:
    if collection['name'] == index_name:
        collection_id = collection['collection_id']
        break
if collection_id == None:
    raise RuntimeError(f"Index not found {index_name}")

print(f'collection_id: {collection_id}')


collection_id: 6c16e135-5086-950f-0000-017d2ea3a766


In [8]:
# Retrieve documents
question = "when can I drop collision on my auto policy ?"
hits = WDS.query(
        project_id=project_id,
        collection_ids=[collection_id],
        natural_language_query=question,
        count=max_num_documents).get_result()["results"]

print(f'Number of hits: {len(hits)}')

results = []
if hits:
    for i, hit in enumerate(hits):
        query_hits = {
        "document": {
            "rank": i,
            "document_id": hit["document_id"] if "document_id" in hit else None,
            "text": hit["text"][0],
            "title": hit["title"] if "title" in hit else None
        },
        "score": hit['result_metadata']['confidence'],
        }
        
        results.append(query_hits)

results_to_display = [result['document'] for result in results]
df = pd.DataFrame.from_records(results_to_display, columns=['rank','document_id','title','text'])
print('======================================================================')
print(f'QUERY: {question}')
display( HTML(df.to_html()) )

                

Number of hits: 5
QUERY: when can I drop collision on my auto policy ?


Unnamed: 0,rank,document_id,title,text
0,0,384517da-ffef-4a1a-814b-59a16e7306cb,6048,"the single best thing you can do is shop around with other auto insurance companies . each company is unique in that they have their own appetite for what they like and do n't like . you may fit right into some company 's `` sweet spot '' . but you will never know which companies can beat your current rates . as far as lowering premium on an active policy , you have to be careful . normally when drivers lower their coverages or cut them out altogether , they tend to make cuts in the wrong areas . it 's important to understand the difference between a controlled risk and an uncontrolled risk . a controlled risk is one where by lowering or dropping coverage , you know the exact dollar amount of the extra risk you are taking on . this includes raising your deductibles on comprehensive and collision coverage -lrb- or dropping it altogether -rrb- , dropping additional coverage such as rental and roadside assistance/towing . stay away from lowering uncontrolled risks ! namely liability coverage and uninsured motorist bodily injury coverage . this is where people get burned ! keep both of these as high as you can afford them ."
1,1,234aff19-c002-48a9-baa2-4aea6fae6914,18434,i rarely recommend that any driver drop collision coverage . regardless of the premium there will be a sum recovered if the car is damaged in a collision and that might be extremely important to the customer . i do advise them that as the car loses value the amount they will recover drops as well and that for some older cars a minor collision would result in the car being declared a total .
2,2,d686b2a5-12c6-4787-b00c-c672782b8e35,12610,dropping collision on your car depends on how much you wish to self insure . as long as you do not have a loan you can drop collision anytime . however one of the more common times to consider dropping collision is if your vehicle is worth $ 3500.00 or less . the reason for this thinking is this is the amount uninsured motorist property damage would pick up in the event you are hit by someone with no insurance . of course your still on your own if you cause the accident or if your involved in a hit and run type situation where you do n't know who caused the accident . you may want to look at what you actually are paying for collision coverage and what the vehicle is worth in a total loss since this is all your going to get .
3,3,82c021bd-d613-4ce1-af4c-6e759a6a0f98,18084,"i understand it 's hard when you have a driver with , shall we say , a less than stellar driving record . most companies , however , will want your husband to be named on your auto insurance policy . over the years they 've realized that people drive the cars that they own whether they 're insured on a policy or not . it 's just the reality of the situation . however , that does n't mean that your rates have to be sky-high , as you put it . there are some pointers i can give you to help minimize the impact . your husband can get his own insurance separate from yours . if his record is bad enough that he has be placed with a non standard -lrb- high risk -rrb- company , you can still have a policy through a standard company . as long as he has insurance , he does n't have to be listed as a driver on your policy . whether you choose to have him get his own policy or list him on yours , list him on a car that does n't have physical damage coverage -lrb- comp and collision -rrb- . physical damage coverage on a non standard policy can cost big dollars , as much as 75 % of the total cost of the insurance . track his driving record . any violation will eventually drop off for rating purposes . most minor violations will drop off after two years , at fault accidents will usually drop off after three years , and major violations -lrb- dui etc. -rrb- will drop off after five years . as soon as he 's eligible to be insured through a standard carrier , list him . many people do n't realize this and continue to pay non standard rates even after they 're eligible . i 'll preface this last point by saying i do n't recommend it . some companies can exclude a driver . in so doing they 're not rated on the policy and they 're record does n't matter . but realize that there is no coverage if they get into an accident , even if they 're driving a car on the policy . once again , i do not recommend this . from time to time i see a driver that has a problem driving safely . they routinely get tickets and have accidents . some will even just stop carrying insurance because it 's so expensive but do n't seem to see the correlation between their driving and their insurance rates . if he will , get him into a safe driving program . if he wo n't , you 'll probably be dealing with this for a long time . good luck ."
4,4,6ce4dec0-b137-4547-b817-62aa0ce75cd6,13033,based on the question i would assume that you are talking about collision coverage for you auto . collision coverage is typically paired with other than collision coverage in your auto policy and both have their own deductible levels . collision coverage typicallyprovides for payment of damage to your vehicle when you are involved in an accident even if you are deemed at fault . please read your policy completely to understand the coverage provided and any exclusions that ther may be or contact your local agent to have them go over the policy with you .


In [10]:
# Rerank search results using ColBERTReranker
# Download model if needed
! wget https://huggingface.co/PrimeQA/DrDecr_XOR-TyDi_whitebox/resolve/main/DrDecr.dnn


--2023-03-22 08:17:35--  https://huggingface.co/PrimeQA/DrDecr_XOR-TyDi_whitebox/resolve/main/DrDecr.dnn
Resolving huggingface.co (huggingface.co)... 34.202.121.154, 34.203.133.210, 35.173.225.216, ...
Connecting to huggingface.co (huggingface.co)|34.202.121.154|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/d4/ef/d4ef44ce7d987b0ad737d45af61c195b32745b69da94de28f652bef09436ef7d/b9243c4014ae3fc2d779c6560900962d26262ec76137f76140c9f95154ca9522?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27DrDecr.dnn%3B+filename%3D%22DrDecr.dnn%22%3B&Expires=1679746656&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2Q0L2VmL2Q0ZWY0NGNlN2Q5ODdiMGFkNzM3ZDQ1YWY2MWMxOTViMzI3NDViNjlkYTk0ZGUyOGY2NTJiZWYwOTQzNmVmN2QvYjkyNDNjNDAxNGFlM2ZjMmQ3NzljNjU2MDkwMDk2MmQyNjI2MmVjNzYxMzdmNzYxNDBjOWY5NTE1NGNhOTUyMj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7

In [18]:
# Run ColBERT Reranker
from primeqa.components.reranker.colbert_reranker import ColBERTReranker
model_name_or_path = "DrDecr.dnn"
max_reranked_documents = 2
reranker = ColBERTReranker(model=model_name_or_path)
reranker.load()

reranked_results = reranker.predict(queries= [question], documents = [results], max_num_documents=max_reranked_documents)

print(reranked_results)

reranked_results_to_display = [result['document'] for result in reranked_results[0]]
df = pd.DataFrame.from_records(reranked_results_to_display, columns=['rank','document_id','title','text'])
print('======================================================================')
print(f'QUERY: {question}')
display( HTML(df.to_html()) )


[Mar 22, 08:33:00] #>>>>> at ColBERT name (model type) : DrDecr.dnn
[Mar 22, 08:33:00] #>>>>> at BaseColBERT name (model type) : DrDecr.dnn
[Mar 22, 08:33:03] factory model type: xlm-roberta-base
[Mar 22, 08:33:15] get query model type: xlm-roberta-base
[Mar 22, 08:33:16] get doc model type: xlm-roberta-base
[Mar 22, 08:33:17] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Mar 22, 08:33:17] #> Input: $ when can I drop collision on my auto policy ?, 		 True, 		 None
[Mar 22, 08:33:17] #> Output IDs: torch.Size([32]), tensor([    0,  9748,  3229,   831,    87, 36069, 61770,  6889,    98,   759,
         1809, 44930,   705,     2,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1])
[Mar 22, 08:33:17] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Mar 22, 08:33:17] #>>



[Mar 22, 08:33:18] #>>>> before linear query ==
[Mar 22, 08:33:18] #>>>>> Q: torch.Size([32, 768]), tensor([[-0.1172,  0.0835,  0.1692,  ..., -0.3383, -0.4692,  0.3933],
        [-0.7007, -0.3939,  0.0785,  ...,  0.0186, -0.3977,  0.3549],
        [-0.8649, -0.5747,  0.0779,  ...,  0.1467, -0.5632,  0.3818],
        ...,
        [-0.3437,  0.1829, -0.0974,  ..., -0.3219, -0.7634,  0.9359],
        [-0.3437,  0.1829, -0.0974,  ..., -0.3219, -0.7634,  0.9359],
        [-0.3437,  0.1829, -0.0974,  ..., -0.3219, -0.7634,  0.9359]])
[Mar 22, 08:33:18] #>>>>> self.linear query : Parameter containing:
tensor([[-0.0286,  0.0017, -0.0202,  ..., -0.0262,  0.0210,  0.0006],
        [-0.0102,  0.0121, -0.0111,  ..., -0.0362, -0.0165, -0.0012],
        [-0.0047, -0.0172, -0.0054,  ..., -0.0069, -0.0194, -0.0193],
        ...,
        [-0.0286,  0.0231,  0.0004,  ...,  0.0373, -0.0045,  0.0125],
        [ 0.0051,  0.0023,  0.0212,  ..., -0.0254,  0.0034,  0.0206],
        [-0.0068,  0.0256, -0.0263,

Unnamed: 0,rank,document_id,title,text
0,1,234aff19-c002-48a9-baa2-4aea6fae6914,18434,i rarely recommend that any driver drop collision coverage . regardless of the premium there will be a sum recovered if the car is damaged in a collision and that might be extremely important to the customer . i do advise them that as the car loses value the amount they will recover drops as well and that for some older cars a minor collision would result in the car being declared a total .
1,4,6ce4dec0-b137-4547-b817-62aa0ce75cd6,13033,based on the question i would assume that you are talking about collision coverage for you auto . collision coverage is typically paired with other than collision coverage in your auto policy and both have their own deductible levels . collision coverage typicallyprovides for payment of damage to your vehicle when you are involved in an accident even if you are deemed at fault . please read your policy completely to understand the coverage provided and any exclusions that ther may be or contact your local agent to have them go over the policy with you .


In [19]:
# import the PrimeQA reader
from primeqa.components.reader.extractive import ExtractiveReader
import json

In [20]:
# Instantiate Reader
reader = ExtractiveReader()
reader.load()

{"time":"2023-03-22 08:34:59,396", "name": "ExtractiveQAHead", "level": "INFO", "message": "Loading dropout value 0.1 from config attribute 'hidden_dropout_prob'"}
{"time":"2023-03-22 08:35:00,091", "name": "XLMRobertaModelForDownstreamTasks", "level": "INFO", "message": "Setting task head for first time to 'None'"}


In [36]:
# Predict answers using WD search results
contexts = [[result['document']['text'] for result in results]]
answers = reader.predict([question], contexts)

print(f"Question: {question}")
print("Answers using WD search results:")
print(json.dumps(answers,indent=4))

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{"time":"2023-03-22 08:51:27,290", "name": "primeqa.mrc.trainers.mrc", "level": "INFO", "message": "The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaModelForDownstreamTasks.forward` and have been ignored: context_idx, example_id, example_idx, offset_mapping."}


***** Running Evaluation *****
  Num examples = 6
  Batch size = 8


100%|██████████| 1/1 [00:00<00:00, 29.70it/s]

Question: when can I drop collision on my auto policy ?
Answers using WD search results:
{
    "0": [
        {
            "example_id": "0",
            "passage_index": 1,
            "span_answer_text": "as the car loses value",
            "span_answer": {
                "start_position": 232,
                "end_position": 254
            },
            "span_answer_score": 9.767881229519844,
            "confidence_score": 0.40340543499795295
        },
        {
            "example_id": "0",
            "passage_index": 1,
            "span_answer_text": "if the car is damaged in a collision",
            "span_answer": {
                "start_position": 117,
                "end_position": 153
            },
            "span_answer_score": 9.520724594593048,
            "confidence_score": 0.31506704689959253
        },
        {
            "example_id": "0",
            "passage_index": 1,
            "span_answer_text": "regardless of the premium there will be a sum re




In [37]:
# Predict answers using reranked results
contexts = [[result['document']['text'] for result in reranked_results[0]]]
print(json.dumps(contexts, indent=2))
answers = reader.predict([question], contexts)

print(f"Question: {question}")
print("Answers using Reranked WD search results:")
print(json.dumps(answers,indent=4))

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


[
  [
    "i rarely recommend that any driver drop collision coverage . regardless of the premium there will be a sum recovered if the car is damaged in a collision and that might be extremely important to the customer . i do advise them that as the car loses value the amount they will recover drops as well and that for some older cars a minor collision would result in the car being declared a total .",
    "based on the question i would assume that you are talking about collision coverage for you auto . collision coverage is typically paired with other than collision coverage in your auto policy and both have their own deductible levels . collision coverage typicallyprovides for payment of damage to your vehicle when you are involved in an accident even if you are deemed at fault . please read your policy completely to understand the coverage provided and any exclusions that ther may be or contact your local agent to have them go over the policy with you ."
  ]
]


Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{"time":"2023-03-22 08:51:49,806", "name": "primeqa.mrc.trainers.mrc", "level": "INFO", "message": "The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaModelForDownstreamTasks.forward` and have been ignored: context_idx, example_id, example_idx, offset_mapping."}


***** Running Evaluation *****
  Num examples = 2
  Batch size = 8


100%|██████████| 1/1 [00:00<00:00, 94.63it/s]

Question: when can I drop collision on my auto policy ?
Answers using Reranked WD search results:
{
    "0": [
        {
            "example_id": "0",
            "passage_index": 0,
            "span_answer_text": "as the car loses value",
            "span_answer": {
                "start_position": 232,
                "end_position": 254
            },
            "span_answer_score": 9.767880752682686,
            "confidence_score": 0.40340543499795295
        },
        {
            "example_id": "0",
            "passage_index": 0,
            "span_answer_text": "if the car is damaged in a collision",
            "span_answer": {
                "start_position": 117,
                "end_position": 153
            },
            "span_answer_score": 9.52072411775589,
            "confidence_score": 0.31506704689959253
        },
        {
            "example_id": "0",
            "passage_index": 0,
            "span_answer_text": "regardless of the premium there will be 


