# Medical Few-shot OpenQA

## Set-up

### General set-up

In [None]:
!pip install -r requirements.txt

In [1]:
import collections
from contextlib import nullcontext
from collections import namedtuple
from datasets import load_dataset
import json
import numpy as np
import random
import re 
import string
import torch
from typing import List
import torch

Try to set all the seeds for reproducibility (won't extend to GPT-3):

In [2]:
seed = 1

np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

### Language model set-up

In [3]:
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
transformers.logging.set_verbosity_error()

### ColBERT set-up

Our retriever will be a ColbERT-based model ([Khattab and Zaharia 2020](https://arxiv.org/abs/2004.12832)). ColBERT is a powerful neural information retrieval (Neural IR) model that has proven extremely successful in retrieval applications and as a component in a variety of different systems for OpenQA and other knowledge-intensive tasks (e.g., [Khattab et al. 2021a](https://aclanthology.org/2021.tacl-1.55/); [Khattab et al. 2021b](https://proceedings.neurips.cc/paper/2021/hash/e8b1cbd05f6e6a358a81dee52493dd06-Abstract.html); [Santhanam, Khattab, et al. 2021](https://arxiv.org/abs/2112.01488)).

The following will clone the ColBERTv2 repository for use in this notebook:

In [8]:
!git clone -b cpu_inference https://github.com/stanford-futuredata/ColBERT.git

Cloning into 'ColBERT'...
remote: Enumerating objects: 764, done.[K
remote: Counting objects: 100% (379/379), done.[K
remote: Compressing objects: 100% (132/132), done.[K
remote: Total 764 (delta 272), reused 318 (delta 247), pack-reused 385[K
Receiving objects: 100% (764/764), 302.54 KiB | 1.11 MiB/s, done.
Resolving deltas: 100% (424/424), done.


In [4]:
import os
import sys
sys.path.insert(0, 'ColBERT/')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Collection
from colbert.searcher import Searcher
from utility.utils.dpr import has_answer, DPR_normalize

## Language models

In few-shot OpenQA, the language model (LM) must read in a prompt and answer the question posed somewhere in the prompt. 

### Answerhood

In [5]:
def _find_generated_answer(tokens, newline="\n" ): 
    """Our LMs tend to insert initial newline characters before
    they begin generating text. This function ensures that we 
    properly capture the true first line as the answer while
    also ensuring that token probabilities are aligned."""        
    answer_token_indices = []
    char_seen = False            
    for i, tok in enumerate(tokens):
        # This is the main condition: a newline that isn't an initial
        # string of newlines:
        if tok == newline and char_seen:
            break
        # Keep the initial newlines for consistency:
        elif tok == newline and not char_seen:
            answer_token_indices.append(i)
        # Proper tokens:
        elif tok != newline:
            char_seen = True
            answer_token_indices.append(i)
    return answer_token_indices 

### Eleuther models from Hugging Face

In [6]:
# "gpt-neo-125M" "gpt-neo-1.3B" "gpt-neo-2.7B" "gpt-j-6B"
eleuther_model_name = "gpt-neo-125M"

eleuther_tokenizer = AutoTokenizer.from_pretrained(
    f"EleutherAI/{eleuther_model_name}", 
    padding_side="left", 
    padding='longest', 
    truncation='longest_first', max_length=2000)
eleuther_tokenizer.pad_token = eleuther_tokenizer.eos_token

eleuther_model = AutoModelForCausalLM.from_pretrained(
    f"EleutherAI/{eleuther_model_name}")

In [7]:
def run_eleuther(prompts, temperature=0.1, top_p=0.95, **generate_kwargs): 
    """
    Parameters
    ----------
    prompts : iterable of str
    temperature : float
        It seems best to set it low for this task!
    top_p : float
       
    For options for `generate_kwargs`, see:
    
    https://huggingface.co/docs/transformers/master/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate
    
    Options that are likely to be especially relevant include 
    `temperature`, `length_penalty`, and the parameters that
    determine the decoding strategy. With `num_return_sequences > 1`,
    the default parameters in this function do multinomial sampling.
    
    Returns
    -------
    list of dicts
    
    {"prompt": str, 
     "generated_text": str, "generated_tokens": list of str, "generated_probs": list of float,
     "answer": str, "answer_tokens": list of str, "answer_probs": list of float
    }
         
    """
    prompt_ids = eleuther_tokenizer(
        prompts, return_tensors="pt", padding=True).input_ids
        
    with torch.inference_mode():
        # Automatic mixed precision if possible.
        with torch.cuda.amp.autocast() if torch.cuda.is_available() else nullcontext():
            model_output = eleuther_model.generate(
                prompt_ids,
                temperature=temperature,
                do_sample=True,
                top_p=top_p,           
                max_new_tokens=16,
                num_return_sequences=1,                
                pad_token_id=eleuther_tokenizer.eos_token_id, 
                return_dict_in_generate=True,
                output_scores=True,
                **generate_kwargs)
        
    # Converting output scores using the helpful recipe here:
    # https://discuss.huggingface.co/t/generation-probabilities-how-to-compute-probabilities-of-output-scores-for-gpt2/3175
    gen_ids = model_output.sequences[:, prompt_ids.shape[-1] :]
    gen_probs = torch.stack(model_output.scores, dim=1).softmax(-1)
    gen_probs = torch.gather(gen_probs, 2, gen_ids[:, :, None]).squeeze(-1)
    
    # Generated texts, including the prompts:
    gen_texts = eleuther_tokenizer.batch_decode(
        model_output.sequences, skip_special_tokens=True)
    
    data = []     
    iterator = zip(prompts, gen_ids, gen_texts, gen_probs)    
    for prompt, gen_id, gen_text, gen_prob in iterator:       
        gen_tokens = eleuther_tokenizer.convert_ids_to_tokens(gen_id)
        generated_text = gen_text[len(prompt): ]
        gen_prob = [float(x) for x in gen_prob.numpy()] # float for JSON storage
        ans_indices = _find_generated_answer(gen_tokens, newline="Ċ")
        answer_tokens = [gen_tokens[i] for i in ans_indices]
        answer_probs = [gen_prob[i] for i in ans_indices]
        answer = "".join(answer_tokens).replace("Ġ", " ").replace("Ċ", "\n")                                       
        data.append({
            "prompt": prompt,
            "generated_text": generated_text,
            "generated_tokens": gen_tokens,
            "generated_probs": gen_prob,
            "generated_answer": answer,
            "generated_answer_probs": answer_probs,
            "generated_answer_tokens": answer_tokens})                        

    return data

In [None]:
%%time
## test run

eleuther_ex = run_eleuther([    
    "What year was Stanford University founded?", 
    "In which year did Stanford first enroll students?"])

eleuther_ex

## Dataset Loading


### SQuAD

In [9]:
squad = load_dataset("squad")

Reusing dataset squad (/home/zhanj289/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

The following utility just reads a SQuAD split in as a list of `SquadExample` instances:

In [10]:
SquadExample = namedtuple("SquadExample",  "id title context question answers")

In [11]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.
    
    Returns
    -------
    list of SquadExample named tuples with attributes
    id, title, context, question, answers
    
    """    
    fields = squad[split].features
    data = zip(*[squad[split][field] for field in fields])
    return [SquadExample(eid, title, context, question, answers["text"]) 
            for eid, title, context, question, answers in data]

In [12]:
## Split Dev and Train

In [13]:
fields = squad['validation'].features
data = zip(*[squad['validation'][field] for field in fields])

In [14]:
squad_dev = get_squad_split(squad)

In [15]:
squad_dev[100]

SquadExample(id='56d602631c85041400946edb', title='Super_Bowl_50', context='CBS broadcast Super Bowl 50 in the U.S., and charged an average of $5 million for a 30-second commercial during the game. The Super Bowl 50 halftime show was headlined by the British rock group Coldplay with special guest performers Beyoncé and Bruno Mars, who headlined the Super Bowl XLVII and Super Bowl XLVIII halftime shows, respectively. It was the third-most watched U.S. broadcast ever.', question='Who were special guests for the Super Bowl halftime show?', answers=['Beyoncé and Bruno Mars', 'Beyoncé and Bruno Mars', 'Beyoncé and Bruno Mars'])

In [55]:
dev_exs = sorted(squad_dev, key=lambda x: hash(x.id))[: 200]

In [56]:
squad_train = get_squad_split(squad, "train")

In [25]:
squad['train']

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

### BioASQ

In [8]:
# with open('./data/bioasq/squad.json', 'r') as f:
#     squad_test = json.load(f)

In [16]:
with open('./data/bioasq/training10b.json', 'r') as f:
    bioasq_json = json.load(f)

In [17]:
# pick all factoid questions but ignore all else

In [None]:
bioasq_json['questions'][0]['snippets']

In [226]:
# text_dict = {}

# for snip in bioasq_json['questions'][0]['snippets']:
#     if snip['beginSection'] == 'abstract':
#         for k in range(snip['offsetInBeginSection'], snip['offsetInEndSection']):
#             text_dict[k] = snip['text'][k- snip['offsetInBeginSection']]

In [227]:
# recon_text = ''
# for key in sorted(text_dict.keys()):
#     recon_text += text_dict[key]

In [228]:
# recon_text

"BACKGROUND: RET is the major gene associated to Hirschsprung disease (HSCR) with differential contributions of its rare and common, coding and noncodinIn the etiology of Hirschsprung disease various genes play a role; these are: RET, EDNRB, GDNF, EDN3 and SOX10, NTN3, ECE1, Mutations in these genes may result in dominant, recessive or multifactorial patterns of inheritance.Coding sequence mutaOn the basis of a skewed sex-ratio (M/F = 4/1) and a risk to relatives much higher than the incidence in the general population, HSCR has long been regarded as a sex-modified multifactorial disorderhermore, mutations in the RET gene are responsible for approximately half of the familial and some sThe majority of the identified genes are related to Mendelian syndromic forms of Hirschsprung's diseasee The non-Mendelian inheritance of sporadic non-syndromic Hirschsprung's disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model expression in malese HSCR p

In [None]:
bioasq_json['questions'][1]

In [19]:
### Construct dataset
count_factoid = 0
count_list =0
count_summary=0
count_yesno =0

bioasq_list= []

for i in range(len(bioasq_json['questions'])):
    
    sample = bioasq_json['questions'][i]
    
    if sample['type'] == 'summary':
            count_summary += 1
    if sample['type'] == 'yesno':
            count_yesno += 1
    
    if sample['type'] in ['factoid', 'list']:
        
    #  Context
    ## flatten all the snippet, conccatenate and use as context
        context = '' 
        for snip in [ele['text'].strip() for ele in sample['snippets']]:
            snip += ' '
            context += snip
            
        context = context.replace('\n', ' ')
        
        ## limit the length of context
        ### Max: 4096 (for eleuther model)
        context = context[:1024]
        
        # question
        question = sample['body']
        question = question.replace('\n', ' ')
        
        # answer:
        ## deal with factoid question and list question differently
        if sample['type'] == 'factoid':
            answer = sample['exact_answer']
            count_factoid += 1
        
        if sample['type'] == 'list':
            answer = [x for y in sample['exact_answer'] for x in y]
            count_list += 1
        

        # construct a QA pairs like SQUAD
        bioasq_list.append({
            'id': i,
            'context': context,
            'question': sample['body'],
            'answers': answer,
            'type': sample['type']
        }) 

print(f'we have {count_factoid} factoid questions, {count_list} list questions, {count_summary} summary questions, {count_yesno} yesno qquestions')   

print(f'total is {count_factoid +count_list+ count_summary +count_yesno}')

we have 1252 factoid questions, 816 list questions, 1018 summary questions, 1148 yesno qquestions
total is 4234


In [20]:
len(bioasq_list)

2068

In [21]:
from sklearn.model_selection import train_test_split
def get_bioasq_split(bioasq_list, random_state):
    """
    
    Returns
    -------
    list of example named tuples with attributes
    id, title, context, question, answers
    
    """
    BioasqExample = namedtuple("BioasqExample",  "id context question answers")
    
    bioasq_data = [BioasqExample(ele['id'], ele['context'], ele['question'], ele['answers']) for ele in bioasq_list]
    
    bioasq_train, _ = train_test_split(bioasq_data, test_size=0.9, random_state=random_state)

    bioasq_dev, bioasq_test = train_test_split(_, test_size=0.8888, random_state=random_state)
    
    return bioasq_train, bioasq_dev, bioasq_test

In [22]:
bioasq_train, bioasq_dev, bioasq_test = get_bioasq_split(bioasq_list, random_state=40)

In [23]:
## split dev and test



In [24]:
print(f"{len(bioasq_train)}, {len(bioasq_dev)}, {len(bioasq_test)} ")

206, 207, 1655 


In [25]:
## pick 20 just for sanity check
dev_exs = bioasq_dev[:20]

In [26]:
dev_exs[0]

BioasqExample(id=829, context='Nearly one half of all cases of acquired resistance to epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors (TKIs) for non-small-cell lung cancer (NSCLC) are due to the T790M mutation in EGFR exon 20. Two types of epidermal growth factor receptor (EGFR) mutations in exon 19 and exon 21 (ex19del and L858R) are prevalent in lung cancer patients and sensitive to targeted EGFR inhibition. A resistance mutation in exon 20 (T790M) has been found to accompany drug treatment when patients relapse. Acquired EGFR C797S mutation mediates resistance to AZD9291 in non-small cell lung cancer harboring EGFR T790M. However, resistance to the EGFR TKIs develops mostly secondary to T790M mutation in exon 20. The T790M mutation in EGFR accounts for approximately half of all lung cancer cases with acquired resistance to the current clinical EGFR tyrosine kinase inhibitors. In nonsmall cell lung cancer (NSCLC), the threonine(790)-methionine(790) (T790M) point mu

## Evaluation

Our evaluation protocols are the standard ones for SQuAD and related tasks: exact match of the answer (EM) and token-level F1.

We say further that the predicted answer is the first line of generated text after the prompt.

The following evaluation code is taken from the [apple/ml-qrecc](https://github.com/apple/ml-qrecc/blob/main/utils/evaluate_qa.py) repository. It performs very basic string normalization before doing the core comparisons.

In [27]:
def normalize_answer(s: str) -> str:
    """Lower text and remove punctuation, articles and extra whitespace."""

    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def get_tokens(s: str) -> List[str]:
    """Normalize string and split string into tokens."""
    if not s:
        return []
    return normalize_answer(s).split()


def compute_exact(a_gold: str, a_pred: str) -> int:
    """Compute the Exact Match score."""
    return int(normalize_answer(a_gold) == normalize_answer(a_pred))


def compute_f1_from_tokens(gold_toks: List[str], pred_toks: List[str]) -> float:
    """Compute the F1 score from tokenized gold answer and prediction."""
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())

    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)

    if num_same == 0:
        return 0

    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def compute_f1(a_gold: str, a_pred: str) -> float:
    """Compute the F1 score."""
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    return compute_f1_from_tokens(gold_toks, pred_toks)

The following is our general evaluation function. We will make extensive use of it to evaluate different systems:

In [28]:
def evaluate(examples, prompts, gens):
    """Generic evalution function.
    
    Parameters
    ----------
    examples: iterable of `SquadExample` instances
    prompts: list of str
    preds: list of LM-generated texts to evaluate as answers
    
    Returns
    -------
    dict with keys "em_per", "macro_f1", "examples", where
    each "examples" value is a dict
    
    """        
    results = []
    for ex, prompt, gen in zip(examples, prompts, gens):
        answers = ex.answers
        pred = gen['generated_answer']
        # The result is the highest EM from the available answer strings:
        em = max([compute_exact(ans, pred) for ans in answers])
        f1 = max([compute_f1(ans, pred) for ans in answers])
        gen.update({
            "id": ex.id, 
            "question": ex.question, 
            "prediction": pred, 
            "answers": answers, 
            "em": em,
            "f1": f1
        })
        results.append(gen)
    data = {}        
    data["macro_f1"] = np.mean([d['f1'] for d in results])
    data["em_per"] = sum([d['em'] for d in results]) / len(results)
    data["examples"] = results
    return data

Here is a highly simplified example to help make the logic behind `evaluate` clearer:    

In [29]:
ex = namedtuple("SquadExample",  "id title context question answers")

examples = [
    ex("0", "CS224u", 
       "The course to take is NLU!", 
       "What is the course to take?", 
       ["NLU", "CS224u"])]

prompts = ["Dear model, Please answer this question!\n\nQ: What is the course to take?\n\nA:"]

gens = [{"generated_answer": "NLU", "generated_text": "NLU\nWho am I?"}]

evaluate(examples, prompts, gens)

{'macro_f1': 1.0,
 'em_per': 1.0,
 'examples': [{'generated_answer': 'NLU',
   'generated_text': 'NLU\nWho am I?',
   'id': '0',
   'question': 'What is the course to take?',
   'prediction': 'NLU',
   'answers': ['NLU', 'CS224u'],
   'em': 1,
   'f1': 1.0}]}

The bake-off uses `macro_f1` as the primary metric.

## Open QA with no context

We now have all the pieces we need to begin building few-shot OpenQA systems. Our first system is the simplest and most naive: we simply feed the question text in as the prompt and hope that the model provides an answer as the first line of its generated text.

In [30]:
def evaluate_no_context(examples, gen_func=run_eleuther, batch_size=20):
    prompts = [] 
    gens = []
    for i in range(0, len(examples), batch_size):
        ps = [ex.question for ex in examples[i: i+batch_size]]
        gs = gen_func(ps)        
        prompts += ps
        gens += gs    
    return evaluate(examples, prompts, gens)    

In [31]:
%%time
nocontext_results = evaluate_no_context(bioasq_test)

print(nocontext_results['macro_f1'])

0.0312800712469176
CPU times: user 20min 16s, sys: 50.8 s, total: 21min 7s
Wall time: 2min 39s


## Few-shot QA

The above formulation is not especially fair to our model, since it doesn't convey anything about the intended structure of the prompt. We want the model to give us an answer to the input question, but we didn't specify that goal unambiguously. Perhaps we were looking for commentary on the question, or a count of the number of tokens it contains, or a passage containing the question string, or something else entirely.

In few-shot QA, we construct a prompt that is intended to convey our intentions more clearly. The first part of the prompt gives some examples of what we want, and the final part provides the set-up for our actual question. In the current formulation, we assume access to the gold passage. For example, if our example of interest is

```
Title: CS224u

Background: The course to take is NLU!

Q: What is the course to take?
```

with gold answer ```NLU```, then we would create a prompt with, say, 2 additional examples preceding this, to yield a full prompt like this:

```
Title: Pragmatics

Background: Pragmatics is the study of language use.

Q: What is pragmatics?

A: The study of language use

Title: Bert

Background: Bert is a Muppet who is lives with Ernie.

Q: Who is Bert?

A: Bert is a  Muppet

Title: CS224u

Background: The course to take is NLU!

Q: What is the course to take?

A:
```
This is essentially the formulation used in the GPT-3 paper for SQuAD. The context examples are drawn randomly from the SQuAD train set. We will adopt this same protocol for now. (You might revisit this in the context of your original system.)

In [32]:
def build_few_shot_qa_prompt(ex, bioasq_train, n_context=2, joiner="\n\n"):
    segs = []
    train_exs = random.sample(bioasq_train, k=n_context)    
    for t in train_exs:
        segs += [
            # f"Title: {t.title}",
            f"Background: {t.context}",
            f"Q: {t.question}",
            f"A: {t.answers[0]}"
        ]
    segs += [
        # f"Title: {ex.title}",
        f"Background: {ex.context}",
        f"Q: {ex.question}",
        f"A:"
    ]
    return joiner.join(segs)                

Here's the sort of output we get with `n_context=1`:

In [34]:
print(build_few_shot_qa_prompt(dev_exs[2], bioasq_train, n_context=1))

Background: Over the past decade, MM therapy is significantly improved by the introduction of novel therapeutics such as immunomodulatory agents (thalidomide, lenalidomide, and pomalidomide), proteasome inhibitors (bortezomib, carfilzomib, and ixazomib), monoclonal antibodies (daratumumab and elotuzumab), histone deacetylase (HDAC) inhibitors (Panobinostat). Due to the largely incurable nature of multiple myeloma, the development of newer agents is ongoing and includes new oral PIs (ixazomib), immunotherapies (e.g., CD38- or SLAMF7-targeted antibodies), and small molecules. Ixazomib (MLN9708-MLN2238), the second-generation proteasome inhibitor, selectivity and potency were similar to that of bortezomib, is currently being investigated in phase I studies. In the last few weeks, the FDA approved three new therapies for multiple myeloma: ixazomib, the first oral proteasome inhibitor; and daratumumab and elotuzumab, two monoclonal antibodies that target CD38 and SLAMF7, respectively. Ixazo

In [35]:
def evaluate_few_shot_qa(examples, bioasq_train, gen_func=run_eleuther, batch_size=20, n_context=2):
    prompts = []
    gens = []
    for i in range(0, len(examples), batch_size):
        batch = examples[i: i+batch_size]
        ps = [build_few_shot_qa_prompt(ex, bioasq_train, n_context=n_context) for ex in batch]        
        gs = gen_func(ps)       
        prompts += ps
        gens += gs
    return evaluate(examples, prompts, gens)

In [None]:
%%time
few_shot_qa_results = evaluate_few_shot_qa(bioasq_test, bioasq_train, n_context=1)

print(few_shot_qa_results['macro_f1'])

## ColBERT

It's now just a short step to our core task, few-shot OpenQA. We just need to give up our beloved gold passage and instead try to retrieve the right passage or passages from a corpus. 

The first step is instantiating the ColBERT retriever and loading in an index. Our ColBERT retriever was initially trained on MS MARCO, and we have pre-indexed a collection of 100K documents that we know to be well-aligned with SQuAD and with the dataset used for the bake-off assessment. (See [the original system question](#Your-original-system-[3-points]) for tips on creating your own index.)

In [38]:
index_home = os.path.join("experiments", "notebook", "indexes")

### ColBERT parameters

In [39]:
if not os.path.exists(os.path.join("data", "openqa", "colbertv2.0.tar.gz")):
    !mkdir -p data/openqa
    # ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
    !wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P data/openqa/
    !tar -xvzf data/openqa/colbertv2.0.tar.gz -C data/openqa/

If something went wrong with the above, you can just download the file https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz, unarchive it, and move the resulting `colbertv2.0` directory into the `data/openqa` directory.

### ColBERT index

In [33]:
# if not os.path.exists(os.path.join(index_home, "cs224u.collection.2bits.tgz")):
#     !wget https://web.stanford.edu/class/cs224u/data/cs224u.collection.2bits.tgz -P experiments/notebook/indexes
#     !tar -xvzf experiments/notebook/indexes/cs224u.collection.2bits.tgz -C experiments/notebook/indexes

Here we use our created index for bioasq passages

In [40]:
index_home = './experiments/bioasq/indexes'

collection = os.path.join(index_home, "bioasq.all.2bits", "bioasq_passage.tsv")

collection = Collection(path=collection)

f'Loaded {len(collection):,} passages'

[Jun 03, 20:22:53] #> Loading collection...
0M 


'Loaded 2,068 passages'

In [41]:
index_name = "bioasq.all.2bits"

Now we create our `searcher`:

In [42]:
with Run().context(RunConfig(experiment='bioasq')):
    searcher = Searcher(index=index_name)

[Jun 03, 20:22:57] #> Loading collection...
0M 
[Jun 03, 20:23:14] #> Building the emb2pid mapping..
[Jun 03, 20:23:14] len(self.emb2pid) = 378124


In [43]:
len(searcher.collection)

2068

### Search

Now that the index is loaded, you can do searches over it. The index is limited, but retrieval is very solid!

In [44]:

query = "biomarker"

print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3) 

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t[{passage_rank}]\t{passage_score:.1f}\t {searcher.collection[passage_id]}")

#> biomarker

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . biomarker, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1, 16012, 10665,  2121,   102,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

	[1]	21.6	 The target genes SEC22B, RAB10, and FLT1 may be potential biomarkers of AD.
	[2]	21.5	 A total of 92 biomarkers were measured before a standardized meal as well as 30 and 120 minutes afterwards with the Proseek Multiplex CVD III kit PROSEEK Multiplex CVD and PROSEEK Multiplex INF A multiplex proximity extension assay allowed us to measure 157 cardiovascular disease (CVD) and inflammatory disease-related biomarkers in pa

### Retrieval evaluation

For more rigorous evaluations of the retriever alone, we can use Sucess@`k` defined relative to the SQuAD passages and answers. We say that we have a "success" if a passage in the top `k` retrieved passages contains any of the answers substrings, and Sucess@`k` is the percentage of such success cases. This is very heuristic (perhaps the answer string happens to occur somewhere in a completely irrelevant passage), but it can still be good guidance.

In [45]:
def success_at_k(examples, k=20):
    scores = []
    for ex in examples: 
        scores.append(evaluate_retrieval_example(ex, k=5))
    return sum(scores) / len(scores)
        
    
def evaluate_retrieval_example(ex, k=20):    
    results = searcher.search(ex.question, k=k)
    for passage_id, passage_rank, passage_score in zip(*results):
        passage = searcher.collection[passage_id]
        score = has_answer([DPR_normalize(ans) for ans in ex.answers], passage)
        if score:
            return 1
    return 0

Here is Sucess@20 for the SQuAD dev set:

In [46]:
%%time
if torch.cuda.is_available():
    # This will take a few hours on a CPU:
    print(success_at_k(bioasq_dev))
else:
    # This should be reasonably fast and yields the
    # same kind of result:
    print(success_at_k(bioasq_dev))

0.7681159420289855
CPU times: user 33.9 s, sys: 1.17 s, total: 35 s
Wall time: 4.4 s


## Zero-shot OpenQA with ColBERT retrieval

We're now in a position to define a system that does our full few-shot OpenQA task. To get this started, we define just a version that doesn't include any SQuaD-training examples in the prompt. So this is really zero-shot OpenQA. (The homework asks you to move to the true few-shot setting.)

In [47]:
def build_zero_shot_openqa_prompt(question, passage, joiner="\n\n"):
    ## since there is no title, passage itself is context
    context = passage
    
    segs = [
        # f"Title: {title}",
        f"Background: {context}",
        f"Q: {question}",
        "A:"
    ]
    return joiner.join(segs)    

In [48]:
def evaluate_zero_shot_openqa(examples, joiner="\n\n", gen_func=run_eleuther, batch_size=20):
    prompts = []
    gens = []
    for i in range(0, len(examples), batch_size):
        exs = examples[i: i+batch_size]
        results = [searcher.search(ex.question, k=1) for ex in exs]
        passages = [searcher.collection[r[0][0]] for r in results]
        
        ps = [build_zero_shot_openqa_prompt(ex.question, psg, joiner=joiner) 
              for ex, psg in zip(exs, passages)]
        gs = gen_func(ps)       
        prompts += ps
        gens += gs
    return evaluate(examples, prompts, gens)

In [50]:
%%time
zero_shot_openqa_results = evaluate_zero_shot_openqa(dev_exs)
print(zero_shot_openqa_results['macro_f1'])

0.0904595812684048
CPU times: user 40.4 s, sys: 9.62 s, total: 50 s
Wall time: 6.26 s


## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Few-shot OpenQA with no context [2 points]

In the section [Open QA with no context](#Open-QA-with-no-context) above, we simply prompted our LM with a question string and looked at what came back. This is arguably unfair to the LM, since we didn't convey anything about our intentions.

For a fairer assessment of what the LM alone can do, we should move to the few-shot setting by giving the model a few examples of what we have in mind. The idea here is to create prompts that look like this:

   ```   
   Q: What is pragmatics?

   A: The study of language use

   Q: Who is Bert?

   A: Bert is one of the Muppets.

   Q: What was Stanford University founded?
   
   A: 
   ```
   
This question asks you to write a function for creating such prompts, using SQuAD training examples, and a second function for evaluating this approach. The goal is to have a no context baseline for the other few-shot approaches we are considering.

__Task 1___: Complete the function `build_few_shot_no_context_prompt` so that it builds prompts like the above. You can use `test_build_few_shot_no_context_prompt` to check that your function is returning prompts in the desired format.

__Task 2__: Complete the function `evaluate_few_shot_no_context` so that you can evaluate this approach. You can use `test_evaluator` to check that your function is performing the desired kind of evaluation.

In [51]:
def build_few_shot_no_context_prompt(question, train_exs, joiner="\n\n"):
    """No context few-shot OpenQA prompts.

    Parameters
    ----------
    question : str   
    train_exs : iterable of SQuAD train examples. These can be 
        obtained via a random sample 
        from `squad_train` as defined above.
    joiner : str
        The character to use to join pieces of the prompt into 
        a single str.

    Returns
    -------
    str, the prompt

    """
    ##### YOUR CODE HERE
    segs = []
    # train_exs = random.sample(train_exs, k = len(train_exs))    
    for t in train_exs:
        segs += [
            f"Q: {t.question}",
            f"A: {t.answers[0]}"
        ]
    segs += [
        f"Q: {question}",
        f"A:"
    ]
    return joiner.join(segs)




In [52]:
def evaluate_few_shot_no_context(
        examples,
        squad_train,
        batch_size=20,
        n_context=2,
        joiner="\n\n",
        gen_func=run_eleuther):
    """Evaluate a few-shot OpenQA with no context approach 
    defined by `build_few_shot_no_context_prompt` and `gen_func`.

    Parameters
    ----------
    examples : iterable of SQuAD train examples
        Presumably a subset of `squad_dev` as defined above.
    squad_train : iterable of SQuAD train examples
    batch_size : int
        Number of examples to send to `gen_func` at once.
    joiner : str
        Used by `build_few_shot_open_qa_prompt` to join segments
        of the prompt into a single str.
    gen_func : either `run_eleuther` or `run_gpt3`

    Returns
    -------
    dict as determined by `evaluate` above.

    """
    # A list of strings that you build and feed into `gen_func`.
    prompts = []

    # A list of dicts that you get from `gen_func`.
    gens = []

    # Iterate through the examples in batches:
    for i in range(0, len(examples), batch_size):
        # Sample some SQuAD training examples to use with
        # `build_few_shot_no_context_prompt` and `ex.question`,
        # run the resulting prompt through `gen_func`, and
        # add your prompts and results to `prompts` and `gens`.

        ##### YOUR CODE HERE

        # squad dev examples in batch
        batch = examples[i: i+batch_size]

        # build training to sample the Q/A pair from squad train
        train_exs = random.sample(squad_train, k=n_context)

        # build prompt using dev questions and Q/A pair from squad train
        ps = [build_few_shot_no_context_prompt(ex.question, train_exs, joiner=joiner) for ex in batch]  

        # feed that into run_eleuther
        gs = gen_func(ps) 

        # append the results
        prompts += ps
        gens += gs

    # Return value from a call to `evalaute`, with `examples`
    # as provided by the user and the `prompts` and `gens`
    # you built:
    return evaluate(examples, prompts, gens)

In [53]:
def test_evaluator(func):
    examples = [SquadExample(0, "T1", "Q1", "C1", ["A1"])]    
    squad_train = [SquadExample(0, "sT1", "sQ1", "sC1", ["sA1"])] 
    
    def gen_func(*prompts):
        return [{
            "generated_answer": "Constant output", 
            "generated_answer_tokens": ["Constant", "output"], 
            "generated_answer_probs": [0.1, 0.2]}]
    
    batch_size = 1    
    n_context = 1    
    joiner = "\n"
    result = func(
        examples, 
        squad_train, 
        batch_size=1, 
        n_context=1, 
        joiner=joiner, 
        gen_func=gen_func)
    expected_keys = {'em_per', 'examples', 'macro_f1'}
    result_keys = set(result.keys())     
    if expected_keys != result_keys:
        print(f"Unexpected keys in result. "
              f"Expected: {expected_keys}; Got: {result_keys}")
        return
    expected_ex_keys = {
        'f1', 'id', 'em', 'generated_answer_tokens', 'generated_answer_probs',
        'prediction', 'generated_answer', 'question', 'answers'}
    result_ex_keys = set(result["examples"][0].keys())
    if expected_ex_keys != result_ex_keys:
        print(f"Unexpected keys in result['examples']. "
              f"Expected: {expected_ex_keys}; Got: {result_ex_keys}")
        return
    print("No errors detected in `evaluate_few_shot_open_qa`")  

In [41]:
test_evaluator(evaluate_few_shot_no_context)

No errors detected in `evaluate_few_shot_open_qa`


### Few-shot OpenQA [2 points]

In the section [Few-shot QA](Few-shot-QA) above, we used SQuAD training examples to build prompts that we hope will help the model infer our intended semantics for the prompts themselves. When we moved to the open formulation of the problem, in [Open QA with ColBERT retrieval](Open-QA-with-ColBERT-retrieval), we forced the model to deal with prompts that lack these context clues. This is a "zero-shot" formulation of the problem. The goal of this homework problem is to improve that system so that it truly supports few-shot OpenQA.

__Task 1__: Complete the function `build_few_shot_open_qa_prompt` so that it builds prompts from a question, a passage, and a sample of SQuAD training examples. You can use `test_build_few_shot_open_qa_prompt` to check that your function is returning prompts in the desired format.

__Task 2__: Complete the function `evaluate_few_shot_open_qa` so that you can evaluate this approach. You can use `test_evaluator` from above to check that your function is performing the desired kind of evaluation.

We will be checking only that the tests pass. We will not be evaluating the quality of the results you obtain using this code.

In [54]:
def build_few_shot_open_qa_prompt(question, passage, train_exs, joiner="\n\n"):
    """Few-shot OpenQA prompts.

    Parameters
    ----------
    question : str
    passage : str
        Presumably something retrieved via search.
    train_exs : iterable of SQuAD train examples
        These can be obtained via a random sample from 
        `squad_train` as defined above.
    joiner : str
        The character to use to join pieces of the prompt 
        into a single str.

    Returns
    -------
    str, the prompt

    """
    ##### YOUR CODE HERE
    passage_context = passage
    
    segs = []

    for t in train_exs:
        segs += [
            # f"Title: {t.title}",
            f"Background: {t.context}",
            f"Q: {t.question}",
            f"A: {t.answers[0]}"
        ]
    segs += [
            # f"Title: {passage_title}",
            f"Background: {passage_context}",
            f"Q: {question}",
            f"A:"
    ]
    return joiner.join(segs)


In [55]:
def evaluate_few_shot_open_qa(
        examples,
        squad_train,
        batch_size=20,
        n_context=2,
        joiner="\n\n",
        gen_func=run_eleuther):
    """Evaluate a few-shot OpenQA approach defined by 
    `build_few_shot_open_qa_prompt` and `gen_func`.

    Parameters
    ----------
    examples : iterable of SQuAD train examples
        Presumably a subset of `squad_dev` as defined above.
    squad_train : iterable of SQuAD train examples
    batch_size : int
        Number of examples to send to `gen_func` at once.
    joiner : str
        Used by `build_few_shot_open_qa_prompt` to join segments
        of the prompt into a single str.
    gen_func : either `run_eleuther` or `run_gpt3`

    Returns
    -------
    dict as determined by `evaluate` above.

    """
    # A list of strings that you build and feed into `gen_func`.
    prompts = []

    # A list of dicts that you get from `gen_func`.
    gens = []

    # Iterate through the examples in batches:
    for i in range(0, len(examples), batch_size):
        # Use the `searcher` defined above to get passages
        # using `ex.question` as the query, and use your
        # `build_few_shot_open_qa_prompt` to build prompts.

        ##### YOUR CODE HERE
        
        batch = examples[i: i+batch_size]

        # sample training from squad_train
        train_exs = random.sample(squad_train, k=n_context)

        ## get a passage for each example in the dev batch
        # get search results (passage index)
        results = [searcher.search(ex.question, k=1) for ex in batch]

        # from passage index to get the passage 'title | passage'
        passages = [searcher.collection[r[0][0]] for r in results]
 
        ps = []

        # for every question, combine the find passage and generate the prompt
        # append all prompt into a list
        for ex, psg in zip(batch, passages):
            ps.append(build_few_shot_open_qa_prompt(ex.question, psg, train_exs, joiner=joiner))  

        # feed prompt to gen_func
        gs = gen_func(ps)       

        # add the prompt to prompt list
        prompts += ps
        # add generated txt to gen list
        gens += gs


    # Return value from a call to `evalaute`, with `examples`
    # as provided by the user and the `prompts` and `gens`
    # you built:
    return evaluate(examples, prompts, gens)

### Answer scoring [2 points]

We have so far been assuming that the top-ranked passage retrieved by ColBERT should be used in the prompt and that the single answer returned by the LM is our prediction. It may be possible to improve on this by scoring answers using the ColBERT scores and the probabilities returned by the LM. This question asks you to explore a basic approach to such scoring. The core scoring function:

$$
\textbf{score}_{\text{prompt-func}}(\textrm{answer}, \textrm{passage}, \textrm{question}) = 
P(\textrm{passage} \mid \textrm{question}) \cdot 
P(\textrm{answer} \mid \text{prompt-func}(\textrm{question}, \textrm{passage}) ) 
$$

where we estimate the two conditional probabilities as follows:

* $P(\textrm{passage} \mid \textrm{question})$ is defined only for the top $k$ passages and defined by the softmax of the top $k$ scores returned by the retriever.

* $P(\textrm{answer} \mid \text{prompt-func}(\textrm{question}, \textrm{passage}))$ is simply the product of the per-token probabilities of the generated answer given the prompt determined by $\text{prompt-func}(\textrm{question}, \textrm{passage})$. These values can be extracted from the return values of both `run_eleuther` and `run_gpt3` using the key `"generated_answer_probs"`. (Your prompt function might of course have other arguments not represented here.)

__Your task__: Implement this scoring function for an individual example. The two required pieces are `get_passages_with_scores` and `answer_scoring`. Starter code for each is below, and each has a unit test you can run to check your work.

(With this implemented, it is easy to create a new prediction function that uses the $\textrm{answer}$ from the highest-scoring $\textrm{answer}/\textrm{passage}$ pair as the prediction for input $\textrm{question}$. You are not required to implement such a prediction function, but you might do this as part of [your original system](#Your-original-system-[3-points]).)

In [56]:
def get_passages_with_scores(question, k=5):
    """Pseudo-probabilities from the retriever.

    Parameters
    ----------
    question : str
    k : int
        Number of passages to retrieve.

    Returns
    -------
    passages (list of str), passage_probs (np.array)

    """
    # Use the `searcher` to get `k` passages for `questions`:
    ##### YOUR CODE HERE
    search_score = searcher.search(question, k = k)[2]
    passage_index = searcher.search(question, k = k)[0]

    # Softmax normalize the scores and convert the list to
    # a NumPy array:
    ##### YOUR CODE HERE
    exp_score = np.exp(search_score)
    sum_score = np.sum(exp_score) 
    passage_probs = np.array([score/sum_score for score in exp_score] )

    # Get the passages as a list of texts:
    ##### YOUR CODE HERE

    passages = [searcher.collection[idx] for idx in passage_index]

    return passages, passage_probs



In [57]:
def test_get_passages_with_scores(func):
    question = "What is linguistics?"        
    passages, passage_probs = get_passages_with_scores(question, k=2)    
    if len(passages) != len(passage_probs):
        print("`get_passages_with_scores` should return equal length "
              "lists of passages and passage probabilities.")
        return
    if len(passages) != 2:
        print(f"`get_passages_with_scores` should return `k` passages. Yours returns {len(passages)}")
        return
    if not all(isinstance(psg, str) for psg in passages):
        print("The first return argument should be a list of passage strings.")
        return
    if not all(isinstance(p, (float, np.float32, np.float64)) for p in passage_probs): 
        print("The second return argument should be a list of floats.")
        return 
    print("No errors detected in `get_passages_with_scores`")

In [58]:
test_get_passages_with_scores(get_passages_with_scores)

No errors detected in `get_passages_with_scores`


In [59]:
from types import GeneratorType
def answer_scoring(passages, passage_probs, prompts, gen_func=run_eleuther):
    """Implements our basic scoring strategy.

    Parameters
    ----------
    passages : list of str
    passage_probs : list of float
    prompts : list of str
    gen_func : either `run_eleuther` or `run_gpt3`

    Returns
    -------
    list of pairs (score, dict), sorted with the largest score first.
    `dict` should be the return value of `gen_func` for an example.

    """
    data = []
    for passage, passage_prob, prompt in zip(passages, passage_probs, prompts):
        # Run `gen_func` on [prompt] (crucially, the singleton list here),
        # and get the dictionary `gen` from the singleton list `gen_func`
        # returns, and then use the values to score `gen` according to our
        # scoring method.
        #
        # Be sure to use "generated_answer_probs" for the scores.
        ##### YOUR CODE HERE

        gen = gen_func([prompt])

        # print(gen)
        
        answer_score = np.prod(gen[0]['generated_answer_probs'])

        final_score = passage_prob*answer_score
        
        data.append((final_score, gen[0]))


    # Return `data`, sorted with the highest scoring `(score, gen)`
    # pair given first.
    ##### YOUR CODE HERE
    data.sort(key = lambda x: x[0], reverse=True)

    return data



In [60]:
def test_answer_scoring(func):
    passages = [
        "Pragmatics is the study of language use.", 
        "Phonology is the study of linguistic sound systems."]
    passage_probs = [0.75, 0.25]
    prompts = passages
    
    def gen_func(*prompts):
        return [{
            "generated_answer": "Constant output", 
            "generated_answer_tokens": ["Constant", "output"], 
            "generated_answer_probs": [0.1, 0.2]}]
    
    data = func(passages, passage_probs, prompts, gen_func=gen_func)
    
    if not all(len(x) == 2 for x in data):
        print("`answer_scoring` should return a list of pairs (score, gen)")
        return 
    if not isinstance(data[0][0], (float, np.float32, np.float64)):
        print("The first member of each pair in `data` should be a score (type `float`).")
        return    
    if not isinstance(data[0][1], dict):
        print("The second member of each pair in `data` should be a dict " 
              "created by running `gen_func` on a single example.")
        return    
    if data[0][0] != max([x for x, y in data]):
        print("`answer_scoring` should sort its data with the highest score first.")
        return 
    
    print("No errors detected in `answer_scoring`")

In [65]:
test_answer_scoring(answer_scoring)

No errors detected in `answer_scoring`


In [61]:
def answer_scoring_demo(question):
    """Example usage for answer_scoring. Here we extract the top-scoring
    results, which can then be used in an evaluation."""    
    passages, passage_probs = get_passages_with_scores(question)
    prompts = [build_zero_shot_openqa_prompt(question, psg) for psg in passages]
    # for p in prompts:
    #   print(p)
    data = answer_scoring(passages, passage_probs, prompts)
    # Top result:
    return data[0]

In [None]:
answer_scoring_demo("How long is Moby Dick?")

### Your original system [3 points]

This question asks you to design your own few-shot OpenQA system. All of the code above can be used and modified for this, and the requirement is just that you try something new that goes beyond what we've done so far. 

Terms for the bake-off:

* You can make free use of SQuAD and other publicly available data.
* The LM must be an autoregressive language model. No trained QA components can be used. Our list of preallowed models are those available via the OpenAI API whose names begin with "text" and the Eluether models "gpt-neo-125M", "gpt-neo-1.3B", "gpt-neo-2.7B", and "gpt-j-6B". If you would like to use a model outside of this set, please check with the teaching team first.

Here are some ideas for the original system:

* We have so far sampled randomly from the SQuaD train set to create few-shot prompts. One might instead sample passages that have some connection to the target question.

* We have used actual SQuAD training examples to build contexts. These might be different in meaningful ways from the passages in our corpus. An alternative is to use the SQuAD question–answer pairs to retrieve passages that contain the answer and use the resulting question–answer–passage triple when building prompts.

* There are a lot of parameters to our LMs that we have so far ignored. Exploring different values might lead to better results. The `temperature` parameter is highly impactful for our task.

* We have distributed a fixed index of 100K passages. These cover SQuAD plus our bake-off data, but there might still be value in creating a different/expanded index. There is starter code for indexing data with ColBERT [here](https://github.com/stanford-futuredata/ColBERT/blob/new_api/docs/intro.ipynb).

* [Khattab et al. (2021a)](https://aclanthology.org/2021.tacl-1.55/) fine-tune the retriever through a handful of successive rounds, using weak supervision from the QA dataset. This is an ambitious direction that could quickly build to an original project, as the role of retriever training is under-explored so far in the context of few-shot OpenQA.

* In our "Answer scoring" question, we don't normalize scores by answer length. Such normalization might be fairer to long answers and so seems worth adding.

* Our "Answer scoring" question is inspired by the Retrieval Augmented Generation (RAG) model of [Lewis et al. 2020](https://arxiv.org/abs/2005.11401). Their model fully marginalizes over $k$ retrieved passages to create a proper model of $P(\textrm{answer} \mid \textrm{question})$. Implementing this requires having the probabilities for the prompts. For GPT-3, these can be obtained with `echo=False`, which will lead you to have to make changes to the output processing of `run_gpt3`. For the Eleuther models, one needs to do another call to the model forward function. Here is some starter code that could be used to begin modifying `run_eleuther`:

   ```
    prompt_logits = eleuther_model(prompt_ids).logits                
    prompt_probs = prompt_logits.softmax(-1)                                   
    prompt_probs = torch.gather(prompt_probs, 2, prompt_ids[:, :, None]).squeeze(-1)
    prompt_probs = [list(prompt_prob.numpy()) for p in prompt_probs]
   ```

__Original system instructions__:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies. 

We also ask that you report the best macro F1 score your system got during development on `dev_exs` [as defined above](#SQuAD-dev-sample), just to help us understand how systems performed overall.

Please review the descriptions in the following comment and follow the instructions.

#### Baseline System (ColBERT straight output + Eleuther)

#### ColBERT + Answer Scoring + Eleuther

### ColBERT Improvment One + Eleuther

In [None]:
######## This part is functional modules ############
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### enhanced squad training example searching

def train_tf_idf(bioasq_train):
    tfidfvectorizer = TfidfVectorizer(analyzer='word',stop_words= 'english', ngram_range=(1, 3))

  # append all context
    train_context = [x.context for x in bioasq_train]

    tfidfvectorizer.fit_transform(train_context)

    context_tfidf = tfidfvectorizer.transform(train_context)

    return tfidfvectorizer, context_tfidf

def sample_bioasq_train(tfidfvectorizer, context_tfidf, question, n_context):
    '''
    This is using tf-idf and consine similarity to sample "related to question" bioasq example to build the prompt
    '''
    question_tfidf = tfidfvectorizer.transform([question])

    cosine_sim = cosine_similarity(context_tfidf, question_tfidf).flatten()

    related_index = cosine_sim.argsort()[-n_context:][::-1]

    train_exs = [bioasq_train[i] for i in related_index]

    return train_exs

### revised answer scoring by normalizing the score by length
from types import GeneratorType
## added temperature arg to allow change
def answer_scoring_normalized(passages, passage_probs, prompts, temperature, gen_func=run_eleuther):
    """Implements our basic scoring strategy.

  Parameters
  ----------
  passages : list of str
  passage_probs : list of float
  prompts : list of str
  gen_func : either `run_eleuther` or `run_gpt3`

  Returns
  -------
  list of pairs (score, dict), sorted with the largest score first.
  `dict` should be the return value of `gen_func` for an example.

    """
    data = []
    length_sum = 0
    gen_list = []

    for passage, passage_prob, prompt in zip(passages, passage_probs, prompts):
        gen = gen_func([prompt], temperature = temperature)

        gen_list.append(gen)
        # calculate the total length of answers
        length_sum += len(gen[0]['generated_answer'].split(' '))

    for passage_prob, gen in zip(passage_probs, gen_list):

        answer_score = np.prod(gen[0]['generated_answer_probs'])

        length_of_answer = len(gen[0]['generated_answer'].split(' '))

        # give more weight to longer answers, as its product of per-token probabiliyy is underdog
        weight = length_of_answer/length_sum

        final_score = passage_prob*answer_score*weight

        data.append((final_score, gen[0]))

    data.sort(key = lambda x: x[0], reverse=True)

    return data


######## This part is system development ############

batch_size = 5
joiner = '\n\n'
n_context = 2

# temperatures = [0.01, 0.025, 0.05, 0.075]
temperatures = [0.025]
# bioasq_dev
# bioasq_train

for temperature in temperatures:
    prompts = []

    gens = []

    # use tf-idf to find "related few shot in bioasq to build the prompt
    # train tf-idf on all bioasq examples
    tfidfvectorizer, context_tfidf = train_tf_idf(bioasq_train)

    for i in range(0, len(bioasq_test), batch_size):
        # Use the `searcher` defined above to get passages
        # using `ex.question` as the query, and use your
        # `build_few_shot_open_qa_prompt` to build prompts.

        # get a batch from bioasq dev (to replace dev_exs)
        batch = bioasq_test[i: i+batch_size]

        # train_exs = random.sample(bioasq_train, k=n_context)

        ## score for answer-passage pair
        for ex in batch:

          # use tf idf to sample training exs, instead of just random sampling bioasq training
            train_exs = sample_bioasq_train(tfidfvectorizer, context_tfidf, ex.question, n_context)

            passages, passage_probs = get_passages_with_scores(ex.question)

            # re-initiating prompt
            ps = []
            # iterate through each passage in the top k (5) passages
            for psg in passages:
            # build the prompt based on question, that specific passge, and training examples
            # say we have passage, then ps will be ['prompt1', 'prompt2', 'prompt3', 'prompt4', 'prompt5']
                ps.append(build_few_shot_open_qa_prompt(ex.question, psg, train_exs, joiner=joiner)) 

          # calculate the answering score for the highest passage-answer pair                 
          # data = answer_scoring(passages,       # only related to question, same length as ps
          #                       passage_probs,  # only related to question, same length as ps
          #                       ps,             # k prompts
          #                       run_eleuther)

            data = answer_scoring_normalized(passages,       # only related to question, same length as ps
                                passage_probs,  # only related to question, same length as ps
                                ps,             # k prompts
                                temperature,
                                run_eleuther)

            # pick highest score answer-prompt pair (note: in)
            highest_gs = [data[0][1]]
            highest_ps = [data[0][1]['prompt']]

            # add the prompt to prompt list
            prompts += highest_ps

            # add generated txt to gen list
            gens += highest_gs
 
    eva = evaluate(bioasq_test, prompts, gens)
    print(f"temperature {temperature} get {eva['macro_f1']}")


# ######## This part is final function wrapper for bakeoff ############

# ## train tf-idf model first
# tfidfvectorizer, context_tfidf = train_tf_idf(bioasq_train)

# # make the system feed on one single question
# def run_original_system(question, tfidfvectorizer=tfidfvectorizer, context_tfidf=context_tfidf):
# '''
# Wrapper for producing the bakeoff dictionary

# args:
#   questions: single question
# '''
# joiner = '\n\n'
# n_context = 2

# temperature = 0.025 # obtained from hyperparameter searching

# gens = {}

# # score for answer-passage pair
# # use tf idf to sample training exs, instead of just random sampling bioasq training
# train_exs = sample_bioasq_train(tfidfvectorizer, context_tfidf, question, n_context)

# passages, passage_probs = get_passages_with_scores(question)

# # re-initiating prompt
# ps = []

# # iterate through each passage in the top k (5) passages
# for psg in passages:
#   # build the prompt based on question, that specific passge, and training examples
#   # say we have passage, then ps will be ['prompt1', 'prompt2', 'prompt3', 'prompt4', 'prompt5']
#   ps.append(build_few_shot_open_qa_prompt(question, psg, train_exs, joiner=joiner)) 

# #answer scoring
# data = answer_scoring_normalized(passages,       # only related to question, same length as ps
#                             passage_probs,  # only related to question, same length as ps
#                             ps,             # k prompts
#                             temperature,
#                             run_gpt3)


# # pick highest score answer-prompt pair (note: in)
# highest_gs = [data[0][1]]

# return highest_gs


# STOP COMMENT: Please do not remove this comment.

If the above fails, you can just download https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt and place it in `data/openqa`.

This file contains only questions. The starter code below will help you structure this. It writes a file "cs224u-openqa-bakeoff-entry.json" to the current directory. That file should be uploaded as-is. Please do not change its name.