<a href="https://colab.research.google.com/github/marcospiau/ia368-dd-dl4ir/blob/main/aula08-inpars/04_finetune_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hardware info

In [None]:
!echo *********
!echo lspu is "$(lscpu)"
# !echo *********
!echo free -mh is "$(free -mh)"
# !echo *********
!echo TPU_NAME is $TPU_NAME

sample_data
lspu is Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              2
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2199.998
BogoMIPS:                        4399.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):             

# Gcloud authentication

In [None]:
import os
from google.colab import auth
import tensorflow_gcs_config

os.environ['USE_AUTH_EPHEM'] = '0'
auth.authenticate_user(clear_output=False)

In [None]:
!gcloud auth login

In [None]:
!gcloud auth application-default login

In [None]:
BUCKET = 'gs://aula08-inpars'
os.environ['BUCKET'] = BUCKET
!gsutil ls {BUCKET}
!gsutil ls $BUCKET

gs://aula08-inpars/castorini_baseline/
gs://aula08-inpars/finetune/
gs://aula08-inpars/models/
gs://aula08-inpars/synthetic-data-generation/
gs://aula08-inpars/teste_score_eval/
gs://aula08-inpars/castorini_baseline/
gs://aula08-inpars/finetune/
gs://aula08-inpars/models/
gs://aula08-inpars/synthetic-data-generation/
gs://aula08-inpars/teste_score_eval/


# Installs

Anserini, pyserini and anserini-tools:

In [None]:
%%capture
!wget -nc https://raw.githubusercontent.com/marcospiau/ia368-dd-dl4ir/main/scripts/install_anserini.sh && chmod +x install_anserini.sh && time ./install_anserini.sh

Pygaggle:

In [None]:
!git clone --recursive https://github.com/castorini/pygaggle.git

Cloning into 'pygaggle'...
remote: Enumerating objects: 1562, done.[K
remote: Counting objects: 100% (634/634), done.[K
remote: Compressing objects: 100% (230/230), done.[K
remote: Total 1562 (delta 528), reused 433 (delta 404), pack-reused 928
Receiving objects: 100% (1562/1562), 513.05 KiB | 6.93 MiB/s, done.
Resolving deltas: 100% (1001/1001), done.
Submodule 'tools' (https://github.com/castorini/anserini-tools.git) registered for path 'tools'
Cloning into '/content/pygaggle/tools'...
remote: Enumerating objects: 788, done.        
remote: Counting objects: 100% (545/545), done.        
remote: Compressing objects: 100% (467/467), done.        
remote: Total 788 (delta 101), reused 514 (delta 77), pack-reused 243        
Receiving objects: 100% (788/788), 119.60 MiB | 11.61 MiB/s, done.
Resolving deltas: 100% (185/185), done.
Submodule path 'tools': checked out '808f48711b5e172da6aec8b1855518c8ea65489f'


T5 (from pypi) and remaining stuff:

In [None]:
%%capture
!pip install -q ftfy polars toolz cytoolz transformers datasets dm-tree huggingface_hub
!pip install -U t5[gcp,cache-tasks]==0.9.3
!pip install -U jaxlib
!sudo apt install -qq tree htop

# Imports

In [None]:
import os
import pandas as pd
import polars as pl
import ftfy
import datasets
import numpy as np

import functools

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 1000)

os.environ['POLARS_FMT_STR_LEN']='1000'

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

os.environ['POLARS_FMT_STR_LEN']='1000'

In [None]:
from huggingface_hub import login
login()

In [None]:
# adding a random integer for each random id
# seed=12 (row number for my name on google sheets)
DEFAULT_RANDOM_SEED = 12

## Copying data from GCP

We already preprocessed some files on previous notebooks, so we will just copy it locally instead of reprocessing them again.

rsync data from gcp to local folder:

In [None]:
!mkdir -pv aula08-inpars
!gsutil -m rsync -r gs://aula08-inpars/ aula08-inpars/
!tree -lht aula08-inpars/

mkdir: created directory 'aula08-inpars'
Building synchronization state...
Starting synchronization...
Copying gs://aula08-inpars/castorini_baseline/trec-covid-corpus.tsv...
Copying gs://aula08-inpars/castorini_baseline/run.title_and_text_no_expansion.txt...
Copying gs://aula08-inpars/castorini_baseline/castorini-baseline-t5-input.txt...
Copying gs://aula08-inpars/castorini_baseline/castorini-baseline-monot5-base-predictions.txt-1100000...
Copying gs://aula08-inpars/castorini_baseline/castorini-baseline-t5-input_ids.txt...
Copying gs://aula08-inpars/castorini_baseline/trec-covid-queries_trec_format.txt...
Copying gs://aula08-inpars/models/doc2query_train/commands/command...
Copying gs://aula08-inpars/models/doc2query_train/checkpoint...
Copying gs://aula08-inpars/finetune/data/train_v1.tsv...
Copying gs://aula08-inpars/castorini_baseline/trec-covid-qrels_trec_format.txt...
Copying gs://aula08-inpars/models/doc2query_train/commands/command.1...
Copying gs://aula08-inpars/models/doc2quer

# Preparing data for finetuning

Getting data from HuggginFace Hub:

In [None]:
ds = datasets.load_dataset('unicamp-dl/trec-covid-experiment')
ds



  0%|          | 0/20 [00:00<?, ?it/s]

DatasetDict({
    example: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 3
    })
    example2: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 3
    })
    eduseiti_100_queries_expansion_20230501_01: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 463
    })
    leandro_carisio_01: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1001
    })
    thales_1k_generated_queries_20230429: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1000
    })
    manoel_1k_generated_queries_20230430: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 1000
    })
    manoel_2k_generated_queries_20230501: Dataset({
        features: ['query', 'positive_doc_id', 'negative_doc_ids'],
        num_rows: 2000
    })
    thiago_l

Row by count by split:

In [None]:
{k: len(v) for k,v in ds.items()}

{'example': 3,
 'example2': 3,
 'eduseiti_100_queries_expansion_20230501_01': 463,
 'leandro_carisio_01': 1001,
 'thales_1k_generated_queries_20230429': 1000,
 'manoel_1k_generated_queries_20230430': 1000,
 'manoel_2k_generated_queries_20230501': 2000,
 'thiago_laitz_1k_queries': 1000,
 'mirelle_1k_generated_queries_20230501': 999,
 'hugo_padovani_query_generation': 979,
 'marcus_borela_1k_gptj6b_20230501': 1000,
 'juliatessler_1000_queries': 1000,
 'pedro_holanda_1k_generated_queries_20230502': 1088,
 'leonardo_avila_queries_v1': 996,
 'marcus_borela_1k_gptj6b_20230501_v2': 1000,
 'gustavo_1k_cohere': 1000,
 'marcospiau_1k_v1': 1000,
 'pedrogengo_queries_inparsv1': 1146,
 'ricardo_primi_1k': 999,
 'thiago_vieira_1k_queries': 1000}

My dataset:

In [None]:
fix_query_expr = pl.col('query').str.strip()

In [None]:
df_queries_me = pl.from_arrow(ds['marcospiau_1k_v1'].data.table).with_columns(pl.lit('marcospiau_1k_v1').alias('origin'))
df_queries_me = df_queries_me.with_columns(fix_query_expr)
df_queries_me.head(1), df_queries_me.shape

(shape: (1, 4)
 ┌──────────────────────────────┬─────────────────┬──────────────────────────────┬──────────────────┐
 │ query                        ┆ positive_doc_id ┆ negative_doc_ids             ┆ origin           │
 │ ---                          ┆ ---             ┆ ---                          ┆ ---              │
 │ str                          ┆ str             ┆ list[str]                    ┆ str              │
 ╞══════════════════════════════╪═════════════════╪══════════════════════════════╪══════════════════╡
 │ What are the ethical         ┆ ejrcujnx        ┆ ["qe0vxmox", "82gaerf4", …   ┆ marcospiau_1k_v1 │
 │ conflicts between public     ┆                 ┆ "f8h9hlks"]                  ┆                  │
 │ health-driven focus of       ┆                 ┆                              ┆                  │
 │ Covid-19 prevention and      ┆                 ┆                              ┆                  │
 │ containment measures versus  ┆                 ┆                

Check duplicates:

In [None]:
def check_duplicated_positive_doc_ids_by_origin(df_queries):
    """Counts positive_doc_ids to check duplication"""
    # count by origin and positive_doc_id
    counts = (
        df_queries.groupby('origin', 'positive_doc_id')
        .agg(pl.count().alias('positive_doc_ids_unique_count'))
        .sort('positive_doc_ids_unique_count')
    )
    # pivot
    pivot = counts.pivot(
        index='origin',
        columns='positive_doc_ids_unique_count',
        values='positive_doc_id',
        aggregate_function=pl.count()
        ).fill_null(0)
    return pivot

check_duplicated_positive_doc_ids_by_origin(df_queries_me)

origin,1
str,u32
"""marcospiau_1k_v1""",1000


In [None]:
df_queries_others = pl.concat([
    pl.from_arrow(v.data.table).with_columns(pl.lit(k).alias('origin'))
    for k,v in ds.items()
    # if k in ['eduseiti_100_queries_expansion_20230501_01']
    if k not in [
        # example splits
        'example', 'example2',
        # my own splits
        'marcospiau_1k_v1',
        # this split did't provide negative_doc_ids
        'thales_1k_generated_queries_20230429'
    ]
])
df_queries_others = df_queries_others.with_columns(fix_query_expr)
df_queries_others.shape, df_queries_others.head(1)

((16671, 4),
 shape: (1, 4)
 ┌──────────────────────────┬─────────────────┬──────────────────────────┬──────────────────────────┐
 │ query                    ┆ positive_doc_id ┆ negative_doc_ids         ┆ origin                   │
 │ ---                      ┆ ---             ┆ ---                      ┆ ---                      │
 │ str                      ┆ str             ┆ list[str]                ┆ str                      │
 ╞══════════════════════════╪═════════════════╪══════════════════════════╪══════════════════════════╡
 │ How can chatbots be      ┆ 70hskj1o        ┆ ["mt00852w", "x7ol32mz", ┆ eduseiti_100_queries_exp │
 │ designed to effectively  ┆                 ┆ … "e2g1iu39"]            ┆ ansion_20230501_01       │
 │ share up-to-date         ┆                 ┆                          ┆                          │
 │ information during a     ┆                 ┆                          ┆                          │
 │ pandemic?                ┆                 ┆       

Checking duplicates:

In [None]:
check_duplicated_positive_doc_ids_by_origin(df_queries_others)

origin,1,2,3,4,5,6,7,8,10
str,u32,u32,u32,u32,u32,u32,u32,u32,u32
"""leandro_carisio_01""",1001,0,0,0,0,0,0,0,0
"""marcus_borela_1k_gptj6b_20230501""",1000,0,0,0,0,0,0,0,0
"""leonardo_avila_queries_v1""",996,0,0,0,0,0,0,0,0
"""hugo_padovani_query_generation""",979,0,0,0,0,0,0,0,0
"""gustavo_1k_cohere""",1000,0,0,0,0,0,0,0,0
"""pedro_holanda_1k_generated_queries_20230502""",1088,0,0,0,0,0,0,0,0
"""manoel_2k_generated_queries_20230501""",2000,0,0,0,0,0,0,0,0
"""juliatessler_1000_queries""",990,5,0,0,0,0,0,0,0
"""mirelle_1k_generated_queries_20230501""",999,0,0,0,0,0,0,0,0
"""marcus_borela_1k_gptj6b_20230501_v2""",1000,0,0,0,0,0,0,0,0


We are combining our queries with the ones generated by our colleagues. When deduplicating, we will prioritize our queries.

In [None]:
positive_doc_ids_me = set(df_queries_me['positive_doc_id'])
df_queries = pl.concat([
    df_queries_me,
    df_queries_others.filter(
        ~pl.col('positive_doc_id').is_in(positive_doc_ids_me)
    )
])
print(df_queries.shape)
check_duplicated_positive_doc_ids_by_origin(df_queries)

(17541, 4)


origin,1,2,3,4,5,6,7,8,10
str,u32,u32,u32,u32,u32,u32,u32,u32,u32
"""thiago_vieira_1k_queries""",991,3,0,0,0,0,0,0,0
"""manoel_2k_generated_queries_20230501""",1980,0,0,0,0,0,0,0,0
"""marcus_borela_1k_gptj6b_20230501_v2""",993,0,0,0,0,0,0,0,0
"""mirelle_1k_generated_queries_20230501""",988,0,0,0,0,0,0,0,0
"""leonardo_avila_queries_v1""",988,0,0,0,0,0,0,0,0
"""pedrogengo_queries_inparsv1""",1131,0,0,0,0,0,0,0,0
"""ricardo_primi_1k""",996,0,0,0,0,0,0,0,0
"""leandro_carisio_01""",991,0,0,0,0,0,0,0,0
"""hugo_padovani_query_generation""",970,0,0,0,0,0,0,0,0
"""manoel_1k_generated_queries_20230430""",991,0,0,0,0,0,0,0,0


Removing duplicates:

In [None]:
df_queries = df_queries.unique(subset=['positive_doc_id'], keep='first')
check_duplicated_positive_doc_ids_by_origin(df_queries)

origin,1
str,u32
"""pedrogengo_queries_inparsv1""",1054
"""thiago_vieira_1k_queries""",906
"""ricardo_primi_1k""",906
"""pedro_holanda_1k_generated_queries_20230502""",1034
"""thiago_laitz_1k_queries""",973
"""manoel_2k_generated_queries_20230501""",1964
"""leandro_carisio_01""",991
"""mirelle_1k_generated_queries_20230501""",945
"""marcus_borela_1k_gptj6b_20230501_v2""",802
"""juliatessler_1000_queries""",943


Finally, we remove 'origin' info and check duplicates again - OK:

In [None]:
df_queries = df_queries.with_columns(pl.lit('all').alias('origin'))
check_duplicated_positive_doc_ids_by_origin(df_queries)

origin,1
str,u32
"""all""",16311


In [None]:
# removes: "Sorry, as an AI language model, I cannot provide a good query for the given document without any context or information about what the document is about. Please provide more details or information about the document so I can assist you better."	
bad_question_expr = pl.col('query').str.to_lowercase().str.starts_with('sorry')

bad_question_expr |= pl.col('query').str.to_lowercase().is_in([
    'what is the capital of france?',
    'what are the symptoms of covid-19?',
    'what is the focus of the study described in the document?',
    'what is the purpose of the study mentioned in the document?',
    'what are the benefits of meditation?',
    'please provide a document for this prompt.',
    'what is the aim of the study described in this document?',
    'what is the aim of the study described in the document?',
    'what is the focus of the article?',
    'what is the purpose of the study described in the document?',
    'please provide a document to generate a positive query.',
    'what is the capital city of france?',
    'this study ',
    'this document ',
    'this article']
)

# bad_question_expr |= pl.col('query').str.split(' ').arr.lengths().le(3)

In [None]:
df_queries = df_queries.filter(~bad_question_expr)
check_duplicated_positive_doc_ids_by_origin(df_queries)

origin,1
str,u32
"""all""",16299


## Removing "bad queries"

After creating an initial version of the training dataset, we realized that there were duplicate queries and some queries that would not benefit model training.

Therefore, we went back to the data preparation step to remove these queries.

Initially, I planned to remove these queries manually, but this approach would have been too time-consuming and not scalable. Instead, we will use a few-shot prompt with examples of 'bad queries' to remove them.

Top queries:

In [None]:
df_queries['query'].value_counts(sort=True).head()['query'].to_list()

['What are the clinical characteristics of death cases with COVID-19 and how can they be used to identify critically ill patients early and reduce their mortality?',
 'What is the role of oxidative stress and epidermal growth factor receptor (EGFR) in PM2.5-induced pro-inflammatory response in human bronchial epithelial cells?',
 'What are the unique challenges faced by cancer scientists engaged in basic research?',
 'What is the role of nectar-producing plants in the transmission of Asaia sp.?',
 'What is the correct selection and utilization of respiratory personal protective equipment in high-risk aerosol-generating procedures?']

Prompt used for experimentation (explanation was used to "tune" the input prompt and examples):

In [None]:
"""
You are a human annotator with biomedical expertise. You received the following instruction:
#######Instruction#######
Generate a single question that could be helpful for a specialist who is searching for a very specific piece of information within a large corpus of scientific publications.

Given this instruction, could the following question be a good question genereated by you?
These are examples of bad questions and the reason they are not good questions
#######Bad questions and reasons#######
'What is the capital of France?': The question about the capital of France is not related to biomedical expertise or scientific publications, and therefore not useful for a specialist in this field.
'What are the symptoms of COVID-19?': It is a very general question that can be easily answered with a quick internet search.
'What is the focus of the study described in the document?': We cannot know for sure what the document is talking about, so it is not useful for a narrow search
'What is the purpose of the study mentioned in the document?': Because it doesn't mention the subject, it could be about any document in the collection
'What is the aim of the study described in this document?': is not specific enough and does not provide enough context to limit the search to a particular subject or study. It could potentially return any document in the collection.

#######Question#######
"Sorry, as an AI language model, I cannot provide a good query for the given document without any context or information about what the document is about. Please provide more details or information about the document so I can assist you better."	

Answer with True or False and explain your answer.
#######You response#######
"""

'\nYou are a human annotator with biomedical expertise. You received the following instruction:\n#######Instruction#######\nGenerate a single question that could be helpful for a specialist who is searching for a very specific piece of information within a large corpus of scientific publications.\n\nGiven this instruction, could the following question be a good question genereated by you?\nThese are examples of bad questions and the reason they are not good questions\n#######Bad questions and reasons#######\n\'What is the capital of France?\': The question about the capital of France is not related to biomedical expertise or scientific publications, and therefore not useful for a specialist in this field.\n\'What are the symptoms of COVID-19?\': It is a very general question that can be easily answered with a quick internet search.\n\'What is the focus of the study described in the document?\': We cannot know for sure what the document is talking about, so it is not useful for a narrow

Prompt used for a more direct (True or False response):

In [None]:
# Sorry, as an AI language model, I cannot provide a good query for the given document without any context or information about what the document is about. Please provide more details or information about the document so I can assist you better.
"""\
You are a human annotator with biomedical expertise. You received the following instruction:
#######Instruction#######
Generate a single question that could be helpful for a specialist who is searching for a very specific piece of information within a large corpus of scientific publications.

Given this instruction, could the following question be a good question genereated by you?
These are examples of bad questions and the reason they are not good questions
#######Bad questions and reasons#######
'What is the capital of France?': The question about the capital of France is not related to biomedical expertise or scientific publications, and therefore not useful for a specialist in this field.
'What are the symptoms of COVID-19?': It is a very general question that can be easily answered with a quick internet search.
'What is the focus of the study described in the document?': We cannot know for sure what the document is talking about, so it is not useful for a narrow search
'What is the purpose of the study mentioned in the document?': Becauset it doesn't mention the subject, it could be about any document in the collection
'What is the aim of the study described in this document?': is not specific enough and does not provide enough context to limit the search to a particular subject or study. It could potentially return any document in the collection.

#######Question#######
"Sorry, as an AI language model, I cannot provide a good query for the given document without any context or information about what the document is about. Please provide more details or information about the document so I can assist you better."	

Is this a good question? Answer with True or False only.
Do not use any other words or give any explanation, stop after first word is generated.
Do not use punctuation.
#######You response#######
"""

'You are a human annotator with biomedical expertise. You received the following instruction:\n#######Instruction#######\nGenerate a single question that could be helpful for a specialist who is searching for a very specific piece of information within a large corpus of scientific publications.\n\nGiven this instruction, could the following question be a good question genereated by you?\nThese are examples of bad questions and the reason they are not good questions\n#######Bad questions and reasons#######\n\'What is the capital of France?\': The question about the capital of France is not related to biomedical expertise or scientific publications, and therefore not useful for a specialist in this field.\n\'What are the symptoms of COVID-19?\': It is a very general question that can be easily answered with a quick internet search.\n\'What is the focus of the study described in the document?\': We cannot know for sure what the document is talking about, so it is not useful for a narrow s

Because we have limited time and to keep things simple, we are going to use only a single negative example.

In [None]:
df_corpus = pl.read_csv('aula08-inpars/castorini_baseline/trec-covid-corpus.tsv',
                        has_header=False, separator='\t',
                        new_columns=['id', 'text'])
df_corpus.shape, df_corpus.head(1)

((171332, 2),
 shape: (1, 2)
 ┌──────────┬───────────────────────────────────────────────────────────────────────────────────────┐
 │ id       ┆ text                                                                                  │
 │ ---      ┆ ---                                                                                   │
 │ str      ┆ str                                                                                   │
 ╞══════════╪═══════════════════════════════════════════════════════════════════════════════════════╡
 │ ug7v899j ┆ Clinical features of culture-proven Mycoplasma pneumoniae infections at King          │
 │          ┆ Abdulaziz University Hospital, Jeddah, Saudi Arabia. OBJECTIVE: This retrospective    │
 │          ┆ chart review describes the epidemiology and clinical features of 40 patients with     │
 │          ┆ culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University          │
 │          ┆ Hospital, Jeddah, Saudi Arabia. METHODS

In [None]:
def get_random_indexes(n, seed):
    """Get a random permutation of n elements with a seed"""
    return np.random.default_rng(seed=seed).permutation(range(n))

def prepare_training_data(df_queries, df_corpus, max_negative_examples=1, seed=DEFAULT_RANDOM_SEED):
    # df_queries = df_queries.lazy()
    # df_corpus = df_corpus.lazy()
    df_corpus = df_corpus.rename({'text': 'document'})
    
    def join_with_corpus(left):
        return left.join(df_corpus, on='id', how='inner')

    # get positive documents
    df_pos = df_queries.select(
        'query',
        pl.col('positive_doc_id').alias('id'),
        pl.lit('true').alias('target'))
    # explode negative_doc_ids and get negative documents
    df_neg = (
        df_queries
        .select(
            'query',
            pl.col('negative_doc_ids').arr.slice(0, max_negative_examples).alias('id'),
            pl.lit('false').alias('target'))
        .explode('id'))
    # add documents
    df_all = join_with_corpus(df_pos.vstack(df_neg))
    df_all = df_all.rename({'id': 'doc_id'})
    # format input
    # https://github.com/castorini/pygaggle/blob/master/pygaggle/data/create_msmarco_monot5_train.py
    format_expr = pl.format('Query: {} Document: {} Relevant:', 'query',
                            'document').alias('input')
    df_all = df_all.with_columns(format_expr)

    # shuffle (probably not really necessary)
    random_idx = get_random_indexes(len(df_all), seed=seed)
    df_all = df_all[random_idx]

    # remove line breaks from inputs: this breaks T5 code!!
    df_all = df_all.with_columns(pl.col('input').str.replace_all('\n', ' '))
    return df_all

    
    assert len(df_queries) == len(df_pos)
    return df_pos
df_train = prepare_training_data(df_queries.filter(~bad_question_expr), df_corpus)
df_train

query,doc_id,target,document,input
str,str,str,str,str
"""What is the impact of non-coding RNAs such as microRNAs and small interfering RNAs on the replication and pathogenesis of RNA viruses, specifically in retroviruses like HIV?""","""259cspey""","""true""","""RNA Viruses: RNA Roles in Pathogenesis, Coreplication and Viral Load. The review intends to present and recapitulate the current knowledge on the roles and importance of regulatory RNAs, such as microRNAs and small interfering RNAs, RNA binding proteins and enzymes processing RNAs or activated by RNAs, in cells infected by RNA viruses. The review focuses on how non-coding RNAs are involved in RNA virus replication, pathogenesis and host response, especially in retroviruses HIV, with examples of the mechanisms of action, transcriptional regulation, and promotion of increased stability of their targets or their degradation.""","""Query: What is the impact of non-coding RNAs such as microRNAs and small interfering RNAs on the replication and pathogenesis of RNA viruses, specifically in retroviruses like HIV? Document: RNA Viruses: RNA Roles in Pathogenesis, Coreplication and Viral Load. The review intends to present and recapitulate the current knowledge on the roles and importance of regulatory RNAs, such as microRNAs and small interfering RNAs, RNA binding proteins and enzymes processing RNAs or activated by RNAs, in cells infected by RNA viruses. The review focuses on how non-coding RNAs are involved in RNA virus replication, pathogenesis and host response, especially in retroviruses HIV, with examples of the mechanisms of action, transcriptional regulation, and promotion of increased stability of their targets or their degradation. Relevant:"""
"""What is the SEIR model and how is it used to analyze the COVID-19 epidemic in Buenos Aires and neighbouring cities in Argentina?""","""60qmiwjm""","""false""","""Analysis of meteorological conditions and prediction of epidemic trend of 2019-nCoV infection in 2020. Objective: To investigate the meteorological condition for incidence and spread of 2019-nCoV infection, to predict the epidemiology of the infectious disease, and to provide a scientific basis for prevention and control measures against the new disease. Methods: The meteorological factors during the outbreak period of the novel coronavirus pneumonia in Wuhan in 2019 were collected and analyzed, and were confirmed with those of Severe Acute Respiratory Syndrome (SARS) in China in 2003. Data of patients infected with 2019-nCoV and SARS coronavirus were collected from WHO website and other public sources. Results: This study found that the suitable temperature range for 2019-nCoV coronavirus survival is (13-24 degree Celsius), among which 19 degree Celsius lasting about 60 days is conducive to the spread between the vector and humans; the humidity range is 50%-80%, of which about 75% hu…","""Query: What is the SEIR model and how is it used to analyze the COVID-19 epidemic in Buenos Aires and neighbouring cities in Argentina? Document: Analysis of meteorological conditions and prediction of epidemic trend of 2019-nCoV infection in 2020. Objective: To investigate the meteorological condition for incidence and spread of 2019-nCoV infection, to predict the epidemiology of the infectious disease, and to provide a scientific basis for prevention and control measures against the new disease. Methods: The meteorological factors during the outbreak period of the novel coronavirus pneumonia in Wuhan in 2019 were collected and analyzed, and were confirmed with those of Severe Acute Respiratory Syndrome (SARS) in China in 2003. Data of patients infected with 2019-nCoV and SARS coronavirus were collected from WHO website and other public sources. Results: This study found that the suitable temperature range for 2019-nCoV coronavirus survival is (13-24 degree Celsius), among which 19 d…"
"""What are the potential nanobiotechnology and microfluidic approaches for developing alternative therapeutic methods for treating sepsis in extracorporeal circuits?""","""493nholj""","""true""","""Multiscale Biofluidic and Nanobiotechnology Approaches for Treating Sepsis in Extracorporeal Circuits. Infectious diseases and their pandemics periodically attract public interests due to difficulty in treating the patients and the consequent high mortality. Sepsis caused by an imbalanced systemic inflammatory response to infection often leads to organ failure and death. The current therapeutic intervention mainly includes “the sepsis bundles,” antibiotics (antibacterial, antiviral, and antifungal), intravenous fluids for resuscitation, and surgery, which have significantly improved the clinical outcomes in past decades; however, the patients with fulminant sepsis are still in desperate need of alternative therapeutic approaches. One of the potential supportive therapies, extracorporeal blood treatment, has emerged and been developed for improving the current therapeutic efficacy. Here, I overview how the treatment of infectious diseases has been assisted with the extracorporeal adjuv…","""Query: What are the potential nanobiotechnology and microfluidic approaches for developing alternative therapeutic methods for treating sepsis in extracorporeal circuits? Document: Multiscale Biofluidic and Nanobiotechnology Approaches for Treating Sepsis in Extracorporeal Circuits. Infectious diseases and their pandemics periodically attract public interests due to difficulty in treating the patients and the consequent high mortality. Sepsis caused by an imbalanced systemic inflammatory response to infection often leads to organ failure and death. The current therapeutic intervention mainly includes “the sepsis bundles,” antibiotics (antibacterial, antiviral, and antifungal), intravenous fluids for resuscitation, and surgery, which have significantly improved the clinical outcomes in past decades; however, the patients with fulminant sepsis are still in desperate need of alternative therapeutic approaches. One of the potential supportive therapies, extracorporeal blood treatment, has…"
"""How has the COVID-19 pandemic affected the diagnosis and treatment of obsessive-compulsive disorder, and what strategies can be implemented to address these challenges?""","""lr1qrg5x""","""true""","""The impact of COVID‐19 in the diagnosis and treatment of obsessive‐compulsive disorder. Obsessive‐compulsive disorder (OCD) is characterized by unwanted and distressing thoughts, images or urges (obsessions) and repetitive behaviors or mental acts that aim to decrease the resulting distress or according to rigid rules (compulsions) (APA, 2013). Different studies suggest OCD to affect up to 3.1 % of the general population and to be associated with substantial disability and decreased quality of life (Fontenelle, Mendlowicz, & Versiani, 2006; Ruscio, Stein, Chiu, & Kessler, 2010). This article is protected by copyright. All rights reserved.""","""Query: How has the COVID-19 pandemic affected the diagnosis and treatment of obsessive-compulsive disorder, and what strategies can be implemented to address these challenges? Document: The impact of COVID‐19 in the diagnosis and treatment of obsessive‐compulsive disorder. Obsessive‐compulsive disorder (OCD) is characterized by unwanted and distressing thoughts, images or urges (obsessions) and repetitive behaviors or mental acts that aim to decrease the resulting distress or according to rigid rules (compulsions) (APA, 2013). Different studies suggest OCD to affect up to 3.1 % of the general population and to be associated with substantial disability and decreased quality of life (Fontenelle, Mendlowicz, & Versiani, 2006; Ruscio, Stein, Chiu, & Kessler, 2010). This article is protected by copyright. All rights reserved. Relevant:"""
"""What is the difference between the causative organisms of condyloma acuminatum and verruca vulgaris?""","""jn17ev8m""","""false""","""Cells and Viruses. Cells are the smallest structural component of all known living organisms capable of self-maintenance and reproduction. Although cells vary greatly in their appearance or size, their structure is basically similar. Even the plant and animal cells show a significant degree of similarity in their overall organization. There are two types of cells: eukaryotic and prokaryotic. The main difference between them is the method of genetic material storage: in eukaryotic cells — in an isolated nucleus, in prokaryotic cells — directly in the cytoplasm (there is no nucleus). Prokaryotic cells are usually independent (unicellular), while eukaryotic cells are often found in multicellular organisms.""","""Query: What is the difference between the causative organisms of condyloma acuminatum and verruca vulgaris? Document: Cells and Viruses. Cells are the smallest structural component of all known living organisms capable of self-maintenance and reproduction. Although cells vary greatly in their appearance or size, their structure is basically similar. Even the plant and animal cells show a significant degree of similarity in their overall organization. There are two types of cells: eukaryotic and prokaryotic. The main difference between them is the method of genetic material storage: in eukaryotic cells — in an isolated nucleus, in prokaryotic cells — directly in the cytoplasm (there is no nucleus). Prokaryotic cells are usually independent (unicellular), while eukaryotic cells are often found in multicellular organisms. Relevant:"""
"""What is the effect of capsaicin and its analogues on the nocifensive response of Caenorhabditis elegans to noxious heat?""","""oshmehqb""","""false""","""Exploring the purine core of 3′-C-ethynyladenosine (EAdo) in search of novel nucleoside therapeutics. A series of new nucleoside analogues based on a C-3 branched ethynyl sugar derivative as present in 3′-C-ethynylcytidine (ECyd) and -adenosine (EAdo), combined with modified purine bases was synthetized and evaluated against a broad array of viruses and tumour cell lines. The pronounced cytostatic activity of EAdo was confirmed. EAdo and its 2,6-diaminopurine analogue showed inhibitory activity against vaccinia virus (EC(50): 0.31 and 51 μM, respectively). Derivative 10 on the other hand was found active against varicella zoster virus (EC(50): 4.68 μM).""","""Query: What is the effect of capsaicin and its analogues on the nocifensive response of Caenorhabditis elegans to noxious heat? Document: Exploring the purine core of 3′-C-ethynyladenosine (EAdo) in search of novel nucleoside therapeutics. A series of new nucleoside analogues based on a C-3 branched ethynyl sugar derivative as present in 3′-C-ethynylcytidine (ECyd) and -adenosine (EAdo), combined with modified purine bases was synthetized and evaluated against a broad array of viruses and tumour cell lines. The pronounced cytostatic activity of EAdo was confirmed. EAdo and its 2,6-diaminopurine analogue showed inhibitory activity against vaccinia virus (EC(50): 0.31 and 51 μM, respectively). Derivative 10 on the other hand was found active against varicella zoster virus (EC(50): 4.68 μM). Relevant:"""
"""What is the ""push/pull"" point-of-dispensing (POD) vaccination model and how effective was it in improving healthcare worker influenza vaccination rates at Flushing Hospital Medical Center?""","""pbl7gqkc""","""false""","""MODELING COVID19 IN INDIA (MAR 3 - MAY 7, 2020): HOW FLAT IS FLAT, AND OTHER HARD FACTS. A time-series model was developed for Number of Total Infected Cases in India, using data from Mar 3 to May 7, 2020. Two models developed in the early phases were discarded when they lost statistical validity, The third, current, model is a 3rd-degree polynomial that has remained stable over the last 30 days (since Apr 8), with R2 > 0.998 consistently. This model is used to forecast Total Covid cases, after cautionary discussion of triggers that would invalidate the model. The purpose of all forecasts in the study is to provide a comparator to evaluate policy initiatives to control the pandemic: the forecasts are not objectives by themselves. Actual observations less than forecasts mean successful policy interventions. Figures of Doubling Time, Fatality Rate and Recovery Rate used by authorities are questioned. Elongation of doubling rates is inherent in the model, and worthy of mention only when …","""Query: What is the ""push/pull"" point-of-dispensing (POD) vaccination model and how effective was it in improving healthcare worker influenza vaccination rates at Flushing Hospital Medical Center? Document: MODELING COVID19 IN INDIA (MAR 3 - MAY 7, 2020): HOW FLAT IS FLAT, AND OTHER HARD FACTS. A time-series model was developed for Number of Total Infected Cases in India, using data from Mar 3 to May 7, 2020. Two models developed in the early phases were discarded when they lost statistical validity, The third, current, model is a 3rd-degree polynomial that has remained stable over the last 30 days (since Apr 8), with R2 > 0.998 consistently. This model is used to forecast Total Covid cases, after cautionary discussion of triggers that would invalidate the model. The purpose of all forecasts in the study is to provide a comparator to evaluate policy initiatives to control the pandemic: the forecasts are not objectives by themselves. Actual observations less than forecasts mean successf…"
"""what is the relationship between tonsil hypertrophy and epiglottitis in children?""","""bkmi8izx""","""true""","""Tonsillar hypertrophy and prolapse in a child – is epiglottitis a predisposing factor for sudden unexpected death?. BACKGROUND: Tonsillitis, with associated tonsillar hypertrophy, is a common disease of childhood, yet it is rarely associated with sudden death due to airway obstruction. Lethal complications involving the inflamed tonsils include haemorrhage, retropharyngeal abscess and disseminated sepsis. CASE PRESENTATION: We report on a case of sudden and unexpected death in an 8-year-old female who was diagnosed with and treated for tonsillitis. The child was diagnosed with acute tonsillitis 2 days prior to her collapse and was placed on a course of oral antibiotics. There were no signs of upper or lower airway obstruction. She was found to be unresponsive by her caregiver and gasping for air in her bed in the early hours of the second morning after the start of treatment. Autopsy showed massive and symmetrically enlarged palatine tonsils. The tonsils filled the pharynx almost comp…","""Query: what is the relationship between tonsil hypertrophy and epiglottitis in children? Document: Tonsillar hypertrophy and prolapse in a child – is epiglottitis a predisposing factor for sudden unexpected death?. BACKGROUND: Tonsillitis, with associated tonsillar hypertrophy, is a common disease of childhood, yet it is rarely associated with sudden death due to airway obstruction. Lethal complications involving the inflamed tonsils include haemorrhage, retropharyngeal abscess and disseminated sepsis. CASE PRESENTATION: We report on a case of sudden and unexpected death in an 8-year-old female who was diagnosed with and treated for tonsillitis. The child was diagnosed with acute tonsillitis 2 days prior to her collapse and was placed on a course of oral antibiotics. There were no signs of upper or lower airway obstruction. She was found to be unresponsive by her caregiver and gasping for air in her bed in the early hours of the second morning after the start of treatment. Autopsy sho…"
"""What is the relationship between drug-drug interactions and preventable adverse drug reactions among older adults, particularly in relation to analgesics and opioids?""","""jbk38wpj""","""true""","""Opioid therapies and cytochrome p450 interactions.. Adverse drug reactions are common and associated with substantial economic and human costs. Particularly among older adult populations, preventable adverse drug reactions are often caused by drug-drug interactions. All analgesics have side effect profiles and many have known drug-drug interactions. Opioids are recognized as a necessary option for managing moderate-to-severe pain, yet many opioid side effects can be enhanced by metabolic interactions within the liver, involving other drugs, diseases, or genetics.""","""Query: What is the relationship between drug-drug interactions and preventable adverse drug reactions among older adults, particularly in relation to analgesics and opioids? Document: Opioid therapies and cytochrome p450 interactions.. Adverse drug reactions are common and associated with substantial economic and human costs. Particularly among older adult populations, preventable adverse drug reactions are often caused by drug-drug interactions. All analgesics have side effect profiles and many have known drug-drug interactions. Opioids are recognized as a necessary option for managing moderate-to-severe pain, yet many opioid side effects can be enhanced by metabolic interactions within the liver, involving other drugs, diseases, or genetics. Relevant:"""
"""What measures can be taken to address the mental health challenges faced by refugees during the COVID-19 pandemic, particularly in terms of access to healthcare and support services?""","""ag0s2u1t""","""true""","""A CRISIS WITHIN THE CRISIS: THE MENTAL HEALTH SITUATION OF REFUGEES IN THE WORLD DURING THE 2019 CORONAVIRUS (2019-nCoV) OUTBREAK. Abstract Background 68.5 million people around the world have been forced to leave their houses. Refugees have mainly to face their adaption in a host country, which involves bureaucracy, different culture, poverty, and racism. The already fragile situation of refugees becomes worrying and challenged in the face of the new coronavirus (COVID-19) epidemic. Therefore, we aimed to describe the factors that can worsen the mental health of refugees. Method The studies were identified in well-known international journals found in three electronic databases: PubMed, Scopus, and Embase. The data were cross-checked with information from the main international newspapers. Results According to the literature, the difficulties faced by refugees with the COVID-19 pandemic are potentiated by the pandemic state. There are several risk factors common to coronavirus and ps…","""Query: What measures can be taken to address the mental health challenges faced by refugees during the COVID-19 pandemic, particularly in terms of access to healthcare and support services? Document: A CRISIS WITHIN THE CRISIS: THE MENTAL HEALTH SITUATION OF REFUGEES IN THE WORLD DURING THE 2019 CORONAVIRUS (2019-nCoV) OUTBREAK. Abstract Background 68.5 million people around the world have been forced to leave their houses. Refugees have mainly to face their adaption in a host country, which involves bureaucracy, different culture, poverty, and racism. The already fragile situation of refugees becomes worrying and challenged in the face of the new coronavirus (COVID-19) epidemic. Therefore, we aimed to describe the factors that can worsen the mental health of refugees. Method The studies were identified in well-known international journals found in three electronic databases: PubMed, Scopus, and Embase. The data were cross-checked with information from the main international newspaper…"


Examples by label:

In [None]:
df_train.select(pl.col('target').value_counts())

target
struct[2]
"{""false"",16299}"
"{""true"",16299}"


The dataset is not perfectly balanced because we deduplicated and also because some colleagues did not added negative doc_ids.

We have duplicated queries:

In [None]:
def check_duplicated_queries(df_train):
    return (
        df_train.groupby('query')
        .agg(pl.count().alias('number_of_times_the_query_appears'))
        ['number_of_times_the_query_appears']
        .value_counts()
        .sort('number_of_times_the_query_appears'))
check_duplicated_queries(df_train)

number_of_times_the_query_appears,counts
u32,u32
2,16269
4,15


Most frequent queries:

In [None]:
df_train['query'].value_counts(sort=True).head(10)

query,counts
str,u32
"""What is the importance of continuing biologic therapy in patients with severe asthma during the COVID-19 pandemic?""",4
"""""What are the case-fatality risk estimates for COVID-19 based on lag time for fatality?""""",4
"""What are the unique challenges faced by cancer scientists engaged in basic research?""",4
"""What was the pattern of mortality in the Southern Cone outbreak?""",4
"""【Health protection guideline of hotels reconstructed as isolation places for close contacts during COVID-19 outbreak】. What is the purpose of this guideline?""",4
"""What is the role of oxidative stress and epidermal growth factor receptor (EGFR) in PM2.5-induced pro-inflammatory response in human bronchial epithelial cells?""",4
"""What is the diagnostic performance of the Luminex NxTAG CoV Extended Panel for SARS-CoV-2 detection in nasopharyngeal swab specimens?""",4
"""What is the correct selection and utilization of respiratory personal protective equipment in high-risk aerosol-generating procedures?""",4
"""What is the role of nectar-producing plants in the transmission of Asaia sp.?""",4
"""What are the two thought provoking comments related to understanding the problems with the coronavirus?""",4


Count by doc_ids:

In [None]:
df_train.groupby('doc_id').count()['count'].value_counts().sort('count')

count,counts
u32,u32
1,26289
2,2495
3,302
4,66
5,22
6,3
7,3


Writing training tsv:

In [None]:
!ls -lht aula08-inpars

total 20K
drwxr-xr-x 2 root root 4.0K May  4 11:34 synthetic-data-generation
drwxr-xr-x 2 root root 4.0K May  4 11:34 castorini_baseline
drwxr-xr-x 2 root root 4.0K May  4 11:34 teste_score_eval
drwxr-xr-x 3 root root 4.0K May  4 11:33 finetune
drwxr-xr-x 3 root root 4.0K May  4 11:33 models


In [None]:
!mkdir -pv aula08-inpars/finetune/data
df_train.select('input', 'target').write_csv('aula08-inpars/finetune/data/train_v1.tsv', separator='\t', has_header=False)
!ls -lht aula08-inpars/finetune/data
!wc -l aula08-inpars/finetune/data/*.tsv
!head aula08-inpars/finetune/data/*.tsv

total 51M
-rw-r--r-- 1 root root 51M May  4 11:59 train_v1.tsv
32598 aula08-inpars/finetune/data/train_v1.tsv
Query: What is the impact of non-coding RNAs such as microRNAs and small interfering RNAs on the replication and pathogenesis of RNA viruses, specifically in retroviruses like HIV? Document: RNA Viruses: RNA Roles in Pathogenesis, Coreplication and Viral Load. The review intends to present and recapitulate the current knowledge on the roles and importance of regulatory RNAs, such as microRNAs and small interfering RNAs, RNA binding proteins and enzymes processing RNAs or activated by RNAs, in cells infected by RNA viruses. The review focuses on how non-coding RNAs are involved in RNA virus replication, pathogenesis and host response, especially in retroviruses HIV, with examples of the mechanisms of action, transcriptional regulation, and promotion of increased stability of their targets or their degradation. Relevant:	true
Query: What is the SEIR model and how is it used to an

Sending data to GCP for training:

In [None]:
!gsutil -m rsync -r aula08-inpars/ gs://aula08-inpars/

Building synchronization state...
Starting synchronization...


In [None]:
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-base')

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## Are we truncating data?

In [None]:
# with open(args.output_to_t5, 'w') as fout_t5:
#     for line_num, line in enumerate(tqdm(open(args.triples_train))):
#         query, positive_document, negative_document = line.strip().split('\t')
#         fout_t5.write(f'Query: {query} Document: {positive_document} Relevant:\ttrue\n')
#         fout_t5.write(f'Query: {query} Document: {negative_document} Relevant:\tfalse\n')

In [None]:
df2 = pl.read_csv('aula08-inpars/finetune/data/train_v1.tsv', separator='\t', has_header=False, new_columns=['input', 'target'])
df2 = df2.with_columns(df2['input'].apply(lambda x: tokenizer(x).input_ids).alias('input_tokens'))
df2.head()

Token indices sequence length is longer than the specified maximum sequence length for this model (662 > 512). Running this sequence through the model will result in indexing errors


input,target,input_tokens
str,bool,list[i64]
"""Query: What is the impact of non-coding RNAs such as microRNAs and small interfering RNAs on the replication and pathogenesis of RNA viruses, specifically in retroviruses like HIV? Document: RNA Viruses: RNA Roles in Pathogenesis, Coreplication and Viral Load. The review intends to present and recapitulate the current knowledge on the roles and importance of regulatory RNAs, such as microRNAs and small interfering RNAs, RNA binding proteins and enzymes processing RNAs or activated by RNAs, in cells infected by RNA viruses. The review focuses on how non-coding RNAs are involved in RNA virus replication, pathogenesis and host response, especially in retroviruses HIV, with examples of the mechanisms of action, transcriptional regulation, and promotion of increased stability of their targets or their degradation. Relevant:""",True,"[3, 27569, … 1]"
"""Query: What is the SEIR model and how is it used to analyze the COVID-19 epidemic in Buenos Aires and neighbouring cities in Argentina? Document: Analysis of meteorological conditions and prediction of epidemic trend of 2019-nCoV infection in 2020. Objective: To investigate the meteorological condition for incidence and spread of 2019-nCoV infection, to predict the epidemiology of the infectious disease, and to provide a scientific basis for prevention and control measures against the new disease. Methods: The meteorological factors during the outbreak period of the novel coronavirus pneumonia in Wuhan in 2019 were collected and analyzed, and were confirmed with those of Severe Acute Respiratory Syndrome (SARS) in China in 2003. Data of patients infected with 2019-nCoV and SARS coronavirus were collected from WHO website and other public sources. Results: This study found that the suitable temperature range for 2019-nCoV coronavirus survival is (13-24 degree Celsius), among which 19 d…",False,"[3, 27569, … 1]"
"""Query: What are the potential nanobiotechnology and microfluidic approaches for developing alternative therapeutic methods for treating sepsis in extracorporeal circuits? Document: Multiscale Biofluidic and Nanobiotechnology Approaches for Treating Sepsis in Extracorporeal Circuits. Infectious diseases and their pandemics periodically attract public interests due to difficulty in treating the patients and the consequent high mortality. Sepsis caused by an imbalanced systemic inflammatory response to infection often leads to organ failure and death. The current therapeutic intervention mainly includes “the sepsis bundles,” antibiotics (antibacterial, antiviral, and antifungal), intravenous fluids for resuscitation, and surgery, which have significantly improved the clinical outcomes in past decades; however, the patients with fulminant sepsis are still in desperate need of alternative therapeutic approaches. One of the potential supportive therapies, extracorporeal blood treatment, has…",True,"[3, 27569, … 1]"
"""Query: How has the COVID-19 pandemic affected the diagnosis and treatment of obsessive-compulsive disorder, and what strategies can be implemented to address these challenges? Document: The impact of COVID‐19 in the diagnosis and treatment of obsessive‐compulsive disorder. Obsessive‐compulsive disorder (OCD) is characterized by unwanted and distressing thoughts, images or urges (obsessions) and repetitive behaviors or mental acts that aim to decrease the resulting distress or according to rigid rules (compulsions) (APA, 2013). Different studies suggest OCD to affect up to 3.1 % of the general population and to be associated with substantial disability and decreased quality of life (Fontenelle, Mendlowicz, & Versiani, 2006; Ruscio, Stein, Chiu, & Kessler, 2010). This article is protected by copyright. All rights reserved. Relevant:""",True,"[3, 27569, … 1]"
"""Query: What is the difference between the causative organisms of condyloma acuminatum and verruca vulgaris? Document: Cells and Viruses. Cells are the smallest structural component of all known living organisms capable of self-maintenance and reproduction. Although cells vary greatly in their appearance or size, their structure is basically similar. Even the plant and animal cells show a significant degree of similarity in their overall organization. There are two types of cells: eukaryotic and prokaryotic. The main difference between them is the method of genetic material storage: in eukaryotic cells — in an isolated nucleus, in prokaryotic cells — directly in the cytoplasm (there is no nucleus). Prokaryotic cells are usually independent (unicellular), while eukaryotic cells are often found in multicellular organisms. Relevant:""",False,"[3, 27569, … 1]"


Percentage of truncated inputs (more than 512 tokens):

In [None]:
df2.select(pl.col('input_tokens').arr.lengths().gt(512).mean())

input_tokens
f64
0.199245


In [None]:
df2['input_tokens'].arr.lengths().describe(percentiles=[.5, .25, .5, .75, .95, .99])

statistic,value
str,f64
"""count""",32598.0
"""null_count""",0.0
"""mean""",375.149304
"""std""",200.094974
"""min""",19.0
"""max""",8558.0
"""median""",371.0
"""50%""",371.0
"""25%""",256.0
"""75%""",483.0


We are truncating approximately 25% of our data. I will not address this issue because we won't have time for testing, and inputs will also be truncated on inference time.

If we were to remove truncated sentences:

In [None]:
df2.filter(pl.col('input_tokens').arr.lengths().gt(512))['target'].value_counts()

target,counts
bool,u32
False,3858
True,2637


# Model finetuning

Steps in 1 epoch:

In [None]:
1100000 + 254

1100254

In [None]:
len(df_train)/128

254.671875

We are going to train for 1 epoch.

In [None]:
!echo $BUCKET




In [None]:
32598 / 128

254.671875

In [None]:
len(df_train)

32598

In [None]:
!gsutil cat gs://aula08-inpars/finetune/data/train_v1.tsv | wc -l 

32598


In [None]:
!gsutil ls gs://aula08-inpars/finetune/data/train_v1.tsv

gs://aula08-inpars/finetune/data/train_v1.tsv


In [None]:
!gsutil ls gs://castorini/monot5/experiments/base/model.ckpt-1100000

CommandException: One or more URLs matched no objects.


In [None]:
!gsutil ls $BUCKET

gs://aula08-inpars/castorini_baseline/
gs://aula08-inpars/finetune/
gs://aula08-inpars/synthetic-data-generation/
gs://aula08-inpars/teste_score_eval/


In [None]:
!gsutil ls gs://aula08-inpars/finetune/models/v1


gs://aula08-inpars/finetune/data/


In [None]:
!gsutil ls gs://aula08-inpars/finetune/models/finetune_v1

gs://aula08-inpars/finetune/data/


In [None]:
%%writefile finetune_v1.sh
set -u

# This env var should be already set
# BUCKET=gs://aula08-inpars

ZONE=us-central1-f
# TPU_NAME=local this is already set on colab
TPU_SIZE=v2-8
TFDS_DATA_DIR=${BUCKET}/tensorflow_datasets
CACHE_TASKS_DIR=${BUCKET}/cache_tasks

CASTORINI_MODEL_DIR=gs://castorini/monot5/experiments/base
CASTORINI_INIT_CHECKPOINT=gs://castorini/monot5/experiments/base/model.ckpt-1100000
MODEL_DIR=gs://aula08-inpars/finetune/models/v1

MODEL_DIR=gs://aula08-inpars/finetune/models/finetune_v1

echo MODEL_DIR=$MODEL_DIR

LOG_FILE=train_log.log
echo $LOG_FILE
echo LOG_FILE log file is $LOG_FILE
TOTAL_TRAIN_STEPS=1100254

# save only last step
CKPT_INTERVAL=10000000


echo -e "*****pip requirements*****\n$(pip freeze)" > $LOG_FILE
(time python3 -m t5.models.mesh_transformer_main  \
--gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
--tpu_zone="${ZONE}" \
--tpu="${TPU_NAME}" \
--gin_param="utils.run.train_steps=${TOTAL_TRAIN_STEPS}" \
--gin_param="utils.run.save_checkpoints_steps = ${CKPT_INTERVAL}" \
--gin_param="init_checkpoint = '${CASTORINI_INIT_CHECKPOINT}'" \
--model_dir="${MODEL_DIR}" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \
--gin_param="utils.run.iterations_per_loop = 10" \
--gin_param="utils.run.keep_checkpoint_max = None" \
--gin_param="utils.run.batch_size = (\"tokens_per_batch\", 65536)" \
--gin_param="utils.run.sequence_length = {'inputs': 512, 'targets': 2}" \
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
--gin_param="tsv_dataset_fn.filename = 'gs://aula08-inpars/finetune/data/train_v1.tsv'" \
--gin_file="learning_rate_schedules/constant_0_001.gin" \

$@ ) 2>&1 | tee -a $LOG_FILE

FINAL_LOG_FILE=${MODEL_DIR}/train.log
echo Final log file is $FINAL_LOG_FILE
gsutil cp $LOG_FILE $FINAL_LOG_FILE

Overwriting finetune_v1.sh


In [None]:
!chmod +x finetune_v1.sh && time ./finetune_v1.sh

MODEL_DIR=gs://aula08-inpars/finetune/models/finetune_v1
train_log.log
LOG_FILE log file is train_log.log
Instructions for updating:
non-resource variables are not supported in the long term
I0504 12:40:35.982220 140020320651072 resource_reader.py:50] system_path_file_exists:gs://t5-data/pretrained_models/base/operative_config.gin
E0504 12:40:35.983251 140020320651072 resource_reader.py:55] Path not found: gs://t5-data/pretrained_models/base/operative_config.gin
I0504 12:40:36.116784 140020320651072 config.py:2372] Skipping import of unknown module `t5.data.sentencepiece_vocabulary` (skip_unknown=True).
I0504 12:40:36.145574 140020320651072 resource_reader.py:50] system_path_file_exists:learning_rate_schedules/constant_0_001.gin
E0504 12:40:36.146072 140020320651072 resource_reader.py:55] Path not found: learning_rate_schedules/constant_0_001.gin
INFO:tensorflow:model_type=bitransformer
I0504 12:40:36.156799 140020320651072 utils.py:2912] model_type=bitransformer
INFO:tensorflow:mode=t

Tensorboard (too slow):

In [None]:
# %load_ext tensorboard
# !tensorboard --logdir=gs://aula08-inpars/finetune/models/finetune_v1 --load_fast=false

# Reranking using the finetuned model

Scoring using the recommended (only true and false logits) require a fork of mesh-tensorflow and would require a new colab instance. We are going to use the log-likelihood (score_eval model on T5) for now to avoid creating a new colab notebook.

In [None]:
%%writefile monot5-base-score-eval.sh
set -u
# export MODEL_NAME=base
# export MODEL_DIR=gs://castorini/monot5/experiments/${MODEL_NAME}
# export CHECKPOINT_STEP=1100000

ZONE=us-central1-f
TPU_SIZE=v2-8

INPUT_FILENAME=$1
OUTPUT_FILENAME=$2
TARGETS_FILENAME=$3
LOG_FILE=$4
MODEL_DIR=$5
CKPT_STEP=$6



time python3 -m t5.models.mesh_transformer_main \
--gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
--gin_file="infer.gin" \
--tpu="${TPU_NAME}" \
--tpu_zone="${ZONE}" \
--model_dir="${MODEL_DIR}" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \
--gin_file="beam_search.gin" \
--gin_param="utils.run.sequence_length = {'inputs': 512, 'targets': 2}" \
--gin_param="Bitransformer.decode.max_decode_length = 2" \
--gin_param="utils.run.batch_size=('tokens_per_batch', 65536)" \
--gin_param="Bitransformer.decode.beam_size = 1" \
--gin_param="Bitransformer.decode.temperature = 0.0" \
--gin_param="Unitransformer.sample_autoregressive.sampling_keep_top_k = -1" \
--gin_file="score_from_file.gin" \
--gin_param="inputs_filename = '${INPUT_FILENAME}'" \
--gin_param="targets_filename = '${TARGETS_FILENAME}'" \
--gin_param="scores_filename = '${OUTPUT_FILENAME}'" \
--model_dir="${MODEL_DIR}" \
--gin_param="infer_checkpoint_step = ${CKPT_STEP}" \
2>&1 | tee -a $LOG_FILE

Overwriting monot5-base-score-eval.sh


In [None]:
# INPUT_FILENAME=$1
# OUTPUT_FILENAME=$2
# TARGETS_FILENAME=$3
# LOG_FILE=$4
# MODEL_DIR=$5
# CKPT_STEP=$6


!chmod +x monot5-base-score-eval.sh && \
time ./monot5-base-score-eval.sh \
gs://aula08-inpars/castorini_baseline/castorini-baseline-t5-input.txt \
gs://aula08-inpars/preds_eval_score_finetune_v1/predictions.txt \
gs://aula08-inpars/teste_score_eval/true_50klines.txt \
teste_score_eval.log \
gs://aula08-inpars/finetune/models/finetune_v1/ \
1100254

Instructions for updating:
non-resource variables are not supported in the long term
I0504 13:11:04.667075 140131670230848 resource_reader.py:50] system_path_file_exists:gs://t5-data/pretrained_models/base/operative_config.gin
E0504 13:11:04.668255 140131670230848 resource_reader.py:55] Path not found: gs://t5-data/pretrained_models/base/operative_config.gin
I0504 13:11:04.813249 140131670230848 config.py:2372] Skipping import of unknown module `t5.data.sentencepiece_vocabulary` (skip_unknown=True).
I0504 13:11:04.843255 140131670230848 resource_reader.py:50] system_path_file_exists:infer.gin
E0504 13:11:04.843743 140131670230848 resource_reader.py:55] Path not found: infer.gin
I0504 13:11:04.845785 140131670230848 resource_reader.py:50] system_path_file_exists:beam_search.gin
E0504 13:11:04.846140 140131670230848 resource_reader.py:55] Path not found: beam_search.gin
I0504 13:11:04.848197 140131670230848 resource_reader.py:50] system_path_file_exists:score_from_file.gin
E0504 13:11:04

Generated files:

In [None]:
!gsutil ls gs://aula08-inpars/preds_eval_score_finetune_v1/

gs://aula08-inpars/preds_eval_score_finetune_v1/
gs://aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.lengths
gs://aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores
gs://aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets


Copying prediction files to local directory:

In [None]:
!gsutil -m rsync -r -x "(.*)/models/(.*)$" gs://aula08-inpars/ aula08-inpars/

!ls -lht aula08-inpars/preds_eval_score_finetune_v1
!wc -l aula08-inpars/preds_eval_score_finetune_v1/*
!head aula08-inpars/preds_eval_score_finetune_v1/*

Building synchronization state...
Starting synchronization...
total 792K
-rw-r--r-- 1 root root  98K May  4 13:13 predictions.txt.lengths
-rw-r--r-- 1 root root 245K May  4 13:13 predictions.txt.targets
-rw-r--r-- 1 root root 441K May  4 13:13 predictions.txt.scores
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.lengths
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets
150000 total
==> aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.lengths <==
2
2
2
2
2
2
2
2
2
2

==> aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores <==
-1.20705
-2.04487
-1.8179
-2.52126
-2.69493
-2.92988
-2.52126
-2.35021
-0.064624
-2.09946

==> aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets <==
true
true
true
true
true
true
true
true
true
true


Converting predictions to a trec-format run:

In [None]:
!python3 pygaggle/pygaggle/data/convert_run_from_t5_to_trec_format.py --help

usage: convert_run_from_t5_to_trec_format.py
       [-h]
       --predictions
       PREDICTIONS
       --query_run_ids
       QUERY_RUN_IDS
       --output
       OUTPUT

Convert T5
predictions
into a
TREC-
formatted
run.

options:
  -h, --help
    show this
    help
    message and
    exit
  --predictions PREDICTIONS
    T5
    predictions
    file.
  --query_run_ids QUERY_RUN_IDS
    File
    containing
    query doc
    id pairs
    paired with
    the T5's
    predictions
    file.
  --output OUTPUT
    run file in
    the TREC
    format.


In [None]:
!paste -d'\t' preds_eval_score_finetune_v1/predictions.txt.targets preds_eval_score_finetune_v1/predictions.txt.scores | head

true	-1.20705
true	-2.04487
true	-1.8179
true	-2.52126
true	-2.69493
true	-2.92988
true	-2.52126
true	-2.35021
true	-0.064624
true	-2.09946


In [None]:
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.lengths
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores
 50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets

In [None]:
!paste -d'\t' aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores | head

true	-1.20705
true	-2.04487
true	-1.8179
true	-2.52126
true	-2.69493
true	-2.92988
true	-2.52126
true	-2.35021
true	-0.064624
true	-2.09946


In [None]:
!python3 pygaggle/pygaggle/data/convert_run_from_t5_to_trec_format.py \
    --predictions=<(paste -d'\t' aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores) \
    --query_run_ids=aula08-inpars/castorini_baseline/castorini-baseline-t5-input_ids.txt \
    --output=aula08-inpars/preds_eval_score_finetune_v1/finetune_score_eval_run.txt


!ls -lht aula08-inpars/preds_eval_score_finetune_v1
!wc -l aula08-inpars/preds_eval_score_finetune_v1/*
!head aula08-inpars/preds_eval_score_finetune_v1/*

Done!
total 2.9M
-rw-r--r-- 1 root root 2.1M May  4 14:38 finetune_score_eval_run.txt
-rw-r--r-- 1 root root  98K May  4 13:13 predictions.txt.lengths
-rw-r--r-- 1 root root 245K May  4 13:13 predictions.txt.targets
-rw-r--r-- 1 root root 441K May  4 13:13 predictions.txt.scores
  50000 aula08-inpars/preds_eval_score_finetune_v1/finetune_score_eval_run.txt
  50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.lengths
  50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.scores
  50000 aula08-inpars/preds_eval_score_finetune_v1/predictions.txt.targets
 200000 total
==> aula08-inpars/preds_eval_score_finetune_v1/finetune_score_eval_run.txt <==
1 Q0 h3ovjqcn 1 1.0 T5
1 Q0 pu9l36j9 2 0.5 T5
1 Q0 n8e2n30b 3 0.3333333333333333 T5
1 Q0 o7d6m9k5 4 0.25 T5
1 Q0 iohvj16d 5 0.2 T5
1 Q0 utsr0zv7 6 0.16666666666666666 T5
1 Q0 4dtk1kyh 7 0.14285714285714285 T5
1 Q0 xuczplaf 8 0.125 T5
1 Q0 1mjaycee 9 0.1111111111111111 T5
1 Q0 sh7lrdou 10 0.1 T5

==> aula08-inpars/preds_eva

# Evaluating results after finetuning

In [None]:
import shlex
import subprocess

TREC_EVAL_BIN_PATH = './tools/eval/trec_eval.9.0.4/trec_eval'
# Essa função foi escrita usando o github copilot
def get_trec_eval_metrics(flags, qrels_path, results_path):
    """Runs trec_eval and returns the results as a dictionary.

    Args:
        flags (str): Flags to pass to trec_eval.
        qrels_path (str): Path to the qrels file.
        results_path (str): Path to the results file.

    Returns:
        Dict[str, float]: A dictionary mapping metric names to their values.
    """
    output = subprocess.check_output([
        TREC_EVAL_BIN_PATH,
        qrels_path,
        results_path,
        *shlex.split(flags)
    ]).decode('utf-8')
    return {
        line.split()[0]: (line.split()[2])
        for line in output.splitlines()
    }


eval_metrics_fn = functools.partial(
    get_trec_eval_metrics,
    '-c -m ndcg_cut.10 -m recall.1000 -m recip_rank.10',
    'aula08-inpars/castorini_baseline/trec-covid-qrels_trec_format.txt'
)

In [None]:
eval_metrics_fn('aula08-inpars/preds_eval_score_finetune_v1/finetune_score_eval_run.txt')

{'recip_rank': '0.8936', 'recall_1000': '0.3955', 'ndcg_cut_10': '0.7520'}

In [None]:
eval_metrics_fn('aula08-inpars/teste_score_eval/teste_score_eval_run.txt')

{'recip_rank': '0.8585', 'recall_1000': '0.3955', 'ndcg_cut_10': '0.7174'}