# Crisis Event Social Media Summarization with GPT-3 and Neural Reranking

This notebook is the code used in the experiments reported in the paper **[Crisis Event Social Media Summarization with GPT-3 and Neural Reranking](#)**, accepted for publication at the **20th Annual Global Conference on Information Systems for Crisis Response and Management ([ISCRAM 2023](https://www.unomaha.edu/college-of-information-science-and-technology/iscram2023/index.php))**

> **Abstract:** Managing emergency events, such as natural disasters, requires management teams to have an up-to-date view of what is happening throughout the event. In this paper, we demonstrate how a method using a state-of-the-art open-sourced search engine and a large language model can generate accurate and comprehensive summaries by retrieving information from social media and online news sources.  We evaluated our method on the TREC CrisisFACTS challenge dataset using automatic summarization metrics (e.g., Rouge-2 and BERTScore) and the manual evaluation performed by the challenge organizers. Our approach is the best in comprehensiveness despite presenting a high redundancy ratio in the generated summaries. In addition, since all pipeline components are few-shot, there is no need to collect training data, allowing us to deploy the system rapidly.

* **This notebook requires a machine with GPU**
* **You must have a [OpenaAI API](https://platform.openai.com/docs/api-reference/answers) key.** Search for `API_KEY` in this notebook and place your key.


## Install required packages

In [None]:
!pip install pygaggle
!pip install git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)
!pip install pyserini
!pip install transformers --upgrade
!pip install openai

## Setting the device

Make sure you are using a GPU runtime. It is fine to use a CPU, but it is very slow.

In [None]:
!nvidia-smi

Fri Feb 24 10:51:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    25W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch

if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

## Download Dataset

[CrisisFACTS](https://crisisfacts.github.io/) is making available multi-stream datasets from several disasters, covering Twitter, Reddit, Facebook, and online news sources. We supplement these datasets with queries defining the information needs of disaster-response stakeholders (extracted from FEMA ICS209 forms). Participants’ systems should integrate these streams into temporally ordered lists of important facts, which we can aggregate into summaries for disaster response personnel.



In [None]:
credentials = {
    "institution": "", # University, Company or Public Agency Name
    "contactname": "", # Your Name
    "email": "", # A contact email address
    "institutiontype": "Research" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

In [None]:
# Event numbers as a list
eventNoList = [
          "001", # Lilac Wildfire 2017
          "002", # Cranston Wildfire 2018
          "003", # Holy Wildfire 2018
          "004", # Hurricane Florence 2018
          "005", # 2018 Maryland Flood
          "006", # Saddleridge Wildfire 2019
          "007", # Hurricane Laura 2020
          "008" # Hurricane Sally 2020
]
eventNames = {
    "001":"Lilac Wildfire 2017",
    "002":"Cranston Wildfire 2018",
    "003":"Holy Wildfire 2018",
    "004":"Hurricane Florence 2018",
    "005":"2018 Maryland Flood",
    "006":"Saddleridge Wildfire 2019",
    "007":"Hurricane Laura 2020",
    "008":"Hurricane Sally 2020"
}

In [None]:
import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

  # We will download a file containing the day list for an event
  url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

  # Download the list and parse as JSON
  dayList = requests.get(url).json()


  return dayList

eventsMeta = {}

for eventNo in eventNoList: # for each event
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo


In [None]:
eventsMeta['003']

[{'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r5',
  'dateString': '2018-08-06',
  'startUnixTimestamp': 1533510000,
  'endUnixTimestamp': 1533596399},
 {'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r6',
  'dateString': '2018-08-07',
  'startUnixTimestamp': 1533596400,
  'endUnixTimestamp': 1533682799},
 {'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r7',
  'dateString': '2018-08-08',
  'startUnixTimestamp': 1533682800,
  'endUnixTimestamp': 1533769199},
 {'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r8',
  'dateString': '2018-08-09',
  'startUnixTimestamp': 1533769200,
  'endUnixTimestamp': 1533855599},
 {'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r9',
  'dateString': '2018-08-10',
  'startUnixTimestamp': 1533855600,
  'endUnixTimestamp': 1533941999},
 {'eventID': 'CrisisFACTS-003',
  'requestID': 'CrisisFACTS-003-r10',
  'dateString': '2018-08-12',
  'startUnixTimestamp': 1534028400,
  'endUnixTime

Visualize documents with pandas

In [None]:
import ir_datasets
import pandas as pd
id_ = 'crisisfacts/{0}/{1}'.format("003", "2018-08-13")
print(id_)
dataset = ir_datasets.load(id_)

# queries = pd.DataFrame(dataset.queries_iter())
docs = pd.DataFrame(dataset.docs_iter())
docs.head()

[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/003/2018-08-13


[INFO] [finished] requesting access key [5.55s]
docs_iter: 195doc [00:06, 29.56doc/s]
[INFO] [finished] docs_iter: [00:06] [195doc] [29.56doc/s]
[INFO] [finished] building docstore [6.62s]


Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
0,CrisisFACTS-003-Facebook-7517-0,CrisisFACTS-003,,"{'pageID': 228211847326695, 'postID': 11377160...",Facebook,1534114807
1,CrisisFACTS-003-Twitter-20739-0,CrisisFACTS-003,@MBELANOVA Well there more on the republican s...,{'created_at': 'Sun Aug 12 23:00:18 +0000 2018...,Twitter,1534114818
2,CrisisFACTS-003-Twitter-34418-0,CrisisFACTS-003,@DFantom_ @Dpoz10 @kayykayyyy_ @Marcel_LV @RhO...,{'created_at': 'Sun Aug 12 23:00:30 +0000 2018...,Twitter,1534114830
3,CrisisFACTS-003-Twitter-647-0,CrisisFACTS-003,"“I got a heart that’s full of pain, fuck love ...",{'created_at': 'Sun Aug 12 23:01:04 +0000 2018...,Twitter,1534114864
4,CrisisFACTS-003-Twitter-7783-0,CrisisFACTS-003,Thanks for a wonderful Sunday @TigerWoods !!!!...,{'created_at': 'Sun Aug 12 23:01:16 +0000 2018...,Twitter,1534114876


## Create searchable index with pyggagle

In [None]:
%mkdir jsonlfiles

In [None]:
import ir_datasets
import pandas as pd
c = 0

jsonl = open("jsonlfiles/data.jsonl","w")

# for eventId in eventsMeta:
for eventId in ["001"]:
  requests = eventsMeta[eventId] 
  for request in requests:
    id_ = 'crisisfacts/{0}/{1}'.format(eventId, request['dateString'])
    print(id_)
    dataset = ir_datasets.load(id_)
    
    # queries = pd.DataFrame(dataset.queries_iter())
    docs = pd.DataFrame(dataset.docs_iter())
    docs_dict = docs.to_dict(orient="records")
    
    for doc in docs_dict:
      doc['contents'] = doc['text']
      doc["dateString"] = request['dateString']
      doc['eventId'] = eventId
      doc["id"] = doc['doc_id']
      del doc['source']
      # del doc['id']
      jsonl.write(json.dumps(doc)+"\n")
    

[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-07


[INFO] [finished] requesting access key [1.17s]
docs_iter: 7288doc [00:24, 293.54doc/s]
[INFO] [finished] docs_iter: [00:24] [7288doc] [293.52doc/s]
[INFO] [finished] building docstore [24.83s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-08


[INFO] [finished] requesting access key [2.54s]
docs_iter: 19231doc [01:06, 290.17doc/s]
[INFO] [finished] docs_iter: [01:06] [19231doc] [290.16doc/s]
[INFO] [finished] building docstore [01:06]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-09


[INFO] [finished] requesting access key [1.03s]
docs_iter: 5839doc [00:19, 299.06doc/s]
[INFO] [finished] docs_iter: [00:19] [5839doc] [299.04doc/s]
[INFO] [finished] building docstore [19.53s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-10


[INFO] [finished] requesting access key [761ms]
docs_iter: 4407doc [00:14, 314.56doc/s]
[INFO] [finished] docs_iter: [00:14] [4407doc] [314.54doc/s]
[INFO] [finished] building docstore [14.01s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-11


[INFO] [finished] requesting access key [677ms]
docs_iter: 3394doc [00:11, 299.00doc/s]
[INFO] [finished] docs_iter: [00:11] [3394doc] [298.98doc/s]
[INFO] [finished] building docstore [11.35s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-12


[INFO] [finished] requesting access key [755ms]
docs_iter: 2805doc [00:10, 256.10doc/s]
[INFO] [finished] docs_iter: [00:10] [2805doc] [256.08doc/s]
[INFO] [finished] building docstore [10.96s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-13


[INFO] [finished] requesting access key [611ms]
docs_iter: 2658doc [00:10, 251.73doc/s]
[INFO] [finished] docs_iter: [00:10] [2658doc] [251.71doc/s]
[INFO] [finished] building docstore [10.56s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-14


[INFO] [finished] requesting access key [611ms]
docs_iter: 2728doc [00:09, 287.38doc/s]
[INFO] [finished] docs_iter: [00:09] [2728doc] [287.35doc/s]
[INFO] [finished] building docstore [9.50s]
[INFO] [starting] building docstore
[INFO] [starting] requesting access key
docs_iter: 0doc [00:00, ?doc/s]

crisisfacts/001/2017-12-15


[INFO] [finished] requesting access key [590ms]
docs_iter: 2665doc [00:09, 294.13doc/s]
[INFO] [finished] docs_iter: [00:09] [2665doc] [294.10doc/s]
[INFO] [finished] building docstore [9.06s]


In [None]:
!python -m pyserini.index   -collection JsonCollection  
                            -generator DefaultLuceneDocumentGenerator  
                            -threads 1  
                            -input jsonlfiles/ 
                            -index my_index/ 
                            -storeRaw

2023-02-24 10:55:10,807 INFO  [main] index.IndexCollection (IndexCollection.java:650) - Setting log level to INFO
2023-02-24 10:55:10,811 INFO  [main] index.IndexCollection (IndexCollection.java:653) - Starting indexer...
2023-02-24 10:55:10,811 INFO  [main] index.IndexCollection (IndexCollection.java:655) - DocumentCollection path: jsonlfiles/
2023-02-24 10:55:10,811 INFO  [main] index.IndexCollection (IndexCollection.java:656) - CollectionClass: JsonCollection
2023-02-24 10:55:10,814 INFO  [main] index.IndexCollection (IndexCollection.java:657) - Generator: DefaultLuceneDocumentGenerator
2023-02-24 10:55:10,815 INFO  [main] index.IndexCollection (IndexCollection.java:658) - Threads: 1
2023-02-24 10:55:10,815 INFO  [main] index.IndexCollection (IndexCollection.java:659) - Stemmer: porter
2023-02-24 10:55:10,815 INFO  [main] index.IndexCollection (IndexCollection.java:660) - Keep stopwords? false
2023-02-24 10:55:10,816 INFO  [main] index.IndexCollection (IndexCollection.java:661) - St

## Retrieving and Reading

### Load BM25 retriever

In [None]:
from pyserini.index import IndexReader

index_reader = IndexReader('my_index')

### Loading Reranker

In [None]:
from transformers import AutoTokenizer
from transformers import T5ForConditionalGeneration

import torch

model = T5ForConditionalGeneration.from_pretrained("castorini/monot5-3b-msmarco-10k", torch_dtype=torch.float16) #t5 3b

# You can try a smaller model
# model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco-10k')

model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading (…)lve/main/config.json', max=1381.0, style=P…




HBox(children=(FloatProgress(value=0.0, description='Downloading (…)"pytorch_model.bin";', max=11406622919.0, …




For stability purposes, it is recommended to have accelerate installed when using this model in torch.float16, please install it with `pip install accelerate`


HBox(children=(FloatProgress(value=0.0, description='Downloading (…)lve/main/config.json', max=1202.0, style=P…




HBox(children=(FloatProgress(value=0.0, description='Downloading (…)ve/main/spiece.model', max=791656.0, style…




HBox(children=(FloatProgress(value=0.0, description='Downloading (…)/main/tokenizer.json', max=1389353.0, styl…




For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-3b automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=4096, bias=False)
              (k): Linear(in_features=1024, out_features=4096, bias=False)
              (v): Linear(in_features=1024, out_features=4096, bias=False)
              (o): Linear(in_features=4096, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 32)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=1024, out_features=16384, bias=False)
              (wo): Linear(in_features=16384, out_features=1024, bias=False)
              

In [None]:
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

reranker = MonoT5(model=model)

HBox(children=(FloatProgress(value=0.0, description='Downloading (…)lve/main/config.json', max=1208.0, style=P…




HBox(children=(FloatProgress(value=0.0, description='Downloading (…)ve/main/spiece.model', max=791656.0, style…




For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


### GPT-3 interface

In [None]:
import os
import openai

openai.api_key = "API_KEY"

def generate(prompt,max_tokens=250, temperature=0):
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response["choices"][0]['text']

### Question Answering

In [None]:
def extract_info(query, documents):
    # prompt="For each example, use the documents to create an \"Answer\" and an \"Evidence\" to the \"Question\". Use \"Unanswerable\" when not enough information is provided in the documents\n\nExample 1:\n\n[Document 1]: Title: San Tropez (song). Content: \"San Tropez\" is the fourth track from the album Meddle by the band Pink Floyd. This song was one of several to be considered for the band's \"best of\" album, Echoes: The Best of Pink Floyd.\n\n[Document 2]: Title: French Riviera. Content: The French Riviera (known in French as the Côte d'Azur [kot daˈzyʁ]; Occitan: Còsta d'Azur [ˈkɔstɔ daˈzyɾ]; literal translation \"Azure Coast\") is the Mediterranean coastline of the southeast corner of France. There is no official boundary, but it is usually considered to extend from Cassis, Toulon or Saint-Tropez on the west to Menton at the France–Italy border in the east, where the Italian Riviera joins.\n\n[Document 3]: Title: Moon Jae. Content: Moon also promised transparency in his presidency, moving the presidential residence from the palatial and isolated Blue House to an existing government complex in downtown Seoul.\n\n[Document 4]: Title: Saint-Tropez. Content: Saint-Tropez (US: /ˌsæn troʊˈpeɪ/ SAN-troh-PAY, French: [sɛ̃ tʁɔpe]; Occitan: Sant-Tropetz , pronounced [san(t) tʀuˈpes]) is a town on the French Riviera, 68 kilometres (42 miles) west of Nice and 100 kilometres (62 miles) east of Marseille in the Var department of the Provence-Alpes-Côte d'Azur region of Occitania, Southern France.\n\nQuestion: Did Pink Floyd have a song about the French Riviera?\n\nEvidence: According to [Document 1], \"San Tropez\" is a song by Pink Floyd about the French Riviera. This is further supported by [Document 4], which states that Saint-Tropez is a town on the French Riviera. Therefore, the answer is yes\n\nAnswer: yes.\n\nExample 2:\n\n"
    # prompt="For each example, use the documents to create an \"Answer\" and an \"Evidence\" to the \"Question\". Use \"Unanswerable\" when not enough information is provided in the documents\n\nExample 1:\n\n[Document 1]: Title: San Tropez (song). Content: \"San Tropez\" is the fourth track from the album Meddle by the band Pink Floyd. This song was one of several to be considered for the band's \"best of\" album, Echoes: The Best of Pink Floyd.\n\n[Document 2]: Title: French Riviera. Content: The French Riviera (known in French as the Côte d'Azur [kot daˈzyʁ]; Occitan: Còsta d'Azur [ˈkɔstɔ daˈzyɾ]; literal translation \"Azure Coast\") is the Mediterranean coastline of the southeast corner of France. There is no official boundary, but it is usually considered to extend from Cassis, Toulon or Saint-Tropez on the west to Menton at the France–Italy border in the east, where the Italian Riviera joins.\n\n[Document 3]: Title: Moon Jae. Content: Moon also promised transparency in his presidency, moving the presidential residence from the palatial and isolated Blue House to an existing government complex in downtown Seoul.\n\n[Document 4]: Title: Saint-Tropez. Content: Saint-Tropez (US: /ˌsæn troʊˈpeɪ/ SAN-troh-PAY, French: [sɛ̃ tʁɔpe]; Occitan: Sant-Tropetz , pronounced [san(t) tʀuˈpes]) is a town on the French Riviera, 68 kilometres (42 miles) west of Nice and 100 kilometres (62 miles) east of Marseille in the Var department of the Provence-Alpes-Côte d'Azur region of Occitania, Southern France.\n\nQuestion: Did Pink Floyd have a song about the French Riviera?\n\nEvidence: \"San Tropez\" is a song by Pink Floyd about the French Riviera [Document 1]. Saint-Tropez is a town on the French Riviera [Document 4].\n\nAnswer: yes.\n\nExample 2:\n\n"
    prompt = """For each example, use the documents to create an "Answer" and an "Evidence" to the "Question". Use "Unanswerable" when not enough information is provided in the documents

Example 1:

[Document 1]: Giovanni Messe pursued a military career in 1901.

[Document 2]: Giovanni Messe became aide-de-camp to King Victor Emmanuel III, holding this post from 1923 to 1927.

[Document 3]: The head of a single regiment or demi-brigade would be called a 'mestre de camp' or, after the Revolution, a 'chef de brigade'.).

[Document 4]: The First World War was global war originating in Europe that lasted from 28 July 1914 to 11 November 1918

[Document 5]: After World WarII began in 1939, the terms became more standard, with British Empire historians, including Canadians, favouring "The First World War" and Americans "World War I".

Question: How long had the First World War been over when Messe was named aide-de-camp?

Evidence: Giovanni Messe became aide-de-camp in 1923 [Document 2]. The First World War ended in 1918 [Document 4].

Answer: 5 years.

Example 2:

"""

    for i, doc in enumerate(documents):
        prompt += "[Document {0}]: {1}\n\n".format(i+1, doc['text'])
    prompt += "Question: {0}?\n\nEvidence:".format(query)

    # print(prompt)
    res = generate(prompt,temperature=0)
    explanation = None
    if "answer:" not in res.lower():
        # res = res+ generate(prompt+res)  
        prompt = prompt+res+"\n\nAnswer:"
        res = res+"\n\nAnswer:"+ generate(prompt)
    # print(res.lower().split("answer:"))  
    explanation, answer = res.lower().split("answer:")
    
    return prompt, explanation, answer


### Execution

In [None]:
import numpy as np
import re
import torch
import math
from tqdm import tqdm

attempts = 2
step = 5
regex = r"\[(.*?)\]"
docs_per_query = 100
logs = []
facts = []
detailed_facts = []

token_lens = []
count = 0
for eventId in ["001"]: # first event
# for eventId in eventsMeta:
    requests = eventsMeta[eventId] 
    eventName = eventNames[eventId]
    for request in requests[:1]:
        # print(request)
        # break
        all_items = []
        id_ = 'crisisfacts/{0}/{1}'.format(eventId, request['dateString'])
        dataset = ir_datasets.load(id_)
        
        queries = pd.DataFrame(dataset.queries_iter())
        docs = pd.DataFrame(dataset.docs_iter())

        docs["datetime"]=(pd.to_datetime(docs["unix_timestamp"],unit='s'))

        docs = docs.set_index("datetime")
        groups = [a for _,a in docs.resample('6h')]

        for group in groups[:1]:
            ids = group['doc_id'].to_list()
            for qid,question in tqdm(queries[['query_id','text']].values):
                count += 1
                # continue
                scores = np.zeros((len(ids)))
                for i,id_ in enumerate(ids):
                    scores[i] = index_reader.compute_query_document_score(id_, question)
        
                results = docs.iloc[np.argsort(-scores)[:200]].to_dict(orient="records")
        
                texts = [Text(item['text'], item,0) for item in results]

                title_question = "{0}: {1}".format(eventName, question)

                query = Query(title_question)
                reranked = reranker.rerank(query, texts)
                reranked.sort(key=lambda x: x.score, reverse=True)
                rerank_scores = [t.score for t in reranked]
                reranked = [t.metadata for t in reranked]

                documents = reranked[:10]

                token_lens.append(extract_info_dynamic(title_question, documents))
                # continue
                prompt,explanation, answer = extract_info_dynamic(title_question, documents)

                if answer.strip().lower() != "unanswerable": # adiciona o fato apenas se o GPT-3 conseguir reponder a pergunta
                    txt = explanation.rstrip().strip()
                    matches = re.findall(regex, txt, re.MULTILINE)
                    factText = " ".join(re.split(regex, txt)[::2]) # o texto do fato
                    docs_idxs = [int(a)-1 for a in re.findall(r'\d+'," ".join(matches))] 
                    docs_idxs = list(set(docs_idxs)) # documentos citados pelo GPT-3

                    used_docs = [a['doc_id'] for a in np.array(documents)[docs_idxs]]
                    importance = 0.
                    if len(used_docs) > 0:
                        # used_docs_score = m(torch.from_numpy(np.array(rerank_scores).reshape((-1,len(rerank_scores)))))[:,docs_idxs].reshape((-1))
                        used_docs_score = np.exp(rerank_scores)

                        fact_ = {
                            "requestID": request['requestID'],
                            "factText": factText,
                            "unixTimestamp":int(docs.iloc[docs_idxs]["unix_timestamp"].min()),
                            "importance": float(used_docs_score.mean()),
                            "sources": used_docs,
                            "streamID": None,
                            "informationNeeds": [qid]
                        }

                        facts.append(fact_)


                        # fact_docs = docs.loc[fact_['sources']]
                        # fact_['sources'] = fact_docs[["text"]].to_dict(orient="index")
                        # fact_queries = queries.loc[fact_['informationNeeds']]
                        # fact_['informationNeeds'] =  fact_queries[["text"]].to_dict(orient="index")
                        fact_['prompt'] = prompt
                        fact_['explanation'] = explanation
                        fact_['answer'] = answer

                        detailed_facts.append(fact_)
                # else:
                    # break
                logs.append({
                    "question": question,
                    "prompt": prompt,
                    "explanation": explanation,
                    "answer": answer
                })


### Save facts

In [None]:
json.dump(facts, open("my_submission.json",'w'))