<a href="https://colab.research.google.com/github/maggoatt/Grounded-Text-Summarization-of-Research-Papers/blob/main/Evidence_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evidence Retrieval

**Installations and imports**

In [2]:
!pip -q install rank-bm25

In [8]:
from rank_bm25 import BM25Okapi
import json, re

In [19]:
def tokenize(s: str):
    return re.findall(r"[a-z0-9]+", s.lower())

In [21]:
with open("./data/249953535.json", "r") as file:
    paper = json.load(file)

## Preprocessing

In [22]:
section_info = []
raw_texts = []

for section in paper["sections"]:
    # Include the title to make retrieval better
    raw_text = f'{section["section_title"]} {section["text"]}'
    curr_section = {
        "corpusId": paper["corpusid"],
        "title": paper["title"],
        "section_title": section["section_title"],
        "text": section["text"]
    }
    raw_texts.append(raw_text)
    section_info.append(curr_section)

In [23]:
print(len(raw_texts))

29


In [24]:
tokenized_texts = [tokenize(raw_text) for raw_text in raw_texts]

## BM25

In [25]:
bm25 = BM25Okapi(tokenized_texts)

In [31]:
summary_sentence = """Existing evaluations are too object-focused, artificially creating "novelty" by repurposing images designed for object classification. """
summary_scores = bm25.get_scores(tokenize(summary_sentence))

In [32]:
top_k = 5
top_idxs = summary_scores.argsort()[::-1][:top_k]

In [33]:
for i in top_idxs:
    section = section_info[int(i)]
    print("Corpus Id", section["corpusId"])
    print("Paper Title: ", section["title"])
    print("Section Title: ", section["section_title"])
    print("Section Score: ", float(summary_scores[i]))
    print("Section snippet: ", section["text"][:250], "...")

Corpus Id 249953535
Paper Title:  NovelCraft: A Dataset for Novelty Detection and Discovery in Open Worlds
Section Title:  The contributions of this study are:
Section Score:  19.355517706229957
Section snippet:  1. A scene-focused dataset purpose-built for visual novelty detection. Existing evaluations are too object-focused, artificially creating "novelty" by repurposing images designed for object classification. In Sec. 4, we benchmark visual detectors on  ...
Corpus Id 249953535
Paper Title:  NovelCraft: A Dataset for Novelty Detection and Discovery in Open Worlds
Section Title:  Introduction
Section Score:  14.733107416521557
Section snippet:  Recent progress in computer vision (Krizhevsky et al., 2017) and vision-informed reinforcement learning (Mnih et al., 2015;Silver et al., 2018) is exciting but focused on tasks like classification or video game playing where the agent's goals are nar ...
Corpus Id 249953535
Paper Title:  NovelCraft: A Dataset for Novelty Detection and Discov