In [13]:
import os
import numpy as np
from sentence_transformers import SentenceTransformer
from sentence_transformers import CrossEncoder
import faiss

from systematic_review import *

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
model = SentenceTransformer("allenai/scibert_scivocab_uncased")
directory = "../collection/examples/processed/"
papers = get_filenames_in_directory(directory)
token_size = 512

# Load and process documents as chunks with specified token size:
all_content = []
content_id = []
for paper in papers:
    file_path = os.path.join(directory, paper)
    doi = paper.partition(".grobid")[0]
    doi = doi.replace("_", "/")
    doc = XmlDocument(doi = doi)
    doc.load(file_path, token_size = token_size)

    for i,page in enumerate(doc.pages):
        if page == ".":
            continue
        else:
            text = page.strip()
            if text.startswith(". "):
                text = text[2:]
            all_content.append(text)
            content_id.append(f"{doi}_{i}")


No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with mean pooling.


In [7]:
len(all_content)

391

In [8]:
embeddings = model.encode(all_content)
query = "A pond is a small body of still water, usually less than 5 acres (2 hectares) in surface " \
"area and typically less than 6.6 feet (2 meters) deep. It can be natural or artificial and is " \
"often shallow enough for sunlight to reach the bottom, supporting plant and animal life throughout."

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
distances, indices = index.search(model.encode([query]), k=25)
for i, idx in enumerate(indices[0]):
    print(f"Result {i+1}:")
    print(f"Content ID: {content_id[idx]}")
    print(f"Distance: {distances[0][i]}")
    print(f"Content: {all_content[idx]}\n")

Result 1:
Content ID: definitions2_12
Distance: 42.80820083618164
Content: ## Ditches
Ditches are man-made waterbodies although, particularly in flood plain and coastal grazing areas, their attributes (still or slow-flowing interconnected channels) are an analogue of the multi-thread and anatomising channels of the floodplains and coastal alluvial plains they often drain. This drainage is essential for maintaining productive agricultural areas around the globe: the International Commission on Irrigation and Drainage has estimated that, globally, 190 million ha of agricultural lands are drained artificially, with most of that land in the Americas (65 Mha), Asia (58 Mha) and Europe (47 Mha), about 12% of the estimated 1,500 Mha of arable land and permanent crops (Ausubel et al., 2012;Leslie et al., 2012).The length of ditch networks has not been assessed globally or regionally, although they are extensive in agriculturally intensive countries. In the heavily drained Netherlands, the tota

In [None]:
query_pairs = [(query, all_content[idx]) for idx in indices[0]]

[('A pond is a small body of still water, usually less than 5 acres (2 hectares) in surface area and typically less than 6.6 feet (2 meters) deep. It can be natural or artificial and is often shallow enough for sunlight to reach the bottom, supporting plant and animal life throughout.',
  '## IntroductionPonds are small lentic ecosystems that permanently or temporarily hold water [1]. They are shallow and their size ranges from a few square metres to several hectares. They can be natural or man-made'),
 ('A pond is a small body of still water, usually less than 5 acres (2 hectares) in surface area and typically less than 6.6 feet (2 meters) deep. It can be natural or artificial and is often shallow enough for sunlight to reach the bottom, supporting plant and animal life throughout.',
  '. In the UK landscape, the majority of ditches are 1-3 m wide, with only a small proportion narrower or wider than this (Brown et al., 2006). A survey of the ditch network by Shore et al'),
 ('A pond i

In [10]:
cross_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
query_pairs = [(query, all_content[idx]) for idx in indices[0]]
scores = cross_model.predict(query_pairs)


In [20]:
ranking = np.argsort(scores)[::-1]
for i,r in enumerate(ranking):
    idx = indices[0][r]
    print(f"Result {i+1}:")
    print(f"Content ID: {content_id[idx]}")
    print(f"Score: {scores[r]}")
    print(f"Content: {all_content[idx]}\n")
    print()

Result 1:
Content ID: ponds2_1
Score: 0.5885521173477173
Content: ## Introduction
Ponds are small lentic ecosystems that permanently or temporarily hold water [1]. They are shallow and their size ranges from a few square metres to several hectares. They can be natural or man-made. Their number is much higher than that of large lakes, which constitute a small percentage of the total number of lakes [2]. Despite this, studies of lentic ecosystems have concentrated mainly on moderately large lakes [2].Ponds differ functionally from larger lakes [3], since their littoral structure and its productivity dominates the ecosystem [2]. Despite their small size they contain a significant part of aquatic biodiversity on the landscape scale [4,5]. Humans have created millions of ponds for multiple purposes [2], but today they serve as refugia for a variety of freshwater biota [5] and are, as such, an irreplaceable type of habitat [6][7][8].The biotic communities that develop in ponds depend on envi