# Introducing paperetl

paperetl is an ETL library for processing medical and scientific papers. paperetl transforms XML, CSV and PDF articles into a structured dataset, enabling downstream processing by machine learning applications.



# Install dependencies

Install `paperetl` and all dependencies. This step also downloads input data to process.

In [1]:
%%capture
!pip install git+https://github.com/neuml/paperetl

# Download NLTK data
!python -c "import nltk; nltk.download('punkt')"

# Download data
!mkdir -p paperetl
!wget -N https://github.com/neuml/paperetl/releases/download/v1.6.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Review data

Now let's take a look at the input data, which is a list of files in a directory.

In [2]:
!ls -l paperetl/file/data

total 1692
-rw-rw-r-- 1 1000 1000  95375 Nov  4  2020 0.xml
-rw-rw-r-- 1 1000 1000    353 Dec  5  2021 10.csv
-rw-rw-r-- 1 1000 1000 310066 Nov  4  2020 1.xml
-rw-rw-r-- 1 1000 1000 349016 Nov  4  2020 2.xml
-rw-rw-r-- 1 1000 1000 232888 Nov  4  2020 3.xml
-rw-rw-r-- 1 1000 1000 235276 Nov  4  2020 4.xml
-rw-rw-r-- 1 1000 1000  50414 Nov  4  2020 5.xml
-rw-rw-r-- 1 1000 1000  92683 Nov  4  2020 6.xml
-rw-rw-r-- 1 1000 1000 139379 Nov  4  2020 7.xml
-rw-rw-r-- 1 1000 1000  41640 Nov  4  2020 8.xml
-rw-rw-r-- 1 1000 1000  77557 Nov  4  2020 9.xml
-rw-r--r-- 1 1000 1000   5364 Dec  5  2021 arxiv.xml
-rw-r--r-- 1 1000 1000  70272 Oct  5  2021 pubmed.xml


# Process data

Next, we'll run the ETL process to load the files into a SQLite articles database.

In [3]:
!python -m paperetl.file paperetl/file/data paperetl/models

Processing: paperetl/file/data/0.xml
  soup = BeautifulSoup(stream, "lxml")
Processing: paperetl/file/data/1.xml
  soup = BeautifulSoup(stream, "lxml")
Processing: paperetl/file/data/10.csv
Processing: paperetl/file/data/2.xml
Processing: paperetl/file/data/3.xml
Processing: paperetl/file/data/4.xml
Processing: paperetl/file/data/5.xml
Processing: paperetl/file/data/6.xml
Processing: paperetl/file/data/7.xml
Processing: paperetl/file/data/8.xml
Processing: paperetl/file/data/9.xml
Processing: paperetl/file/data/arxiv.xml
  soup = BeautifulSoup(stream, "lxml")
Processing: paperetl/file/data/pubmed.xml
Total articles inserted: 0


In [4]:
!ls -l paperetl/models

total 936
-rw-r--r-- 1 root root 958464 Apr 26 19:35 articles.sqlite


This ETL process took the XML and CSV files, parsed the metadata/content and loaded it all into `articles.sqlite`.

# Import the necessary modules

The Sentence Transformers library typically requires data in the form of InputExample objects, where each example contains a pair of texts and a label indicating the similarity between them. For a semantic similarity task, you'll need to create pairs of sentences and a corresponding label, usually a float that indicates how similar the two texts are.

# Review parsed data

The two main tables in `articles.sqlite` are articles and sections.

- The articles table stores metadata (date, authors, publication, title...)
- The sections table stores the article text split into sections and sentences

Now let's take a look at what was loaded.

In [None]:
import sqlite3

import pandas as pd

from IPython.display import display, HTML

def execute(sql):
  db = sqlite3.connect("paperetl/models/articles.sqlite")
  cursor = db.cursor()
  cursor.execute(sql)

  df = pd.DataFrame([list(x) for x in cursor], columns=[c[0] for c in cursor.description])
  display(HTML(df.to_html(index=False)))

# Show articles
execute("SELECT * FROM articles LIMIT 5")

 let's check the schema of the articles table to see what columns are available.

In [6]:
# Connect to the SQLite database
db = sqlite3.connect("paperetl/models/articles.sqlite")

# Create a cursor object using the cursor() method
cursor = db.cursor()

# Get the schema of the articles table
cursor.execute("PRAGMA table_info(articles)")
schema = cursor.fetchall()

# Close the database connection
db.close()

# Print the schema
for column in schema:
    print(column)


(0, 'Id', 'TEXT', 0, None, 1)
(1, 'Source', 'TEXT', 0, None, 0)
(2, 'Published', 'DATETIME', 0, None, 0)
(3, 'Publication', 'TEXT', 0, None, 0)
(4, 'Authors', 'TEXT', 0, None, 0)
(5, 'Affiliations', 'TEXT', 0, None, 0)
(6, 'Affiliation', 'TEXT', 0, None, 0)
(7, 'Title', 'TEXT', 0, None, 0)
(8, 'Tags', 'TEXT', 0, None, 0)
(9, 'Reference', 'TEXT', 0, None, 0)
(10, 'Entry', 'DATETIME', 0, None, 0)


In [7]:
# Connect to the SQLite database
db = sqlite3.connect("paperetl/models/articles.sqlite")

# Create a cursor object using the cursor() method
cursor = db.cursor()

# Get the schema of the sections table
cursor.execute("PRAGMA table_info(sections)")
sections_schema = cursor.fetchall()

# Close the database connection
db.close()

# Print the schema of sections table
for column in sections_schema:
    print(column)


(0, 'Id', 'INTEGER', 0, None, 1)
(1, 'Article', 'TEXT', 0, None, 0)
(2, 'Name', 'TEXT', 0, None, 0)
(3, 'Text', 'TEXT', 0, None, 0)


In [8]:
# Show sections
execute("SELECT * FROM sections LIMIT 5")

Id,Article,Name,Text
0,00398e4c637f5e5447e35e63669187f0239c0357,TITLE,Changing travel patterns in China during the early stages of the COVID-19 pandemic
1,00398e4c637f5e5447e35e63669187f0239c0357,,"T he COVID-19 pandemic was first identified in Wuhan, China, in late 2019, and came to prominence in January 2020, and quickly spread within the country."
2,00398e4c637f5e5447e35e63669187f0239c0357,,"January is also a major holiday period in China, and the 40-day period around Lunar New Year (LNY), or Chunyun, marks the largest annual human movement in the world, with major travel flows out of large cities 1 ."
3,00398e4c637f5e5447e35e63669187f0239c0357,,The purpose of this holiday travel is often to visit family members.
4,00398e4c637f5e5447e35e63669187f0239c0357,,"The temporary displacement from residential addresses as a result of this holiday travel could last one to two weeks, up to a month."


In [33]:
# Show articles
execute("SELECT * FROM articles LIMIT 5")

Id,Source,Published,Publication,Authors,Affiliations,Affiliation,Title,Tags,Reference,Entry
00398e4c637f5e5447e35e63669187f0239c0357,0.xml,,,"Gibbs, Hamish; Liu, Yang; Pearson, Carl; Jarvis, Christopher; Grundy, Chris; Quilty, Billy; Diamond, Charlie; Cmmid, Lshtm; Eggo, Rosalind","Department of Infectious Disease Epidemiology, School of Hygiene and Tropical Medicine; Centre for Mathematical Modelling of Infectious Diseases, School of Hygiene and Tropical Medicine","Centre for Mathematical Modelling of Infectious Diseases, School of Hygiene and Tropical Medicine",Changing travel patterns in China during the early stages of the COVID-19 pandemic,PDF,https://doi.org/10.1038/s41467-020-18783-0,2024-04-26 00:00:00
1001,datasource2,,Test Journal2,Test Author2,,,Test Article2,,test url2,2021-04-01 00:00:00
1000,datasource,,Test Journal,Test Author,,,Test Article,,test url,2021-05-01 00:00:00
00c4c8c42473d25ebb38c4a8a14200c6900be2e9,1.xml,2020-04-26 00:00:00,Abouk and Heydari (2020),"Chernozhukov, Victor; Kasahara, Hiroyuki; Schrimpf, Paul; Chernozhukov, V; Kasahara, H; Schrimpf, P","Department of Economics and Center for Statistics and Data Science, MIT; School of Economics, UBC","School of Economics, UBC",1.xml,PDF,https://doi.org/10.1016/j.jeconom.2020.09.003,2024-04-26 00:00:00
b9f6e3d2dd7d18902ac3a538789d836793dd48b2,2.xml,,,"Hessami, Amirhossein; Shamshirian, Amir; Heydari, Keyvan; Pourali, Fatemeh; Alizadeh-Navaei, Reza; Moosazadeh, Mahmood; Abrotan, Saeed; Shojaie, Layla; Sedighi, Sogol; Shamshirian, Danial; Rezaei, Nima","School of Medicine, Student Research Committee, Mazandaran University of Medical Sciences; Systematic Review and Meta-Analysis Expert Group (SRMEG), Universal Scientific Education and Research Network (USERN); Network of Immunity in Infection, Malignancy and Autoimmunity (NIIMA), Universal Scientific Education and Research Network (USERN); Gastrointestinal Cancer Research Center, Non-Communicable Disease Institute, Mazandaran University of Medical Sciences; Department of Medical Laboratory Sciences, School of Allied Medical Science, Student Research Committee, Mazandaran University of Medical Sciences; Health Science Research Center, Addiction Institute, Mazandaran University of Medical Sciences; Department of Cardiology, Babol University of Medical Sciences; Research Center for Liver Diseases, Departments of Medicine, Keck School of Medicine, University of Southern California; Student Research Committee, Shiraz University of Medical Sciences; Chronic Respiratory Diseases Research Center, National Research Institute of Tuberculosis and Lung Diseases (NRITLD), Shahid Beheshti University of Medical Sciences; Research Center for Immunodeficiencies, Children's Medical Center, Tehran University of Medical Sciences; Department of Immunology, School of Medicine, Tehran University of Medical Sciences","Department of Immunology, School of Medicine, Tehran University of Medical Sciences",Cardiovascular diseases burden in COVID-19: Systematic review and meta-analysis,PDF,https://doi.org/10.1016/j.ajem.2020.10.022,2024-04-26 00:00:00


In [3]:
import sqlite3
from sentence_transformers import InputExample

# Connect to the SQLite database
db_path = 'paperetl/models/articles.sqlite'
db = sqlite3.connect(db_path)

# Create a cursor object using the cursor() method
cursor = db.cursor()

# Select titles from the articles table
cursor.execute("SELECT Id, Title FROM articles")

# Fetch all rows using fetchall() method
articles_data = cursor.fetchall()

# Dictionary to hold article titles and their first section of text
article_content = {}

# For each article, fetch the first section of text
for article_id, title in articles_data:
    cursor.execute("SELECT Text FROM sections WHERE Article = ? ORDER BY ROWID ASC LIMIT 1", (article_id,))
    first_section_text = cursor.fetchone()

    if first_section_text:
        article_content[title] = first_section_text[0]

# Close the database connection
db.close()

# List to hold InputExample objects for training
training_examples = []

# Create InputExample objects from titles and their first section of text
for title, text in article_content.items():
    # The label is set to 1.0 for all pairs, as we don't have actual similarity scores
    # In a real use-case, you would want to have labels reflecting the actual similarity
    training_examples.append(InputExample(texts=[title, text], label=1.0))

# Now you have a list of InputExample objects ready for training

**Model Training**

In [35]:
!pip install -U sentence_transformers


Collecting sentence_transformers
  Using cached sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
Installing collected packages: sentence_transformers
  Attempting uninstall: sentence_transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed sentence_transformers-2.7.0


Set up the Sentence Transformers Model

Step 1: Import Libraries and Initialize Model

In [4]:
from sentence_transformers import models, SentenceTransformer, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader
from sentence_transformers import SentencesDataset

# Load the pre-trained model from Hugging Face
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
embeddings = models.Transformer(model_name)

# Pooling strategy aggregates word embeddings into a fixed-size sentence embedding
pooling = models.Pooling(embeddings.get_word_embedding_dimension())

# Combine transformer and pooling modules into a Sentence Transformer model
model = SentenceTransformer(modules=[embeddings, pooling])


Step 2: Prepare the Dataloader for training

In [5]:
from torch.utils.data import DataLoader

# Convert list of InputExamples to a SentencesDataset
train_dataset = SentencesDataset(examples=training_examples, model=model)

# DataLoader prepares the data for training
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16)


In [6]:
# MultipleNegativesRankingLoss is used for tasks where you have one positive example and several negative examples
train_loss = losses.MultipleNegativesRankingLoss(model=model)


Setup an Evaluator

In [10]:

# Convert the validation data into InputExample objects
valid_examples = [InputExample(texts=[text_pair[0], text_pair[1]], label=text_pair[2]) for text_pair in valid_data]

# Create the evaluator using these new examples
sentences1 = [example.texts[0] for example in valid_examples]
sentences2 = [example.texts[1] for example in valid_examples]
scores = [example.label for example in valid_examples]

# Finally, create the EmbeddingSimilarityEvaluator
evaluator = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)


In [11]:

# Create lists of the first and second sentences, as well as similarity scores
sentences1 = [example.texts[0] for example in valid_examples]
sentences2 = [example.texts[1] for example in valid_examples]
scores = [example.label for example in valid_examples]

# Create the evaluator
evaluator = EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)


Train the Model

In [12]:
# Define number of epochs and evaluation steps
epochs = 4
evaluation_steps = 1000  # Adjust these hyperparameters according to your dataset and hardware capabilities

# Train the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=epochs,
    evaluator=evaluator,
    evaluation_steps=evaluation_steps,
    output_path="output/pubmedbert-base-embeddings"
)


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
# Run the evaluator on the model
model.evaluate(evaluator)


0.9999999999999999

In [17]:
# Pick a few examples from your test set and see the model predictions
for i in range(5):  # Adjust the range as needed
    print(f"Text 1: {test_sentences1[i]}")
    print(f"Text 2: {test_sentences2[i]}")

Text 1: Changing travel patterns in China during the early stages of the COVID-19 pandemic
Text 2: Changing travel patterns in China during the early stages of the COVID-19 pandemic
Text 1: Changing travel patterns in China during the early stages of the COVID-19 pandemic
Text 2: T he COVID-19 pandemic was first identified in Wuhan, China, in late 2019, and came to prominence in January 2020, and quickly spread within the country.
Text 1: Changing travel patterns in China during the early stages of the COVID-19 pandemic
Text 2: January is also a major holiday period in China, and the 40-day period around Lunar New Year (LNY), or Chunyun, marks the largest annual human movement in the world, with major travel flows out of large cities 1 .
Text 1: Changing travel patterns in China during the early stages of the COVID-19 pandemic
Text 2: The purpose of this holiday travel is often to visit family members.
Text 1: Changing travel patterns in China during the early stages of the COVID-19 pa