# Introducing paperetl

[paperetl](https://github.com/neuml/paperetl) is an ETL library for processing medical and scientific papers. paperetl transforms XML, CSV and PDF articles into a structured dataset, enabling downstream processing by machine learning applications.

This notebook gives a brief overview of paperetl.

# Install dependencies

Install `paperetl` and all dependencies. This step also downloads input data to process.

In [None]:
%%capture
!pip install git+https://github.com/neuml/paperetl
 
# Download NLTK data
!python -c "import nltk; nltk.download('punkt')"

# Download data
!mkdir -p paperetl
!wget -N https://github.com/neuml/paperetl/releases/download/v1.6.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Review data

Now let's take a look at the input data, which is a list of files in a directory.

In [None]:
!ls -l paperetl/file/data

total 1692
-rw-rw-r-- 1 1000 1000  95375 Nov  4  2020 0.xml
-rw-rw-r-- 1 1000 1000    353 Dec  5  2021 10.csv
-rw-rw-r-- 1 1000 1000 310066 Nov  4  2020 1.xml
-rw-rw-r-- 1 1000 1000 349016 Nov  4  2020 2.xml
-rw-rw-r-- 1 1000 1000 232888 Nov  4  2020 3.xml
-rw-rw-r-- 1 1000 1000 235276 Nov  4  2020 4.xml
-rw-rw-r-- 1 1000 1000  50414 Nov  4  2020 5.xml
-rw-rw-r-- 1 1000 1000  92683 Nov  4  2020 6.xml
-rw-rw-r-- 1 1000 1000 139379 Nov  4  2020 7.xml
-rw-rw-r-- 1 1000 1000  41640 Nov  4  2020 8.xml
-rw-rw-r-- 1 1000 1000  77557 Nov  4  2020 9.xml
-rw-r--r-- 1 1000 1000   5364 Dec  5  2021 arxiv.xml
-rw-r--r-- 1 1000 1000  70272 Oct  5  2021 pubmed.xml


In this example, we're only covering XML and CSV files. Processing PDF articles requires [installing GROBID](https://github.com/neuml/paperetl#additional-dependencies).

# Process data

Next, we'll run the ETL process to load the files into a SQLite articles database.

In [None]:
!python -m paperetl.file paperetl/file/data paperetl/models

Processing: paperetl/file/data/0.xml
Processing: paperetl/file/data/1.xml
Processing: paperetl/file/data/10.csv
Processing: paperetl/file/data/2.xml
Processing: paperetl/file/data/3.xml
Processing: paperetl/file/data/4.xml
Processing: paperetl/file/data/5.xml
Processing: paperetl/file/data/6.xml
Processing: paperetl/file/data/7.xml
Processing: paperetl/file/data/8.xml
Processing: paperetl/file/data/9.xml
Processing: paperetl/file/data/arxiv.xml
Processing: paperetl/file/data/pubmed.xml
Total articles inserted: 21


In [None]:
!ls -l paperetl/models

total 940
-rw-r--r-- 1 root root 962560 Jan 23 16:29 articles.sqlite


This ETL process took the XML and CSV files, parsed the metadata/content and loaded it all into `articles.sqlite`. 

# Review parsed data

The two main tables in `articles.sqlite` are articles and sections. 

- The articles table stores metadata (date, authors, publication, title...)
- The sections table stores the article text split into sections and sentences

Now let's take a look at what was loaded. 

In [None]:
import sqlite3

import pandas as pd

from IPython.display import display, HTML

def execute(sql):
  db = sqlite3.connect("paperetl/models/articles.sqlite")
  cursor = db.cursor()
  cursor.execute(sql)

  df = pd.DataFrame([list(x) for x in cursor], columns=[c[0] for c in cursor.description])
  display(HTML(df.to_html(index=False)))

# Show articles
execute("SELECT * FROM articles LIMIT 5")

Id,Source,Published,Publication,Authors,Affiliations,Affiliation,Title,Tags,Reference,Entry
00398e4c637f5e5447e35e63669187f0239c0357,0.xml,,,"Gibbs, Hamish; Liu, Yang; Pearson, Carl; Jarvis, Christopher; Grundy, Chris; Quilty, Billy; Diamond, Charlie; Cmmid, Lshtm; Eggo, Rosalind","Department of Infectious Disease Epidemiology, School of Hygiene and Tropical Medicine; Centre for Mathematical Modelling of Infectious Diseases, School of Hygiene and Tropical Medicine","Centre for Mathematical Modelling of Infectious Diseases, School of Hygiene and Tropical Medicine",Changing travel patterns in China during the early stages of the COVID-19 pandemic,PDF,https://doi.org/10.1038/s41467-020-18783-0,2023-01-23 00:00:00
1001,datasource2,,Test Journal2,Test Author2,,,Test Article2,,test url2,2021-04-01 00:00:00
1000,datasource,,Test Journal,Test Author,,,Test Article,,test url,2021-05-01 00:00:00
00c4c8c42473d25ebb38c4a8a14200c6900be2e9,1.xml,2020-01-23 00:00:00,Abouk and Heydari (2020),"Chernozhukov, Victor; Kasahara, Hiroyuki; Schrimpf, Paul; Chernozhukov, V; Kasahara, H; Schrimpf, P","Department of Economics and Center for Statistics and Data Science, MIT; School of Economics, UBC","School of Economics, UBC",1.xml,PDF,https://doi.org/10.1016/j.jeconom.2020.09.003,2023-01-23 00:00:00
3d2fb136bbd9bd95f86fc49bdcf5ad08ada6913b,3.xml,2021-01-23 00:00:00,Biosensors and Bioelectronics,"Yüce, Meral; Filiztekin, Elif; Gasia, Korin; Zkaya, Ö","SUNUM Nanotechnology Research and Application Centre, Sabanci University; Faculty of Engineering and Natural Sciences, Sabanci University","Faculty of Engineering and Natural Sciences, Sabanci University",COVID-19 diagnosis -A review of current methods,PDF,https://doi.org/10.1016/j.bios.2020.112752,2023-01-23 00:00:00


In [None]:
# Show sections
execute("SELECT * FROM sections LIMIT 5")

Id,Article,Name,Text
0,00398e4c637f5e5447e35e63669187f0239c0357,TITLE,Changing travel patterns in China during the early stages of the COVID-19 pandemic
1,00398e4c637f5e5447e35e63669187f0239c0357,,"T he COVID-19 pandemic was first identified in Wuhan, China, in late 2019, and came to prominence in January 2020, and quickly spread within the country."
2,00398e4c637f5e5447e35e63669187f0239c0357,,"January is also a major holiday period in China, and the 40-day period around Lunar New Year (LNY), or Chunyun, marks the largest annual human movement in the world, with major travel flows out of large cities 1 ."
3,00398e4c637f5e5447e35e63669187f0239c0357,,The purpose of this holiday travel is often to visit family members.
4,00398e4c637f5e5447e35e63669187f0239c0357,,"The temporary displacement from residential addresses as a result of this holiday travel could last one to two weeks, up to a month."


The results above show a sample of the metadata and content. 

# Wrapping up

This notebook gave a brief overview of paperetl. The processed data can be used for a simple query and display application. It can also feed machine learning models for more advanced use cases (see [paperai](https://github.com/neuml/paperai)). 