# Text and Tables Extraction

This notebook presents how to use our pipeline to extract text and tables from arXiv papers with available LaTeX source code.

In [None]:
from pathlib import Path
from axcell.helpers.paper_extractor import PaperExtractor

### Structure of Directories

We cache the artifacts produced by successful execution of the intermediate steps of extraction pipeline. The `root` argument of `PaperExtractor` is a path under which the following directory structue is created:

```
root
├── sources                       # e-print archives
├── unpacked_sources              # extracted latex sources (generated automatically)
├── htmls                         # converted html files (generated automatically)
└── papers                        # extracted text and tables (generated automatically)
```

In [None]:
ROOT_PATH = Path('data')

In our case there's a single e-print archive:

In [None]:
!tree {ROOT_PATH}

[01;36mdata[00m
└── [01;34msources[00m
    └── [01;34m1903[00m
        └── 1903.11816v1

2 directories, 1 file


In [None]:
extract = PaperExtractor(ROOT_PATH)

To extract text and tables from a single paper just pass the path to the archive:

In [None]:
SOURCES_PATH = ROOT_PATH / 'sources'
extract(SOURCES_PATH / '1903' / '1903.11816v1')

'success'

The subdirectory structure under `sources` directory will be replicated in the other top-level directories.

In [None]:
!tree -L 4 {ROOT_PATH}

[01;36mdata[00m
├── [01;34mhtmls[00m
│   └── [01;34m1903[00m
│       └── [01;34m1903.11816v1[00m
│           └── index.html
├── [01;34mpapers[00m
│   └── [01;34m1903[00m
│       └── [01;34m1903.11816v1[00m
│           ├── layout_01.csv
│           ├── layout_02.csv
│           ├── layout_03.csv
│           ├── layout_04.csv
│           ├── layout_05.csv
│           ├── metadata.json
│           ├── table_01.csv
│           ├── table_02.csv
│           ├── table_03.csv
│           ├── table_04.csv
│           ├── table_05.csv
│           └── text.json
├── [01;34msources[00m
│   └── [01;34m1903[00m
│       └── 1903.11816v1
└── [01;34munpacked_sources[00m
    └── [01;34m1903[00m
        └── [01;34m1903.11816v1[00m
            ├── eso-pic.sty
            ├── iccv.sty
            ├── iccv_eso.sty
            ├── ieee.bst
            ├── [01;34mimages[00m
            ├── submission_465.bbl
            └── submission_465.tex

12 dire

The extracted data is stored in `papers` directory. We can read it using `PaperCollection` class. `PaperCollection` is a wrapper for `list` of papers with additional functions added for convenience. Due to large number of papers it is recommended to load the dataset in parallel (default uses number of processes equal to number of CPU cores) and store it in a pickle file. Set jobs=1 to disable multiprocessing.

In [None]:
from axcell.data.paper_collection import PaperCollection

PAPERS_PATH = ROOT_PATH / 'papers'
pc = PaperCollection.from_files(PAPERS_PATH)
# pc.to_pickle('mypapers.pkl')
# pc = PaperCollection.from_pickle('mypapers.pkl')

In [None]:
paper = pc.get_by_id('1903.11816')

In [None]:
paper.text.title

'FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation'

In [None]:
paper.tables[4]

0,1,2,3
Rank,Team,Single Model,Final Score
1,CASIA_IVA_JD,✗,0.5547
2,WinterIsComing,✗,0.5544
-,PSPNet [38],ResNet-269,0.5538
-,EncNet [36],ResNet-101,0.5567
-,Ours,ResNet-101,0.5584


As *FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation* (Wu et al., 2019) is present in our **SegmentedTables** dataset, we can use `PaperCollection` to import annotations (table segmentation and results):

In [None]:
from axcell.helpers.datasets import read_tables_annotations

V1_URL = 'https://github.com/paperswithcode/axcell/releases/download/v1.0/'
SEGMENTED_TABLES_URL = V1_URL + 'segmented-tables.json.xz'

segmented_tables = read_tables_annotations(SEGMENTED_TABLES_URL)

pc = PaperCollection.from_files(PAPERS_PATH, annotations=segmented_tables.to_dict('record'))

In [None]:
paper = pc.get_by_id('1903.11816')
paper.tables[4]

0,1,2,3
Rank,Team,Single Model,Final Score
1,CASIA_IVA_JD,✗,0.5547
2,WinterIsComing,✗,0.5544
-,PSPNet [38],ResNet-269,0.5538
-,EncNet [36],ResNet-101,0.5567
-,Ours,ResNet-101,0.5584


In [None]:
pc.cells_gold_tags_legend()

0,1
Tag,description
model-best,the best performing model introduced in the paper
model-paper,model introduced in the paper
model-ensemble,ensemble of models introduced in the paper
model-competing,model from another paper used for comparison
dataset-task,Task
dataset,Dataset
dataset-sub,Subdataset
dataset-metric,Metric
model-params,"Params, f.e., number of layers or inference time"


In [None]:
paper.tables[4].sota_records

Unnamed: 0_level_0,task,dataset,metric,format,model,raw_value
cell_ext_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
table_05.csv/5.3,Semantic Segmentation,ADE20K test,Test Score,,EncNet + JPU,0.5584


## Parallel Extraction

For a single paper extraction can take from several seconds to a few minutes (the longest phase of converting LaTeX source into HTML is timed-out after 5 minutes), so to process multiple files we run extraction in parallel.

In [None]:
%%time

from joblib import delayed, Parallel

# access extract from the global context to avoid serialization
def extract_single(file): return extract(file)

files = sorted([path for path in SOURCES_PATH.glob('**/*') if path.is_file()])

statuses = Parallel(backend='multiprocessing', n_jobs=-1)(delayed(extract_single)(file) for file in files)

CPU times: user 100 ms, sys: 40.5 ms, total: 141 ms
Wall time: 30.1 s
