# papeles package - parsing files examples


In this notebook, there are 3 examples of things you might want to do with the raw PDF files you might want to do analysis on. Using the data extracted from the [neurips_crawler](https://github.com/glhuilli/neurips_crawler), raw PDF files are processed, and two sub tasks are presented: extract paper sentences (everything that is not references or header) and extract header (everything above the abstract). 

Some applications like references extraction will be added in future applications. 


## Process PDF files 

using `neurips.load.load_folder` you'll be able to process Neurips PDF files and extract as a generator that can be used for processing purposes (e.g. the ones presented in the following sections). 


In [33]:
import os
from tqdm.notebook import tqdm

from papeles.corpus.neurips.load import load_folder

# output is the output folder from neurips_crawler: https://github.com/glhuilli/neurips_crawler
NEURIPS_DATA_OUTPUT = '/var/data/neurips_crawler/output'  


metadata = {}  # year -> NeurIPS object 
files_data = {}  # file_name -> text of file
for folder in tqdm(os.listdir(NEURIPS_DATA_OUTPUT)):
    year = folder.split('_')[-1]
    files_sentences, files_metadata = load_folder(os.path.join(NEURIPS_DATA_OUTPUT, folder))
    metadata[year] = files_metadata
    files_data.update(files_sentences)
print(f'files: {len(files_data)}')

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))


files: 5327


In [37]:
metadata['2009']['3689-information-theoretic-lower-bounds-on-the-oracle-complexity-of-convex-optimization.pdf']

{'id': 'a9f07f6e-073c-5dca-9071-256ffc6dec5b',
 'title': 'Information-theoretic lower bounds on the oracle complexity of convex optimization',
 'pdf_name': '3689-information-theoretic-lower-bounds-on-the-oracle-complexity-of-convex-optimization.pdf',
 'abstract': 'Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.',
 'authors': [{'id': 'alekh

In [38]:
list(files_data.keys())[0]

'3689-information-theoretic-lower-bounds-on-the-oracle-complexity-of-convex-optimization.pdf'

## Extract "Sentences" of Papers

Each paper has different sections that might be of relevance for different applications. In this case, `paper.get_sentences` is mainly focused in getting all those lines in the paper that are related to the main content of the paper (before the "references" and after the "abstract").



In [27]:
from papeles.utils import paper
from collections import Counter, defaultdict


NEURIPS_ANALYSIS_DATA = '/var/data/neurips_analysis/'


# Note that this takes ~2hrs to run. 
for paper_key, paper_sentences in tqdm(list(files_data.items())):
    with open(os.path.join(NEURIPS_ANALYSIS_DATA, f'files_sentences/{paper_key}_sentences.txt'), 'w') as f:
        for line in paper.get_sentences(paper.flatten(paper_sentences)):
            f.write(line + '\n')


## Extract "Header" of Papers 

To get the header of a paper, which is defined as everything right before the abstract (i.e. title, authors, etc.), you can use the `paper.get_header` method as follows: 

In [34]:

for paper_key, paper_sentences in tqdm(list(files_data.items())):
    with open(os.path.join(NEURIPS_ANALYSIS_DATA, f'files_headers/{paper_key}_headers.txt'), 'w') as f:
        for line in paper.get_header(paper.flatten(paper_sentences)):
            f.write(line + '\n')


HBox(children=(FloatProgress(value=0.0, max=5327.0), HTML(value='')))




## Save paper metadata

In the particular case of Neurips, there's metadata available, which is recommended to be stored independently from other files, and it shows information as the year of publication, authors (author_id and author_name), and the paper_key (number at the beginning of the file). 

In [41]:
import json

from papeles.paper.neurips import get_key


for year, paper_data in tqdm(data.items()):
    for paper_key, file_metadata in paper_data.items():
        file_metadata = metadata[year][paper_key]
        file_metadata['year'] = int(year)
        file_metadata['paper_key'] = get_key(paper_key)
        with open(os.path.join(NEURIPS_ANALYSIS_DATA, f'files_metadata/{paper_key}_metadata.json'), 'w') as f:
            json.dump(file_metadata, f)


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))


