# Papers with Code ML papers dataset

In [1]:
%load_ext autoreload
%autoreload 2
%cd ..

/home/ubuntu/paperswithcode/paper-extractor


In [2]:
from sota_extractor2.data.paper_collection import PaperCollection
from pathlib import Path

DATA_PATH = Path("data/arxiv")
PICKLE_PATH = Path("data/pc.pkl")

## Dataset
The dataset was created by parsing 75K arXiv papers related to machine learning. Due to parsing errors, the dataset contains texts and tables extracted from 56K papers. 
```
.
└── arxiv
    ├── papers
    │   ├── 0709
    │   │   ├── 0709.1667
    │   │   │   ├── text.json
    │   │   │   ├── metadata.json
    │   │   │   ├── table_01.csv
    │   │   │   ...
    │   │   ...
    │   ...
    └── structure-annotations.json
```

`text.json` files contains papers' content organized into sections. `metadata.json` list tables and their captions found in a given paper. `table_xx.csv` contains data of a given table (nested tables are flattened). We provide a simple API to load and access the dataset. Due to large number of papers it is recommended to load the dataset in parallel (default uses number of processes equal to number of CPU cores) and store it in a pickle file. Set `jobs=1` to disable multiprocessing. PaperCollection is a wrapper for `list` of papers with additional functions added for convenience. 

In [3]:
%time pc = PaperCollection.from_files(DATA_PATH)
len(pc)

CPU times: user 4min 58s, sys: 12.4 s, total: 5min 11s
Wall time: 7min 28s


56696

In [4]:
pc.to_pickle(PICKLE_PATH)

In [5]:
#%time pc = PaperCollection.from_pickle(PICKLE_PATH)

CPU times: user 3min 11s, sys: 9.39 s, total: 3min 20s
Wall time: 3min 20s


The path is searched recursively for papers, so it is easy to specify smaller dataset to play with. In this case, however, a path to `structure-annotations.json` file needs to be specified.

In [6]:
#%time pc_small = PaperCollection.from_files(DATA_PATH / "papers" / "1602", annotations_path=DATA_PATH / "structure-annotations.json")
#len(pc_small)

CPU times: user 2.35 s, sys: 2.08 s, total: 4.43 s
Wall time: 8.62 s


555

## Tables
Each `Paper` contains `text` and `tables` fields. Tables can be displayed with color-coded labels.

In [7]:
paper = pc.get_by_id('1607.04315')
table = paper.tables[0]
table.display()

0,1,2,3,4
Model,d,|θ|M,Train,Test
Classifier with handcrafted features [12],-,-,99.7,78.2
LSTM encoders [12],300,3.0M,83.9,80.6
Dependency Tree CNN encoders [13],300,3.5M,83.3,82.1
SPINN-PI encoders [14],300,3.7M,89.2,83.2
NSE,300,3.4M,86.2,84.6
MMA-NSE,300,6.3M,87.1,84.8
LSTM attention [15],100,242K,85.4,82.3
LSTM word-by-word attention [15],100,252K,85.3,83.5
MMA-NSE attention,300,6.5M,86.9,85.4


In [8]:
PaperCollection.cells_gold_tags_legend()

0,1
Tag,description
model-best,model that has results that author most likely would like to have exposed
model-paper,"an example of a generic model, (like LSTM)"
model-competing,model from another paper used for comparison
dataset-task,Task
dataset,Dataset
dataset-sub,Subdataset
dataset-metric,Metric
model-params,"Params, f.e., number of layers or inference time"
table-meta,Cell describing other header cells


Table's data is stored in `.df` pandas `DataFrame`. Each cell contains its content `value`, annotated `gold_tags` and references `refs` to other papers. Most of the references were normalized across all papers.

In [9]:
table.df.iloc[4,0]

Cell(value='SPINN-PI encoders [14]', gold_tags='model-competing', refs=['xxref-23c141141f4f63c061d3cce14c71893959af5721'])

Additionally, each table contains `gold_tags` describing what is the content of the table.

In [10]:
table.gold_tags

'sota'

## Text Content
Papers' content is represented using elastic search document classes (can be easily `save()`'ed to an existing elastic search instance). Each `text` contains `title`, `abstract`, and 'authors'. Paper's text is split into `fragments`.

In [11]:
paper.text.abstract

'Abstract We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read , compose and write operations. NSE can also access 1 xxanchor-x1-2f1 multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.'

In [12]:
paper.text.print_toc()

1 xxanchor-x1-10001 Introduction

2 xxanchor-x1-20002 Related Work

3 xxanchor-x1-30003 Proposed Approach

3.1 xxanchor-x1-40003.1 Read, Compose and Write

3.2 xxanchor-x1-50003.2 Shared and Multiple Memory Accesses

4 xxanchor-x1-60004 Experiments

4.1 xxanchor-x1-70004.1 Natural Language Inference

4.2 xxanchor-x1-80004.2 Answer Sentence Selection

4.3 xxanchor-x1-90004.3 Sentence Classification

4.4 xxanchor-x1-100004.4 Document Sentiment Analysis

4.5 xxanchor-x1-110004.5 Machine Translation

5.1 xxanchor-x1-130005.1 Memory Access and Compositionality

6 xxanchor-x1-140006 Conclusion

xxanchor-x1-150006 Acknowledgments

xxanchor-x1-160006 References

A xxanchor-x1-17000A Step-by-step visualization of memory states in NSE

In [13]:
paper.text.print_section("Machine Translation")

# 4.5 xxanchor-x1-110004.5 Machine Translation

Lastly, we conducted an experiment on neural machine translation (NMT). The NMT problem is mostly defined within the encoder-decoder framework [ xxref-4b9b7eed30feee37db3452b74503d0db9f163074 , xxref-0b544dfe355a5070b60986319a3f51fb45d1348e , xxref-39dba6f22d72853561a4ed684be265e179a39e4f ]. The encoder provides the semantic and syntactic information about the source sentences to the decoder and the decoder generates the target sentences by conditioning on this information and its partially produced translation. For an efficient encoding, the attention-based NTM was introduced [ xxref-071b16f25117fb6133480c6259227d54fc2a5ea0 ].

11000



For NTM, we implemented three different models. The first model is a baseline model and is similar to the one proposed in [ xxref-071b16f25117fb6133480c6259227d54fc2a5ea0 ] (RNNSearch). This model (LSTM-LSTM) has two LSTM for the encoder/decoder and has the soft attention neural net, which attends over the source sentence and constructs a focused encoding vector for each target word. The second model is an NSE-LSTM encoder-decoder which encodes the source sentence with NSE and generates the targets with the LSTM network by using the NSE output states and the attention network. The last model is an NSE-NSE setup, where the encoding part is the same as the NSE-LSTM while the decoder NSE now uses the output state and has an access to the encoder memory, i.e., the encoder and the decoder NSEs access a shared memory. The memory is encoded by the first NSEs and then read/written by the decoder NSEs. We used the English-German translation corpus from the IWSLT 2014 evaluation campaign [ xxref-c64d27b122d5b6ef0be135e63df05c3b24bd80c5 ]. The corpus consists of sentence-aligned translation of TED talks. The data was pre-processed and lowercased with the Moses toolkit. 9 xxanchor-x1-11001f9 We merged the dev2010 and dev2012 sets for development and the tst2010, tst2011 and tst2012 sets for test data 10 xxanchor-x1-11002f10 . Sentence pairs with length longer than 25 words were filtered out. This resulted in 110,439/4,998/4,793 pairs for train/dev/test sets. We kept the most frequent 25,000 words for the German dictionary. The English dictionary has 51,821 words. The 300-D Glove 840B vectors were used for embedding the words in the source sentence whereas a lookup embedding layer was used for the target German words. Note that the word embeddings are usually optimized along with the NMT models. However, for the evaluation purpose we in this experiment do not optimize the English word embeddings. Besides, we do not use a beam search to generate the target sentences.

11001



xxanchor-x1-110032 Figure 2: Word association or composition graphs produced by NSE memory access. The directed arcs connect the words that are composed via compose module. The source nodes are input words and the destination nodes (pointed by the arrows) correspond to the accessed memory slots. < S > denotes the beginning of sequence.

11002



The LSTM encoder/decoders have two layers with 300 units. The NSE read/write modules are two one-layer LSTM with the same number of units as the LSTM encoder/decoders. This ensures that the number of parameters of the models is roughly the equal. The models were trained to minimize word-level cross entropy loss and were regularized by 20% input dropouts and the 30% output dropouts. We set the batch size to 128, the initial learning rate to 1e-3 for LSTM-LSTM and 3e-4 for the other models and l 2 regularizer strength to 3e-5, and train each model for 40 epochs. We report BLEU score for each models. 11 xxanchor-x1-11004f11

11003



Table xxref-x1-100035 reports our results. The baseline LSTM-LSTM encoder-decoder (with attention) obtained 17.02 BLEU on the test set. The NSE-LSTM improved the baseline slightly. Given this very small improvement of the NSE-LSTM, it is unclear whether the NSE encoder is helpful in NMT. However, if we replace the LSTM decoder with another NSE and introduce the shared memory access to the encoder-decoder model (NSE-NSE), we improve the baseline result by almost 1.0 BLEU. The NSE-NSE model also yields an increasing BLEU score on dev set. The result demonstrates that the attention-based NMT systems can be improved by a shared-memory encoder-decoder model. In addition, memory-based NMT systems should perform well on translation of long sequences by preserving long term dependencies.

11004



Fragments can be accessed separately

In [14]:
paper.text.fragments[1]

# 1 xxanchor-x1-10001 Introduction,
Recently several studies have explored ways of extending the neural networks with an external memory [ xxref-6eedf0a4fe861335f7f7664c14de7f71c00b7932 – xxref-950ebd31505dfc0733c391ad9b7a16571c46002e ]. Unlike LSTM, the short term memories and the training parameters of such a neural network are no longer coupled and can be adapted. In this paper we propose a novel class of memory augmented neural networks called Neural Semantic Encoders (NSE) for natural language understanding. NSE offers several desirable properties. NSE has a variable sized encoding memory which allows the model to access entire input sequence during the reading process; therefore efficiently delivering long-term dependencies over time. The encoding memory evolves over time and maintains the memory of the input sequence through read , compose and write operations. NSE sequentially processes the input and supports word compositionality inheriting both temporal and hierarchical natur