# Citation Needed? Automatically Detecting Sentences from Computing Research Papers that Need Citations
Author: Nicholas Vincent, Northwestern University

email: nickvincent@u.northwestern.edu | lab I work in: [www.psagroup.org](http://www.psagroup.org) | personal website: [www.nickmvincent.com](http://www.nickmvincent.com)

Prof. Doug Downey's EECS349 course, final project

## Abstract: 
The scientific paper is the primary artifact produced by scientists of every discipline. Within a scientific paper, citations are incredibly valuable: they help connect a paper to the surrounding literature, provide evidence for claims, and empower a single PDF (often with less than 10 pages) to “stand on the shoulders of giants” by referencing prior work. However, language is ambiguous, and it may not always be trivial to decide whether a certain sentence should include a citation or not.  Therefore, it may be valuable for an author or reviewer to be able to quickly identify whether a sentence should include a citation or not. We implement and test a variety of machine learning classifiers that attempt to solve this task.

In this work, we explore a variety of approaches to classification,  (1) using bag-of-words vectorization alongside textual metadata with traditional classification techniques (Naive Bayes Classifier, Decision Trees, Support Vector Classifier, etc.) and (2) using character-level features and word embeddings with a deep learning techniques (recurrent neural networks, long-short term memory). Using a dataset with 6022 examples (85% negative), we see that support vector machine approaches provide a good balance of performance and quick training (reaching X AUROC with very quick training). Finally, using a larger dataset with 32,228 examples, we find that after tuning hyperparameters we can train a classifer with 85% recall and 40% precision (accuracy is 89%). This performance should be adequate to make this classifier useful as a "machine assistant" for authors or reviewers of academic papers.

## Organization of this Notebook
The main content of this notebook is organized based on the project guidelines for EECS349. At the end of the notebook is an appendix which goes into greater details about project details that are not specifically related to machine learning (e.g. details about the data cleaning, various examples of our dataset).

The following sections are:
1. Methods: Dataset
2. Methods: Overview of Features and Classification Approaches
3. Results
4. Future Work

An [appendix](/appendix) is included in a separate page. While not part of the official project submission, the appendix includes a variety of minor details that may be interesting and helpful (or not) to readers.

## Methods: Dataset
#### Data Aquisition
We generated all training and test data by downloading computing papers as PDF files and converting them to text (detailed below). Each text file was processed so as to split the text into sentences, identify all the sentences with citations, and then strip out the evidence of the citations (e.g. bracketed numbers like "[1]") so the data could be fairly for training. The full data pre-processing steps are explained in the Appendix.

We built two datasets for developing and testing, a small dataset and a larger dataset.

First we used a small dataset composed of X papers. This dataset included 6022 different sentences and 11% (X) had citations. The papers for this dataset came exclusively from the PSA Research Group (the author's research lab), so we use this data when referring to example sentences. We refer to this dataset as the "psa_research dataset" in code and file organization.

The second dataset was produced by downloading the full Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI 2018), made available through the [ACM Digital Library]() (CITE) From this set of over 600 papers, we used 100 papers, which allowed the full training and testing process could be performed on a single laptop.


## Methods: Overview of Features and Classification Approaches Used

### Initial Approach
For our first set of classification approaches, we followed a simple strategy to turn sentences into features. We vectorized each sentence using a term frequency-inverse document frequency (Tf-Idf) approach, and then computed additional "hand-crafted" features to account for some aspects of text that are not captured by word counting: number of characters in the sentence and boolean variables indicating whether each sentence has any digit characters, has a comma character, has a quote character, and has an uppercase character after the first character.

Tf-Idf details: For the large dataset, we limited the bag-of-words vocabulary size to the 10,000 most frequent words to avoid memory issues. For all datasets, we included stop-words (as removing them just lowered performance) and stripped accents.

Then, we performed a variety of runs with statistical test feathe ture selection to reduce dimensionality of our feature space (sklearn's SelectKBest with ANOVA F-value, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif). We tried reducing the max number of features to:
50,100,500,1000,2000,3000,4000,5000

Using these sets of features, we tested a variety of classifiers (using sklearn implementations, with default parameters unless otherwise noted).
1. Logistic Regression (using C Values of 0.1, 1, and 10)
2. Linear Support Vector Classifier (using C values of 0.1, 1, and 10)
3. Decision Tree
4. Gaussian Naive Bayes
5. K-Nearest Neighbors (3 neighbors)

### Deep Learning
We tested two different deep learning approaches for classification as well. Both were implemented using Tensorflow.

First, we tried to use a character-level RNN (features were character embeddings) trained for language modeling to perform classifications by adding our labels as tokens at the end of the sentence (e.g. "this is negative example.@@0 And this is a positive examples.@@1". See code for more details). and predicted these tokens. The code for this implementation was a modified fork of this MIT-licensed Github repo: https://github.com/crazydonkey200/tensorflow-char-rnn.

Second, we trained a simple neural network using Tensorflow's DNNClassifier Estimator. For this classifier, we used the Wiki-word word embeddings, which are pre-trained on the Wikipedia corpus and available with Tensorflow. We tried a few network architectures with 1, 2, and 3 hidden layers. The code for this implementation was a modified version of this CC-3.0 tutorial in the Tensorflow docs: https://www.tensorflow.org/tutorials/text_classification_with_tf_hub.

### A note on software and hardware used
As mentioned above, all our deep learning implementations used Tensorflow and all other machine learning implementations used sklearn. While we used multiple machines (including Google Cloud Platform) throughout experiments, all our reported results were run from the same machine, a Dell XPS 13 64-bit Windows machine with Intel i7-6560U CPU @ 2.20 GHZ and no GPU. This allowed us to directly compare the time taken of various methods.

## Results
### Metrics and other Considerations
For all our experiments, we decided to focus on performance metrics beyond accuracy, especially given the imbalanced class labels. Specifically, we chose to use area under the receiver operator characteristic curve (AUROC) when initially comparing models with the small dataset (as we were interested in having a model with strong discriminative power), and then to focus on the precision-recall curves for the large dataset to try to achieve a good precision-recall balance for users. The use case we considered was an academic who is writing a paper or reviewing a paper and wants to quickly use our tool to check for any sentences that might need citations. Under this model, the user likely wants to emphasize recall over precision: because most sentences do not have citations, if the model identifies all the sentences that do have citations, the user can quickly eliminate the false positives with their human judgment. Therefore, we decided to aim for a model that had the best precision when recall was over 80%.

Given that academics are often very busy, come from a variety of different fields and disciplines, often like to run their own software, and often work with limited computing resources, we also wanted to emphasize models that train quickly, test quickly, and run on a laptop, so that users could realistically train their own personal model. That being said, we still did consider the performance of slower and costlier approaches (e.g. deep learning, nearest neighbors) in case performance was vastly improved.

### Comparing models performance for the `psa_research` dataset
Below, Figure 1 shows the AUROC for our experiments on the smaller `psa_research` dataset. We found that SVM, logistic regression, and deep neural network classifer, all performed well (although the neural net took orders of magnitude more time to train). The best classifer was a linear SVM classifer with C=0.1 and the best 2000 features selected. Based on these results, we eliminated Naive Bayes, Decision Tree, and Nearest Neighbor classifer.

### Precision-recall with the `chi_2018` dataset

We treated this 
We found that...



In [None]:
df = pd.read_csv()

## Future Work
While we focused on testing models that were small and quick to train, an interesting next step would be to try to build an extremely general model, incorporating text from across academic disciplines. This would require substantially more infrastructure and resources. It would be interesting to see how deep learning methods compare to other methods when using orders of magnitude more data (and, by extension less similar data).

Additionally, if user tests suggest that performance of our classifers are unsatisfactory, it may be possible to tweak our implementation to improve overall performance, perhaps through additional hand-crafted features designed by academics familiar with a particular fields writing trends, or through a more comprehensive hyperparameter search.

## Appendix

In [1]:
import pandas as pd
df = pd.read_csv('all_labeled_sentences.csv', encoding='utf8')
df['has_citation'].value_counts()

0.0    211017
1.0     31422
Name: has_citation, dtype: int64

In [2]:
sample_neg = df[df.has_citation == 0].sample(50, random_state=0)
print('\n===\nExample sentences without citations:')

for i, row in sample_neg.iterrows():
    print('(no citation)', row['text'])


===
Example sentences without citations:
(no citation) how to join these lumber boards is essential to woodworking.
(no citation) Paper 309

Page 1

CHI 2018 Paper

CHI 2018, April 21­26, 2018, Montréal, QC, Canada

Guided by this prior work, we conducted a study to answer this overarching question: How does socioeconomic context shape caregivers' perceptions and use of current PA tracking tools?
(no citation) Our system was more accurate on fully incorrect SPSes (57% were "Accurate") than partially incorrect SPSes (36% were "Accurate").
(no citation) 15.00 DOI: https://doi.org/10.1145/3173574.3173656

INTRODUCTION Electronic textile technology enables people to create expressive, interactive, and functional textile artifacts for both playful and serious applications.
(no citation) In the long-answer category, recordings coded as Definition made up 7.63%.
(no citation) Clickstream and in-video dropout data are passively collected in that the data is natu

In [1]:
sample_pos = df[df.has_citation == 1].sample(50, random_state=0)
print('\n===\nExample sentences with citations:')
for i, row in sample_pos.iterrows():
    print('Raw text:', row['text'])
    print(row['has_citation'])
    print('Processed text:', row['processed_text'])

NameError: name 'df' is not defined

### Data Cleaning
This section provides a careful walk-through of our data cleaning process. In particular, it focuses on justifying each step and providing examples.

Broadly, there are 4 issues (in order of execution, not importance)
1. Tokenizing academic text
2. Finding and "disposing" of citations cleanly
3. Finding and removing reference sections
4. Dealing with artifacts of PDF conversion

#### Tokenizing text from academic PDF files
Academic text includes frequent use of the period character in ways that are not handled well by NLTK's default sentence tokenizer. Therefore, we simply replace common academic expressions with an equivalent (albeit grammatically incorrect) version without periods.


In [44]:
from nltk import tokenize

data = "This sentence is quite academic, i.e. it belongs in an academic paper (e.g. a conference paper). \
We show in Fig. 1 that our work is important, which supports the findings of Smith et al. among others."

pairs = {
    'Fig.': 'Fig',
    'e.g.': 'eg',
    'i.e.': 'ie',
    'et al.': 'et al',
}
for key, val in pairs.items():
    data = data.replace(key, val)
sentences = tokenize.sent_tokenize(data)
print(sentences)

['This sentence is quite academic, ie it belongs in an academic paper (eg a conference paper).', 'We show in Fig 1 that our work is important, which supports the findings of Smith et al among others.']


#### Finding and "disposing" of citations cleanly
To even beging generating labels for our machine learning task, we need to generate labels that indicate which sentences have citations. However, if we don't "dispose" of the citation markers in the training data (e.g. bracketed citations like [1] or [34,35]) our training data won't generalize at all to real data. Most obviously, a character-based model might just learn to label all sentences with a bracket character as having citation. However, less obvious issues may occur: for example, we found after stripping away the [1], we might be left with odd-looking text, such as a comma surrounding by whitespace.

`"...Smith et al. showed this [1], and therefore..." -> "Smith et al. showed this , and therefore..."`

#### Finding and removing reference sections