# Citation Needed? Automatically Detecting Sentences from Computing Research Papers that Need Citations
Author: Nicholas Vincent, Northwestern University

email: nickvincent@u.northwestern.edu | lab website: [www.psagroup.org](http://www.psagroup.org) | personal website: [www.nickmvincent.com](http://www.nickmvincent.com)

Prof. Doug Downey's EECS349 course, final project

## Abstract: 
The scientific paper is the primary artifact produced by scientists of every discipline. Within a scientific paper, citations are incredibly valuable: they help connect a paper to the surrounding literature, provide evidence for claims, and empower a single PDF (often with less than 10 pages) to “stand on the shoulders of giants” by referencing prior work. However, language is ambiguous, and it may not always be trivial to decide whether a certain sentence should include a citation or not.  Therefore, it may be valuable for an author or reviewer to be able to quickly identify whether a sentence should include a citation or not. We implement and test a variety of machine learning classifiers that attempt to solve this task.

In this work, we explore two distinct approaches to classification: (1) using word-level features (i.e. bag-of-words) alongside textual metadata with traditional classification techniques (Naive Bayes, trees, logistic regression, etc) and (2) using character-level features with a deep learning techniques (recurrent neural networks, long-short term memory). We find that for a small dataset (every sentence from 20 computing research papers) logistic regression performs the best.

## Organization of this Notebook
The main content of this notebook is organized based on the project guidelines for EECS349. At the end of the notebook is an appendix which goes into greater details about project details that are not specifically related to machine learning (e.g. details about the data cleaning, various examples of our dataset).

The following sections are:
1. Dataset
2. Overview of Features and Classification Approaches
3. Results
4. Future Work
5. Appendix

## Dataset
#### Data Aquisition
We generated all training and test data by downloading computing papers as PDF files and converting them to text (detailed below). Each text file was processed so as to split the text into sentences, identify all the sentences with citations, and then strip out the evidence of the citations (e.g. bracketed numbers like "[1]") so the data could be fairly for training.

We built two datasets for developing and testing: first we used a "small" dataset composed of X papers. This dataset included X different sentences and 11% (X) had citations. The papers for this dataset came exclusively from the PSA Research Group (the author's research lab), so this dataset is appropriate for including example sentences.

### Iterating on Data Cleaning


## Overview of Features and Classification Approaches Used

## Results
We found that...

## Future Work
Although...


## Appendix


In [42]:
import pandas as pd
df = pd.read_csv('all_labeled_sentences.csv', encoding='utf8')
df['has_citation'].value_counts()

0.0    5171
1.0     855
Name: has_citation, dtype: int64

In [33]:
sample_neg = df[df.has_citation == 0].sample(50, random_state=0)
print('\n===\nExample sentences without citations:')

for i, row in sample_neg.iterrows():
    print('(no citation)', row['text'])


===
Example sentences without citations:
(no citation) These definitional differences were likely a major cause of the deviations between the outputs of each localness metric, indicating that choosing an incorrect definition of localness may be costly.
(no citation) Gravity Modeling

Intuition Spatial interaction models seek to explain the relationship between two locations (i and j) using the distance between them and their individual attributes.
(no citation) Interestingly, for state-scale single-location definitions of localness, our results suggest that Plurality exhibits near perfect performance.
(no citation) This might be accomplished by creating a hybrid vector representation that combines the content-based and navigation-based vectors using a deep learning.
(no citation) In other words, we predict that for each $100,000 increase in median house price in a city, there will be about 43.4 more Airbnb hosts per 100,000 citizens, and 3.8 fewer Couchsurfing hosts.
(no citatio

In [34]:
sample_pos = df[df.has_citation == 1].sample(50, random_state=0)
print('\n===\nExample sentences with citations:')
for i, row in sample_pos.iterrows():
    print('Raw text:', row['text'])
    print(row['has_citation'])
    print('Processed text:', row['processed_text'])


===
Example sentences with citations:
Raw text: Additionally, further operationalizations of the term "local" appear in various legal and other contexts (eg in the food industry [61]).
1.0
Processed text: Additionally, further operationalizations of the term "local" appear in various legal and other contexts (eg in the food industry).
Raw text: A major recent thrust of this work relates to gender dynamics, with several studies showing that both OSM and the English Wikipedia have more content about and for men than about and for women (eg, [32,38,45,55,63]).
1.0
Processed text: A major recent thrust of this work relates to gender dynamics, with several studies showing that both OSM and the English Wikipedia have more content about and for men than about and for women .
Raw text: Our framework is straightforward and has the benefit of being easily extensible to include additional alternative routing approaches and additional externality metrics not discussed in this paper (eg number of 

### Data Cleaning
This section provides a careful walk-through of our data cleaning process. In particular, it focuses on justifying each step and providing examples.

Broadly, there are 4 issues (in order of execution, not importance)
1. Tokenizing academic text
2. Finding and "disposing" of citations cleanly
3. Finding and removing reference sections
4. Dealing with artifacts of PDF conversion

#### Tokenizing text from academic PDF files
Academic text includes frequent use of the period character in ways that are not handled well by NLTK's default sentence tokenizer. Therefore, we simply replace common academic expressions with an equivalent (albeit grammatically incorrect) version without periods.


In [44]:
from nltk import tokenize

data = "This sentence is quite academic, i.e. it belongs in an academic paper (e.g. a conference paper). \
We show in Fig. 1 that our work is important, which supports the findings of Smith et al. among others."

pairs = {
    'Fig.': 'Fig',
    'e.g.': 'eg',
    'i.e.': 'ie',
    'et al.': 'et al',
}
for key, val in pairs.items():
    data = data.replace(key, val)
sentences = tokenize.sent_tokenize(data)
print(sentences)

['This sentence is quite academic, ie it belongs in an academic paper (eg a conference paper).', 'We show in Fig 1 that our work is important, which supports the findings of Smith et al among others.']


#### Finding and "disposing" of citations cleanly
To even beging generating labels for our machine learning task, we need to generate labels that indicate which sentences have citations. However, if we don't "dispose" of the citation markers in the training data (e.g. bracketed citations like [1] or [34,35]) our training data won't generalize at all to real data. Most obviously, a character-based model might just learn to label all sentences with a bracket character as having citation. However, less obvious issues may occur: for example, we found after stripping away the [1], we might be left with odd-looking text, such as a comma surrounding by whitespace.

`"...Smith et al. showed this [1], and therefore..." -> "Smith et al. showed this , and therefore..."`

#### Finding and removing reference sections