# Appendix

In [1]:
import pandas as pd
df = pd.read_csv('all_labeled_sentences.csv', encoding='utf8')
df['has_citation'].value_counts()

0.0    211017
1.0     31422
Name: has_citation, dtype: int64

In [2]:
sample_neg = df[df.has_citation == 0].sample(50, random_state=0)
print('\n===\nExample sentences without citations:')

for i, row in sample_neg.iterrows():
    print('(no citation)', row['text'])


===
Example sentences without citations:
(no citation) how to join these lumber boards is essential to woodworking.
(no citation) Paper 309

Page 1

CHI 2018 Paper

CHI 2018, April 21­26, 2018, Montréal, QC, Canada

Guided by this prior work, we conducted a study to answer this overarching question: How does socioeconomic context shape caregivers' perceptions and use of current PA tracking tools?
(no citation) Our system was more accurate on fully incorrect SPSes (57% were "Accurate") than partially incorrect SPSes (36% were "Accurate").
(no citation) 15.00 DOI: https://doi.org/10.1145/3173574.3173656

INTRODUCTION Electronic textile technology enables people to create expressive, interactive, and functional textile artifacts for both playful and serious applications.
(no citation) In the long-answer category, recordings coded as Definition made up 7.63%.
(no citation) Clickstream and in-video dropout data are passively collected in that the data is natu

In [3]:
sample_pos = df[df.has_citation == 1].sample(50, random_state=0)
print('\n===\nExample sentences with citations:')
for i, row in sample_pos.iterrows():
    print('Raw text:', row['text'])
    print(row['has_citation'])
    print('Processed text:', row['processed_text'])


===
Example sentences with citations:
Raw text: These are questions of experience, politics and human values [11,19,79].
1.0
Processed text: These are questions of experience, politics and human values.
Raw text: 1https://kodi.tv/

Distribution of Head and Gaze Angles We measured the 3D head pose by fitting a generic 3D face model to the detected facial landmarks, and transformed the onscreen gaze location to the 3D direction vector in the camera coordinate system as in [3].
1.0
Processed text: 1https://kodi.tv/

Distribution of Head and Gaze Angles We measured the 3D head pose by fitting a generic 3D face model to the detected facial landmarks, and transformed the onscreen gaze location to the 3D direction vector in the camera coordinate system as in.
Raw text: Block-Based Programming Interface Block-based programming interfaces are well represented in educational programming games and environments today--with notable examples being Scratch [66], Blockly [16], Reduct [4],

### Data Cleaning
This section provides a careful walk-through of our data cleaning process. In particular, it focuses on justifying each step and providing examples.

Broadly, there are 4 issues (in order of execution, not importance)
1. Tokenizing academic text
2. Finding and "disposing" of citations cleanly
3. Finding and removing reference sections
4. Dealing with artifacts of PDF conversion

#### Tokenizing text from academic PDF files
Academic text includes frequent use of the period character in ways that are not handled well by NLTK's default sentence tokenizer. Therefore, we simply replace common academic expressions with an equivalent (albeit grammatically incorrect) version without periods.


In [4]:
from nltk import tokenize

data = "This sentence is quite academic, i.e. it belongs in an academic paper (e.g. a conference paper). \
We show in Fig. 1 that our work is important, which supports the findings of Smith et al. among others."

pairs = {
    'Fig.': 'Fig',
    'e.g.': 'eg',
    'i.e.': 'ie',
    'et al.': 'et al',
}
for key, val in pairs.items():
    data = data.replace(key, val)
sentences = tokenize.sent_tokenize(data)
print(sentences)

['This sentence is quite academic, ie it belongs in an academic paper (eg a conference paper).', 'We show in Fig 1 that our work is important, which supports the findings of Smith et al among others.']


#### Finding and "disposing" of citations cleanly
To even beging generating labels for our machine learning task, we need to generate labels that indicate which sentences have citations. However, if we don't "dispose" of the citation markers in the training data (e.g. bracketed citations like [1] or [34,35]) our training data won't generalize at all to real data. Most obviously, a character-based model might just learn to label all sentences with a bracket character as having citation. However, less obvious issues may occur: for example, we found after stripping away the [1], we might be left with odd-looking text, such as a comma surrounding by whitespace.

`"...Smith et al. showed this [1], and therefore..." -> "Smith et al. showed this , and therefore..."`

#### Finding and removing reference sections