# Extracting text with Grobid


Technically, this isn't an OpenReview snippet, but here's some code to grab text from pdfs, many of which you might have from OpenReview.

First, you have to set up Grobid. I'm mostly following [these instructions](https://grobid.readthedocs.io/en/latest/Install-Grobid/)
```
wget https://github.com/kermitt2/grobid/archive/0.7.1.zip
unzip 0.7.1.zip
rm 0.7.1.zip
```

You will need Gradle to run Grobid. I hope that the following works for you, because I don't know other ways to do this :-(
```
brew install gradle
```

Create a directory in which you will save the pdfs to be extracted, e.g. `pdfs/`, and a directory where the xml files will be dumped, e.g. `xmls/`:

```
mkdir pdfs xmls
```

Copy your pdfs into this directory, then run:

```
cd grobid-0.7.1/
./gradlew clean install

```

Below is example code for extracting the section headers and text from a pdf.

In [31]:
import glob

import xml.etree.ElementTree as ET

PREFIX = "{http://www.tei-c.org/ns/1.0}"
TEXT_ID = f"{PREFIX}text"
BODY_ID = f"{PREFIX}body"
DIV_ID = f"{PREFIX}div"
HEAD_ID = f"{PREFIX}head"
P_ID = f"{PREFIX}p"


def get_docs(filename):
  section_titles = []
  section_texts = []
  divs = ET.parse(filename).getroot().findall(
    TEXT_ID)[0].findall(BODY_ID)[0].findall(DIV_ID)
  for div in divs:
    header_node, = div.findall(HEAD_ID)
    section_titles.append(header_node.text)
    text = ""
    for p in div.findall(P_ID):
      text += " ".join(p.itertext())
    section_texts.append(text)
  return section_titles, section_texts


for filename in glob.glob("xmls/*"):
  sections, texts = get_docs(filename)
  for section, text in zip(sections, texts):
    print(section)
    print(text[:500] + "...")
    print()


INTRODUCTION
Interest in visualization has exploded in recent years. Driven in part by the emergence of cheap, ubiquitous data, visualizations are now a common medium for exploring and explaining data produced in the sciences, medicine, humanities, and even our day-to-day lives  [13] . Amongst the huge growth in the number and type of visual analysis tools created for researchers and scholars who want to make sense of complex data, a quickly emerging subclass of visualizations are infographics  [2, 3, 14] . ...

BACKGROUND
For purposes of comparing domains, we delineate visualization practitioners based on their primary skill sets, whether in design or programming. In this work, we study designers whose main expertise is in design. In contrast, we refer to people who primarily create visualizations and visualization creation tools programmatically as visualization programmers. While there are many practitioners who have expertise in both, we find interesting comparisons when individual

# Extracting quotes with pdfalto

This requires you to install [pdfalto](https://github.com/kermitt2/pdfalto). This can be quite complicated. Also, I personally have not been able to get this to work. The fonts are somehow different between manuscripts at the same venue using the same template.

Anyway, if you have pdfalto installed in the ccurrent directory, do

```
cd pdfalto
./pdfalto /path/to/input.pdf temp.xml
```

then the code below will work (once you add the correct tests to select italics styles). I recommend using [an XML viewer](https://codebeautify.org/xmlviewer) to browse the xml file and figure out which styles apply to the content you want to extract.

In [9]:
root =   ET.parse('pdfalto/temp.xml').getroot()

In [10]:
PREFIX = '{http://www.loc.gov/standards/alto/ns-v3#}'
STYLES = f'{PREFIX}Styles'
TEXTSTYLE = f'{PREFIX}TextStyle'
LAYOUT = f'{PREFIX}Layout'
PAGE = f'{PREFIX}Page'
TEXT_BLOCK = f'{PREFIX}TextBlock'
STRING = f'{PREFIX}String'

In [28]:
italics_styles = []
for style_child in root.findall(STYLES)[0].findall(TEXTSTYLE):
  # Uncomment to print all styles
  # print(style_child.attrib)
  if "i9" in style_child.attrib["FONTFAMILY"]: # You will need a different test for your document
    italics_styles.append(style_child.attrib["ID"])

{'ID': 'font0', 'FONTFAMILY': 'arial', 'FONTSIZE': '17.933', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000', 'FONTSTYLE': 'bold'}
{'ID': 'font1', 'FONTFAMILY': 'nimbussanl', 'FONTSIZE': '11.955', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000'}
{'ID': 'font2', 'FONTFAMILY': 'cmsy6', 'FONTSIZE': '5.978', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000', 'FONTSTYLE': 'superscript'}
{'ID': 'font3', 'FONTFAMILY': 'arial', 'FONTSIZE': '9.963', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000'}
{'ID': 'font4', 'FONTFAMILY': 'nimbussanl', 'FONTSIZE': '11.960', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000'}
{'ID': 'font5', 'FONTFAMILY': 'nimbusromno9l', 'FONTSIZE': '11.955', 'FONTTYPE': 'sans-serif', 'FONTWIDTH': 'proportional', 'FONTCOLOR': '000000'}
{'ID': 'font6', 'FONTFAMILY': 'minionpro', 'FONTSIZE': '8.966', 'FONTTYPE': 'serif', 'FONTWIDTH': 'pr

In [26]:
for page in root.findall(LAYOUT)[0].findall(PAGE):
  for print_space in page:
    for text_block in print_space.findall(TEXT_BLOCK):
      for text_line in text_block:
        flag = False
        for string in text_line.findall(STRING):
          if string.attrib['STYLEREFS'] in italics_styles or 'FONTSTYLE' in string.attrib and string.attrib['FONTSTYLE'] == 'italics':
            print(string.attrib["CONTENT"], end=" ")
            flag = True
        if flag:
          print("\n")

visualization; 

toolkits 

infographics 

in situ, 

main 

et al. 

et al 

D 

in situ 

etc. 

“a better 

idea of the behavior of each attribute.” 

“I 

spend most of my time with the data. That is the hard part 

you are teaching because the students like to jump very 

quickly into solutions. It is very hard to explain that most 

of your time spent creating a visualization is with data.” 

“There’s a default on the design side to go quickly to 

it looks and not necessarily ﬁnd the outlier.” 

“Having a knack for the data science part often 

separates the good designers from the great ones. Personally 

I believe that the data science part is the Achilles heel of the 

designer. You gain insight by working directly with the data. 

The best designers are the ones that will open up Excel and 

manipulate the data before they get to the graphics part.” 

“I am amazed at what people will sit 

through in terms of doing something manually with Illustra- 

or InDesign... not all o