# Extracting text with Grobid


Technically, this isn't an OpenReview snippet, but here's some code to grab text from pdfs, many of which you might have from OpenReview.

First, you have to set up Grobid. I'm mostly following [these instructions](https://grobid.readthedocs.io/en/latest/Install-Grobid/)
```
wget https://github.com/kermitt2/grobid/archive/0.7.1.zip
unzip 0.7.1.zip
rm 0.7.1.zip
```

You will need Gradle to run Grobid. I hope that the following works for you, because I don't know other ways to do this :-(
```
brew install gradle
cd grobid-0.7.1/
./gradlew clean install
cd ../

```

Create a directory in which you will save the pdfs to be extracted, e.g. `pdfs/`, and a directory where the xml files will be dumped, e.g. `xmls/`:

```
mkdir pdfs xmls
```

Copy your pdfs into this directory, then run:

```
java -Xmx4G -jar grobid-0.7.1/grobid-core/build/libs/grobid-core-0.7.1-onejar.jar \
	-gH grobid-0.7.1/grobid-home \
	-dIn pdfs/ \
	-dOut xmls/ \
	-exe processFullText 
```

Below is example code for extracting the section headers and text from a pdf.

In [None]:
import glob

import xml.etree.ElementTree as ET

PREFIX = "{http://www.tei-c.org/ns/1.0}"
TEXT_ID = f"{PREFIX}text"
BODY_ID = f"{PREFIX}body"
DIV_ID = f"{PREFIX}div"
HEAD_ID = f"{PREFIX}head"
P_ID = f"{PREFIX}p"


def get_docs(filename):
  section_titles = []
  section_texts = []
  divs = ET.parse(filename).getroot().findall(
    TEXT_ID)[0].findall(BODY_ID)[0].findall(DIV_ID)
  for div in divs:
    header_node, = div.findall(HEAD_ID)
    section_titles.append(header_node.text)
    text = ""
    for p in div.findall(P_ID):
      text += " ".join(p.itertext())
    section_texts.append(text)
  return section_titles, section_texts


for filename in glob.glob("xmls/*"):
  sections, texts = get_docs(filename)
  for section, text in zip(sections, texts):
    print(section)
    print(text[:500] + "...")
    print()


# Extracting quotes with pdfalto

This requires you to install [pdfalto](https://github.com/kermitt2/pdfalto). This can be quite complicated. Also, I personally have not been able to get this to work. The fonts are somehow different between manuscripts at the same venue using the same template.

Anyway, if you have pdfalto installed in the ccurrent directory, do

```
cd pdfalto
./pdfalto /path/to/input.pdf temp.xml
```

then the code below will work (once you add the correct tests to select italics styles). I recommend using [an XML viewer](https://codebeautify.org/xmlviewer) to browse the xml file and figure out which styles apply to the content you want to extract.

In [None]:
root =   ET.parse('pdfalto/temp.xml').getroot()

In [None]:
PREFIX = '{http://www.loc.gov/standards/alto/ns-v3#}'
STYLES = f'{PREFIX}Styles'
TEXTSTYLE = f'{PREFIX}TextStyle'
LAYOUT = f'{PREFIX}Layout'
PAGE = f'{PREFIX}Page'
TEXT_BLOCK = f'{PREFIX}TextBlock'
STRING = f'{PREFIX}String'

In [None]:
italics_styles = []
for style_child in root.findall(STYLES)[0].findall(TEXTSTYLE):
  # Uncomment to print all styles
  # print(style_child.attrib)
  if "i9" in style_child.attrib["FONTFAMILY"]: # You will need a different test for your document
    italics_styles.append(style_child.attrib["ID"])

In [None]:
for page in root.findall(LAYOUT)[0].findall(PAGE):
  for print_space in page:
    for text_block in print_space.findall(TEXT_BLOCK):
      for text_line in text_block:
        flag = False
        for string in text_line.findall(STRING):
          if string.attrib['STYLEREFS'] in italics_styles or 'FONTSTYLE' in string.attrib and string.attrib['FONTSTYLE'] == 'italics':
            print(string.attrib["CONTENT"], end=" ")
            flag = True
        if flag:
          print("\n")

### Extract an abstract

In [None]:
HEADER_ID = f"{PREFIX}teiHeader"
PROFILE_ID = f"{PREFIX}profileDesc"
ABSTRACT_ID = f"{PREFIX}abstract"

def get_abstract(filename):
    abstract = ""
    for j in ET.parse(filename).getroot().findall(HEADER_ID)[0].findall(PROFILE_ID)[0]:
        if j.tag == ABSTRACT_ID: #have trouble searching to ABSTRACT_ID in line above; iterating works 
            
            for div in j:
                for p in div.findall(P_ID):
                    abstract = abstract + "".join(p.itertext())
    return abstract