# LLOD demo corpus

This notebook uses Python and a number of common Unix tools to
- retrieve PDFs from the web,
- extract their text,
- perform syntactic analysis,
- convert to RDF, and
- download the results

You can press `<CTRL>+F9` (or select `Kernel`|`Run all` from the menu) to run the full pipeline.

> *Note*: The notebook is designed to run in [Google Colaboratory](https://colab.research.google.com/) and uses its extensions for downloading output files. If you run it on your local [Jupyter Notebook](https://jupyter.org/) installation, you can comment this out, and instead, access the files directly on your local harddrive.

> *Note*: As a number of pre-requisites need to be installed and because of hardware limitations on Google Colab, running this script will take several minutes. In particular, the parser seems to be slow. Don't panic, just give it some slack ;)

In [1]:
# @title 0. Installing requirements (click "show code" for details)
!echo 'installing text extraction scripts (GhostScript)' 1>&2
!echo '================================================' 1>&2
!if apt-get install ghostscript >& tmp.log; \
 then echo done 1>&2; \
 else cat tmp.log 1>&2; \
 fi;
!rm tmp.log;
!echo 1>&2;

!echo 'installing parser (spaCy)' 1>&2
!echo '=========================' 1>&2
!if  pip install spacy_conll spacy_stanza >& tmp.log; \
 then echo done 1>&2;\
 else cat tmp.log 1>&2;\
 fi;\
 rm tmp.log
!echo 1>&2

!echo 'installing FINTAN' 1>&2
!echo '=================' 1>&2
!if [ ! -e /fintan ]; then mkdir /fintan; \
 cd /fintan;\
 wget -nc https://github.com/acoli-repo/fintan-backend/releases/download/fintan-backend-release-v1.0.0/fintan.jar;\
 wget -nc https://github.com/acoli-repo/fintan-backend/releases/download/fintan-backend-release-v1.0.0/run.sh;\
 chmod u+x /fintan/run.sh;\
 fi;
!if [ -x /fintan/run.sh ]; \
 then echo ok 1>&2;\
 else echo /fintan/run.sh not executable ! 1>&2;\
 fi;
!echo 1>&2

installing text extraction scripts (GhostScript)
done

installing parser (spaCy)
done

installing FINTAN
--2024-03-29 12:07:59--  https://github.com/acoli-repo/fintan-backend/releases/download/fintan-backend-release-v1.0.0/fintan.jar
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/292584141/2d8d3ce7-825c-44d5-9b77-b1d10bdaacea?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240329%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240329T120800Z&X-Amz-Expires=300&X-Amz-Signature=6526bb6039d992eef221606228a83fb989698f0e25731210f4133a671df87ba6&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=292584141&response-content-disposition=attachment%3B%20filename%3Dfintan.jar&response-content-type=application%2Foctet-stream [following]
--2024-03-29 12:08:00--  htt

In [2]:
# @title 1. Retrieve sample document(s) into `samples/pdf` folder (click "show code" for details).
!echo  building sample corpus 1>&2
!echo '======================' 1>&2

! if [ -e samples/ ]; then rm -rf samples/; fi;
!mkdir -p samples/pdf;
!url2files="https://aclanthology.org/2011.tal-3.10.pdf=sample1.pdf";\
 for url2file in $url2files; do \
  url=`echo $url2file | sed s/'=.*'//`;\
  file=`basename $url2file | sed s/'.*='//`;\
  echo $url '>' $file 1>&2;\
  wget -nc $url -O samples/pdf/$file >&/dev/null; \
 done;
!echo 1>&2

!echo  retrieved PDFs 1>&2
!echo '==============' 1>&2

!find samples/ | grep '\.pdf' 1>&2

building sample corpus
https://aclanthology.org/2011.tal-3.10.pdf > sample1.pdf

retrieved PDFs
samples/pdf/sample1.pdf


  > Note: To process our own files to the script above, just add URLs and (optionally) file names to the `url2file` variable below.

In [None]:
# @title 2. Perform text extraction over all PDF files in `samples/pdf` (click "show code" for details).
!echo  perform text extraction 1>&2
!echo '=======================' 1>&2
!if [ ! -e samples/txt ]; then mkdir -p samples/txt; fi;
!for file in samples/pdf/*.pdf; do \
  tgt=samples/txt/`basename $file`.txt;\
  echo -n $file '>' $tgt' ..' 1>&2;\
  ps2ascii $file > $tgt;\
  if [ -s $file ]; \
  then echo '. ok' 1>&2;\
  else echo '. failed' 1>62;\
  fi;\
 done;
!echo 1>&2

!echo  resulting text files 1>&2;
!echo '====================' 1>&2;
!find samples/txt | grep '\.txt' 1>&2


perform text extraction
samples/pdf/sample1.pdf > samples/txt/sample1.pdf.txt ..

In [None]:
# @title 3. Parse all text files using spaCy (click "show code" for details)
# code adapted from https://spacy.io/universe/project/spacy-conll

from spacy_conll import init_parser
import os,re

# Initialise English parser, already including the ConllFormatter as a pipeline component.
# Indicate that we want to get the CoNLL headers in the string output.
# `use_gpu` and `verbose` are specific to stanza. These keywords arguments are passed onto their Pipeline() initialisation
print("initializing parser")
print("===================")
nlp = init_parser("en",
                  "stanza",
                  parser_opts={"use_gpu": True, "verbose": False},
                  include_headers=True)
print("done\n")

txt_dir="./samples/txt"
out_dir="./samples/conll"
if not os.path.exists(out_dir):
  os.makedirs(out_dir)

print("parsing")
print("=======")

for file in os.listdir(txt_dir):
  file=os.path.join(txt_dir,file)
  tgt=os.path.join(out_dir,re.sub(r".*/","",file)+".conllu")
  print(f"{file} > {tgt}", end=" ..")
  with(open(file,"rt",errors="ignore")) as input:
    txt=input.read()
    with open(tgt,"wt",errors="ignore") as output:
      txt=re.sub(r"<[^>]*>","",txt)
      txt=re.sub(r"([.!?])\s*\n","\1<br>",txt)
      txt=re.sub(r"\n([(\"]*[A-Z])",r"<br>\1", txt)
      txt=re.sub(r"\n\s*\n","<br>",txt)
      txt=re.sub(r"([a-z])[\-]\s*\n\s*","\1",txt)
      txt=" ".join(txt.split()).strip()
      txt="\n\n".join(txt.split("<br>"))
      txt=re.sub(r"([aeiouyAEIOUY][^\s]*[!?.][\")]*)\s+([\"(]*[A-Z])","\1\n\2",txt)
      for sent in txt.split("\n"):
        sent=sent.strip()
        if len(sent)==0:
          output.write(sent)
        else:
          parsed=nlp(sent)
          output.write(parsed._.conll_str+"\n")
          output.flush()
    print(". ok")
print()

!echo  resulting CoNLL files 1>&2;
!echo '=====================' 1>&2;
!find samples/ | grep '\.conll' 1>&2



> *Note*: Since Jun 2022, the [spaCy parser](https://spacy.io/) comes pre-installed with Google Colaboratory. In case this would change in the future, see https://spacy.io/ for installation instructions. Alternatively, spaCy can be replaced by any other UD parser.

> *Note*: At present, the parser is remarkably slow, but this is because the input isn't properly pre-processed. With an additional sentence splitting (and merging) step, it should be significantly faster. To be added.

In [None]:
# @title 4. Generating RDF Data

!if [ ! -e samples/ttl ]; then mkdir samples/ttl; fi

!echo 'Configure FINTAN for CoNLL-RDF' 1>&2
!echo '==============================' 1>&2

import json
ttl_dir="samples/ttl"
for file in os.listdir(out_dir):
  tgt=os.path.join(ttl_dir,file+".ttl")
  conf=os.path.join(".","conf_"+file+".ttl.json")
  file=os.path.join(out_dir,file)
  print(f"{file} > {conf}",end=" ..")
  with open(conf,"wt") as output:
    json.dump(
      {"input" : file, "output" : tgt, "pipeline" : [
        { "class" : "CoNLLStreamExtractor",
          "baseURI" : file+"#", "columns" : ["ID", "WORD", "LEMMA", "UPOS", "POS", "FEAT", "HEAD", "EDGE", "DEPS", "MISC"] },
        { "class" : "CoNLLRDFFormatter",
          "modules" : [ {"mode":"RDF", "columns": ["ID", "WORD", "LEMMA", "UPOS", "POS", "FEAT", "HEAD", "EDGE", "DEPS", "MISC"]} ]
        } ] },
      output)
  print(f". ok")
!echo 1>&2

!echo 'Convert to CoNLL-RDF' 1>&2
!echo '====================' 1>&2
!for conf in conf*ttl.json; do \
  echo $conf 1>&2;\
  /fintan/run.sh -c $conf 2>/dev/null;\
  rm $conf;\
 done;
!echo 1>&2

!echo 'Resulting RDF/TTL files' 1>&2
!echo '=======================' 1>&2
!find samples/ttl | grep '\.ttl' 1>&2
!echo 1>&2





> *Note*: FINTAN is not actually meant to convert block data. Rather, it is optimized for on-the-fly stream processing over large data quantities, and specifically, to enable data transformation and enrichment with SPARQL without loading the entire dataset into memory. In general, this will be significantly faster than SPARQL querying/updates over a database. Try the [command-line interface](https://github.com/acoli-repo/conll-rdf) to unleash its full potential. A much more effective implementation would be to feed spaCy output directly into FINTAN.

> *Note*: In addition to parsing CoNLL data, CoNLL-RDF can also generate different CoNLL formats. In fact, it is designed to complement standard NLP pipelines (using CoNLL or similar exchange formats) with SPARQL rewriting rules. In the subsequent [FINTAN platform](https://github.com/acoli-repo/fintan), the range of input and output formats has been significantly extended.

In [None]:
# @title 5. Download the resulting annotations in a zip archive
from google.colab import files
import os
import shutil

shutil.make_archive("sample","zip",ttl_dir) # first argument is file name without file extension, second argment is file extension
files.download("sample.zip")