A parsed version of the 7-authors corpus in http://www.cic.ipn.mx/~sidorov/
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
BoothTarkington
CharlesDickens
FrederickMarryat
GeorgeDonald
GeorgeVaizey
LouisTracy
MarkTwain
logs
README

README

INTRO:
    Thanks to Dr. Grigori Sidorov et al. [1] for sharing their evaluation
    corpus of 7 authors.

    This repository provides a parsed version of that corpus. As the process is
    computationally expensive, I found sharing it useful.

    For inquiries email m [ta] khonji [tod] org.

    [1] http://www.cic.ipn.mx/~sidorov/


PARSER:
    stanford-parser-full-2014-10-31


JAVA:
    oracle-jre-bin-1.8.0.25


COMMANDS:
    'java' was executd with -xm14000m as some books contained very long
    sentences. To be specific, below is the modified version of Stanford's
    lexparser.sh script that we used to parse the books.

    #!/usr/bin/env bash
    #
    # Runs the English PCFG parser on one or more files, printing trees only
    
    if [ ! $# -ge 1 ]; then
      echo Usage: `basename $0` 'file(s)'
      echo
      exit
    fi
    
    scriptdir=`dirname $0`
    
    java -mx14000m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \
     -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $*