Convenient command-line interface for applying spacy and trankit pipelines. Allows word tokenization and sentence segmentation, for given language or using automatic language detection, as well as printing spacy parses directly as json.
$ pipx install git+https://github.com/mwestera/spacy-wrapThis will make three commands available:
tokenizesentencizespacyjson
$ echo "Here's just a short text. For you to parse." | tokenize --info -treeOr, to process each line from a file separately (and this time using a transformer model, --trf):
$ cat some_dutch_sentences.txt | tokenize --info --trf --lang nl --lines --tree$ cat texts_in_various_languages.txt | sentencize --trf --linesNote: In this case, will detect language separately for each input line.
Or output full sentence parses in json format:
$ cat texts_in_various_languages.txt | sentencize --lines --lang nl --jsonOr entire spacy docs as json:
$ cat texts_in_various_languages.txt | spacyjson --lines --lang nl --json > parses.jsonl