# The `py-processors` module

## Features

- Sentence segmentation

- Tokenization

- **Named Entity Recognition**

- **Syntactic dependency parsing**

- PoS tagging

- Lemmatization

- Additional tools for the bio domain

- sentiment analysis

- Rule-based information extraction

- $\dots$

## More info:
- http://py-processors.readthedocs.io/en/latest/
- https://github.com/myedibleenso/py-processors
- A bridge to [`processors`](https://github.com/clulab/processors) via [`processors-server`](https://github.com/myedbileenso/processors-server)

# Why not `NLTK`?

`NLTK` is great educational tool, but it's missing a few things$\dots$

- No **Named Entity Recognition**
- No **syntactic dependency parsing**
- No rule-based information extraction
- No bio-specific stuff

## Getting starting with a `py-processors`

In [3]:
from processors import *
API = ProcessorsAPI(port=8886, keep_alive=True)
# NOTE: if you use keep_alive=True, you'll want to run
# API.stop_server() when you're done with everything

Using default
Connection with server established!


In [4]:
text = "Hello.  My name is Inigo Montoya.  You killed my father.  Prepare to die."
doc = API.fastnlp.annotate(text)

# What's produced by `.annotate`?

<p align="center">
    <img src="figs/annotate-example.png">
</p>

In [5]:
# How many sentences?
doc.size

4

In [6]:
sentence = doc.sentences[1]
print("WORDS:\t{}\n".format(sentence.words))
print("TAGS:\t{}\n".format(sentence.tags))
print("LEMMAS:\t{}\n".format(sentence.lemmas))
print("NAMED ENTITIES:\t{}\n".format(sentence.nes))
print("SYNTACTIC DEPENDENCIES:")
for edge in sentence.dependencies.edges:
    print(edge)

WORDS:	['My', 'name', 'is', 'Inigo', 'Montoya', '.']

TAGS:	['PRP$', 'NN', 'VBZ', 'NNP', 'NNP', '.']

LEMMAS:	['my', 'name', 'be', 'Inigo', 'Montoya', '.']

NAMED ENTITIES:	defaultdict(<class 'list'>, {'PERSON': ['Inigo Montoya']})

SYNTACTIC DEPENDENCIES:
{'destination': 2, 'source': 4, 'relation': 'cop'}
{'destination': 3, 'source': 4, 'relation': 'nn'}
{'destination': 5, 'source': 4, 'relation': 'punct'}
{'destination': 1, 'source': 4, 'relation': 'nsubj'}
{'destination': 0, 'source': 1, 'relation': 'poss'}


# Easily generate syntactic features

In [7]:
# an edge is head_relation_dependent (outgoing_relation_incoming)
sentence.labeled_dependencies_using("words")
#sentence.labeled_dependencies_using??

['name_POSS_My',
 'Montoya_COP_is',
 'Montoya_NN_Inigo',
 'Montoya_PUNCT_.',
 'Montoya_NSUBJ_name']

In [8]:
sentence.unlabeled_dependencies_using("tags")
#sentence.unlabeled_dependencies_using??

['NN_PRP$', 'NNP_VBZ', 'NNP_NNP', 'NNP_.', 'NNP_NN']

In [9]:
deps = sentence.dependencies
# what outgoing edges exist for token 1 (index starts at 0)?
print(deps.outgoing[1])

[(0, 'poss')]


# Serialize to `json`

In [10]:
json_string = doc.to_JSON()

# Load from `json`

In [11]:
doc2 = Document.load_from_JSON(json.loads(json_string))
# customized equality!
doc == doc2

True

# Let's look at some examples $\dots$