# spaCy

All-in-one package for performing basic and advanced natural language processing, with special optimization "quickstart" features for certain languages. See [spaCy Language Support](https://spacy.io/usage/models#languages) for details.

## Features

* **Tokenization**: Segmenting text into individual "tokens", that is, words, punctuations marks, numbers, etc.

* **Part-of-speech (POS) Tagging**: Assigning grammatical word types to tokens, like "verb" or "noun" (using [Universal POS Tags](https://universaldependencies.org/u/pos/) with `.pos_` and [Penn Part of Speech Tags](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html) with `.tag_`).

* **Dependency Parsing**: Assigning syntactic dependency labels, describing the relations between individual tokens, as in subject, object, dependent clause, etc.

* **Lemmatization**: Determining the base form, or *lemma* of a word.  The lemma of "went" is "to go", and the lemma of "trees" is "tree".

* **Named Entity Recognition (NER)**: Labeling "real-world" objects with names, such as persons, companies or locations.

## Download and Load Resources

In [None]:
# Download language package.
%run -m spacy download en_core_web_sm

In [None]:
# Import and load resources.
import spacy
nlp = spacy.load("en_core_web_sm")

## Some Linguistic Basics

The `nlp()` function initiates a [pipeline](https://spacy.io/usage/processing-pipelines) that first tokenizers the text, then runs a series of processors which can be customized. The default English pipeline we loaded above includes a tagger, a lemmatizer, a parser and an entity recognizer.

In [None]:
# Sentence written by ChatGPT after I asked it to write a sentence with all (16) parts of speech. 
# It repeatedly failed.

sentence = "Wow! Oh no, I forgot to buy ten oranges and seven apples for the party tomorrow, but I promise I'll get them soon."

# The nlp() function initiates a pipeline.
doc = nlp(sentence)
for token in doc:
    print(token, token.pos_, token.tag_)

The `count_by` function can help count things, but takes a bit of coaxing to reveal helpful results

In [None]:
from spacy.attrs import *
pos_counts = doc.count_by(POS)
print(pos_counts)

In [None]:
for key, value in pos_counts.items():
    human_readable_tag = doc.vocab[key].text
    print(human_readable_tag, value)

In [None]:
for token in doc:
    print(token.text, "-->", token.lemma_)

Sentence diagrams

In [None]:
sentence = "Our natural resources are developed by an earnest culture of the arts and peace."
doc2 = nlp(sentence)

displacy.render(doc, style="dep")

options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}
# displacy.render(doc2, style="dep", options=options)

## Longer Texts

Processing is lightning-fast on individual texts, but there are size limitations, and even shorter texts will take a few moments to analyze.

In [None]:
with open("data/KafkaMetamorphosis.txt") as f:
    text = nlp(f.read())
print(text[:150])

Sentence-level segmentation.

In [None]:
for sent in text.sents:
    print(sent)
    break

In [None]:
count = 1
for sent in text.sents:
    print(count, sent.text.strip())
    count += 1
    if count > 10:
        break

In [None]:
sentence_list = list(text.sents)

In [None]:
len(sentence_list)

In [None]:
sentence_list[387]

## Multi-Token Segments

spaCy doesn't place much emphasis on "bigrams" or "trigrams" as some other text analysis packages do. Instead it offers "noun chunks" which are single- or multi-word phrases derived from 

In [None]:
for chunk in text.noun_chunks:
    print(" -- ".join([chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text]))

## Pattern Matching

String together sequences of tokens with a variety of characteristics to highlight interesting linguistic occasions within the text

* [available token attributes](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes): A list of all the ways tokens can be matched.
* [match tester](https://demos.explosion.ai/matcher): A user interface for testing out your matches.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern1 = [
    [
     {"POS": "ADJ", "OP": "?"}, 
     {"POS": "ADJ"}, 
     {"POS": "NOUN"}
    ]
]

pattern2 = [
    [{"LENGTH": {">=": 16}}]
]

pattern3 = [
    [{"POS": "ADJ"}, {"LOWER": {"IN": ["legs", "thorax", "head", "abdomen", "back", "eyes", "mouth", "antennae"]}}]
]

pattern4 = [
    [{"ENT_TYPE": "PERSON"}, {"POS": "VERB"}]
]


matcher.add("Adj-Noun", pattern1)
# matcher.add("Long-Words", pattern2)
# matcher.add("Legs", pattern3)
# matcher.add("Person", pattern4)
matches = matcher(text)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = text[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)


## Named Entity Recognition

In [None]:
for ent in text.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
from spacy import displacy
displacy.render(text[:500], style="ent")