# Spacy token class


In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
import regex as re
import spacy
import pandas as pd

# Language Model

In [3]:
nlp = spacy.load("en_core_web_lg")

# Token attributes

* [Attributes](https://spacy.io/api/token#attributes)

In [4]:
RE_SPACY_ATTR: re.Pattern = re.compile(
    pattern=r"^(DEP|ENT_|IS_|LEMMA|LENGTH|LIKE_|LOWER|MORPH|NORM|ORTH|POS|SENT_START|SPACY|SHAPE|TAG)"
)
[
    attr for attr in dir(spacy.attrs)
    if re.match(pattern=RE_SPACY_ATTR, string=attr) 
]

['DEP',
 'ENT_ID',
 'ENT_IOB',
 'ENT_KB_ID',
 'ENT_TYPE',
 'IS_ALPHA',
 'IS_ASCII',
 'IS_BRACKET',
 'IS_CURRENCY',
 'IS_DIGIT',
 'IS_LEFT_PUNCT',
 'IS_LOWER',
 'IS_OOV_DEPRECATED',
 'IS_PUNCT',
 'IS_QUOTE',
 'IS_RIGHT_PUNCT',
 'IS_SPACE',
 'IS_STOP',
 'IS_TITLE',
 'IS_UPPER',
 'LEMMA',
 'LENGTH',
 'LIKE_EMAIL',
 'LIKE_NUM',
 'LIKE_URL',
 'LOWER',
 'MORPH',
 'NORM',
 'ORTH',
 'POS',
 'SENT_START',
 'SHAPE',
 'SPACY',
 'TAG']

## ORTH


```token.orth_``` is the same with ```token.text```. In Matcher rule, ```ORTH``` will be used to define a pattern rule.

* [ "ORTH" vs "TEXT" in pattern matchin](https://github.com/explosion/spacy-course/issues/4)

> The ```ORTH``` is a reference to **orthography** and the ```Token.orth``` (hash ID of the token) and ```Token.orth_``` (string value of the token and the same as ```Token.text```).



## Morph

```Morph``` is [Morphology](https://spacy.io/usage/linguistic-features#morphology)

> Inflectional **morphology** is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that **a lemma (root form) is inflected (modified/combined) with one or more morphological features** to create a surface form.
> | CONTEXT                                  | SURFACE | LEMMA | POS  | MORPHOLOGICAL FEATURES             |
> |------------------------------------------|---------|-------|------|------------------------------------|
> | I was reading the paper                  | reading | read  | VERB | VerbForm=Ger                       |
> | I don’t watch the news, I read the paper | read    | read  | VERB | VerbForm=Fin, Mood=Ind, Tense=Pres |
> | I read the paper yesterday               | read    | read  | VERB | VerbForm=Fin, Mood=Ind, Tense=Past |

Spacy looks following **CoNLL-U Format**.

* [CoNLL-U Format](https://universaldependencies.org/format.html#conll-u-format)

> ```FEATS```: List of morphological features from the [universal feature inventory](https://universaldependencies.org/u/feat/index.html) or from a defined language-specific extension; underscore if not available.

* [Morphological Annotation](https://universaldependencies.org/format.html#morphological-annotation) (MUST)
* [universal feature inventory](https://universaldependencies.org/u/feat/index.html)  (MUST)

> | Lexical features* | Inflectional features* |           |
|-------------------|------------------------|-----------|
|                   | Nominal*               | Verbal*   |
| PronType          | Gender                 | VerbForm  |
| NumType           | Animacy                | Mood      |
| Poss              | NounClass              | Tense     |
| Reflex            | Number                 | Aspect    |
| Foreign           | Case                   | Voice     |
| Abbr              | Definite               | Evident   |
| Typo              | Degree                 | Polarity  |
|                   |                        | Person    |
|                   |                        | Polite    |
|                   |                        | Clusivity |



* [PARTS OF SPEECH TAGGING AND DEPENDENCY PARSING USING SPACY | NLP | PART 3](https://ashutoshtripathi.com/2020/04/13/parts-of-speech-tagging-and-dependency-parsing-using-spacy-nlp) (MUST)

### VERB

* [VerbForm](https://universaldependencies.org/u/feat/VerbForm.html) (MUST)

> VerbForm is an inflectional feature of verbs and auxiliaries, however, it is also used as a lexical feature of some adjectives and adverbs.
> * Fin: finite verb (a form of a verb that (a) shows agreement with a subject and (b) is marked for tense)
> * Inf: infinitive
> * Part: participle
> * Ger: gerund **(Using VerbForm=Ger is discouraged and alternatives should be considered)**

* [Mood](https://universaldependencies.org/u/feat/Mood.html)

> Mood is a feature that expresses modality and subclassifies finite verb forms. It is an inflectional feature of auxiliaries and verbs.
> * Ind: indicative
> * Imp: imperative
> * Cnd: conditional

### PRON

* [PronType: pronominal type](https://universaldependencies.org/u/feat/PronType.html)

> * Prs: personal pronoun or determiner
> * Int: interrogative pronoun or determiner
> * Rel: relative pronoun or determiner
> * Dem: demonstrative pronoun or determiner
> * Neg: negative pronoun, determiner or adverb

---
# Example

In [5]:
text = """We start learning C with the famous 'Hello World!'.
These days, any language book will be starting 'Hello World'.
"""

In [6]:
def display_token_attributes(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {
                'token': i,  
                'text': t.text, 
                'lemma_': t.lemma_,
                'is_stop': t.is_stop, 
                'is_alpha': t.is_alpha,
                'is_punct': t.is_punct,
                'is_space': t.is_space,
                'pos_': t.pos_, 
                'dep_': t.dep_, 
                'tag_': t.tag_,
                'ent_type_': t.ent_type_, 
                'ent_iob_': t.ent_iob_,
                'sent_start_': t.sent_start,
                'morph': t.morph
            }
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df

In [7]:
display_token_attributes(doc=nlp(text), include_punct=True)

Unnamed: 0,text,lemma_,is_stop,is_alpha,is_punct,is_space,pos_,dep_,tag_,ent_type_,ent_iob_,sent_start_,morph
0,We,we,True,True,False,False,PRON,nsubj,PRP,,O,False,"(Case=Nom, Number=Plur, Person=1, PronType=Prs)"
1,start,start,False,True,False,False,VERB,ROOT,VBP,,O,-1,"(Tense=Pres, VerbForm=Fin)"
2,learning,learn,False,True,False,False,VERB,xcomp,VBG,,O,-1,"(Aspect=Prog, Tense=Pres, VerbForm=Part)"
3,C,c,False,True,False,False,NOUN,dobj,NN,,O,-1,(Number=Sing)
4,with,with,True,True,False,False,ADP,prep,IN,,O,-1,()
5,the,the,True,True,False,False,DET,det,DT,,O,-1,"(Definite=Def, PronType=Art)"
6,famous,famous,False,True,False,False,ADJ,amod,JJ,,O,-1,(Degree=Pos)
7,',',False,False,True,False,PUNCT,punct,``,,O,-1,"(PunctSide=Ini, PunctType=Quot)"
8,Hello,hello,False,True,False,False,INTJ,intj,UH,EVENT,B,-1,()
9,World,World,False,True,False,False,PROPN,pobj,NNP,EVENT,I,-1,(Number=Sing)


## Morphology feature 
Morphology feature can be acquired with ```Token.morph.get(feature)```.

* [Morphology.get](https://spacy.io/api/morphology#get)

In [8]:
doc = nlp("I was reading the paper.")
token = doc[2]  # 'reading'
print(token.morph)
print(token.morph.get("VerbForm"))

Aspect=Prog|Tense=Pres|VerbForm=Part
['Part']


In [9]:
dir(token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le