# spaCy trained Transformer pipeline - Japanese

Spacy comes with many trained pipeline in a list of languages.
We will use the trained pipeline from [here](https://spacy.io/models/ja#ja_core_news_trf)

The pipeline included below components: `transformer, morphologizer, parser, attribute_ruler, ner`

In [23]:
import spacy
from spacy import displacy

# Load pipeline
nlp = spacy.load("ja_core_news_trf")
sentence = "株式会社オープンAIは東京都渋谷区に本社を構えています。"
# universal API to start the analysis
doc = nlp(sentence)


By loading the pipeline and calling `nlp` method, the input sequence is analyzed by processing through the loaded pipeline, i.e. a list of components.
The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component. You can refer to the section [here](https://spacy.io/usage/spacy-101#pipelines).

The analyzed variable `doc` holds all the information, including those of the original text.

In each `Token` level, the instance stores a lot of information, referring to [here](https://spacy.io/api/token#attributes), for example:

In [19]:
for token in doc:
    # Refer to https://spacy.io/api/token for properties meaning
    print(token.text, f"{token.ent_iob_}-{token.ent_type_}", token.tag_, token.pos_, token.dep_, token.morph)

株式 B-ORG 名詞-普通名詞-一般 NOUN compound Reading=カブシキ
会社 I-ORG 名詞-普通名詞-一般 NOUN compound Reading=ガイシャ
オープン I-ORG 名詞-普通名詞-サ変形状詞可能 NOUN compound Reading=オープン
AI I-ORG 名詞-普通名詞-一般 NOUN nsubj Reading=エーアイ
は O- 助詞-係助詞 ADP case Reading=ハ
東京 B-GPE 名詞-固有名詞-地名-一般 PROPN compound Reading=トウキョウ
都 I-GPE 名詞-普通名詞-一般 NOUN nmod Reading=ト
渋谷 I-GPE 名詞-固有名詞-地名-一般 PROPN compound Reading=シブヤ
区 I-GPE 接尾辞-名詞的-一般 NOUN obl Reading=ク
に O- 助詞-格助詞 ADP case Reading=ニ
本社 O- 名詞-普通名詞-一般 NOUN obj Reading=ホンシャ
を O- 助詞-格助詞 ADP case Reading=ヲ
構え O- 動詞-一般 VERB ROOT Inflection=下一段-ア行;連用形-一般|Reading=カマエ
て O- 助詞-接続助詞 SCONJ mark Reading=テ
い O- 動詞-非自立可能 VERB fixed Inflection=上一段-ア行;連用形-一般|Reading=イ
ます O- 助動詞 AUX aux Inflection=助動詞-マス;終止形-一般|Reading=マス
。 O- 補助記号-句点 PUNCT punct Reading=。


## Tokenization

spaCy pipelines first tokenize input text, i.e. convert sequence of strings into tokens. It can be visualized via below code:

In [20]:
for token in doc:
    print(token.text)

株式
会社
オープン
AI
は
東京
都
渋谷
区
に
本社
を
構え
て
い
ます
。


## Part of speech tagging

After tokenization, spaCy CAN tag the POS of each token in the given sequence, if this component is not skipped. 
The tagging is predicted via the pre-trained pipeline(statistical model).

Each field is explained in spaCy doc as below:

- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple UPOS part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalization, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [22]:
print("TEXT	LEMMA	POS	TAG	DEP	SHAPE	ALPHA	STOP")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

TEXT	LEMMA	POS	TAG	DEP	SHAPE	ALPHA	STOP
株式 株式 NOUN 名詞-普通名詞-一般 compound xx True False
会社 会社 NOUN 名詞-普通名詞-一般 compound xx True False
オープン オープン NOUN 名詞-普通名詞-サ変形状詞可能 compound xxxx True False
AI AI NOUN 名詞-普通名詞-一般 nsubj XX True False
は は ADP 助詞-係助詞 case x True True
東京 東京 PROPN 名詞-固有名詞-地名-一般 compound xx True False
都 都 NOUN 名詞-普通名詞-一般 nmod x True False
渋谷 渋谷 PROPN 名詞-固有名詞-地名-一般 compound xx True False
区 区 NOUN 接尾辞-名詞的-一般 obl x True False
に に ADP 助詞-格助詞 case x True True
本社 本社 NOUN 名詞-普通名詞-一般 obj xx True False
を を ADP 助詞-格助詞 case x True True
構え 構える VERB 動詞-一般 ROOT xx True False
て て SCONJ 助詞-接続助詞 mark x True True
い いる VERB 動詞-非自立可能 fixed x True True
ます ます AUX 助動詞 aux xx True True
。 。 PUNCT 補助記号-句点 punct 。 False False


We can visualize the relationship via spaCy's `displacy`.

In [26]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}

displacy.render(doc, style="dep", options=options)

## NER

Named-entity recognition is to assign the predifined classes to a subsequence of the input sequence, e.g. organization, date...etc.

Spacy statistical(or new transformer) models come with the basic NER feature based on the dataset they were pre-trained on. Therefore, the default NER feature might not perform well on every situations and need further fine-tuning.

Named entities are analyzed and stored in the `ents` property of **doc**:

In [29]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

株式会社オープンAI 0 10 ORG
東京都渋谷区 11 17 GPE


Additionally, you can access the properties via each token in this way:

In [30]:
for token in doc:
    print(token.text, f"{token.ent_iob_}-{token.ent_type_}")

株式 B-ORG
会社 I-ORG
オープン I-ORG
AI I-ORG
は O-
東京 B-GPE
都 I-GPE
渋谷 I-GPE
区 I-GPE
に O-
本社 O-
を O-
構え O-
て O-
い O-
ます O-
。 O-


Visualize NER with displacy:

In [28]:
displacy.render(doc, style="ent", options=options)