# spaCyにおける固有表現認識の課題

このノートブックでは、spaCnに組み込まれた固有表現認識モデルの課題とテキスト構造への敏感さについて説明します。

## 準備
### パッケージのインストール

In [2]:
!pip install -q spacy==3.1.2

[K     |████████████████████████████████| 5.8 MB 13.8 MB/s 
[K     |████████████████████████████████| 456 kB 52.8 MB/s 
[K     |████████████████████████████████| 623 kB 72.8 MB/s 
[K     |████████████████████████████████| 42 kB 916 kB/s 
[K     |████████████████████████████████| 10.1 MB 67.7 MB/s 
[?25h

### モデルのダウンロード

In [3]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.1.0/en_core_web_lg-3.1.0-py3-none-any.whl (777.1 MB)
[K     |████████████████████████████████| 777.1 MB 18 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


### インポート

In [4]:
import spacy

### モデルの読み込み

In [5]:
nlp = spacy.load("en_core_web_lg")

## 固有表現を認識する

では、spaCyのモデルを使って、固有表現認識をしてみましょう。やり方は簡単です。テキストを渡して`Doc`オブジェクトを作成したら、`ents`プロパティへアクセスするだけです。そうすることで、固有表現とそのタイプを取得できます。

In [6]:
mytext = """SAN FRANCISCO — Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.

On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.

“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview. “If there is excess cash, then obviously we want to return it to investors.”

Apple’s record buybacks should be welcome news to shareholders, as the stock price is likely to climb. But the buybacks could also expose the company to more criticism that the tax cuts it received have mostly benefited investors and executives.
"""
doc = nlp(mytext)
for ent in doc.ents:
    print(ent.text, "\t", ent.label_)

SAN FRANCISCO 	 GPE
Apple 	 ORG
last year 	 DATE
$252 billion 	 MONEY
$100 billion 	 MONEY
Tuesday 	 DATE
Apple 	 ORG
$75 billion 	 MONEY
first 	 ORDINAL
Luca Maestri 	 PERSON
Apple 	 ORG
Apple 	 ORG


`sents`プロパティへアクセスすることで、文を抽出してみましょう。今回のテキストであれば、人間であれば6つの文を抽出できるはずです。

In [8]:
for sent in doc.sents:
    print(sent.text)
    print("***End of sent****")
print("Total sentences: ", len(list(doc.sents)))

SAN FRANCISCO — Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.
***End of sent****


On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.
***End of sent****



***End of sent****
“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview.
***End of sent****
“If there is excess cash, then obviously we want to return it to investors.”
***End of sent****


Apple’s record buybacks should be welcome news to shareholders, as the stock price is likely to climb.
***End of sent****
But the buybacks could also expose the company to more criticism that the tax cuts it received have mostly benefited investors and executives.
***End of sent****


***End of sent****
Total sentences:  8


8つの文が抽出されました。改行が影響を及ぼしている箇所があるようです。

では、もし固有表現の途中で改行が入った場合はどうなるのでしょうか？試してみましょう。

In [20]:
# 改行なし
doc = nlp('The United States Army is the land service branch of the United States Armed Forces.')
for ent in doc.ents:
    print(ent.text, "\t", ent.label_)

The United States Army 	 GPE
the United States Armed Forces 	 GPE


In [21]:
# 改行あり
doc = nlp('The United States\nArmy is the land service branch of the United States Armed Forces.')
for ent in doc.ents:
    print(ent.text, "\t", ent.label_)

The United States 	 GPE
Army 	 ORG
the United States Armed Forces 	 GPE


抽出結果が変わってしまいました。

今回の例では、認識結果が変わるように意図的に改行を入れましたが、実際、企業内の文章（業務文書、メールなど）であれば、画面内におさめるために、文の途中で改行することはよくあるかと思います。そういった場合、単に学習済みのモデルを適用するだけではなく、文境界の認識などの前処理をする必要があります。