# textacyを用いたキーフレーズ抽出

このノートブックでは、さまざまな自然言語処理タスクを行うことのできるライブラリ「textacy」を利用して、キーフレーズ抽出をします。

## 準備
### パッケージのインストール

In [1]:
!pip install textacy==0.11.0 spacy==3.1.2

Collecting textacy==0.11.0
  Downloading textacy-0.11.0-py3-none-any.whl (200 kB)
[K     |████████████████████████████████| 200 kB 5.1 MB/s 
[?25hCollecting spacy==3.1.2
  Downloading spacy-3.1.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 54.2 MB/s 
Collecting jellyfish>=0.8.0
  Downloading jellyfish-0.8.8.tar.gz (134 kB)
[K     |████████████████████████████████| 134 kB 48.6 MB/s 
Collecting cytoolz>=0.10.1
  Downloading cytoolz-0.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 65.1 MB/s 
Collecting pyphen>=0.10.0
  Downloading pyphen-0.11.0-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 53.0 MB/s 
Collecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting thinc<8.1.0,>=8.0.8
  Downloading thinc-8.0.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623 kB)

### モデルのダウンロード

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 73 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### インポート

In [3]:
import spacy
import textacy
from textacy import extract

### データのアップロード

まずは可視化する埋め込みをアップロードします。本ノートブックと同じ階層にDataフォルダがあり、その下に`nlphistory.txt`があるので、そちらをアップロードします。

In [4]:
from google.colab import files
uploaded = files.upload()

Saving nlphistory.txt to nlphistory.txt


### データの読み込み

アップロードしたデータを読み込みます。Colabでない場合は、`Data/nlphistory.txt`を指定して読み込んでください。

In [5]:
mytext = open("nlphistory.txt").read()

In [6]:
mytext

'The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n\nThe Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nSome notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working 

### spaCyのドキュメントを取得

In [7]:
# spaCyモデルの読み込み
en = textacy.load_spacy_lang("en_core_web_sm")

# テキストをspaCyドキュメントへ変換
doc = textacy.make_spacy_doc(mytext, lang=en)

## TextRankを用いたキーフレーズ抽出

`extract.keyterms.textrank`を用いて、キーフレーズ抽出をします。

In [8]:
extract.keyterms.textrank(doc, topn=5)

[('successful natural language processing system', 0.024618603795057474),
 ('statistical machine translation system', 0.024521833176108804),
 ('natural language system', 0.020420428577977795),
 ('statistical natural language processing', 0.018475996624680684),
 ('natural language task', 0.01571262194574599)]

TextRankとSGRankの結果を比較してみましょう。

In [9]:
kps_textrank = [kps for kps, _ in extract.keyterms.textrank(doc, normalize="lemma", topn=5)]
kps_sgrank = [kps for kps, _ in extract.keyterms.sgrank(doc, topn=5)]
print(f"Textrank output\t: {kps_textrank}")
print(f"SGRank output\t: {kps_sgrank}")

Textrank output	: ['successful natural language processing system', 'statistical machine translation system', 'natural language system', 'statistical natural language processing', 'natural language task']
SGRank output	: ['natural language processing system', 'statistical machine translation', 'early', 'research', 'late 1980']


重複したキーフレーズに対処するために、textacyは`aggregate_term_variants`関数を用意しています。この関数を使うことで、重複のないキーフレーズを得ることができます。

In [10]:
terms = set([term for term, _ in extract.keyterms.sgrank(doc)])
extract.utils.aggregate_term_variants(terms)

[{'natural language processing system'},
 {'statistical machine translation'},
 {'statistical model'},
 {'late 1980'},
 {'research'},
 {'example'},
 {'early'},
 {'ELIZA'},
 {'world'},
 {'real'}]

名詞のチャンクは、キーフレーズの候補として考えることができます。この方法の欠点は、大量のフレーズができてしまうことと、それらをランク付けする方法が無いことです。

In [11]:
[chunk for chunk in extract.noun_chunks(doc)]

[history,
 natural language processing,
 1950s,
 work,
 earlier periods,
 Alan Turing,
 article,
 "Intelligence,
 what,
 Turing test,
 criterion,
 intelligence,
 Georgetown experiment,
 fully automatic translation,
 more than sixty Russian sentences,
 English,
 authors,
 three or five years,
 machine translation,
 real progress,
 ALPAC report,
 ten-year-long research,
 expectations,
 machine translation,
 Little further research,
 machine translation,
 late 1980s,
 first statistical machine translation systems,
 notably successful natural language processing systems,
 1960s,
 SHRDLU,
 natural language system,
 restricted "blocks worlds,
 restricted vocabularies,
 ELIZA,
 simulation,
 Rogerian psychotherapist,
 Joseph Weizenbaum,
 almost no information,
 human thought,
 emotion,
 ELIZA,
 startlingly human-like interaction,
 "patient,
 very small knowledge base,
 ELIZA,
 generic response,
 example,
 My head,
 you,
 your head,
 1970s,
 many programmers,
 "conceptual ontologies,
 real-worl

textacyは他にもさまざまな情報抽出の機能を備えており、その多くは正規表現パターンやヒューリスティックに基づいて、頭字語や引用語などの表現の抽出に対応しています。これら以外にも、品詞タグのパターンを含む正規表現にマッチするものを抽出したり、固有表現を含む文、主語・動詞・目的語のタプルなどを探すこともできます。詳細については、以下のドキュメントを参照してください。

- [textacy: NLP, before and after spaCy](https://textacy.readthedocs.io/en/latest/)