# spaCyとNLTKを用いた前処理

このノートブックでは、[spaCy](https://spacy.io/)と[nltk](https://www.nltk.org/)を用いて、単語分割、ステミング、見出し語化、品詞タグ付けを行います。

## コーパス

今回は、以下のコーパスを例に取り組みます。

In [1]:
corpus_original = "Need to finalize the demo corpus which will be used for this notebook and it should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"
corpus = "Need to finalize the demo corpus which will be used for this notebook & should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"

## 基本的な前処理

基本的な前処理として以下を行います。

- 小文字化
- 数字の除去
- 句読点の除去
- 末尾の空白の除去

In [2]:
corpus = corpus.lower()
print(corpus)

need to finalize the demo corpus which will be used for this notebook & should be done soon !!. it should be done by the ending of this month. but will it? this notebook has been run 4 times !!


In [3]:
import re
corpus = re.sub(r'\d+','', corpus)
print(corpus)

need to finalize the demo corpus which will be used for this notebook & should be done soon !!. it should be done by the ending of this month. but will it? this notebook has been run  times !!


In [4]:
import string
corpus = corpus.translate(str.maketrans('', '', string.punctuation))
print(corpus)

need to finalize the demo corpus which will be used for this notebook  should be done soon  it should be done by the ending of this month but will it this notebook has been run  times 


In [5]:
corpus = ' '.join([token for token in corpus.split()])
corpus

'need to finalize the demo corpus which will be used for this notebook should be done soon it should be done by the ending of this month but will it this notebook has been run times'

## spaCyとNLTKを用いた前処理
### パッケージのインストール

spaCyをインストールし、英語のモデルをダウンロードします。

In [6]:
!pip install spacy
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


### テキストの単語分割

では、NLTKとspaCyを用いて、テキストを単語に分割してみましょう。

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

stop_words_nltk = set(stopwords.words('english'))
tokenized_corpus_nltk = word_tokenize(corpus)
print('Tokenized corpus:', tokenized_corpus_nltk)

tokenized_corpus_without_stopwords = [i for i in tokenized_corpus_nltk if not i in stop_words_nltk]
print('Tokenized corpus without stopwords:', tokenized_corpus_without_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Tokenized corpus: ['need', 'to', 'finalize', 'the', 'demo', 'corpus', 'which', 'will', 'be', 'used', 'for', 'this', 'notebook', 'should', 'be', 'done', 'soon', 'it', 'should', 'be', 'done', 'by', 'the', 'ending', 'of', 'this', 'month', 'but', 'will', 'it', 'this', 'notebook', 'has', 'been', 'run', 'times']
Tokenized corpus without stopwords: ['need', 'finalize', 'demo', 'corpus', 'used', 'notebook', 'done', 'soon', 'done', 'ending', 'month', 'notebook', 'run', 'times']


In [9]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
spacy_model = spacy.load('en_core_web_sm')

stopwords_spacy = spacy_model.Defaults.stop_words
tokenized_corpus_spacy = word_tokenize(corpus)
print('Tokenized Corpus:', tokenized_corpus_spacy)

tokens_without_sw= [word for word in tokenized_corpus_spacy if not word in stopwords_spacy]
print('Tokenized corpus without stopwords', tokens_without_sw)

Tokenized Corpus: ['need', 'to', 'finalize', 'the', 'demo', 'corpus', 'which', 'will', 'be', 'used', 'for', 'this', 'notebook', 'should', 'be', 'done', 'soon', 'it', 'should', 'be', 'done', 'by', 'the', 'ending', 'of', 'this', 'month', 'but', 'will', 'it', 'this', 'notebook', 'has', 'been', 'run', 'times']
Tokenized corpus without stopwords ['need', 'finalize', 'demo', 'corpus', 'notebook', 'soon', 'ending', 'month', 'notebook', 'run', 'times']


In [10]:
print('Difference between NLTK and spaCy output:\n',
      set(tokenized_corpus_without_stopwords) - set(tokens_without_sw))

Difference between NLTK and spaCy output:
 {'used', 'done'}


### ステミング

In [12]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()

print('Before Stemming:')
print(corpus)

print('After Stemming:')
for word in tokenized_corpus_nltk:
    print(stemmer.stem(word), end=' ')

Before Stemming:
need to finalize the demo corpus which will be used for this notebook should be done soon it should be done by the ending of this month but will it this notebook has been run times
After Stemming:
need to final the demo corpu which will be use for thi notebook should be done soon it should be done by the end of thi month but will it thi notebook ha been run time 

### 見出し語化

In [13]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

for word in tokenized_corpus_nltk:
    print(lemmatizer.lemmatize(word), end=' ')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
need to finalize the demo corpus which will be used for this notebook should be done soon it should be done by the ending of this month but will it this notebook ha been run time 

### 品詞タグ付け

In [15]:
from pprint import pprint

nltk.download('averaged_perceptron_tagger')
print('POS Tagging using NLTK:')
pprint(nltk.pos_tag(word_tokenize(corpus_original)))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
POS Tagging using NLTK:
[('Need', 'NN'),
 ('to', 'TO'),
 ('finalize', 'VB'),
 ('the', 'DT'),
 ('demo', 'NN'),
 ('corpus', 'NN'),
 ('which', 'WDT'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('used', 'VBN'),
 ('for', 'IN'),
 ('this', 'DT'),
 ('notebook', 'NN'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('done', 'VBN'),
 ('soon', 'RB'),
 ('!', '.'),
 ('!', '.'),
 ('.', '.'),
 ('It', 'PRP'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('done', 'VBN'),
 ('by', 'IN'),
 ('the', 'DT'),
 ('ending', 'VBG'),
 ('of', 'IN'),
 ('this', 'DT'),
 ('month', 'NN'),
 ('.', '.'),
 ('But', 'CC'),
 ('will', 'MD'),
 ('it', 'PRP'),
 ('?', '.'),
 ('This', 'DT'),
 ('notebook', 'NN'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('run', 'VBN'),
 ('4', 'CD'),
 ('times', 'NNS'),
 ('!', '.'),
 ('!', '.')]


In [18]:
doc = spacy_model(corpus_original)

print('POS Tagging using spacy:')
for token in doc:
  print(f'{token}\t{token.pos_}')

POS Tagging using spacy:
Need	VERB
to	PART
finalize	VERB
the	DET
demo	NOUN
corpus	NOUN
which	DET
will	VERB
be	AUX
used	VERB
for	ADP
this	DET
notebook	NOUN
and	CCONJ
it	PRON
should	VERB
be	AUX
done	VERB
soon	ADV
!	PUNCT
!	PUNCT
.	PUNCT
It	PRON
should	VERB
be	AUX
done	VERB
by	ADP
the	DET
ending	NOUN
of	ADP
this	DET
month	NOUN
.	PUNCT
But	CCONJ
will	VERB
it	PRON
?	PUNCT
This	DET
notebook	NOUN
has	AUX
been	AUX
run	VERB
4	NUM
times	NOUN
!	PUNCT
!	PUNCT


NLTKとspaCy以外にも、基本的な前処理をできるライブラリは他にもあります。