In [13]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 5.4 MB/s eta 0:00:01
Installing collected packages: textblob
Successfully installed textblob-0.15.3


In [16]:
import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/fronteo/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## ①よくある文章のノーマライゼーション処理
    https://mp.weixin.qq.com/s/Gq_0Ks3XTUzZSFYjRrcfMQ
    ・大文字・小文字の統一
    ・数字を単語にするか、削除
    ・文章区切り等記号の処理
    ・空白の削除
    ・省略語の修正
    ・特殊表現の処理（削除等）
    ・text canonicalization

##### ①大文字・小文字の統一

In [6]:
input_str = 'The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil.'
input_str = input_str.lower()
print(input_str)
input_str = input_str.upper()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.
THE 5 BIGGEST COUNTRIES BY POPULATION IN 2017 ARE CHINA, INDIA, UNITED STATES, INDONESIA, AND BRAZIL.


##### ②数字を削除

In [7]:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+','',input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


##### ★★★★➂文章区切り等記号の処理

In [17]:
import string
input_str = 'This &is [an] example? {of} string. with.? punctuation!!!!'
trans=str.maketrans({key: None for key in string.punctuation})
result=input_str.translate(trans)
print(result)

This is an example of string with punctuation


##### ④空白を削除

In [18]:
input_str = ' \t a string example\t '
input_str = input_str.strip()
print(input_str)

a string example


##### ④stopwords
    nltk,scikit-learn,spaCy等stopwordsを提供してくれる
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from spacy.lang.en.stop_words import STOP_WORDS

In [36]:
#英文(nltk)  
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

input_str = 'NLTK is a leading platform for building Python programs to work with human language data.'
stop_words = set(stopwords.words('english'))

tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print(result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


##### ➄単語語根の抽出(Stemming)
単語の語根（books→book　looked→look）
2種類のやり方：
Porter stemming（単語内のよくある形態または語尾を削除）
Lancaster stemming

In [4]:
#NLTKを使って語根を抽出する

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
input_str = 'There are several types of stemming algorithms.'
input_str = word_tokenize(input_str) #分かち書き
for word in input_str:
    print(stemmer.stem(word))

[nltk_data] Downloading package wordnet to /home/fronteo/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


there
are
sever
type
of
stem
algorithm
.


##### ➅単語のステムリダクション（詞幹還原）
    単語のステムリダクションとは、単語ごとの形態を統一するためである。語根抽出と異なり、単語のステムリダクションはモジュールを使って単語の形態を変更する
    
    当前常用的词形还原工具库包括： NLTK（WordNet Lemmatizer），spaCy，TextBlob，Pattern，gensim，Stanford CoreNLP，基于内存的浅层解析器（MBSP），Apache OpenNLP，Apache Lucene，文本工程通用架构（GATE），Illinois Lemmatizer 和 DKPro Core。

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
input_str = 'been had done languages cities mice better'
input_str = word_tokenize(input_str)
for word in input_str:
    print('posあり:'+str(lemmatizer.lemmatize(word,pos = 'a')))
     
    print('posなし:'+str(lemmatizer.lemmatize(word)))
    
    print('======================================')   

posあり:been
posなし:been
posあり:had
posなし:had
posあり:done
posなし:done
posあり:languages
posなし:language
posあり:cities
posなし:city
posあり:mice
posなし:mouse
posあり:good
posなし:better


##### ⑦Part-of-speech tagging（POC）
    当前有许多包含 POS 标记器的工具，包括 NLTK，spaCy，TextBlob，Pattern，Stanford CoreNLP，基于内存的浅层分析器（MBSP），Apache OpenNLP，Apache Lucene，文本工程通用架构（GATE），FreeLing，Illinois Part of Speech Tagger 和 DKPro Core

In [17]:
input_str = 'Parts of speech examples: an article, to write, interesting, easily, and, of'
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]


##### ⑧词语分块（浅解析）
    词语分块是一种识别句子中的组成部分（如名词、动词、形容词等），并将它们链接到具有不连续语法意义的高阶单元（如名词组或短语、动词组等） 的自然语言过程。常用的词语分块工具包括：NLTK，TreeTagger chunker，Apache OpenNLP，文本工程通用架构（GATE），FreeLing。