<a href="https://colab.research.google.com/github/kobemawu/www/blob/master/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nltkの文章群にscikit-learnを用いてクラスタリングを適用してみる

## 導入編

### 必要なライブラリ・データセットのインポート

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import collections

### 今回は以下のnltkの機能を使用できる様にする


In [3]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("reuters")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### データを取得

In [6]:
from nltk.corpus import reuters as corpus

!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m
  inflating: /root/nltk_data/corpora/reuters/training/2231  
  inflating: /root/nltk_data/corpora/reuters/training/2232  
  inflating: /root/nltk_data/corpora/reuters/training/2234  
  inflating: /root/nltk_data/corpora/reuters/training/2236  
  inflating: /root/nltk_data/corpora/reuters/training/2237  
  inflating: /root/nltk_data/corpora/reuters/training/2238  
  inflating: /root/nltk_data/corpora/reuters/training/2239  
  inflating: /root/nltk_data/corpora/reuters/training/2240  
  inflating: /root/nltk_data/corpora/reuters/training/2244  
  inflating: /root/nltk_data/corpora/reuters/training/2246  
  inflating: /root/nltk_data/corpora/reuters/training/2247  
  inflating: /root/nltk_data/corpora/reuters/training/2249  
  inflating: /root/nltk_data/corpora/reuters/training/225  
  inflating: /root/nltk_data/corpora/reuters/training/2251  
  inflating: /root/nltk_data/corpora/reuters/training/2252  
  inflating: /root/nltk_data/corpora/reu

### datasetの中身を確認。場合によって、次のようなコードを実行する必要があります。
"!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora"

In [7]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:300]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears  
among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They  
told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And  
lead to curbs on American imports of their products . But some exporters said that while the conflict would hurt them in the long -  
run , in the short - term Tokyo ' s loss might be their gain . The U . S . Has said it will  
impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to  
stick to a pact not to sell semiconductors on world markets at below cost . Unofficial Japanese estimates put the impact of the tariffs at  
10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports 

### 全document数

In [8]:
len(corpus.fileids())

10788

### (例) 前からk個のdocumentのみで学習する場合

In [9]:
# k = 100
#docs=[corpus.words(fileid) for fileid in corpus.fileids()[:k]]

### 全documentで学習する場合

In [10]:
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...], ['CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7', '-', ...], ['JAPAN', 'TO', 'REVISE', 'LONG', '-', 'TERM', ...], ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...], ['INDONESIA', 'SEES', 'CPO', 'PRICE', 'RISING', ...]]
num of docs: 10788


## 前処理編

### 例 : ストップワードリストの作成

### nltkのストップワードリスト

In [11]:
en_stop = nltk.corpus.stopwords.words('english')

### 例:【発展】記号や数字は正規表現で消してみる

In [12]:
en_stop= ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price"]          \
         +en_stop

### 前処理関数の作成

In [13]:
from nltk.corpus import wordnet as wn #lemmatize関数のためのimport

def preprocess_word(word, stopwordset):
    
    #1.make words lower ex: Python =>python
    word=word.lower()
    
    #2.remove "," and "."
    if word in [",","."]:
        return None
    
    #3.remove stopword  ex: the => (None) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  ex: cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    elif lemma in stopwordset: #lemmatizeしたものがstopwordである可能性がある
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

### 前処理の結果を出力してみる

### 前処理前

In [14]:
print(docs[0][:25]) 

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears']


### 前処理後

In [15]:
print(preprocess_documents(docs)[0][:25])

['asian', 'exporter', 'fear', 'damage', 'japan', 'rift', 'mounting', 'trade', 'friction', 'japan', 'raise', 'fear', 'among', 'many', 'asia', 'exporting', 'nation', 'row', 'could', 'inflict', 'far', 'reaching', 'economic', 'damage', 'businessmen']


## クラスタリング編

### tf idfで上記の前処理済みの文章をベクトル化
### vectorizerを使用する（ハイパーパラメーターの設定）

In [16]:
pre_docs=preprocess_documents(docs)
pre_docs=[" ".join(doc) for doc in pre_docs]
print(pre_docs[0])

vectorizer = TfidfVectorizer(max_features=200, token_pattern=u'(?u)\\b\\w+\\b' )



### fitする

In [17]:
tf_idf = vectorizer.fit_transform(pre_docs)

### K-means
### kmeansの設定

In [18]:
num_clusters = 8
km = KMeans(n_clusters=num_clusters, random_state = 0)

### fitする

In [19]:
clusters = km.fit_predict(tf_idf)

### 出力結果

In [20]:
for doc, cls in zip(pre_docs, clusters):
    print(cls,doc)

[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m
2 first america bank inc fabk 1st qtr 27 15 900 avg shrs 642 403 322 245 loan 57 29 deposit 80 75 assets 75 37
0 lomak petroleum inc lomk loss loss loss 19 loss 348 loss 410 rev 22 acg shrs 13
7 newmont nem stake du pont dd decline newmont mining corp hold 250 376 du pont co outstanding end 970 141 outstanding hold earlier newmont annual report correct march 30 item show holding decrease
7 bankeast corp benh 1st qtr 35 29 732 131 assets 09 861 deposit 817 mlnvs 705 loan 704 553 note result restate reflect acquisition unite banks corp
2 ust corp utsb 1st qtr 41 31 568 656 461 674 assets deposit 46 19 note result restate reflect 100 stock dividend
2 central ill public service cip 1st qtr 28 29 978 916 rev 153 163 note full name central illinois public service co per result reflect payment prefer dividend
7 thera care inc acquire cushing thera care inc agree acquire cushing associate glendale calif exchange 480 thera care
7 interco complete la

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




0 scientific system services inc sssv 4th qtr profit four loss 49 profit 160 700 loss 867 100 rev 700 600 profit two loss 20 profit 84 400 loss 507 800 rev 20 22
4 skipper inc skip semi annual dividend semi annual div four four pay april 30 record april
4 ziegler co inc zegl set quarterly qtly div 13 13 prior pay april 17 record april
1 export inspection thou bushel soybean 18 345 wheat 470 corn 34 940 export inspection thou bushel soybean 18 345 wheat 470 corn 34 940
1 banking center tbcx make acquisition banking center sign letter intent acquire first railroad mortgage co first union bank augusta ga undisclosed terms first railroad loan production 100 servicing 435 loan end
5 san paolo di torino acquire californian bank italian state bank istituto bancario san paolo di torino fully san paolo holding co wilmington sign letter intent acquire valley national bank california san paolo statement subject obtain official authorization relevant body would merge valley national bank subsidia

## ヒント

<p1>
scikit-learnのvectorizerとkmeansにはたくさんのハイパーパラメータがあります。vectorizerのハイパーパラメータの中には前処理機能(例：stop_words)もあります。
    ハイパーパラメータの設定を変える事で最終的な結果は変わります。以下のURLにアクセスしてハイパーパラメータの独自で設定してみてください。<br>
    ・TF-IDFに関するパラメータ<br>
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html<br>
    ・Kmeansに関するパラメータ<br>
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html<br>
</p1>


## 応用
<p1>
    クラスタリング編でコードを以下に指示に従って変更する事で結果がどの様に変わるのかを確認してみましょう<br>
    （１）講義で学んだ他の手法でベクトル化してみる(例：bag-of-words)<br>
    （２）kmeans以外の手法、又はkmeansを可視化してみる(例：階層型クラスタリング)<br>
<p1>