<a href="https://colab.research.google.com/github/kobemawu/www/blob/master/Clustering_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK Corpus Clustering with scikit-learn package

## Preparation
First of all, let us import necessary libraries.
* nltk
* sklearn


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import collections

We will use the following datasets in this tutorial.

In [None]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("reuters")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Load the corpus from NLTK package and check out the contents. In some cases, you may need to unzip the data like 
!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

In [None]:
from nltk.corpus import reuters as corpus

for n,item in enumerate(corpus.words(corpus.fileids()[0])[:300]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears  
among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They  
told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And  
lead to curbs on American imports of their products . But some exporters said that while the conflict would hurt them in the long -  
run , in the short - term Tokyo ' s loss might be their gain . The U . S . Has said it will  
impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to  
stick to a pact not to sell semiconductors on world markets at below cost . Unofficial Japanese estimates put the impact of the tariffs at  
10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports 

In [None]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:300]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears  
among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They  
told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And  
lead to curbs on American imports of their products . But some exporters said that while the conflict would hurt them in the long -  
run , in the short - term Tokyo ' s loss might be their gain . The U . S . Has said it will  
impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to  
stick to a pact not to sell semiconductors on world markets at below cost . Unofficial Japanese estimates put the impact of the tariffs at  
10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports 

The total number of documents.

In [None]:
len(corpus.fileids())

10788

You can train the model with first K number of documents or all documents.

In [None]:
# First K documents
# K=1000
# docs=[corpus.words(fileid) for fileid in corpus.fileids()[:K]]

# All documents
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...], ['CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7', '-', ...], ['JAPAN', 'TO', 'REVISE', 'LONG', '-', 'TERM', ...], ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...], ['INDONESIA', 'SEES', 'CPO', 'PRICE', 'RISING', ...]]
num of docs: 10788


## Data preprocessing
First, let us define some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our result.  
(Optional) Try to ignore numbers and words through regular expression.

In [None]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price"]          \
         +en_stop

Next, let us define several preprocessing functions.

In [None]:
from nltk.corpus import wordnet as wn # import for lemmatize

def preprocess_word(word, stopwordset):
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "."
    if word in [",","."]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

Let us check out the preprocessing result.

In [None]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears']
['asian', 'exporter', 'fear', 'damage', 'japan', 'rift', 'mounting', 'trade', 'friction', 'japan', 'raise', 'fear', 'among', 'many', 'asia', 'exporting', 'nation', 'row', 'could', 'inflict', 'far', 'reaching', 'economic', 'damage', 'businessmen']


## Clustering
Document vectorization with tf-idf. We use the TfidfVectorizer that provided by the sklearn package (and set the hyperparameter).

In [None]:
# define the vectorizer
pre_docs=preprocess_documents(docs)
pre_docs=["".join(doc) for doc in pre_docs]
print(pre_docs[0])

vectorizer = TfidfVectorizer(max_features=200, token_pattern=u'(?u)\\b\\w+\\b' )


# fit
tf_idf = vectorizer.fit_transform(pre_docs)

We use K-means to cluster our documents.

In [None]:
# K-means setting
num_clusters = 8
km = KMeans(n_clusters=num_clusters, random_state = 0)

# fit
clusters = km.fit_predict(tf_idf)

Check out the clustering result.

In [None]:
for doc, cls in zip(preprocess_documents(doｃs)[0], clusters):
    print(cls,doc)

1 asian
1 exporter
1 fear
1 damage
7 japan
1 rift
1 mounting
2 trade
1 friction
7 japan
1 raise
1 fear
1 among
1 many
1 asia
1 exporting
1 nation
1 row
1 could
1 inflict
1 far
1 reaching
1 economic
1 damage
1 businessmen
1 official
1 tell
1 reuter
1 correspondent
1 asian
1 capital
1 move
7 japan
1 might
1 boost
1 protectionist
1 sentiment
1 lead
1 curb
1 american
5 import
1 product
1 exporter
1 conflict
1 would
1 hurt
1 long
1 run
1 short
1 term
1 tokyo
1 loss
1 might
1 gain
1 impose
1 300
4 tariff
5 import
1 japanese
1 electronics
1 good
1 april
1 17
1 retaliation
7 japan
1 allege
1 failure
1 stick
1 pact
1 sell
1 semiconductor
1 world
1 market
1 cost
1 unofficial
1 japanese
1 estimate
1 put
1 impact
4 tariff
1 spokesman
1 major
1 electronics
1 firm
1 would
1 virtually
1 halt
1 export
1 product
1 hit
1 new
1 tax
1 able
1 business
1 spokesman
1 leading
1 japanese
1 electronics
1 firm
1 matsushita
1 electric
1 industrial
1 co
1 ltd
1 mc
1 >.
4 tariff
1 remain
1 place
1 length
1 time
1 b

## Hints

There are many hyperparameters in the vectorizer and kmeans of scikit-learn. The vectorizer method also provides data preprocessing functions with hyperparameters (e.g., stop_words). The clustering result will change according to the change of these hyperparameters. You can try different hyperparameter settings to check out the result refer to the following URL.   
* About TF-IDF   
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html   
* About K-means   
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


## Try it yourself
Modify the above code with the following methods to check out the differences:
1. Try other vectorization methods (e.g., bag-of-words)
2. Try other clustering methods (e.g., hierarchical clustering) or visualize the result of K-means.