# Text Mining I - tokenization
## Data Preparation
* Open the corpus in the file "hindu.txt" from https://en.wikipedia.org/wiki/Hindu
* Other public resource: http://www.gutenberg.org/
* Never "push" protected text to Github or other publicly available platforms.  

# Loading wikipedia data

In [16]:
# !pip install wikipedia
import wikipedia 
import string
# cv = wikipedia.page("Taipei")
# text = cv.content
# print(cv.url)
# print("The length of Taipei page is ", len(text))
# print(text[:100])

# text = wikipedia.page("Rembrandt").content
# print(len(text))

text  = wikipedia.summary("Rembrandt", sentences = 10)
print(type(text))
print("The length of Rembrandt summary is ", len(text))

<class 'str'>
The length of Rembrandt summary is  2170


### One more example

In [17]:
text  = wikipedia.summary("Hindus", sentences = 10)
print(type(text))
print("The length of Hindus summary is ", len(text))

<class 'str'>
The length of Hindus summary is  2016


In [11]:
!wget -N https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
!ls

--2021-05-01 14:05:12--  https://raw.githubusercontent.com/P4CSS/PSS/master/data/hindu.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747 (2.7K) [text/plain]
Saving to: ‘hindu.txt’


Last-modified header missing -- time-stamps turned off.
2021-05-01 14:05:12 (67.5 MB/s) - ‘hindu.txt’ saved [2747/2747]

hindu.txt  hindu.txt.1	hindu.txt.2  sample_data


In [12]:
with open("hindu.txt") as fin:
    text = fin.read()

# Length of the corpus (in characters) 

In [18]:
print("The lenght of the corpus: %d" % len(text))

The lenght of the corpus: 2016


## Content

In [19]:
print(text)

Hindus (Hindustani: [ˈɦɪndu] (listen); ) are persons who regard themselves as culturally, ethnically, or religiously adhering to aspects of Hinduism. Historically, the term has also been used as a geographical, cultural, and later religious identifier for people living in the Indian subcontinent.The historical meaning of the term Hindu has evolved with time. Starting with the Persian and Greek references to the land of the Indus in the 1st millennium BCE through the texts of the medieval era, the term Hindu implied a geographic, ethnic or cultural identifier for people living in the Indian subcontinent around or beyond the Sindhu (Indus) River. By the 16th century CE, the term began to refer to residents of the subcontinent who were not Turkic or Muslims. In DN Jha’s essay “Looking for a Hindu identity”, he writes: “No Indians described themselves as Hindus before the fourteenth century” and “Hinduism was a creation of the colonial period and cannot lay claim to any great antiquity”. H

# Tokenization

## Method 1. by built-in `.split()`

In [20]:
sentence_a = "What’s in a name? That which we call a rose by any other name would smell as sweet."
print(sentence_a.split(" "))

sentence_b = "2020/04/07 00:08:00"
print(sentence_b.split("/"))

['What’s', 'in', 'a', 'name?', 'That', 'which', 'we', 'call', 'a', 'rose', 'by', 'any', 'other', 'name', 'would', 'smell', 'as', 'sweet.']
['2020', '04', '07 00:08:00']


In [21]:
print("123".isalpha())
print("abc".isalpha())

False
True


In [22]:
# 還要處理大小寫、標點符號、數字
print(len(text.split(" ")))
print(text.split(" "))

319
['Hindus', '(Hindustani:', '[ˈɦɪndu]', '(listen);', ')', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.', 'Historically,', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'River.', 'By', 'the', '16th', 'century', 'CE,', 

## Method 2. by nltk's function

In [24]:
# using natural language tool kit
# need to download punkt tokenizer for the first use
# saved at C:/User/user/nltk_data
import nltk
nltk.download('punkt')

# 數字跟標點符號都被斷開來了
from nltk.tokenize import word_tokenize
print(len(word_tokenize(text)))
print(word_tokenize(text))

375
['Hindus', '(', 'Hindustani', ':', '[', 'ˈɦɪndu', ']', '(', 'listen', ')', ';', ')', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', ',', 'ethnically', ',', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '.', 'Historically', ',', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', ',', 'cultural', ',', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', '.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', ',', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', ',', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindh

[nltk_data] Downloading package punkt to C:\Users\user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Method 3. Design manually as Python function
要考慮文本在Python是如何構成的，用for loop iterate 每個字元，遇到空白用if偵測將前面讀過的字元存成一個詞。

In [27]:
# .split() 的原理

tok = "" # 用來暫存的空字元
word_list = []

for ch in text[:30]:
    if ch == " ":
        word_list.append(tok)
        tok = ""
        print("Add to word_list: ", word_list)
    else:
        tok += ch
        print(tok)

H
Hi
Hin
Hind
Hindu
Hindus
Add to word_list:  ['Hindus']
(
(H
(Hi
(Hin
(Hind
(Hindu
(Hindus
(Hindust
(Hindusta
(Hindustan
(Hindustani
(Hindustani:
Add to word_list:  ['Hindus', '(Hindustani:']
[
[ˈ
[ˈɦ
[ˈɦɪ
[ˈɦɪn
[ˈɦɪnd
[ˈɦɪndu
[ˈɦɪndu]
Add to word_list:  ['Hindus', '(Hindustani:', '[ˈɦɪndu]']
(


In [29]:
import math
def myfun(x, y):
    return math.sqrt(x**2 + y**2), x, y # 回傳一個 tuple
print(myfun(3, 4))

(5.0, 3, 4)


In [62]:
# 把上面的 for loop 包成函數
def my_tokenizer(txt):
    tok = "" # 用來暫存的空字元
    word_list = []

    for ch in txt:
        if ch == " ":
            word_list.append(tok)
            tok = ""
            # print("Add to word_list: ", word_list)
        else:
            tok += ch
            # print(tok)
    return word_list
            
word_list = my_tokenizer(text)
print(len(word_list))

# 會遺漏最後一個字，因為是讀到下一個空白才斷

318


In [32]:
tok = ""
if tok:
    print("Yes")
else:
    print("No")

No


In [34]:
# A problematic implementation for word tokenization.

def tokenize(text):
    tokens = []
    tok = ""
    for ch in text:
        if ch == " ":
            if tok: # 若 tok 不為空
                tokens.append(tok)
                tok = ""
        else:
            tok += ch
    if tok: # 補上最後一個字(前提是最後面沒有空白)
        tokens.append(tok)
    return tokens

print(len(tokenize(text)))
print(tokenize(text))

319
['Hindus', '(Hindustani:', '[ˈɦɪndu]', '(listen);', ')', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.', 'Historically,', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'River.', 'By', 'the', '16th', 'century', 'CE,', 

## How to compare if two lists are identical?


## Counting

In [82]:
from collections import Counter
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
word_count = Counter(tokens)
print(word_count.most_common(20)) # list of tuple

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

[('the', 27), (',', 17), ('and', 14), ('.', 10), ('of', 9), ('Hindu', 9), ('to', 7), ('a', 7), ('in', 7), ('as', 6), ('or', 6), ('(', 5), (')', 5), ('term', 5), ('Indian', 5), ('century', 5), ('“', 4), ('”', 4), ('it', 4), ('Hindus', 3)]
the	27
,	17
and	14
.	10
of	9
Hindu	9
to	7
a	7
in	7
as	6
or	6
(	5
)	5
term	5
Indian	5
century	5
“	4
”	4
it	4
Hindus	3


# Stopword and sign removal

## Method 1.Removal of Punctuation Marks

In [85]:
import string
print(string.punctuation)

# list comprehension
clean_tokens = [tok for tok in tokens if tok not in string.punctuation] # 若判斷式成立，將 tok 取出存進 clean_tokens
# clean_tokens = []
# for tok in tokens:
#         if tok not in string.punctuation:
#             clean_tokens.append(tok)

word_count = Counter(clean_tokens)

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
the	27
and	14
of	9
Hindu	9
to	7
a	7
in	7
as	6
or	6
term	5
Indian	5
century	5
“	4
”	4
it	4
Hindus	3
Hinduism	3
cultural	3
for	3
with	3


In [86]:
def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok not in string.punctuation:
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'Historically', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'River', 'By', 'the', '16th', 'century', 'CE', 'the', 'term', 'began', 'to',

## Method 2. Removing all tokens that contain characters other than letters. 

In [87]:
# 純英文以外的都被拿掉了

def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok.isalpha():
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

word_count = Counter(remove_punctuation_marks(tokens))
for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'Historically', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'River', 'By', 'the', 'century', 'CE', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 

## A shorter implementation with Python generator.

In [88]:
def remove_punctuation_marks(tokens):
    return [tok for tok in tokens if tok.isalpha()]

print(remove_punctuation_marks(tokens))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'Historically', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'River', 'By', 'the', 'century', 'CE', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 

## New counting results with the removal of punctuations and digits.

In [89]:
tokens = remove_punctuation_marks(tokens)
word_count = Counter(tokens)

for w, c in word_count.most_common(10):
    print("%s\t%d" % (w, c))
    

the	27
and	14
of	9
Hindu	9
to	7
a	7
in	7
as	6
or	6
term	5


## Stopword Removal

Load an English stopword list from NTLK.

In [90]:
# 一定要看看自己移掉那些東西: 全是小寫，所有格、肯定否定字也全都被涵蓋了

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = stopwords.words('english')
print(stopword_list)

clean_tokens = []
for tok in tokens:
    if tok not in string.punctuation:
        if tok.lower() not in stopword_list:
            clean_tokens.append(tok)

word_count = Counter(clean_tokens)
for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Remove stopwords from the tokens.

In [91]:
def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean

print(remove_stopwords(remove_punctuation_marks(tokens)))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'Historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'Indian', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'River', 'By', 'century', 'CE', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'In', 'DN', 'Jha', 'essay', 'Looking', 'Hindu', 'identity', 'writes', 'No', 'Indians', 'described', 'Hindus', 'fourteenth', 'century', 'Hinduism', 'creation', 'colonial', 'period', 'lay', 'claim', 'great', 'antiquity', 'He', 'wrote', 'The', 'British', 'borrowed', 'word', 'Hindu

# Handle Capitalization in English

## Solution 1: Converting all characters to lowercase. 

In [92]:
def lowercase(tokens):
    tokens_lower = []
    for tok in tokens:
        tokens_lower.append(tok.lower())
    return tokens_lower

print(remove_stopwords(lowercase(remove_punctuation_marks(tokens))))


['hindus', 'hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'hinduism', 'historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'indian', 'historical', 'meaning', 'term', 'hindu', 'evolved', 'time', 'starting', 'persian', 'greek', 'references', 'land', 'indus', 'millennium', 'bce', 'texts', 'medieval', 'era', 'term', 'hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'indian', 'subcontinent', 'around', 'beyond', 'sindhu', 'indus', 'river', 'century', 'ce', 'term', 'began', 'refer', 'residents', 'subcontinent', 'turkic', 'muslims', 'dn', 'jha', 'essay', 'looking', 'hindu', 'identity', 'writes', 'indians', 'described', 'hindus', 'fourteenth', 'century', 'hinduism', 'creation', 'colonial', 'period', 'lay', 'claim', 'great', 'antiquity', 'wrote', 'british', 'borrowed', 'word', 'hindu', 'india', 'gave', 'new', 'mea

## Solution 2: Maintain the capitalization.

In [93]:
def remove_stopwords(tokens):
    tokens_clean = [tok for tok in tokens if tok.lower() not in stopword_list]
    # tokens_clean = []
    # for tok in tokens:
    #     if tok.lower() not in stopword_list:
    #         tokens_clean.append(tok)
    return tokens_clean
print(remove_stopwords(remove_punctuation_marks(tokens)))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'Historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'Indian', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'River', 'century', 'CE', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'Muslims', 'DN', 'Jha', 'essay', 'Looking', 'Hindu', 'identity', 'writes', 'Indians', 'described', 'Hindus', 'fourteenth', 'century', 'Hinduism', 'creation', 'colonial', 'period', 'lay', 'claim', 'great', 'antiquity', 'wrote', 'British', 'borrowed', 'word', 'Hindu', 'India', 'gave', 'new', 'mea


## New counting results with the removal of stopwords.

In [94]:
word_count = Counter(remove_stopwords(remove_punctuation_marks(tokens)))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

Hindu	9
term	5
Indian	5
century	5
Hindus	3
Hinduism	3
cultural	3
identity	3
used	2
religious	2
identifier	2
people	2
living	2
historical	2
meaning	2
Indus	2
texts	2
medieval	2
era	2
subcontinent	2


## Unicase results with the removal of stopwords.

In [95]:
word_count = Counter(remove_stopwords(lowercase(remove_punctuation_marks(tokens))))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

hindu	9
term	5
indian	5
century	5
hindus	3
hinduism	3
cultural	3
identity	3
used	2
religious	2
identifier	2
people	2
living	2
historical	2
meaning	2
indus	2
texts	2
medieval	2
era	2
subcontinent	2


# 詞幹提取 Stemming

Stemming with Snowball algorithm implemented by NLTK.

Reference: http://snowball.tartarus.org/texts/introduction.html

In [96]:
from collections import Counter
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
tokens = [tok for tok in tokens if tok.isalpha()]
tokens = [tok for tok in tokens if tok.lower() not in stopword_list]

from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in tokens:
    stemmed_tokens.append(snowball_stemmer.stem(tok))
word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))


hindu	9
indian	6
term	5
centuri	5
cultur	4
hindus	3
religi	3
hinduism	3
histor	3
use	3
refer	3
ident	3
develop	3
ethnic	2
geograph	2
identifi	2
peopl	2
live	2
mean	2
indus	2
text	2
mediev	2
era	2
subcontin	2
ce	2
began	2
muslim	2
coloni	2
british	2
india	2
may	2
sens	2
dharma	2
hindustani	1
ˈɦɪndu	1
listen	1
person	1
regard	1
adher	1
aspect	1
also	1
later	1
evolv	1
time	1
start	1
persian	1
greek	1
land	1
millennium	1
bce	1


# 詞型還原 Lemmatization

Perform lemmatization with WordNet, a lexical ontology, via NLTK. This is a lazy version that does not require part-of-speech information given. 

In [99]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# initialize the Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
    # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
    for p in ['v', 'n', 'a', 'r', 's']:
        l = wordnet_lemmatizer.lemmatize(token, pos=p)
        if l != token:
            return l
    return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))

Dogs
dog
hit


[nltk_data] Downloading package wordnet to C:\Users\user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Show the differences between stemming and lemmatization.

In [100]:
# Lemmatization 比 Stemming 效果好

for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    'install', 'installed', 'uninstall',
    'internalization', 'internationalization',
    'decontextualization', 'decontextualized', 'decentralization', 'decentralized']:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    # if s != l:
    print("%s\t%s\t%s" % (w, s, l))

open	open	open
opens	open	open
opened	open	open
opening	open	open
unopened	unopen	unopened
talk	talk	talk
talks	talk	talk
talked	talk	talk
talking	talk	talk
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
do	do	do
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lied	lie	lie
lies	lie	lie
lay	lay	lay
lain	lain	lie
lying	lie	lie
cats	cat	cat
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly
install	instal	install
installed	instal	instal
uninstall	uninstal	uninstall
internalization	intern	internalization
internationalization	internation	internationalization
decontextualization	decontextu	decontextualization
decontextualized	decontextu	decontextualized
decentralization	decentr	decentralization
decentralized	decentr	decentralize


## New counting results with lemmatization. 

In [101]:
from collections import Counter
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
tokens = [tok for tok in tokens if tok.isalpha()]
tokens = [tok for tok in tokens if tok.lower() not in stopword_list]

lemmatized_tokens = []
for tok in tokens:
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))

Hindu	9
term	5
Indian	5
century	5
Hindus	3
Hinduism	3
use	3
cultural	3
identity	3
religious	2
identifier	2
people	2
live	2
historical	2
mean	2
Indus	2
text	2
medieval	2
era	2
subcontinent	2
CE	2
begin	2
refer	2
write	2
colonial	2
British	2
India	2
may	2
sense	2
develop	2
dharma	2
Hindustani	1
ˈɦɪndu	1
listen	1
person	1
regard	1
culturally	1
ethnically	1
religiously	1
adhere	1
aspect	1
Historically	1
also	1
geographical	1
late	1
evolve	1
time	1
Starting	1
Persian	1
Greek	1


# Applications: Genearte data for WordCloud rendering. 

https://www.jasondavies.com/wordcloud/

In [102]:
repeated_tokens = []
for w, c in word_count.most_common():
    for i in range(c):
        repeated_tokens.append(w) # 這個字出現幾次就印幾次
print(" ".join(repeated_tokens)) # 把 repeated_tokens join 成一個空白分隔的文章

Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu Hindu term term term term term Indian Indian Indian Indian Indian century century century century century Hindus Hindus Hindus Hinduism Hinduism Hinduism use use use cultural cultural cultural identity identity identity religious religious identifier identifier people people live live historical historical mean mean Indus Indus text text medieval medieval era era subcontinent subcontinent CE CE begin begin refer refer write write colonial colonial British British India India may may sense sense develop develop dharma dharma Hindustani ˈɦɪndu listen person regard culturally ethnically religiously adhere aspect Historically also geographical late evolve time Starting Persian Greek reference land millennium BCE imply geographic ethnic around beyond Sindhu River resident Turkic Muslims DN Jha essay Looking Indians describe fourteenth creation period lay claim great antiquity borrow word give new significance reimported reify phenomenon call E