# Text Mining I - tokenization
## Data Preparation
* Open the corpus in the file "hindu.txt" from https://en.wikipedia.org/wiki/Hindu
* Other public resource: http://www.gutenberg.org/
* Never "push" protected text to Github or other publicly available platforms.  

# Loading wikipedia data

In [6]:
import wikipedia 
import string
# cv = wikipedia.page("Taipei")
# text = cv.content
# print(cv.url)
# print("The length of Taipei page is ", len(text))
# print(text[:100])

# text = wikipedia.page("Rembrandt").content
# print(len(text))

text  = wikipedia.summary("Rembrandt", sentences = 10)
type(text)
print("The length of Rembrandt summary is ", len(text))

The length of Rembrandt summary is  2129


### One more example

In [9]:
text  = wikipedia.summary("Hindus", sentences = 10)
type(text)
print("The length of Hindus summary is ", len(text))

The length of Hindus summary is  1914


In [3]:
with open("data/hindu.txt") as fin:
    text = fin.read()

## Length of the corpus (in characters) 

In [10]:
print("The lenght of the corpus: %d" % len(text))

The lenght of the corpus: 1914


## Content

In [11]:
print(text)

Hindus (Hindustani: [ˈɦɪndu] (listen)) are persons who regard themselves as culturally, ethnically, or religiously adhering to aspects of Hinduism. Historically, the term has also been used as a geographical, cultural, and later religious identifier for people living in the Indian subcontinent.The historical meaning of the term Hindu has evolved with time. Starting with the Persian and Greek references to the land of the Indus in the 1st millennium BCE through the texts of the medieval era, the term Hindu implied a geographic, ethnic or cultural identifier for people living in the Indian subcontinent around or beyond the Sindhu (Indus) river. By the 16th century, the term began to refer to residents of the subcontinent who were not Turkic or Muslims.The historical development of Hindu self-identity within the local South Asian population, in a religious or cultural sense, is unclear. Competing theories state that Hindu identity developed in the British colonial era, or that it may have

## Tokenization

## Method 1. by built-in `.split()`

In [14]:
sentence_a = "What’s in a name? That which we call a rose by any other name would smell as sweet."
print(sentence_a.split(" "))

sentence_b = "2020/04/07 00:08:00"
print(sentence_b.split("/"))

['What’s', 'in', 'a', 'name?', 'That', 'which', 'we', 'call', 'a', 'rose', 'by', 'any', 'other', 'name', 'would', 'smell', 'as', 'sweet.']
['2020', '04', '07 00:08:00']


In [15]:
print("123".isalpha())
print("abc".isalpha())

False
True


In [43]:
print(len(text.split(" ")))
print(text.split(" "))


300
['Hindus', '(Hindustani:', '[ˈɦɪndu]', '(listen))', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.', 'Historically,', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.', 'By', 'the', '16th', 'century,', 'the', 'ter

## Method 2. by nltk's function

In [53]:
# import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(len(word_tokenize(text)))
print(word_tokenize(text))


341
['Hindus', '(', 'Hindustani', ':', '[', 'ˈɦɪndu', ']', '(', 'listen', ')', ')', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', ',', 'ethnically', ',', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '.', 'Historically', ',', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', ',', 'cultural', ',', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', '.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', ',', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', ',', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '

## Method 3. Design manually

In [35]:
import math
def myfun(x, y):
    return math.sqrt(x**2 + y**2), x, y
print(myfun(3, 4))

(5.0, 3, 4)


In [42]:

def my_tokenizer(txt):
    tok = ""
    word_list = []

    for ch in txt:
        if ch == " ":
            word_list.append(tok)
            tok = ""
#             print("Word_list: ", word_list)
        else:
            tok += ch
#             print(tok)
    return word_list
            
word_list = my_tokenizer(text)
print(len(word_list))
        

299


In [47]:
tok = " "
if tok:
    print("Yes")
else:
    print("No")

Yes


In [54]:
# A problematic implementation for word tokenization.

def tokenize(text):
    tokens = []
    tok = ""
    for ch in text:
        if ch == " ":
            if tok:
                tokens.append(tok)
                tok = ""
        else:
            tok += ch
    if tok:
        tokens.append(tok)
    return tokens

print(len(tokenize(text)))
print(tokenize(text))


300
['Hindus', '(Hindustani:', '[ˈɦɪndu]', '(listen))', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally,', 'ethnically,', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism.', 'Historically,', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical,', 'cultural,', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent.The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time.', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era,', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic,', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', '(Indus)', 'river.', 'By', 'the', '16th', 'century,', 'the', 'ter

## How to compare if two lists are identical?


## Counting

In [21]:
from collections import Counter

tokens = word_tokenize(text)
word_count = Counter(tokens)

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

the	35
,	33
[	30
]	30
.	18
and	16
of	14
to	12
in	12
Hindu	11
or	6
term	6
century	6
Hindus	6
a	5
The	5
as	4
for	4
Indian	4
3	4


## Removal of Punctuation Marks

In [79]:
import string
print(string.punctuation)       

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [20]:
def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok not in string.punctuation:
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Hindu', 'refers', 'to', 'any', 'person', 'who', 'regards', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', '1', '2', 'It', 'has', 'historically', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'indigenous', 'to', 'the', 'Indian', 'subcontinent', '3', '4', 'The', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', '1st', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', '5', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', '6', 'By', 'the', '16th', 'century', 'the', 'term', 'began', 'to', 'ref

## Method 2. Removing all tokens that contain characters other than letters. 

In [81]:
def remove_punctuation_marks(tokens):
    clean_tokens = []
    for tok in tokens:
        if tok.isalpha():
            clean_tokens.append(tok)
    return clean_tokens

print(remove_punctuation_marks(tokens))

['Rembrandt', 'Harmenszoon', 'van', 'Rijn', 'also', 'US', 'Dutch', 'ˈrɛmbrɑnt', 'ˈɦɑrmə', 'n', 'soːn', 'vɑn', 'ˈrɛin', 'listen', 'July', 'October', 'was', 'a', 'Dutch', 'draughtsman', 'painter', 'and', 'printmaker', 'An', 'innovative', 'and', 'prolific', 'master', 'in', 'three', 'media', 'he', 'is', 'generally', 'considered', 'one', 'of', 'the', 'greatest', 'visual', 'artists', 'in', 'the', 'history', 'of', 'art', 'and', 'the', 'most', 'important', 'in', 'Dutch', 'art', 'history', 'Unlike', 'most', 'Dutch', 'masters', 'of', 'the', 'century', 'Rembrandt', 'works', 'depict', 'a', 'wide', 'range', 'of', 'style', 'and', 'subject', 'matter', 'from', 'portraits', 'and', 'to', 'landscapes', 'genre', 'scenes', 'allegorical', 'and', 'historical', 'scenes', 'and', 'biblical', 'and', 'mythological', 'themes', 'as', 'well', 'as', 'animal', 'studies', 'His', 'contributions', 'to', 'art', 'came', 'in', 'a', 'period', 'of', 'great', 'wealth', 'and', 'cultural', 'achievement', 'that', 'historians', 'c

A shorter implementation with Python generator.

In [8]:
def remove_punctuation_marks(tokens):
    return [tok for tok in tokens if tok.isalpha()]

print(remove_punctuation_marks(tokens))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'are', 'persons', 'who', 'regard', 'themselves', 'as', 'culturally', 'ethnically', 'or', 'religiously', 'adhering', 'to', 'aspects', 'of', 'Hinduism', 'Historically', 'the', 'term', 'has', 'also', 'been', 'used', 'as', 'a', 'geographical', 'cultural', 'and', 'later', 'religious', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'historical', 'meaning', 'of', 'the', 'term', 'Hindu', 'has', 'evolved', 'with', 'time', 'Starting', 'with', 'the', 'Persian', 'and', 'Greek', 'references', 'to', 'the', 'land', 'of', 'the', 'Indus', 'in', 'the', 'millennium', 'BCE', 'through', 'the', 'texts', 'of', 'the', 'medieval', 'era', 'the', 'term', 'Hindu', 'implied', 'a', 'geographic', 'ethnic', 'or', 'cultural', 'identifier', 'for', 'people', 'living', 'in', 'the', 'Indian', 'subcontinent', 'around', 'or', 'beyond', 'the', 'Sindhu', 'Indus', 'river', 'By', 'the', 'century', 'the', 'term', 'began', 'to', 'refer', 'to', 'residents', 'of', 'the',

New counting results with the removal of punctuations and digits.

In [9]:
tokens = remove_punctuation_marks(tokens)
word_count = Counter(tokens)

for w, c in word_count.most_common(10):
    print("%s\t%d" % (w, c))
    

the	29
and	12
of	10
in	10
to	9
Hindu	8
term	7
or	6
century	6
as	4


## Stopword Removal

Load an English stopword list from NTLK.

In [11]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = stopwords.words('english')
print(stopword_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jirlong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Remove stopwords from the tokens.

In [12]:
def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean

print(remove_stopwords(tokens))

['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'Historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'Indian', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'By', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'may', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'me

## Handle Capitalization in English

### Solution 1: Converting all characters to lowercase. 

In [13]:
def lowercase(tokens):
    tokens_lower = []
    for tok in tokens:
        tokens_lower.append(tok.lower())
    return tokens_lower

print(remove_stopwords(lowercase(tokens)))


['hindus', 'hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'hinduism', 'historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'indian', 'historical', 'meaning', 'term', 'hindu', 'evolved', 'time', 'starting', 'persian', 'greek', 'references', 'land', 'indus', 'millennium', 'bce', 'texts', 'medieval', 'era', 'term', 'hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'indian', 'subcontinent', 'around', 'beyond', 'sindhu', 'indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'turkic', 'historical', 'development', 'hindu', 'within', 'local', 'south', 'asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'competing', 'theories', 'state', 'hindu', 'identity', 'developed', 'british', 'colonial', 'era', 'may', 'developed', 'century', 'ce', 'islamic', 'invasion', 'medieval

### Solution 2: Maintain the capitalization.

In [14]:
def remove_stopwords(tokens):
    tokens_clean = []
    for tok in tokens:
        if tok.lower() not in stopword_list:
            tokens_clean.append(tok)
    return tokens_clean
print(remove_stopwords(tokens))


['Hindus', 'Hindustani', 'ˈɦɪndu', 'listen', 'persons', 'regard', 'culturally', 'ethnically', 'religiously', 'adhering', 'aspects', 'Hinduism', 'Historically', 'term', 'also', 'used', 'geographical', 'cultural', 'later', 'religious', 'identifier', 'people', 'living', 'Indian', 'historical', 'meaning', 'term', 'Hindu', 'evolved', 'time', 'Starting', 'Persian', 'Greek', 'references', 'land', 'Indus', 'millennium', 'BCE', 'texts', 'medieval', 'era', 'term', 'Hindu', 'implied', 'geographic', 'ethnic', 'cultural', 'identifier', 'people', 'living', 'Indian', 'subcontinent', 'around', 'beyond', 'Sindhu', 'Indus', 'river', 'century', 'term', 'began', 'refer', 'residents', 'subcontinent', 'Turkic', 'historical', 'development', 'Hindu', 'within', 'local', 'South', 'Asian', 'population', 'religious', 'cultural', 'sense', 'unclear', 'Competing', 'theories', 'state', 'Hindu', 'identity', 'developed', 'British', 'colonial', 'era', 'may', 'developed', 'century', 'CE', 'Islamic', 'invasion', 'medieval

### New counting results with the removal of stopwords.

In [15]:
word_count = Counter(remove_stopwords(tokens))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

Hindu	8
term	7
century	6
Indian	4
Hindus	3
used	3
cultural	3
religious	3
texts	3
colonial	3
Hinduism	2
identifier	2
people	2
living	2
historical	2
Indus	2
medieval	2
era	2
subcontinent	2
began	2


### Unicase results with the removal of stopwords.

In [16]:
word_count = Counter(remove_stopwords(lowercase(tokens)))

for w, c in word_count.most_common(20):
    print("%s\t%d" % (w, c))

hindu	8
term	7
century	6
indian	4
hindus	3
used	3
cultural	3
religious	3
texts	3
colonial	3
hinduism	2
identifier	2
people	2
living	2
historical	2
indus	2
medieval	2
era	2
subcontinent	2
began	2


## Stemming

Stemming with Snowball algorithm implemented by NLTK.

Reference: http://snowball.tartarus.org/texts/introduction.html

In [90]:
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in remove_stopwords(tokens):
    stemmed_tokens.append(snowball_stemmer.stem(tok))
word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))


dutch	9
rembrandt	6
art	6
artist	5
master	3
histori	3
import	3
portrait	3
scene	3
mani	3
also	2
painter	2
printmak	2
innov	2
prolif	2
greatest	2
work	2
style	2
genr	2
studi	2
contribut	2
achiev	2
golden	2
age	2
paint	2
baroqu	2
new	2
like	2
year	2
etch	2
form	2
harmenszoon	1
van	1
rijn	1
us	1
ˈrɛmbrɑnt	1
ˈɦɑrmə	1
n	1
soːn	1
vɑn	1
ˈrɛin	1
listen	1
juli	1
octob	1
draughtsman	1
three	1
media	1
general	1
consid	1
one	1


## Lemmatization

Perform lemmatization with WordNet, a lexical ontology, via NLTK. This is a lazy version that does not require part-of-speech information given. 

In [91]:
# import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
    # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
    for p in ['v', 'n', 'a', 'r', 's']:
        l = wordnet_lemmatizer.lemmatize(token, pos=p)
        if l != token:
            return l
    return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))


Dogs
dog
hit


Show the differences between stemming and lemmatization.

In [92]:
for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    'install', 'installed', 'uninstall',
    'internalization', 'internationalization',
    'decontextualization', 'decontextualized', 'decentralization', 'decentralized']:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    if s != l:
        print("%s\t%s\t%s" % (w, s, l))
    


unopened	unopen	unopened
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lain	lain	lie
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly
install	instal	install
uninstall	uninstal	uninstall
internalization	intern	internalization
internationalization	internation	internationalization
decontextualization	decontextu	decontextualization
decontextualized	decontextu	decontextualized
decentralization	decentr	decentralization
decentralized	decentr	decentralize


New counting results with lemmatization. 

In [93]:
lemmatized_tokens = []
for tok in remove_stopwords(tokens):
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(50):
    print("%s\t%d" % (w, c))


Dutch	9
Rembrandt	6
art	6
artist	5
master	3
great	3
history	3
important	3
portrait	3
scene	3
many	3
also	2
painter	2
innovative	2
prolific	2
work	2
style	2
genre	2
study	2
contribution	2
Golden	2
Age	2
paint	2
Baroque	2
new	2
year	2
etch	2
form	2
Harmenszoon	1
van	1
Rijn	1
US	1
ˈrɛmbrɑnt	1
ˈɦɑrmə	1
n	1
soːn	1
vɑn	1
ˈrɛin	1
listen	1
July	1
October	1
draughtsman	1
printmaker	1
three	1
medium	1
generally	1
consider	1
one	1
visual	1
Unlike	1


# Applications: Genearte data for WordCloud rendering. 

https://www.jasondavies.com/wordcloud/

In [17]:
repeated_tokens = []
for w, c in word_count.most_common():
    for i in range(c):
        repeated_tokens.append(w)
print(" ".join(repeated_tokens))


hindu hindu hindu hindu hindu hindu hindu hindu term term term term term term term century century century century century century indian indian indian indian hindus hindus hindus used used used cultural cultural cultural religious religious religious texts texts texts colonial colonial colonial hinduism hinduism identifier identifier people people living living historical historical indus indus medieval medieval era era subcontinent subcontinent began began refer refer within within sense sense identity identity developed developed dharma dharma islam islam hindustani ˈɦɪndu listen persons regard culturally ethnically religiously adhering aspects historically also geographical later meaning evolved time starting persian greek references land millennium bce implied geographic ethnic around beyond sindhu river residents turkic development local south asian population unclear competing theories state british may ce islamic invasion wars appears dated sanskrit bengali poets vidyapati kabi