# Word Vectorization & Language Tagger
For these tasks, we can ultilize a variety of libraries, the most popular being `nltk`, `spacy`, `gensim`. Each come with support for unsupervised training on a set of data and pretrained models.

## Gensim
One of the lesser popular package as it is somewhat overshadowed by `nltk` for language processing and `sklearn` for machine learning tools, `gensim` is still a very versatile and self-contained library when you wish to dabble in NLP.

In [2]:
# install the library here
!pip install gensim

[33mYou are using pip version 19.0.3, however version 20.3b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Using pretrained model from gensim
`gensim` provide a list of pretrained model on its `gensim.download` package (see https://github.com/RaRe-Technologies/gensim-data). It's mostly English model and data, but it would suffice in showing what the custom model are supposed to contain.

In [4]:
import gensim.downloader as api
# loading the smallest model, 6B uncased
model = api.load("glove-wiki-gigaword-50")
# see the most similar words to a random word
print(model.most_similar("flower"))
# see the words vectors
model["flower"]

[('flowers', 0.8447046875953674), ('fruit', 0.7923509478569031), ('tree', 0.7542152404785156), ('fruits', 0.7288298606872559), ('garden', 0.7106596231460571), ('lavender', 0.7078107595443726), ('purple', 0.706515371799469), ('ornaments', 0.7038703560829163), ('roses', 0.7011839151382446), ('fragrant', 0.6990464329719543)]


array([ 0.075439 ,  1.2659   , -1.3179   ,  0.11341  ,  1.4513   ,
        0.17337  , -0.56265  , -1.0706   ,  0.54898  ,  0.30163  ,
       -0.11471  ,  0.38498  ,  0.9205   , -0.2491   ,  0.3308   ,
        0.060113 , -0.0068846,  0.086864 , -0.20535  , -0.86098  ,
        0.10007  , -0.75486  ,  0.48225  , -0.33253  , -0.23791  ,
        0.17345  ,  0.49777  ,  0.88761  ,  0.089471 , -0.56217  ,
        1.8535   , -0.0055493,  0.45845  ,  0.53943  ,  0.3247   ,
        0.43479  , -0.027253 ,  0.44744  , -0.27514  , -0.016152 ,
       -0.51024  , -0.10113  , -0.80985  , -0.31571  ,  1.5817   ,
        0.2105   , -0.1844   , -1.7266   ,  0.092685 , -0.55696  ],
      dtype=float32)

### Customized training of a gensim model
While interesting, we would like to use a model on our data and language. `gensim` also had provided a tool for such task. In this example, we are using data from 20newsgroup to make up a toy monolingual dataset. Note that we are not calling `Phraser` for this example despite being good practice to English data.

In [6]:
!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
!tar -xvf 20news-18828.tar.gz

--2020-11-25 13:10:05--  http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
Resolving qwone.com (qwone.com)... 173.48.209.137
Connecting to qwone.com (qwone.com)|173.48.209.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14666916 (14M) [application/x-gzip]
Saving to: ‘20news-18828.tar.gz’


2020-11-25 13:10:14 (1.69 MB/s) - ‘20news-18828.tar.gz’ saved [14666916/14666916]

20news-18828/
20news-18828/alt.atheism/
20news-18828/alt.atheism/51203
20news-18828/alt.atheism/51277
20news-18828/alt.atheism/53192
20news-18828/alt.atheism/53222
20news-18828/alt.atheism/51283
20news-18828/alt.atheism/53759
20news-18828/alt.atheism/53225
20news-18828/alt.atheism/53098
20news-18828/alt.atheism/51275
20news-18828/alt.atheism/53601
20news-18828/alt.atheism/54173
20news-18828/alt.atheism/53334
20news-18828/alt.atheism/51158
20news-18828/alt.atheism/53803
20news-18828/alt.atheism/51211
20news-18828/alt.atheism/51315
20news-18828/alt.atheism/53179
20news-18828/alt.atheis

20news-18828/alt.atheism/51266
20news-18828/alt.atheism/53807
20news-18828/alt.atheism/53215
20news-18828/alt.atheism/53355
20news-18828/alt.atheism/53629
20news-18828/alt.atheism/53173
20news-18828/alt.atheism/51234
20news-18828/alt.atheism/54261
20news-18828/alt.atheism/53458
20news-18828/alt.atheism/53661
20news-18828/alt.atheism/53382
20news-18828/alt.atheism/53064
20news-18828/alt.atheism/53391
20news-18828/alt.atheism/54185
20news-18828/alt.atheism/51198
20news-18828/alt.atheism/54201
20news-18828/alt.atheism/51128
20news-18828/alt.atheism/53489
20news-18828/alt.atheism/51293
20news-18828/alt.atheism/53640
20news-18828/alt.atheism/53220
20news-18828/alt.atheism/54219
20news-18828/alt.atheism/53532
20news-18828/alt.atheism/53808
20news-18828/alt.atheism/53110
20news-18828/alt.atheism/53249
20news-18828/alt.atheism/53163
20news-18828/alt.atheism/51183
20news-18828/alt.atheism/51169
20news-18828/alt.atheism/53606
20news-18828/alt.atheism/53377
20news-1

20news-18828/comp.os.ms-windows.misc/10123
20news-18828/comp.os.ms-windows.misc/9594
20news-18828/comp.os.ms-windows.misc/10825
20news-18828/comp.os.ms-windows.misc/10838
20news-18828/comp.os.ms-windows.misc/10125
20news-18828/comp.os.ms-windows.misc/9488
20news-18828/comp.os.ms-windows.misc/9668
20news-18828/comp.os.ms-windows.misc/10785
20news-18828/comp.os.ms-windows.misc/9613
20news-18828/comp.os.ms-windows.misc/10933
20news-18828/comp.os.ms-windows.misc/9566
20news-18828/comp.os.ms-windows.misc/10159
20news-18828/comp.os.ms-windows.misc/10674
20news-18828/comp.os.ms-windows.misc/10604
20news-18828/comp.os.ms-windows.misc/9866
20news-18828/comp.os.ms-windows.misc/10056
20news-18828/comp.os.ms-windows.misc/10182
20news-18828/comp.os.ms-windows.misc/10027
20news-18828/comp.os.ms-windows.misc/9660
20news-18828/comp.os.ms-windows.misc/10833
20news-18828/comp.os.ms-windows.misc/9891
20news-18828/comp.os.ms-windows.misc/9890
20news-18828/comp.os.ms-windows.misc/9882

20news-18828/comp.sys.ibm.pc.hardware/60294
20news-18828/comp.sys.ibm.pc.hardware/60543
20news-18828/comp.sys.ibm.pc.hardware/60489
20news-18828/comp.sys.ibm.pc.hardware/60805
20news-18828/comp.sys.ibm.pc.hardware/60355
20news-18828/comp.sys.ibm.pc.hardware/60522
20news-18828/comp.sys.ibm.pc.hardware/60585
20news-18828/comp.sys.ibm.pc.hardware/60169
20news-18828/comp.sys.ibm.pc.hardware/61004
20news-18828/comp.sys.ibm.pc.hardware/60918
20news-18828/comp.sys.ibm.pc.hardware/60188
20news-18828/comp.sys.ibm.pc.hardware/60246
20news-18828/comp.sys.ibm.pc.hardware/58826
20news-18828/comp.sys.ibm.pc.hardware/58971
20news-18828/comp.sys.ibm.pc.hardware/60404
20news-18828/comp.sys.ibm.pc.hardware/60728
20news-18828/comp.sys.ibm.pc.hardware/60784
20news-18828/comp.sys.ibm.pc.hardware/58827
20news-18828/comp.sys.ibm.pc.hardware/61061
20news-18828/comp.sys.ibm.pc.hardware/60362
20news-18828/comp.sys.ibm.pc.hardware/60856
20news-18828/comp.sys.ibm.pc.hardware/60142
20news-188

20news-18828/comp.sys.mac.hardware/51728
20news-18828/comp.sys.mac.hardware/52445
20news-18828/comp.sys.mac.hardware/51948
20news-18828/comp.sys.mac.hardware/52107
20news-18828/comp.sys.mac.hardware/52017
20news-18828/comp.sys.mac.hardware/51648
20news-18828/comp.sys.mac.hardware/51890
20news-18828/comp.sys.mac.hardware/51603
20news-18828/comp.sys.mac.hardware/52220
20news-18828/comp.sys.mac.hardware/52405
20news-18828/comp.sys.mac.hardware/52227
20news-18828/comp.sys.mac.hardware/52231
20news-18828/comp.sys.mac.hardware/51801
20news-18828/comp.sys.mac.hardware/51621
20news-18828/comp.sys.mac.hardware/51857
20news-18828/comp.sys.mac.hardware/51723
20news-18828/comp.sys.mac.hardware/50547
20news-18828/comp.sys.mac.hardware/51836
20news-18828/comp.sys.mac.hardware/51854
20news-18828/comp.sys.mac.hardware/52443
20news-18828/comp.sys.mac.hardware/51843
20news-18828/comp.sys.mac.hardware/50550
20news-18828/comp.sys.mac.hardware/50538
20news-18828/comp.sys.mac.hardware

20news-18828/misc.forsale/76882
20news-18828/misc.forsale/75970
20news-18828/misc.forsale/76944
20news-18828/misc.forsale/74736
20news-18828/misc.forsale/76193
20news-18828/misc.forsale/76334
20news-18828/misc.forsale/76296
20news-18828/misc.forsale/76638
20news-18828/misc.forsale/76101
20news-18828/misc.forsale/76824
20news-18828/misc.forsale/74796
20news-18828/misc.forsale/76793
20news-18828/misc.forsale/76294
20news-18828/misc.forsale/76309
20news-18828/misc.forsale/76224
20news-18828/misc.forsale/75867
20news-18828/misc.forsale/75874
20news-18828/misc.forsale/76443
20news-18828/misc.forsale/76078
20news-18828/misc.forsale/76396
20news-18828/misc.forsale/76866
20news-18828/misc.forsale/76444
20news-18828/misc.forsale/76665
20news-18828/misc.forsale/76036
20news-18828/misc.forsale/76097
20news-18828/misc.forsale/75976
20news-18828/misc.forsale/76226
20news-18828/misc.forsale/76367
20news-18828/misc.forsale/76423
20news-18828/misc.forsale/74757
20news-188

20news-18828/rec.motorcycles/104780
20news-18828/rec.motorcycles/104458
20news-18828/rec.motorcycles/104681
20news-18828/rec.motorcycles/105257
20news-18828/rec.motorcycles/104972
20news-18828/rec.motorcycles/105224
20news-18828/rec.motorcycles/104712
20news-18828/rec.motorcycles/104288
20news-18828/rec.motorcycles/105244
20news-18828/rec.motorcycles/103185
20news-18828/rec.motorcycles/104880
20news-18828/rec.motorcycles/104782
20news-18828/rec.motorcycles/104837
20news-18828/rec.motorcycles/104357
20news-18828/rec.motorcycles/104391
20news-18828/rec.motorcycles/104721
20news-18828/rec.motorcycles/104843
20news-18828/rec.motorcycles/103236
20news-18828/rec.motorcycles/104305
20news-18828/rec.motorcycles/104566
20news-18828/rec.motorcycles/104522
20news-18828/rec.motorcycles/102616
20news-18828/rec.motorcycles/104394
20news-18828/rec.motorcycles/104858
20news-18828/rec.motorcycles/104807
20news-18828/rec.motorcycles/104965
20news-18828/rec.motorcycles/104797
2

20news-18828/rec.sport.baseball/105055
20news-18828/rec.sport.baseball/104468
20news-18828/rec.sport.baseball/102832
20news-18828/rec.sport.baseball/104476
20news-18828/rec.sport.baseball/104512
20news-18828/rec.sport.baseball/104726
20news-18828/rec.sport.baseball/104829
20news-18828/rec.sport.baseball/104932
20news-18828/rec.sport.baseball/105014
20news-18828/rec.sport.baseball/104834
20news-18828/rec.sport.baseball/104958
20news-18828/rec.sport.baseball/105003
20news-18828/rec.sport.baseball/102615
20news-18828/rec.sport.baseball/104884
20news-18828/rec.sport.baseball/104595
20news-18828/rec.sport.baseball/105107
20news-18828/rec.sport.baseball/102654
20news-18828/rec.sport.baseball/104880
20news-18828/rec.sport.baseball/105030
20news-18828/rec.sport.baseball/104584
20news-18828/rec.sport.baseball/104558
20news-18828/rec.sport.baseball/104652
20news-18828/rec.sport.baseball/104575
20news-18828/rec.sport.baseball/102664
20news-18828/rec.sport.baseball/102614


20news-18828/sci.crypt/15766
20news-18828/sci.crypt/16132
20news-18828/sci.crypt/15989
20news-18828/sci.crypt/15793
20news-18828/sci.crypt/15937
20news-18828/sci.crypt/15553
20news-18828/sci.crypt/15954
20news-18828/sci.crypt/15683
20news-18828/sci.crypt/15232
20news-18828/sci.crypt/16039
20news-18828/sci.crypt/15754
20news-18828/sci.crypt/15994
20news-18828/sci.crypt/15293
20news-18828/sci.crypt/15720
20news-18828/sci.crypt/15748
20news-18828/sci.crypt/16141
20news-18828/sci.crypt/15653
20news-18828/sci.crypt/15402
20news-18828/sci.crypt/15900
20news-18828/sci.crypt/15002
20news-18828/sci.crypt/15270
20news-18828/sci.crypt/15388
20news-18828/sci.crypt/15910
20news-18828/sci.crypt/15639
20news-18828/sci.crypt/15529
20news-18828/sci.crypt/16061
20news-18828/sci.crypt/15843
20news-18828/sci.crypt/15302
20news-18828/sci.crypt/15495
20news-18828/sci.crypt/15361
20news-18828/sci.crypt/15845
20news-18828/sci.crypt/15883
20news-18828/sci.crypt/16082
20news-188

20news-18828/sci.electronics/54485
20news-18828/sci.electronics/53728
20news-18828/sci.electronics/54315
20news-18828/sci.electronics/53772
20news-18828/sci.electronics/53918
20news-18828/sci.electronics/54240
20news-18828/sci.electronics/54239
20news-18828/sci.electronics/54095
20news-18828/sci.electronics/54214
20news-18828/sci.electronics/53903
20news-18828/sci.electronics/53843
20news-18828/sci.electronics/53528
20news-18828/sci.electronics/54252
20news-18828/sci.electronics/53842
20news-18828/sci.electronics/53632
20news-18828/sci.electronics/54288
20news-18828/sci.electronics/53619
20news-18828/sci.electronics/54119
20news-18828/sci.electronics/53630
20news-18828/sci.electronics/53965
20news-18828/sci.electronics/53669
20news-18828/sci.electronics/53533
20news-18828/sci.electronics/53609
20news-18828/sci.electronics/54093
20news-18828/sci.electronics/53787
20news-18828/sci.electronics/53574
20news-18828/sci.electronics/52749
20news-18828/sci.electronics

20news-18828/sci.space/60992
20news-18828/sci.space/61143
20news-18828/sci.space/61371
20news-18828/sci.space/60965
20news-18828/sci.space/60994
20news-18828/sci.space/60153
20news-18828/sci.space/62480
20news-18828/sci.space/61284
20news-18828/sci.space/61332
20news-18828/sci.space/60808
20news-18828/sci.space/61456
20news-18828/sci.space/60911
20news-18828/sci.space/60819
20news-18828/sci.space/60839
20news-18828/sci.space/60234
20news-18828/sci.space/61146
20news-18828/sci.space/60793
20news-18828/sci.space/60233
20news-18828/sci.space/60926
20news-18828/sci.space/61396
20news-18828/sci.space/61049
20news-18828/sci.space/60888
20news-18828/sci.space/60873
20news-18828/sci.space/60901
20news-18828/sci.space/61324
20news-18828/sci.space/59871
20news-18828/sci.space/61325
20news-18828/sci.space/61078
20news-18828/sci.space/60924
20news-18828/sci.space/61235
20news-18828/sci.space/60973
20news-18828/sci.space/62115
20news-18828/sci.space/61175
20news-188

20news-18828/talk.politics.guns/54549
20news-18828/talk.politics.guns/54860
20news-18828/talk.politics.guns/54139
20news-18828/talk.politics.guns/55087
20news-18828/talk.politics.guns/54287
20news-18828/talk.politics.guns/54447
20news-18828/talk.politics.guns/55467
20news-18828/talk.politics.guns/54241
20news-18828/talk.politics.guns/54704
20news-18828/talk.politics.guns/55475
20news-18828/talk.politics.guns/54644
20news-18828/talk.politics.guns/54242
20news-18828/talk.politics.guns/54544
20news-18828/talk.politics.guns/55105
20news-18828/talk.politics.guns/54618
20news-18828/talk.politics.guns/54376
20news-18828/talk.politics.guns/54707
20news-18828/talk.politics.guns/54718
20news-18828/talk.politics.guns/54710
20news-18828/talk.politics.guns/54186
20news-18828/talk.politics.guns/54426
20news-18828/talk.politics.guns/53317
20news-18828/talk.politics.guns/54722
20news-18828/talk.politics.guns/54195
20news-18828/talk.politics.guns/54245
20news-18828/talk.politic

20news-18828/talk.politics.mideast/76568
20news-18828/talk.politics.mideast/77205
20news-18828/talk.politics.mideast/76156
20news-18828/talk.politics.mideast/76211
20news-18828/talk.politics.mideast/76496
20news-18828/talk.politics.mideast/76277
20news-18828/talk.politics.mideast/76469
20news-18828/talk.politics.mideast/76218
20news-18828/talk.politics.mideast/77284
20news-18828/talk.politics.mideast/76490
20news-18828/talk.politics.mideast/77817
20news-18828/talk.politics.mideast/75372
20news-18828/talk.politics.mideast/77311
20news-18828/talk.politics.mideast/76022
20news-18828/talk.politics.mideast/76095
20news-18828/talk.politics.mideast/77255
20news-18828/talk.politics.mideast/77209
20news-18828/talk.politics.mideast/76034
20news-18828/talk.politics.mideast/77293
20news-18828/talk.politics.mideast/77307
20news-18828/talk.politics.mideast/75891
20news-18828/talk.politics.mideast/76167
20news-18828/talk.politics.mideast/77274
20news-18828/talk.politics.mideast

In [25]:
# load all data
import os, io
all_data = []
for root, dirs, files in os.walk("./20news-18828"):
    for f in files:
        with io.open(os.path.join(root, f), "r", encoding="latin-1") as txtfile:
            txt = txtfile.read()
            ignore_header = txt.find("\n\n")
            if(ignore_header > 0): txt = txt[ignore_header:]
            all_data.extend((l.strip() for l in txt.split("\n")))
all_data[:10]

['',
 '',
 'In article <1993May12.205519.1480@alchemy.chem.utoronto.ca> golchowy@alchemy.chem.utoronto.ca (Gerald Olchowy) writes:',
 '>In article <C6x44y.3xD@cbfsb.cb.att.com> sadek@cbnewsg.cb.att.com (mohamed.s.sadek) writes:',
 '>>',
 '>>I like what Mr. Joseph Biden had to say yesterday 5/11/93 in the senate.',
 '>>Condemening the european lack of action and lack of support to us plans',
 '>>and calling that "moral rape".',
 '>>',
 '>']

In [26]:
# clean data
import re
remove_punct = r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]'
clean = [re.sub(pattern=remove_punct, repl='', string=l).split() for l in all_data if l != ""]
clean[:10]

[['In',
  'article',
  '1993May122055191480alchemychemutorontoca',
  'golchowyalchemychemutorontoca',
  'Gerald',
  'Olchowy',
  'writes'],
 ['In',
  'article',
  'C6x44y3xDcbfsbcbattcom',
  'sadekcbnewsgcbattcom',
  'mohamedssadek',
  'writes'],
 [],
 ['I',
  'like',
  'what',
  'Mr',
  'Joseph',
  'Biden',
  'had',
  'to',
  'say',
  'yesterday',
  '51193',
  'in',
  'the',
  'senate'],
 ['Condemening',
  'the',
  'european',
  'lack',
  'of',
  'action',
  'and',
  'lack',
  'of',
  'support',
  'to',
  'us',
  'plans'],
 ['and', 'calling', 'that', 'moral', 'rape'],
 [],
 [],
 ['It',
  'is',
  'easy',
  'for',
  'Sen',
  'Biden',
  'to',
  'say',
  'that',
  'when',
  'there',
  'are',
  'no',
  'US',
  'troops',
  'in'],
 ['Zepa', 'or', 'Srebinica', 'or', 'Sarejevo']]

In [28]:
from gensim.models import Word2Vec
model = Word2Vec(clean, 
                 min_count=3,   # Ignore words that appear less than this
                 size=200,      # Dimensionality of word embeddings
                 workers=2,     # Number of processors (parallelisation)
                 window=5,      # Context window for words during training
                 iter=30)       # Number of epochs training over corpus

In [30]:
# test run
model.wv.most_similar("europe")
# created model can be saved to disk
model.save("./20news-18828/toy-model.model")

[('Ganja', 0.4598428010940552),
 ('eccentricity', 0.4369807243347168),
 ('Morea', 0.43422484397888184),
 ('China', 0.41499266028404236),
 ('talkpoliticsmisc', 0.41395872831344604),
 ('NagornoKarabakh', 0.4132624864578247),
 ('Canda', 0.41209375858306885),
 ('Dagestan', 0.41166573762893677),
 ('aviation', 0.410491943359375),
 ('Norway', 0.41026896238327026)]

### [Assignment] Apply to Vietnamese data
We can access a wikipedia dump of Vietnamese monolingual data at https://github.com/NTT123/viwik18. Apply the technique above to generate a vietnamese model in gensim format.

In [None]:
# Try it here

## Spacy
A library more focused on tokenizing and tagging, `spacy` get an honorable mention since it already have partial support for Vietnamesein the `vi_spacy` package: https://github.com/trungtv/vi_spacy

**Note: this section is currently having error. Come back later.**

In [None]:
# installation for newer version (bugged?)
#!pip install pyvi 
#!pip install spacy==2.1.4
!pip install https://github.com/trungtv/vi_spacy/raw/master/packages/vi_spacy_model-0.2.1/dist/vi_spacy_model-0.2.1.tar.gz
!python -m spacy link --force vi_spacy_model vi
!python -m spacy validate

In [None]:
# installation for older version
!pip install spacy==2.0.0
!pip install https://github.com/trungtv/vivi_spacy/raw/master/vi/vi_core_news_md-2.0.1/dist/vi_core_news_md-2.0.1.tar.gz
!python -m spacy validate

In [None]:
# loading
import spacy
nlp = spacy.load('vi')
#import vi_spacy_model
#nlp = vi_spacy_model.load()
doc = nlp('Cộng đồng xử lý ngôn ngữ tự nhiên')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [None]:
import vi_spacy_model
nlp = vi_spacy_model.load()
doc = nlp('Cộng đồng xử lý ngôn ngữ tự nhiên')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [63]:
#! python -m spacy link vi_spacy_model vi
!head ~/environment/lib/python3.5/site-packages/vi_spacy_model/meta.json

{
  "lang":"vi",
  "name":"spacy_model",
  "version":"0.2.1",
  "spacy_version":">=2.1.4",
  "description":"Vietnamese model for Spacy.IO",
  "author":"Viet-Trung TRAN",
  "email":"trungtv@soict.hust.edu.vn",
  "url":"https://github.com/trungtv/vivi_spacy",
  "license":"MIT",


# NLTK
The `nltk` is the most widely used package when it come to language processing, due to accessible interface, ease of usage and the sheer amount of supporting tools it offer. Similar to the two libraries above, it have in-built support for word vectorization.

In [None]:
!pip install nltk

In [67]:
from nltk.corpus import wordnet

# find synonym and antonym for the word 
synonyms, antonyms = [], []
for syn in wordnet.synsets("beauty"): 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 
print(synonyms, antonyms)

# find similarity of two words with known tag
w1 = wordnet.synset('run.v.01') # v here denotes the tag verb 
w2 = wordnet.synset('sprint.v.01') 
print(w1.wup_similarity(w2)) 

['beauty', 'smasher', 'stunner', 'knockout', 'beauty', 'ravisher', 'sweetheart', 'peach', 'lulu', 'looker', 'mantrap', 'dish', 'beauty', 'beaut'] ['ugliness']
0.8571428571428571


### Hidden Markov tagger
The `nltk` package have a multitude of choices when it come to implementing a POStagger model. In this example, we will build a simple Hidden Markov model for tagging purpose, basing on the in-built `treebank` corpus. Note that we only use a 3000 sentences subset as this is only for demonstrating purpose.

In [71]:
import nltk
from nltk.corpus import treebank

# might be necessary to download
nltk.download('treebank')

# Train data - pretagged
train_data = treebank.tagged_sents()[:3000]

from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)

print(tagger.tag("Today is a good day .".split()))

[nltk_data] Downloading package treebank to /home/khoai23/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


[('Today', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('day', 'NN'), ('.', '.')]


### [Assignment] Adapt with Vietnamese data
Similar to the duty above, you will have to clean and build the model on a Vietnamese treebank dataset. The data can be found in: https://universaldependencies.org/

You are not limited to the HMM, experiment and evaluate with all the method (e.g CRF). See http://www.nltk.org/api/nltk.tag.html for the choices available

In [None]:
# Code your own here