# Mining Unstructured Data

In this lab, we will experiment with word2vec model. First of all, we need to install a packages named [gensim](https://radimrehurek.com/gensim/).

In [7]:
# install packages
import sys

!conda install --yes --prefix {sys.prefix} -c anaconda nltk 
!conda install --yes --prefix {sys.prefix} -c conda-forge gensim

Solving environment: done


  current version: 4.4.9
  latest version: 4.6.7

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /Users/jnyu/anaconda3

  added / updated specs: 
    - nltk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2018.11.29         |           py36_0         146 KB  anaconda
    ca-certificates-2019.1.23  |                0         126 KB  anaconda
    openssl-1.0.2p             |       h1de35cc_0         3.4 MB  anaconda
    ------------------------------------------------------------
                                           Total:         3.7 MB

The following packages will be UPDATED:

    ca-certificates: 2019.1.23-0       --> 2019.1.23-0       anaconda
    certifi:         2018.11.29-py36_0 --> 2018.11.29-py36_0 anaconda
    openssl:         1.0.2p-h1de35cc_0 --> 1.0.2p-h1de35cc_0 anacon

In [37]:
import gzip
import logging
import warnings
from gensim.models import Word2Vec

warnings.simplefilter(action='ignore', category=FutureWarning)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


## word2vec on synthetic data

### Generate some synthetic data
We manually generate some synthetic data.

In [38]:

# define training data
sentences = [
    ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
    ['this', 'is', 'the', 'second', 'sentence'],
    ['yet', 'another', 'sentence'],
    ['one', 'more', 'sentence'],
    ['and', 'the', 'final', 'sentence']
]

### Train word2vec model

In [word2vec](https://radimrehurek.com/gensim/models/word2vec.html), there are a few arguments you may wish to configure:

- size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
- window: (default 5) The maximum distance between a target word and words around the target word.
- min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
- workers: (default 3) The number of threads to use while training.
- sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [39]:
# train model
model = Word2Vec(sentences, size=10, min_count=1)

# summarize the loaded model
print(model)

# save the model
model.save('model.bin')

2019-02-27 16:44:43,305 : INFO : collecting all words and their counts
2019-02-27 16:44:43,307 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-27 16:44:43,309 : INFO : collected 14 word types from a corpus of 22 raw words and 5 sentences
2019-02-27 16:44:43,310 : INFO : Loading a fresh vocabulary
2019-02-27 16:44:43,312 : INFO : min_count=1 retains 14 unique words (100% of original 14, drops 0)
2019-02-27 16:44:43,313 : INFO : min_count=1 leaves 22 word corpus (100% of original 22, drops 0)
2019-02-27 16:44:43,315 : INFO : deleting the raw counts dictionary of 14 items
2019-02-27 16:44:43,317 : INFO : sample=0.001 downsamples 14 most-common words
2019-02-27 16:44:43,319 : INFO : downsampling leaves estimated 2 word corpus (12.7% of prior 22)
2019-02-27 16:44:43,323 : INFO : estimated required memory for 14 words and 10 dimensions: 8120 bytes
2019-02-27 16:44:43,324 : INFO : resetting layer weights
2019-02-27 16:44:43,326 : INFO : training model with 3

Word2Vec(vocab=14, size=10, alpha=0.025)


### Access the model

After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.

For example, you can print the learned vocabulary of tokens (words) as follows:

In [40]:
# summarize vocabulary
words = list(model.wv.vocab)
print(words)

['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']


Also, you can access the embedding of a particular word. 

In [41]:
# access vector for one word
print(model['sentence'])

[-0.01425273  0.03096979 -0.02607897 -0.03749254 -0.02945738 -0.00771068
 -0.01760446  0.01088096 -0.02845169 -0.02757283]


  


### Load a trained model

In [42]:
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)


# summarize vocabulary
new_words = list(model.wv.vocab)
print(new_words)
print(new_model['sentence'])

2019-02-27 16:44:44,157 : INFO : loading Word2Vec object from model.bin
2019-02-27 16:44:44,159 : INFO : loading wv recursively from model.bin.wv.* with mmap=None
2019-02-27 16:44:44,161 : INFO : setting ignored attribute vectors_norm to None
2019-02-27 16:44:44,162 : INFO : loading vocabulary recursively from model.bin.vocabulary.* with mmap=None
2019-02-27 16:44:44,164 : INFO : loading trainables recursively from model.bin.trainables.* with mmap=None
2019-02-27 16:44:44,166 : INFO : setting ignored attribute cum_table to None
2019-02-27 16:44:44,169 : INFO : loaded model.bin


Word2Vec(vocab=14, size=10, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[-0.01425273  0.03096979 -0.02607897 -0.03749254 -0.02945738 -0.00771068
 -0.01760446  0.01088096 -0.02845169 -0.02757283]


  if __name__ == '__main__':


-------

## Word2vec on user review data

In this task, we will apply word2vec to a real world data [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels.

### Read review data

In [45]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
 
    print("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
 
            if (i % 10000 == 0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review b text
            yield gensim.utils.simple_preprocess(line)

In [46]:
documents = list (read_input('reviews_data.txt.gz'))
logging.info ("Done reading data file")

reading file reviews_data.txt.gz...this may take a while
read 0 reviews
read 10000 reviews
read 20000 reviews
read 30000 reviews
read 40000 reviews
read 50000 reviews
read 60000 reviews
read 70000 reviews
read 80000 reviews
read 90000 reviews
read 100000 reviews
read 110000 reviews
read 120000 reviews
read 130000 reviews
read 140000 reviews
read 150000 reviews
read 160000 reviews
read 170000 reviews
read 180000 reviews
read 190000 reviews
read 200000 reviews
read 210000 reviews
read 220000 reviews
read 230000 reviews
read 240000 reviews
read 250000 reviews


2019-02-27 17:00:15,413 : INFO : Done reading data file


## Train the word2vec model

In [48]:
# build vocabulary and train model
model = gensim.models.Word2Vec(
    documents,
    size=150,
    window=10,
    min_count=2,
    workers=10)
model.train(documents, total_examples=len(documents), epochs=3)

2019-02-27 17:02:37,319 : INFO : collecting all words and their counts
2019-02-27 17:02:37,320 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-27 17:02:37,539 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2019-02-27 17:02:37,768 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2019-02-27 17:02:38,038 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2019-02-27 17:02:38,307 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2019-02-27 17:02:38,571 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2019-02-27 17:02:38,834 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types
2019-02-27 17:02:39,048 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types
2019-02-27 17:02:39,283 : INFO : PROG

2019-02-27 17:03:14,628 : INFO : EPOCH 2 - PROGRESS: at 7.17% examples, 1085144 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:15,632 : INFO : EPOCH 2 - PROGRESS: at 10.39% examples, 1093301 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:16,634 : INFO : EPOCH 2 - PROGRESS: at 13.54% examples, 1103319 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:17,637 : INFO : EPOCH 2 - PROGRESS: at 16.84% examples, 1112678 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:18,641 : INFO : EPOCH 2 - PROGRESS: at 19.81% examples, 1114361 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:19,643 : INFO : EPOCH 2 - PROGRESS: at 23.02% examples, 1115730 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:20,647 : INFO : EPOCH 2 - PROGRESS: at 26.50% examples, 1115946 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:03:21,655 : INFO : EPOCH 2 - PROGRESS: at 30.53% examples, 1116032 words/s, in_qsize 18, out_qsize 1
2019-02-27 17:03:22,667 : INFO : EPOCH 2 - PROGRESS: at 34.55% examples, 1116433 

2019-02-27 17:04:09,820 : INFO : EPOCH 4 - PROGRESS: at 10.57% examples, 1122936 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:10,835 : INFO : EPOCH 4 - PROGRESS: at 13.69% examples, 1119892 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:11,843 : INFO : EPOCH 4 - PROGRESS: at 16.95% examples, 1123218 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:12,850 : INFO : EPOCH 4 - PROGRESS: at 19.98% examples, 1126219 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:13,871 : INFO : EPOCH 4 - PROGRESS: at 23.18% examples, 1122928 words/s, in_qsize 17, out_qsize 2
2019-02-27 17:04:14,874 : INFO : EPOCH 4 - PROGRESS: at 26.20% examples, 1105597 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:15,881 : INFO : EPOCH 4 - PROGRESS: at 29.67% examples, 1087807 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:16,886 : INFO : EPOCH 4 - PROGRESS: at 33.38% examples, 1080967 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:04:17,892 : INFO : EPOCH 4 - PROGRESS: at 36.84% examples, 1074160

2019-02-27 17:05:04,741 : INFO : EPOCH - 5 : training on 41519355 raw words (30347365 effective words) took 27.2s, 1114434 effective words/s
2019-02-27 17:05:04,743 : INFO : training on a 207596775 raw words (151737874 effective words) took 139.2s, 1090179 effective words/s
2019-02-27 17:05:04,746 : INFO : training model with 10 workers on 70538 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-02-27 17:05:05,761 : INFO : EPOCH 1 - PROGRESS: at 3.36% examples, 1035836 words/s, in_qsize 17, out_qsize 2
2019-02-27 17:05:06,762 : INFO : EPOCH 1 - PROGRESS: at 7.13% examples, 1093665 words/s, in_qsize 18, out_qsize 1
2019-02-27 17:05:07,763 : INFO : EPOCH 1 - PROGRESS: at 10.55% examples, 1121821 words/s, in_qsize 18, out_qsize 1
2019-02-27 17:05:08,765 : INFO : EPOCH 1 - PROGRESS: at 13.63% examples, 1119332 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:05:09,770 : INFO : EPOCH 1 - PROGRESS: at 16.81% examples, 1116176 words/s, in_qsize 19, out_qsize 0


2019-02-27 17:05:59,266 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-02-27 17:05:59,281 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-02-27 17:05:59,286 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-02-27 17:05:59,294 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-02-27 17:05:59,295 : INFO : EPOCH - 2 : training on 41519355 raw words (30350932 effective words) took 27.4s, 1109014 effective words/s
2019-02-27 17:06:00,311 : INFO : EPOCH 3 - PROGRESS: at 3.54% examples, 1085425 words/s, in_qsize 18, out_qsize 1
2019-02-27 17:06:01,317 : INFO : EPOCH 3 - PROGRESS: at 7.28% examples, 1112167 words/s, in_qsize 18, out_qsize 1
2019-02-27 17:06:02,321 : INFO : EPOCH 3 - PROGRESS: at 10.53% examples, 1116491 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:06:03,337 : INFO : EPOCH 3 - PROGRESS: at 13.58% examples, 1107678 words/s, in_qsize 19, out_qsize 0
2019-02-27 17:06:04,343 : INFO : EPOC

(91049212, 124558065)

### Check the similarity of a word

In [2]:
w1 = "dirty"
model.wv.most_similar (positive=w1)

NameError: name 'model' is not defined

In [50]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)

[('courteous', 0.910372257232666),
 ('friendly', 0.8230658769607544),
 ('curteous', 0.818327009677887),
 ('cordial', 0.8152638077735901),
 ('freindly', 0.7962666749954224),
 ('curtious', 0.7900578379631042)]

In [51]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)

[('germany', 0.7337640523910522),
 ('canada', 0.68161940574646),
 ('england', 0.6617143750190735),
 ('hawaii', 0.6576465368270874),
 ('rome', 0.6501343250274658),
 ('mexico', 0.6385927200317383)]

### Check the similarity between two words

In [52]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.7498892610339847

In [53]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0000000000000002

In [54]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.24801825242792697

# End of lab 7