# Homework 2 - Kristiyan Dimitrov

## Overview
Tasks for preprocessing:
- tokenization
- lowercase
- removing non-alhanumeric chars & numbers
- removing stopwords

Preprocessing results:
- 1 txt file for each article

Finally, train and evaluate word2vec model

# Notes
- Downloading 17.8 GB of latest English Wikipedia XML took ~ 1 hour
- Processing it with the first bash command took ~ 2.5 hours. I took only articles with more than 1000 characters to avoid stumps
- Overall, this processed ~3.5 million wikipedia articles

In terms of preprocessing:
- I used sent tokenize, then gensim.utils.simple_preprocess and finally filtered out stopwords
- I decided NOT to lemmatize the words, because it's important that plurals are preserved e.g. 'dogs' is different from 'dog'


I processed 55,170 articles in ~ 6.5 hours. 29 failed processing, because they had a "/" in their name.
Note that this is out of a total of 3.5 million articles

### Extracting JSON file from raw XML data dump
For extracting the raw Wikipedia dump  
I removed the -i parameter,b ecause I don't want the interlink information (links between articles)  
I added "-m 1000" which is minimum number of article characters; I think this is a reasonable number of characters and will ignore any "stumps/stubs"  

In [None]:
# !python -m gensim.scripts.segment_wiki -m 1000 -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz

### Testing 
Functions based on this recommended tutorial: https://rare-technologies.com/word2vec-tutorial/

In [60]:
import os
from gensim import utils
import gensim, logging
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import json

# Preprocessing

In [228]:
test_iter = 1

 # iterate over the plain text data we just created
with utils.open('enwiki-latest.json.gz', 'rb') as f:
    
    
    for line in f:
        
#         if test_iter % 10_000 == 0:
#             print(test_iter)
#         else:
#             test_iter += 1

        if test_iter  == 500_000:
            break
        else:
            if test_iter % 100 == 0:
                print(test_iter)
            test_iter += 1
        
        # decode each JSON line into a Python dictionary object
        article = json.loads(line)
        
        # If the given article ahs already been processed, skip it
        if article['title']+'.txt' in os.listdir('data'):
            continue
        
        article_sentences = list()
        
        for text in article['section_texts']:
            # Split text into sentences
            sentences = sent_tokenize(text)
            # Preprocess each sentence converting it into a list of lower case tokens, which are not too long or too short
            tokenized_sentences = [gensim.utils.simple_preprocess(sent) for sent in sentences]

            # Remove any tokens which are stop words while preserving the sentence structure of the article (i.e. list of lists)
            no_stops = list()
            for tokenized_sentence in tokenized_sentences:
                no_stops.append([word for word in tokenized_sentence if word not in stopwords.words('english')])
            
            # Add sentences from this article section to the list of all sentences for this article
            article_sentences.extend(no_stops)
            
        # Save all sentences in this article to a txt file with the name of the article
        try:
            with open(os.path.join('data', article['title']+'.txt'), 'w') as filehandle:
                for listitem in article_sentences:
                    filehandle.write(f'{" ".join(listitem)}\n')
        except:
            print(f"FAILED TO SAVE {article['title']}")


# Model Training

In [178]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            if fname in ['.ipynb_checkpoints', '.DS_Store']:
                continue
#             print(fname)
            for line in open(os.path.join(self.dirname, fname)):
#                 yield bytes(line, 'utf-8').decode('utf-8', 'ignore').split()
                yield line.split()

Default model parameters are:
- size = 100
- window = 5
- sg default=0 (short for skip-gram){0,1} 1 for skip-gram, 0 for CBOW
- hs default=0 (short for hierarchical softmax) if 1 then negative sampling is used
- min_count=5 ignores all words with total frequency below this
- workers = 3
- iter = 5 (number of runs over data; the first run is to built the vocabulary; the remaining ones are training epochs

### NOTE: Below you will see A LOT of logging messages; scroll down to the end for some qualitative model evaluation

In [180]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

sentences = MySentences('data') # a memory-friendly iterator

# model = gensim.models.Word2Vec(sentences)
model = gensim.models.Word2Vec(sentences, min_count=50, size=200, workers=5, sg=1, iter=10, window=5)

2020-10-07 23:28:19,050 : INFO : collecting all words and their counts
2020-10-07 23:28:19,237 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-07 23:28:19,379 : INFO : PROGRESS: at sentence #10000, processed 134288 words, keeping 21521 word types
2020-10-07 23:28:19,492 : INFO : PROGRESS: at sentence #20000, processed 259182 words, keeping 32420 word types
2020-10-07 23:28:19,604 : INFO : PROGRESS: at sentence #30000, processed 393145 words, keeping 42002 word types
2020-10-07 23:28:19,730 : INFO : PROGRESS: at sentence #40000, processed 521442 words, keeping 49736 word types
2020-10-07 23:28:19,895 : INFO : PROGRESS: at sentence #50000, processed 649906 words, keeping 57168 word types
2020-10-07 23:28:20,024 : INFO : PROGRESS: at sentence #60000, processed 777926 words, keeping 63911 word types
2020-10-07 23:28:20,132 : INFO : PROGRESS: at sentence #70000, processed 908921 words, keeping 69000 word types
2020-10-07 23:28:20,232 : INFO : PROGRESS: at 

In [222]:
# I can potentially save the model and keep training it later?
# model.save("word2vec_modified.model")
# model = gensim.models.Word2Vec.load("word2vec_default.model")
model = gensim.models.Word2Vec.load("word2vec_modified.model")

2020-10-08 00:25:11,400 : INFO : loading Word2Vec object from word2vec_modified.model
2020-10-08 00:25:11,497 : INFO : loading wv recursively from word2vec_modified.model.wv.* with mmap=None
2020-10-08 00:25:11,498 : INFO : loading vectors from word2vec_modified.model.wv.vectors.npy with mmap=None
2020-10-08 00:25:11,517 : INFO : setting ignored attribute vectors_norm to None
2020-10-08 00:25:11,517 : INFO : loading vocabulary recursively from word2vec_modified.model.vocabulary.* with mmap=None
2020-10-08 00:25:11,518 : INFO : loading trainables recursively from word2vec_modified.model.trainables.* with mmap=None
2020-10-08 00:25:11,519 : INFO : loading syn1neg from word2vec_modified.model.trainables.syn1neg.npy with mmap=None
2020-10-08 00:25:11,541 : INFO : setting ignored attribute cum_table to None
2020-10-08 00:25:11,541 : INFO : loaded word2vec_modified.model


# Using the model 

In [223]:
model.wv.most_similar(positive=['woman','queen'], negative=['king'], topn=1)

2020-10-08 00:25:14,691 : INFO : precomputing L2-norms of word weight vectors


[('girl', 0.5953096151351929)]

In [224]:
model.wv.most_similar(positive=['green','red'], topn=1)

[('blue', 0.826113224029541)]

In [209]:
model.wv.most_similar(positive=['cat','cats'], negative=['dog'],topn=1)

[('sphynx', 0.6461185216903687)]

In [211]:
model.wv.most_similar(positive=['male','man'], negative=['female'],topn=1)

[('creature', 0.6031163930892944)]

In [212]:
model.wv.most_similar(positive=['continent', 'river'],topn=1)

[('estuary', 0.7099519968032837)]

In [213]:
model.wv.most_similar(positive=['two', 'three'],topn=1)

[('four', 0.9649326205253601)]

In [225]:
model.wv.most_similar(positive=['jobs'],topn=1)

[('employment', 0.7199805974960327)]

In [185]:
len(model.wv.vocab)

66165

In [186]:
model.wv.vectors.shape

(66165, 200)

# Appendix