# Embeddings

## Word2Vec

Vector models that we considered before (tf-idf, BOW) are conventionally called *countable* ones. They are based on "counting" words and their neighbors, and on build word vectors based on this information.

Another class of models that is more ubiquitous today is called *predictive* (or neural) models. The idea behind these models is to use neural network architectures that "predict" (rather than count) the neighbors of words. One of the most famous model of this type is **word2vec**. It is based on a neural network that predicts the probability to meet a word in a given context. This tool was developed by a group of Google researchers in 2013, led by Tomas Mikolov (now at Facebook). Here are the two most important articles:

* [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
* [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)

The vectors obtained using such models are called *distributed representations of words*, or **embeddings**.

### How to train ?
For each word we define the vector using the $w$ matrix and the context vector using the $W$ matrix. In fact, word2vec is a generic name for two 
architectures: Skip-Gram and Continuous Bag-Of-Words (CBOW).

**CBOW** predicts the current word based on context around it.

**Skip-gram** predicts the context, given the current word vice versa.

### How it works ?
Word2vec takes a large corpus of text as input and maps each word to a vector, giving the coordinates of the words as output. First, it creates a dictionary from the input text data, and then calculates the vector representation of the words. Vector representation is based on contextual similarity: words that appear in the same context (and therefore, according to the distributive hypothesis, have a similar meaning) will have close vectors after the model training. To calculate the similarity of words, the cosine distance between their vectors is used.

You can build semantic analogies and compute simple vector addition tasks using distributive vector models, for example:

* *king: male = queen: female*
$\Rightarrow$
* *king - man + woman = queen*

![w2v](https://cdn-images-1.medium.com/max/2600/1*sXNXYfAqfLUeiDXPCo130w.png)

### Problems

Word2Vec models can not establish the type of semantic relationship between words: synonyms, antonyms, hyponyms and hyperonyms will be equally close to the target word because they are usually used in similar contexts. Therefore, words that are close in the vector space are called *semantic associates*. This means that they are semantically related, but it is not clear how exactly.

It is impossible to establish the type of semantic relationship between words: synonyms, antonyms, etc. will be equally close because they are usually used in similar contexts. Therefore, words that are close in the vector space are called *semantic associates*. This means that they are semantically related, but it is not clear how exactly.

### RusVectōrēs

You can find different pre-trained models for the Russian language at the site [RusVectōrēs](https://rusvectores.org/ru/). Models are trained on various data, starting from Wikipedia and ending News data and Social Media. RusVectōrēs also provides the interface to search for the closest words to a given one, calculate the semantic similarity of several words and solve vector addition tasks using the "semantic similarity calculator".

You can also find pre-trained models for other languages, for example, models [fastText](https://fasttext.cc/docs/en/english-vectors.html) and [GloVe](https://nlp.stanford.edu/projects/glove/) (more on them later).

### Visualization
There is a good visualization for English [here](https://projector.tensorflow.org/) 

## Gensim

You can use the pretrained embedding model or train your own using the `gensim` library. Here is [gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html).

### How to use the pretrained model
Word2vec models come in different formats:

* .vec.gz — vector file
* .bin.gz — binary file

They can be loaded using the same class `KeyedVectors`, you only need to change the` binary` flag of the `load_word2vec_format` function.

If the model is **not** trained using word2vec, then the `load` function must be used to load. It may be useful to load pretrained embeddings from *glove, fasttext, bpe* and other modles.

Let us download the model for the Russian language, trained in the National Corpus of Russian Language (`НКРЯ`/`NCRL`) dated 2015.

In [None]:
import re
import gensim
import logging
import nltk.data 
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec
from nltk.tokenize import sent_tokenize, RegexpTokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!wget https://rusvectores.org/static/models/rusvectores2/ruscorpora_mystem_cbow_300_2_2015.bin.gz

--2023-03-03 15:06:01--  https://rusvectores.org/static/models/rusvectores2/ruscorpora_mystem_cbow_300_2_2015.bin.gz
Resolving rusvectores.org (rusvectores.org)... 172.104.228.108
Connecting to rusvectores.org (rusvectores.org)|172.104.228.108|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-03-03 15:06:02 ERROR 403: Forbidden.



In [None]:
model_path = 'ruscorpora_mystem_cbow_300_2_2015.bin.gz'

model_ru = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)

FileNotFoundError: ignored

Let's take several words as an example:

In [None]:
words = ['клавиатура_S', 'мышь_S', 'кошка_S']

Tags like `_S` specify part of speech (`S` - `существительное`) for the given word. Downloaded model was trained on lemmatized and annotated with parts of speech (POS) words. **NB!** Model names on `rusvectores` indicate which tagset they are using (mystem, upos, etc.)

Let's ask the model to find 10 nearest neighbors and the cosine similarity for each word:

In [None]:
for word in words:
    # model contains word? 
    if word in model_ru:
        print(word)
        # show resulting vector (it has dimension 300, so let us print first 10 elements)
        print(model_ru[word][:10])
        # show 10 nearest neighbours :
        for word, sim in model_ru.most_similar(positive=[word], topn=10):
            # word + cosine similarity coefficient
            print(f'{word} : {sim}')
        print('\n')
    else:
        # model does not contain the word? 
        print(f'model does not contain the word {word}!')

Let us find the cosine similarity for the pair or words:

In [None]:
print(model_ru.similarity('человек_S', 'обезьяна_S'))

What happens if we subtract Italy from pizza and add Siberia?

* positive - vectors that we add
* negative - vectors that we subtract

In [None]:
print(model_ru.most_similar(positive=['человек_S'], negative=['обезьяна_S'])[0][0])

In [None]:
model_ru.doesnt_match('пицца_S пельмень_S хот-дог_S ананас_S'.split())

**Warm Up Exercise**

Find homonymic word with different meanings in top 10 neighbours (`most_similar` method):

By analogy with Italy - pizza, Siberia - dumplings, find similar bunch of words to check:

In [None]:
word = 'галера_S'
if word in model_ru:
        print(word)

In [None]:
print(model_ru.most_similar(positive=['двигатель_S', 'машина_S'], negative=['галера_S'])[0][0])

Give an example of three words w1, w2, w3 such that w1 and w2 are synonyms, w1 and w3 are antonyms, but similarity (w1, w2) is less than similarity (w1, w3).

In [None]:
w1 = 'мышь_S',
w2 = 'клавиатура_S'
w3 = 'кошка_S'
print(f"D(w1,w2) = {model_ru.similarity(w1, w2)}")
print(f"D(w1,w3) = {model_ru.similarity(w1, w3)}")

### Excersise

Write a function that takes a sentence as input and replaces a random `Noun` with its "associate" (the closest word from the word2vec model).

**NB:** you need a morphology analyzer like pymorphy for this (we briefly talked about it at the last seminar).

how to use pymorphy:

In [None]:
!pip install pymorphy2

In [None]:
from pymorphy2 import MorphAnalyzer

In [None]:
analyser = MorphAnalyzer()

In [None]:
# parse a word (in this case, two results are possible, so we get a list of two elements)
result = analyser.parse('слово')
result

In [None]:
# get POS of the result
result[0].tag.POS

In [None]:
# convert result to the the dative case
result[0].inflect(frozenset(['datv'])).word

Implement your function here (for simplicity, you may not convert the word to 
the "proper" form and limit yourself to the nominative case):

In [None]:
sentence = 'мама мыла раму'
import random

def change_random_noun(sentence):
  tokens = sentence.split()
  print(f'we have {len(tokens)} words in the sentence')
  while True:
    idx = random.randint(0, len(tokens)-1)
    result = analyser.parse(tokens[idx])
    print(f'normal form for {idx} word {result[0].word} is {result[0].normal_form}')
    if result[0].tag.POS == 'NOUN':
      sim_words = model_ru.most_similar(positive=[result[0].normal_form + "_S"], topn=1)
      for sim_word in sim_words:
        if sim_word[0].endswith('_S'):
          print(f'the 1st closest similar NOUN to {result[0].normal_form} is {sim_word[0]}')
          tokens[idx] = sim_word[0][:-2]
          break
      break
  return ' '.join(tokens)
  
change_random_noun(sentence)

## How to train your ̶d̶r̶a̶g̶o̶n̶ model ?

We will use tagged and untagged movie reviews (dataset taken from Kaggle) as training data.

In [None]:
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/train/unlabeledTrainData.tsv

In [None]:
import pandas as pd

In [None]:
!ls -la | grep Data

In [None]:
data = pd.read_csv("unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

len(data)

In [None]:
data.head()

We need to perform some preprocessing: remove links, html markup and non-alphabetic characters from the data, and then bring everything to lowercase and tokenize. The output is an array of sentences, each sentence is an array of words. We use the tokenizer from the `nltk` library.

In [None]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
nltk.download('stopwords')

In [None]:
def review_to_wordlist(review, remove_stopwords=False ):
    # remove links
    review = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", " ", review)
    # get the text from the page
    review_text = BeautifulSoup(review, "lxml").get_text()
    # leave only words
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    # convert to lowercase and split into words using space character
    words = review_text.lower().split()
    if remove_stopwords: # remove stopwords
        stops = stopwords.words("english")
        words = [w for w in words if not w in stops]
    return(words)

def review_to_sentences(review, tokenizer, remove_stopwords=False):
    # break the review oto sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    # apply the function to each sentence
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    return sentences

In [None]:
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = []  

print("Parsing sentences from training set...")
for review in data["review"]:
    sentences += review_to_sentences(review, tokenizer)

In [None]:
print(len(sentences))
print(sentences[1])

In [None]:
# we'll need it later

with open('clean_text.txt', 'w') as f:
    for s in sentences[:5000]:
        f.write(' '.join(s))
        f.write('\n')

Train and save the model.

Main parameters:
* datset must be an iterable object
* size - output vector size,
* window - the size of the observation window,
* min_count - min. the frequency of the word in the corpus,
* sg - learning algorithm (0 - CBOW, 1 - Skip-gram),
* sample - threshold for downsampling high-frequency words,
* workers - the number of threads,
* alpha - learning rate,
* iter - number of iterations,
* max_vocab_size - allows you to set a memory limit when creating a dictionary (i.e. if the limit is exceeded, then low-frequency words will be thrown out). For comparison: 10 million words = 1 GB of RAM.

**NB!** Please note that model training does not include preprocessing! This means that you will have to get rid of punctuation, bring words to lower case, lemmatize them, and put down part-of-speech tags before training the model (if, of course, this is necessary for your task). Those. in what form the words will be in the source text, in this way they will be in the model.

In [None]:
print("Training model...")

%time model_en = word2vec.Word2Vec(sentences, workers=4, size=300, min_count=10, window=10, sample=1e-3)

Let's see how many words are in the model.

In [None]:
print(len(model_en.wv.vocab))

Let's try to evaluate the model manually by solving examples. A few are given below, try to come up with your own.

In [None]:
print(model_en.wv.most_similar(positive=["woman", "actor"], negative=["man"], topn=1))
print(model_en.wv.most_similar(positive=["dogs", "man"], negative=["dog"], topn=1))

print(model_en.wv.most_similar("usa", topn=3))

print(model_en.wv.doesnt_match("comedy thriller western novel".split()))

In [None]:
print(model_en.wv.most_similar("pizza", topn=3))

### How to fit an existing model to your data

When training a model "from scratch", weights are initialized randomly, but we  can initialize weights from a pre-trained model, thus, as if we fit it.

First, let's see the similarity of a pair of words in the existing model, in order to then compare the result with the retrained one.

In [None]:
model_en.wv.similarity('lion', 'rabbit')

We take the text "Alice in Wonderland" as additional data for training:

In [None]:
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/train/alice.txt

In [None]:
with open("alice.txt", 'r', encoding='utf-8') as f:
    text = f.read()

# remove end of lines and tokenize text to sents
text = re.sub('\n', ' ', text)
sents = sent_tokenize(text)

# remove punctuation and tokenize to tokens
punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'
clean_sents = []
for sent in sents:
    s = [w.lower().strip(punct) for w in sent.split()]
    clean_sents.append(s)
    
print(clean_sents[:2])

We will save and then load the model to get necessary files for fitting. All training parameters (vector size, min. Word frequency, etc.) will be taken from the loaded model, i.e. you cannot set them again.

**NB!** You can only fit the full model, not `KeyedVectors` output. Therefore, you need to save the model in the appropriate format. More about the difference [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [None]:
model_path = "movie_reviews.model"

print("Saving model...")
model_en.save(model_path)

In [None]:
model = word2vec.Word2Vec.load(model_path)
model.build_vocab(clean_sents, update=True)
model.train(clean_sents, total_examples=model.corpus_count, epochs=5)

**Lion** and **rabbit** have become closer to each other!

In [None]:
model.wv.similarity('lion', 'rabbit')

You can normalize vectors, then the model will take up less RAM. However it cannot be trained after that. This uses L2 normalization: the vectors are normalized so that if you add the squares of all the elements of the vector, they add up to 1.

In addition, we will save `KeyedVectors`, but not full vectors.

In [None]:
model.init_sims(replace=True)
model_path = "movies_alice.bin"

print("Saving model...")
model_en.wv.save_word2vec_format(model_path, binary=True)

## Evaluation

How do you know which model is better? How to find out if the model become better?

For this, there are special datasets for assessing the quality of distribution models. There are two main ones: one measures the accuracy of solving problems by analogy (about Russia and dumplings), and the second is used to assess the coefficient of semantic similarity.

### Word Similarity

Assert if semantic similarity for the given model correlate with the common sense.

| word 1 | word 2 | similarity |
| --- | --- | --- |
| cat | dog | 0.7 |
| cup | mug | 0.9 |

For each pair of words from a predefined dataset, we can calculate the cosine distance, and get a list of such similarity values. At the same time, we already have a list of similarity values, made by people. We can compare those lists and see how close they are (for example, by calculating the rank correlation). This measure of similarity should tell you how well the model simulates word distances.

### Analogies

Another popular way "internal" evaluation can be done by analogy. As we already discussed above, using simple arithmetic operations, we can modify the meaning of a word. If we collect in advance a set of modifier words, as well as words that we want to receive in the modification results, then based on the counting of the number of "hits" in the desired word, we can estimate how well the model works.

We can use semantic analogies as modifier words. For example, if we have some kind of country-capital relationship, then to evaluate the model we can use pairs like Russia-Moscow, Norway-Oslo, and so on. The dataset will look like this:

| word 1 | word 2 | attitude |
| ------------ | ------------ | --------------- |
| Russia | Moscow | capital - country |
| Norway | Oslo | capital - country |

For two random pairs from the set and given a triplet (Russia, Moscow, Norway) we want to get the word "Oslo", ie find a word that will have the same relationship with the word "Norway" as "Russia" is with Moscow.

Datasets for the Russian language can be downloaded on the page with models at RusVectores. Let's calculate the quality of our NCRL model on a dataset about analogies:

In [None]:
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/evaluation/ru_analogy_tagged.txt

In [None]:
res = model_ru.accuracy('ru_analogy_tagged.txt')

In [None]:
print(res[4]['correct'][:100])

In [None]:
print(res[4]['incorrect'][:10])

## Visualization

The resulting model can be described using 2D visualization.

### t-SNE

**t-SNE** (*t-distributed Stochastic Neighbor Embedding*) is a technique for nonlinear dimensionality reduction and visualization of multidimensional variables. It was specially designed for high dimensional data by L. van der Maaten and D. Hinton, [here is their article](http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf). t-SNE is an iterative algorithm based on calculating pairwise distances between all objects (which is why it is rather slow).

Let's show 1000 most frequent words from the collection of texts about cinema:

In [None]:
from nltk import FreqDist
from tqdm import tqdm_notebook as tqdm
from sklearn.manifold import TSNE

top_words = []


fd = FreqDist()
for s in tqdm(sentences):
    fd.update(s)

for w in fd.most_common(1000):
    top_words.append(w[0])
    
print(top_words[:50:])
top_words_vec = model[top_words]

In [None]:
top_words_vec = model[top_words]

In [None]:
%%time
tsne = TSNE(n_components=2, random_state=0)
top_words_tsne = tsne.fit_transform(top_words_vec)

In [None]:
# !pip install bokeh

In [None]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE (eng model, top1000 words)")

source = ColumnDataSource(data=dict(x1=top_words_tsne[:,0],
                                    x2=top_words_tsne[:,1],
                                    names=top_words))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

To calculate the t-SNE transformation faster (and sometimes even more efficient), you can first reduce the dimension of the original data using, for example, SVD, and then apply t-SNE:

In [None]:
from sklearn.decomposition import TruncatedSVD

svd_50 = TruncatedSVD(n_components=50)
top_words_vec_50 = svd_50.fit_transform(top_words_vec)
top_words_tsne2 = TSNE(n_components=2, random_state=0).fit_transform(top_words_vec_50)

In [None]:
output_notebook()

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE (eng model, top1000 words, +SVD)")

source = ColumnDataSource(data=dict(x1=top_words_tsne2[:,0],
                                    x2=top_words_tsne2[:,1],
                                    names=top_words))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

## FastText

FastText uses not only word embeddings, but also n-gram embeddings. In the corpus, each word is automatically represented as a set of character n-grams. Say, if we set n = 3, then the vector for the word "where" will be the sum of the vectors of the following triggers: "<wh", "whe", "her", "ere", "re>" (where "<" and ">" symbols denoting the beginning and end of a word). Thanks to this, we can also obtain vectors for words that are not in the dictionary, as well as efficiently work with texts containing errors and typos.

* [Article](https://aclweb.org/anthology/Q17-1010)
* [Website](https://fasttext.cc/)
* [Tutorial](https://fasttext.cc/docs/en/support.html)
* [Vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)
* [Vectors trained on wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html) (separate for 294 different languages)
* [Repository](https://github.com/facebookresearch/fasttext)

There is a `fasttext` library for python (you can work with ready-made models through` gensim`).

In [None]:
! git clone https://github.com/facebookresearch/fastText.git
! pip3 install fastText/.

In [None]:
import fasttext

# train your model
ft_model = fasttext.train_unsupervised('clean_text.txt', minn=3, maxn=4, dim=300)

In [None]:
ft_model.get_word_vector("movie")[:20]

In [None]:
ft_model.get_nearest_neighbors('acttor')

In [None]:
ft_model.get_analogies("man", "woman", "actor")

In [None]:
# problem with typos is now solved

ft_model.get_nearest_neighbors('actr')

In [None]:
# problem with out of vocabulary is solved too

ft_model.get_nearest_neighbors('moviegeek')

In [None]:
!wget -O positive.csv https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0

In [None]:
!wget -O negative.csv https://dl.dropboxusercontent.com/s/r6u59ljhhjdg6j0/negative.csv

In [None]:
positive = pd.read_csv('positive.csv', sep=';', usecols=[3], names=['text'])
positive['label'] = ['positive'] * len(positive)
negative = pd.read_csv('negative.csv', sep=';', usecols=[3], names=['text'])
negative['label'] = ['negative'] * len(negative)
df = positive.append(negative)
df.head()

In [None]:
len(df)

Let's do the standard preprocessing:

In [None]:
! pip install pymorphy2

In [None]:
import pymorphy2
from functools import lru_cache
from multiprocessing import Pool
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm
import re

m = pymorphy2.MorphAnalyzer()

regex = re.compile("[А-Яа-я:=!\)\()A-z\_\%/|]+")

def words_only(text, regex=regex):
    try:
        return regex.findall(text)
    except:
        return []

In [None]:
#@lru_cache(maxsize=128)
# if you are *not* working in colab, you can replace pymorphy with mystem and uncomment the first line about lru_cache
def lemmatize(text, pymorphy=m):
    try:
        return " ".join([pymorphy.parse(w)[0].normal_form for w in text])
    except:
        return " "    

In [None]:
def clean_text(text):
    return lemmatize(words_only(text))

In [None]:
with Pool(8) as p:
    lemmas = list(tqdm(p.imap(clean_text, df['text']), total=len(df)))

    
df['lemmas'] = lemmas
df.head()

Let's write the received data in the format for training the classifier:

In [None]:
X = df.lemmas.tolist()
y = df.label.tolist()

X, y = np.array(X), np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
print ("total train examples %s" % len(y_train))
print ("total test examples %s" % len(y_test))

In [None]:
with open('data.train.txt', 'w+') as outfile:
    for i in range(len(X_train)):
        outfile.write('__label__' + y_train[i] + ' '+ X_train[i] + '\n')
    

with open('test.txt', 'w+') as outfile:
    for i in range(len(X_test)):
        outfile.write('__label__' + y_test[i] + ' ' + X_test[i] + '\n')

In [None]:
classifier = fasttext.train_supervised('data.train.txt')
result = classifier.test('test.txt')

print('P@1:', result[1])
print('R@1:', result[2])
print('Number of examples:', result[0])

## Optional Homework

1. We will work with (partial) PUT YOUR DATASOURCE HERE data from [here](https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta/)
2. Perform preprocessing of the text. Break the data into train and test for the classification task (we will use the topic field as the class label). While working on steps 3 and 5, take the following data as inputs for the classification:
    - only titles (title)
    - only news texts (text)
    - both
3. Train fastText to categorize texts by topic. Compare the quality for different data from step 2.
4. Train your w2v model (or take any suitable pre-trained model). Implement a function to compute a vector of text / title / text + title as the average of the vector of the words it contains.
     - (Bonus) Modify the vector mean calculation function: weight the word vectors with the appropriate tf-idf weights.
5. Train the classification algorithm on the obtained average vectors. Compare the obtained quality with the fastText classifier.

In [None]:
!pip install corus
!wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

In [None]:
from corus import load_lenta

path = 'lenta-ru-news.csv.gz'
records = load_lenta(path)
data = [(record.title, record.topic, record.text, record.tags) for record in records]

In [None]:
lenta = pd.DataFrame(data, columns=['title','topic','text','tags'])
lenta = lenta[lenta['topic'].isin(['Экономика','Спорт','Культура','Наука и техника','Бизнес'])]

In [None]:
lenta.head()

In [None]:
len(lenta)

In [None]:
lenta.topic.value_counts()

In [None]:
from gensim.models.phrases import Phrases, Phraser
phrases = Phrases(lemmas, min_count=3)
phraser = Phraser(phrases)

Let's [download created model from Colaboratory workspace](https://stackoverflow.com/questions/48774285/how-to-download-file-created-in-colaboratory-workspace) and evaluate it with Parallax. 


In [None]:
model = word2vec.Word2Vec.load(model_path)

In [None]:

from google.colab import files
model = word2vec.Word2Vec.load(model_path)
model.wv.save_word2vec_format('model_v1.txt', binary=False)
files.download('model_v1.txt') 


In [None]:
model.build_vocab(clean_sents, update=True)
model.train(clean_sents, total_examples=model.corpus_count, epochs=5)
model.wv.save_word2vec_format('model_v2.txt', binary=False)
files.download('model_v2.txt') 



In [None]:
!ls -la
!head -n 3 model_v2.txt