## Create google word2vec mean vectors for Naive Bayes and Logistic Regression


Word embeddings are ways to calculate vectors that represent words, through a system where words that have similar meaning also have similar vectors which fall close to each other in the vector space. The dimension of the vectors can vary and so do the methods to compute them. Word2Vec is one of the most successful word embeddings techniques and it proposes two main algorithms for efficiently learning vector representations of words from large corpus of text data: CBOW (Continuous Bag of Words) and Skip-gram. Both algorithms are neural network based.

One of the main problems that word embeddings solve, along with the ability to capture semantic, is that of sparsity. In fact, unlike BOW and TF-IFD, these techniques produce dense, low-dimensional vectors, which are a big step forward from having to use a matrix that could have had just as many columns as the total number of words contained in the entire vocabulary.

The version of Word2Vec embeddings used here are 300-dimensional vectors representing 3 million words and phrases and were pre-trained on 100 billion words from the [Google News dataset](https://code.google.com/archive/p/word2vec/).

In order to use work embeddings with the Naïve Bayes and Logistic Regression algorithms, the words present in each document are looked up from the set of embeddings available and all the vectors obtained for each word are normalized and then averaged. The result is a 300 dimension mean vector for each document.

In [2]:
import warnings
warnings.filterwarnings("ignore")
import gensim
from gensim.models import Word2Vec
import gensim.downloader as gensim_api
import pandas as pd
import swifter
import pickle

In [4]:
FILE = "../data/data.parquet.gzip"
data = pd.read_parquet(FILE)

In [5]:
wv = gensim_api.load("word2vec-google-news-300")

In [7]:
lenght = len(data)
vecs = []
for i, txt in enumerate(data["processed_docs"]):
    words = txt.split()
    vec = wv.get_mean_vector(words,pre_normalize=True)
    vecs.append(vec)
    p = round(i/lenght,4)*100
    print(f'{i} of {lenght} - {p} % ', end='\r')

save_as="vecs_prenorm"
pickle.dump(vecs, open("tmp/"+save_as+'.pkl', 'wb'))

In [8]:
vecs = pickle.load(open("tmp/"+"vecs_prenorm.pkl", 'rb'))

In [9]:
data["google-news_w2v_mean_prenorm"] = vecs

data[["id",'target','google-news_w2v_mean_prenorm']].to_parquet("../data/w2v-vectors.parquet.gzip")