# fine-tuning w2v on oposum

for each domain: build vocab on corpus (train/dev/test), load pre-trained embedding from google-news, save the intersected w2v model, after training, save the fine-tuned version of w2v model

In [17]:
import gensim
import logging
from gensim.models import Word2Vec, KeyedVectors
import os
from time import time

%load_ext autoreload
%autoreload 2

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
%cd ../

c:\Project\group-1.3


In [4]:
from scripts.utils import *

[nltk_data] Downloading package words to C:\nltk_data...
[nltk_data]   Package words is already up-to-date!


# example setup

In [9]:
domain = 'bags_and_cases'
corpus_file = './processed/oposum/' + domain + '_corpus.pkl'
corpus = pickle_load(corpus_file)

In [11]:
model = Word2Vec(size=300)
model.build_vocab(corpus, min_count=5)
total_examples = model.corpus_count
print(model.wv.vectors.shape)

(9040, 300)


vocab size: 
- (15429, 300), corpus_wostw_wotf1, min_count=1
- (9040, 300), wostw_wotf1, min_count=5
- (30430, 300), wostw, min_count=1
- (9040, 300)

In [None]:
model.intersect_word2vec_format('./wv/GoogleNews-vectors-negative300.bin.gz', binary=True, lockf=1.0)

In [None]:
model.wv.save_word2vec_format("./wv/oposum/" + domain + '_pretrained.bin', binary=True)

In [None]:
model.train(corpus, total_examples=total_examples, epochs=100)

In [None]:
model.wv.save_word2vec_format("./wv/oposum/" + domain + '_tuned.bin', binary=True)

# fine-tuning

In [18]:
all_domains = ['bags_and_cases', 'bluetooth', 'boots', 'keyboards', 'tv', 'vacuums']
pretrained_w2v_file = './wv/GoogleNews-vectors-negative300.bin.gz'
finetune_output_dir = './wv/oposum/'
eps = 100

In [19]:
os.makedirs(finetune_output_dir, exist_ok=True)

In [20]:
for domain in all_domains:
    t0 = time()
    print(f"for domain {domain}")
    corpus_file = './processed/oposum/' + domain + '_corpus.pkl'
    corpus = pickle_load(corpus_file)

    model = Word2Vec(size=300)
    model.build_vocab(corpus, min_count=1)
    total_examples = model.corpus_count
    print(f"vocab size: {model.wv.vectors.shape[0]}")

    print("loading pre-trained vectors ...")
    model.intersect_word2vec_format(pretrained_w2v_file, binary=True, lockf=1.0)
    print("save intersected pre-trained word vectors ...")
    model.wv.save_word2vec_format(finetune_output_dir + domain + '_pretrained.bin', binary=True)
    print("start training ...")
    t1 = time()
    model.train(corpus, total_examples=total_examples, epochs=eps)
    print(f"training cost {time() - t1:.2f} seconds")
    print("save fine-tuned word vectors ...")
    model.wv.save_word2vec_format(finetune_output_dir + domain + '_tuned.bin', binary=True)
    print(f"finish fine-tuning on domain {domain} in {time() - t0:.2f} seconds!\n\n")


for domain bags_and_cases
vocab size: 30430
loading pre-trained vectors ...
save intersected pre-trained word vectors ...
start training ...
training cost 604.11 seconds
save fine-tuned word vectors ...
finish fine-tuning on domain bags_and_cases in 742.97 seconds!


for domain bluetooth
vocab size: 51248
loading pre-trained vectors ...
save intersected pre-trained word vectors ...
start training ...
training cost 1342.57 seconds
save fine-tuned word vectors ...
finish fine-tuning on domain bluetooth in 1529.46 seconds!


for domain boots
vocab size: 30345
loading pre-trained vectors ...
save intersected pre-trained word vectors ...
start training ...
training cost 781.91 seconds
save fine-tuned word vectors ...
finish fine-tuning on domain boots in 907.57 seconds!


for domain keyboards
vocab size: 34081
loading pre-trained vectors ...
save intersected pre-trained word vectors ...
start training ...
training cost 582.01 seconds
save fine-tuned word vectors ...
finish fine-tuning on do