# import2vec - Learning Embeddings for SW Libraries

(see paper https://arxiv.org/abs/1904.03990)

Please first select the **language** for which you want to train library embeddings. More configuration options are available in the config section below.

In [1]:
language = 'javascript'   # Supported languages: python, javascript, java

max_projects = 10000        # Maximum number of GitHub projects to download. Set to 0 to download all.
min_stars = 10            # Minimum number of GitHub stars a project should have to be considered.
max_size = 0              # Maximum project size (in kb). Set to 0 for unlimitted.

## Library Vectors

Trains vectors for Python libraries (top-level modules). Python imports are first mapped to the top-level package they are defined in. Then, the model is trained by creating target/context pairs for any 2 Python packages.

### 1. Setup Environment & Download Dataset from GitHub

First let's check your environment, install required libraries (if needed) and download a dataset for training.

⚠️ The dataset required for training Library Vectors will be downloaded through the GitHub API. You therefore first need to request an <font color='green'>API key</font> at https://github.com/settings/tokens and write it down in a file called <font color='green'>apikey.txt</font> in the <font color='green'>scripts</font> sub-directory. See https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line for more information on how to create an API key (personal access token). In step 7, you need to request the following scope: <font color='green'>repo > public_repo</font>.

⚠️ Make sure your jupyter environment is setup correctly. If you run into issues, please execute the following commands and restart your jupyter or jupyter lab environment:

```bash
     jupyter labextension install @jupyter-widgets/jupyterlab-manager
```

To have a realistic dataset, you will have to configure the below command line arguments: 

```bash
   run.sh <language> [<maxprojects>] [<minstars>] [<maxsize>] [<usecache>]
       
   where
     - <language>      Programming language. Supported languages: python, java, javascript, csharp, php, ruby
     - <maxprojects>   Maximum GitHub projects (default = 0: all)
     - <minstars>      Min GitHub stars (default = 2)
     - <maxsize>       Max project size (in kb) (default = 0)
     - <usecache>      Cache intermediate files/scripts (default = 0)
```

In [2]:
supported_languages = ['python', 'java', 'javascript']
if not language in supported_languages:
    raise ValueError('Unsupported language: ' + language)

In [None]:
! cd ../scripts; ./run.sh $language $max_projects $min_stars $max_size

In [None]:
! pip install keras json_lines sklearn gensim scipy matplotlib tensorflow bcolz bitarray ipywidgets

In [11]:
import gc
import gzip
import collections
import math
import os
import sys
import zipfile
from pathlib import Path
import json
import string
from six import string_types

from keras.models import Model
from keras.layers import Input, Dense, Reshape, merge, dot, add, BatchNormalization
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import numpy as np
import tensorflow as tf
from sklearn.manifold import TSNE as tsne
from gensim import matutils
from scipy.spatial.distance import cosine
from matplotlib import pyplot as plt

from tqdm import tqdm_notebook as tqdm

### 2. Configuration

You can change the below parameters to tune the vectors.

In [12]:
rootDir = '../datasets/' + language + '/'
config = {}

config['validation_ratio'] = 0.0         # number between 0 and 1 corresponding to the percentage of projects to hold out for validation.
config['true_negative_sampling'] = True  # if True, check that negative samples never co-occur in the entire dataset, by using a co-occurrence matrix.
config['negative_sampling_ratio'] = 1    # number of negative samples to generate for each positive sample.

# tuning
config['reduce_dict_size'] = True        # if True, libraries will first be filtered before calculating vectors

config['min_import_freq'] = 2            # minimum number of times a library must be imported for it to be kept in the vocabulary
config['max_import_freq'] = 1000000      # maximum number of times a library can be imported for it to be kept in the vocabulary
config['min_imports'] = 2                # minimum number of imports that must be present in any source file for the file to be kept in the dataset.

# Model parameters
config['window_size'] = 1000             # a large window size gives less importance to the sequence in which libraries are imported in source files.
config['vector_dim'] = 100               # the dimensionality of the trained library vectors.
config['epochs'] = 100                   # number of epochs to train.

rootDir

'../datasets/javascript/'

In [13]:
! mkdir -p $rootDir/models

### 3. Load data

In [14]:
directory = rootDir + '/processed/'

datasets = {}
for filename in filter(lambda x: x.endswith('.json.gz'),os.listdir(directory)):
    with gzip.open(directory + filename, 'rt') as f:
        #print("Loading " + filename + "...")
        new_dataset = json.loads(f.read())
        datasets = {**datasets, **new_dataset}
        sys.stdout.write('\r' + str(len(datasets)) + ' ' + language.upper() + ' files read')

7891 JAVASCRIPT files read

In [15]:
projects = [k for k,v in datasets.items()]
len(projects), projects[:5]

(7891,
 ['../datasets/javascript/dataset01/0123cf/auto-layout',
  '../datasets/javascript/dataset01/0123cf/together-draw',
  '../datasets/javascript/dataset01/06wj/pokemon',
  '../datasets/javascript/dataset01/07Gond/npm-global-rm-interactive',
  '../datasets/javascript/dataset01/0age/metamorphic'])

In [16]:
# utility functions for the supported languages

def to_lib(imp):      # converts imp to library part (i.e. ignore everything after .)
    return imp.split('.')[0].split(':')[0]

def to_package(imp):  # converts imp to package
    return '.'.join(imp.split('.')[:-1])

# for javascript (NPM)
accepted_first_chars = string.ascii_letters + string.digits + '@'

is_valid = lambda x: True

if language == 'python':
    mapping = to_lib
elif language == 'java':
    mapping = to_package
elif language == 'javascript':
    mapping = lambda x: x
    is_valid = lambda x: x[0] in accepted_first_chars
else:
    print("Unsupported language:", language)  

In [17]:
data = []   # list of (project, [project imports])

for project, files in tqdm(datasets.items()):
    data.append([project, list(set([mapping(imp) for file, imps in files.items() for imp in imps if is_valid(imp)]))])

HBox(children=(IntProgress(value=0, max=7891), HTML(value='')))




In [18]:
# remove empty projects
data = list(filter(lambda row: len(row[1]) > 0, data))
len(data)

7592

In [19]:
# filter imps (remove duplicates and empty)
def remove_empty(imps):
    if '' in imps:
        imps.remove('')
    return imps

data = [[row[0], remove_empty(list(set(row[1])))] for row in tqdm(data)]
data[0]

HBox(children=(IntProgress(value=0, max=7592), HTML(value='')))




['../datasets/javascript/dataset01/0123cf/auto-layout',
 ['html-webpack-plugin', 'path', 'clean-webpack-plugin']]

### 4. Create Training and Validation Set

In [20]:
np.random.shuffle(data)
split_point = math.ceil(len(data) * (1-config['validation_ratio']))
data_trn = data[:split_point]
data_val = data[split_point:]
len(data), len(data_trn), len(data_val)

(7592, 7592, 0)

In [21]:
data = None # can't use data anymore, must use data_trn or data_val from this point onwards!
del data

#### 4.1 Save the Training and Validation Set to disk

In [22]:
import bcolz
  
def save_array(fname, arr):
    c=bcolz.carray(arr, rootdir=fname, mode='w')
    c.flush()

def load_array(fname):
    return bcolz.open(fname)[:]

In [23]:
with open(rootDir + 'models/data_trn_dim{}.json'.format(config['vector_dim']), 'w', encoding='utf-8') as out:
    json.dump(data_trn, out)
if len(data_val) > 0:
    with open(rootDir + 'models/data_val_dim{}.json'.format(config['vector_dim']), 'w', encoding='utf-8') as out:
        json.dump(data_val, out)

#### 4.2 Load Validation and Training set

In [24]:
with open(rootDir + 'models/data_trn_dim{}.json'.format(config['vector_dim']), 'r', encoding='utf-8') as f:
    data_trn = json.load(f)
if config['validation_ratio'] > 0.0:
    with open(rootDir + 'models/data_val_dim{}.json'.format(config['vector_dim']), 'r', encoding='utf-8') as f:
        data_val = json.load(f)
else:
    data_val = []

#### 4.3 Build dictionary of imports

The dictionary maps each imported library to an index. The index will later be used as the position in the imports vector.

First, let's create list of (class/project, import) pairs.

In [25]:
dataset = [[row[0], imp] for row in data_trn for imp in row[1]]
len(dataset)

132553

We transform the dataset into a dictionary of library name to a dictionary with library frequency and index.

In [26]:
idx = 0;
lib_dict = {}
for row in tqdm(dataset):
    lib = row[1]
    if lib in lib_dict:
        lib_dict[lib]['count'] += 1
    else:
        lib_dict[lib] = {'count': 1, 'index': idx}
        idx += 1

HBox(children=(IntProgress(value=0, max=132553), HTML(value='')))




This gives us a total vocabulary size (number of distinct imports) of:

In [27]:
vocab_size = len(lib_dict)
vocab_size

35447

on a total number of imports of:

#### 4.4 Reduce the size of the dictionary

In order to make the dataset managable for visulization, we can experiment with a few heuristics:
- remove libraries that aren't frequently imported: configured with `min_import_freq`
- remove libraries that are too frequently imported: configured with `max_import_freq`
- remove classes/projects that have too few imports (from the set of reduced imports): `min_imports`

The reasoning is the following:
- if a library is only imported a few times, they can not contribute to clusters of significant size.
- too frequently imported libraries are very generic and therefore not distinctive enough for clustering. As an example, libraries that do logging or the standard java libraries everyone imports like `java.lang` and `java.util`.
- if there are not enough imports left, there is not much association with other projects that can be done.

In [28]:
if config['reduce_dict_size']:
    lib_dict = {k:v for k,v in lib_dict.items() if v['count'] >= config['min_import_freq'] and v['count'] <= config['max_import_freq']}
    vocab_size = len(lib_dict)
    print(vocab_size)

9303


We need to re-index `lib_dict`, so that the indices fall within `0..vocab_size`.

In [29]:
lib_dict = {k: {"index": idx, "freq": lib_dict[k]["count"]} for idx, k in enumerate(lib_dict)}
reversed_lib_dict = dict(zip([v["index"] for v in lib_dict.values()], lib_dict.keys()))
list(lib_dict)[:5], list(reversed_lib_dict)[:5]

(['util', 'child_process', 'dotenv', 'request-promise', 'path'],
 [0, 1, 2, 3, 4])

#### 4.5 Save the Library dictionary to disk

In [30]:
with open(rootDir + 'models/lib_dict_dim{}.json'.format(config['vector_dim']), 'w', encoding='utf-8') as out:
    json.dump(lib_dict, out)

with open(rootDir + 'models/reversed_lib_dict_dim{}.json'.format(config['vector_dim']), 'w', encoding='utf-8') as out:
    json.dump(reversed_lib_dict, out)

In [31]:
lib_names = list(lib_dict)
len(lib_names), lib_names[:5]

(9303, ['util', 'child_process', 'dotenv', 'request-promise', 'path'])

In [32]:
data_dict_reduced = {}
for row in tqdm(data_trn):
    if row[0] in data_dict_reduced:
        print("Ignoring duplicate project:", row[0])
    else:
        filtered_libs = [lib for lib in row[1] if lib in lib_dict]
        if len(filtered_libs) > config['min_imports']:
            lib_indices = [lib_dict[lib]["index"] for lib in filtered_libs]
            data_dict_reduced[row[0]] = { "libs": filtered_libs, "lib_ids": lib_indices }

HBox(children=(IntProgress(value=0, max=7592), HTML(value='')))




In [33]:
len(data_dict_reduced)

6298

In [34]:
data_list = [v['lib_ids'] for _, v in data_dict_reduced.items()]
len(data_list)

6298

In [35]:
del data_dict_reduced

#### 4.6 Build co-occurence matrix for True negative sampling

Calculate the co-occurence frequency of libraries

Space-efficient storage of co-occurrence matrix:

Store in flattened version of the upper triangle of a matrix of size nxn.
Example: co(0,4) + co(1,2) + co(3,2) + co(3,4) gives

    .0001
    ..100
    ...10
    ....1
    .....

The formula to find the index in the flattened n-dimensional bit-array for storing the co-occurrence of (i,j) is given by $(i.n) + j - \bigg(\frac{(i+1).(i+2)}{2}\bigg)$, where $i < j$  
The size of the bitarray is given by $ \frac{(n-1).n}{2} $



In [36]:
def co_idx(i, j, n):
    if i == j:
        print("indices should be different")
        return -1
    (s, l) = (i, j) if i < j else (j, i)
    return int(s * n + l - (s+1)*(s+2)/2)

In [37]:
co_size = (vocab_size-1)*vocab_size/2
print(co_size / 1024/1024/1024/8, 'GB')

0.005037087597884238 GB


In [38]:
from bitarray import bitarray

if config['true_negative_sampling']:
    co_occurrence = bitarray(int(co_size))
    co_occurrence.setall(False)
    for row in tqdm(data_list):
        for i in range(0, len(row)-1):
            for j in range(i+1, len(row)):
                co_occurrence[co_idx(row[i], row[j], vocab_size)] = True
    print(co_occurrence.count())

HBox(children=(IntProgress(value=0, max=6298), HTML(value='')))


1147263


#### 4.7 Build file-level sequences of library imports

In [39]:
file_data = []

for project, files in tqdm(datasets.items()):
    for file, imps in files.items():
        file_imps = list(set([mapping(imp) for imp in imps if is_valid(imp)]))
        if len(file_imps) > 0:
            file_data.append([project + ':' + file, file_imps])

HBox(children=(IntProgress(value=0, max=7891), HTML(value='')))




In [40]:
file_data[:5]

[['../datasets/javascript/dataset01/0123cf/auto-layout:webpack.config.js',
  ['html-webpack-plugin', 'path', 'clean-webpack-plugin']],
 ['../datasets/javascript/dataset01/0123cf/together-draw:server/index.js',
  ['http', 'socket.io', 'express']],
 ['../datasets/javascript/dataset01/0123cf/together-draw:src/index.js',
  ['fabric']],
 ['../datasets/javascript/dataset01/0123cf/together-draw:webpack.config.dev.js',
  ['html-webpack-plugin', 'path', 'webpack']],
 ['../datasets/javascript/dataset01/0123cf/together-draw:webpack.config.prod.js',
  ['html-webpack-plugin', 'path', 'webpack']]]

In [41]:
file_data_dict = {}
for row in tqdm(file_data):
    if row[0] in file_data_dict:
        print("Ignoring duplicate file:", row[0])
    else:
        filtered_libs = [lib for lib in row[1] if lib in lib_dict]
        if len(filtered_libs) > 0:
            lib_indices = [lib_dict[lib]["index"] for lib in filtered_libs]
            file_data_dict[row[0]] = { "libs": filtered_libs, "lib_ids": lib_indices }

HBox(children=(IntProgress(value=0, max=189370), HTML(value='')))




In [42]:
del file_data

In [43]:
len(file_data_dict)

174607

In [44]:
file_data_list = [v['lib_ids'] for _, v in file_data_dict.items()]
len(file_data_list)

174607

In [45]:
del file_data_dict

#### 4.8 Generate Context-Target Library Pairs for each File-level Sequence

In [46]:
import random
import math

def make_sampling_table(lib_dict, sampling_factor=1e-4):
    total_count = 0
    for _,v in lib_dict.items():
        total_count += v["freq"]
    print("total count:", total_count)
    return [min(1, math.sqrt(sampling_factor/(v["freq"]/total_count))) for _,v in lib_dict.items()]
    
sampling_table = make_sampling_table(lib_dict)

total count: 106409


In [47]:
sampling_probs = np.array(list(zip(list(lib_dict),sampling_table)))
sorted_sampling_probs = sampling_probs[np.argsort(sampling_probs[:,1])]
len(sorted_sampling_probs), sorted_sampling_probs[:10]

(9303, array([['path', '0.05346886884457664'],
        ['fs', '0.06349938013082382'],
        ['react', '0.06716224589952868'],
        ['react-dom', '0.08914539063879517'],
        ['webpack', '0.10183932228443406'],
        ['express', '0.10274457850757096'],
        ['child_process', '0.10279558103168882'],
        ['prop-types', '0.11755573422573078'],
        ['util', '0.1179392776076502'],
        ['html-webpack-plugin', '0.11983400574909088']], dtype='<U66'))

In [48]:
def samples(arr, num, sampling_table):
    """
        Uses sub-sampling to return num samples (library indices) from the given array of 
        library indices using the given library frequency table.
        # Arguments
        arr: the array of possible indices to sample from
        num: the number of samples to return
        sampling_table: a list of size vocab_size with sampling rates for each library, being the 
            probability with which that library should be chosen.
    """
    res = []
    i = 0
    while len(res) < num:
        if sampling_table[arr[i]] >= random.random() and not arr[i] in res:
            res.append(arr[i])            
        i += 1
        if i >= len(arr):
            i = 0
    return res

samples(range(10), 5, sampling_table)

[7, 9, 3, 5, 8]

In [49]:
def randomgrams(sequences, vocabulary_size, num_positive_samples=8, negative_sampling_ratio=1, true_negative_sampling=False,
                sampling_table=sampling_table, shuffle=True):
    """
        Generates randomgram library pairs. This is similar to skipgrams, but we don't take
        a sliding window, since we are dealing with unordered data. We simply take num_positive_samples
        libraries out of a bag of given libraries that are imported by a certain project.
        
        Takes a list of sequences (list of indexes of libraries),
        returns couples of [lib_index, other_lib_index] and labels (1s or 0s),
        where label = 1 if 'other_lib' belongs to the context of 'lib' in the sequence,
        and label=0 if 'other_lib' is randomly sampled
        # Arguments
        sequence: a library sequence (bag), encoded as a list
            of library indices (integers). If using a `sampling_table`,
            library indices are expected to match the rank
            of the libraries in a reference dataset (e.g. 10 would encode
            the 10-th most frequently occurring token).
            Note that index 0 is expected to be a non-library and will be skipped.
        vocabulary_size: int. maximum possible word index + 1
        num_positive_samples: int. number of positive samples to generate pairs for.
        negative_sample_ratio: float. ratio of negative samples compared to the amount of positive samples.
        true_negative_sampling: if True, verify that negative sample is truely negative by crosschecking in co-occurrence matrix.
        shuffle: whether to shuffle the couples before returning them.
        # Returns
        couples, labels: where `couples` are int pairs and
            `labels` are either 0 or 1.
    """
    target_words = []
    context_words = []
    labels = []
    conflicts = 0
    for sequence in tqdm(sequences):
        for i, li in enumerate(sequence):
            if not li:
                continue
            num = min(len(sequence)-1,num_positive_samples)
            if sampling_table:
                for lj in samples(sequence[:i] + sequence[i+1:], num, sampling_table):
                    target_words.append(li)
                    context_words.append(lj)
                    labels.append(1)
            else:
                for j in range(num):
                    if j != i:
                        lj = sequence[j]
                        if not lj:
                            continue
                        target_words.append(li)
                        context_words.append(lj)
                        labels.append(1)
            if negative_sampling_ratio != 0:
                num_neg = math.floor(num * negative_sampling_ratio)
                neg_samples = random.sample(range(0, vocabulary_size-1), num_neg)
                for neg_sample in neg_samples:
                    if true_negative_sampling:
                        while li != neg_sample and co_occurrence[co_idx(li, neg_sample, vocab_size)]:
                            conflicts += 1
                            neg_sample = random.randint(0, vocabulary_size-1)
                    target_words.append(li)
                    context_words.append(neg_sample)
                    labels.append(0)
                
    if shuffle:
        print('shuffling...')
        seed = random.randint(0, 10e6)
        random.seed(seed)
        random.shuffle(target_words)
        random.seed(seed)
        random.shuffle(context_words)
        random.seed(seed)
        random.shuffle(labels)

    if true_negative_sampling:
        print("conflicts =", conflicts)
    return target_words, context_words, labels

In [50]:
lib_target, lib_context, labels = randomgrams(file_data_list, vocab_size, num_positive_samples=config['window_size'], negative_sampling_ratio=config['negative_sampling_ratio'], true_negative_sampling=config['true_negative_sampling'])
print("#couples =", len(lib_target))

HBox(children=(IntProgress(value=0, max=174607), HTML(value='')))


shuffling...
conflicts = 1457327
#couples = 4129262


In [51]:
del co_occurrence

In [52]:
lib_target = np.array(lib_target, dtype="int32")
lib_context = np.array(lib_context, dtype="int32")

print(lib_target[:20])
print(lib_context[:20])
print(labels[:20])

[5275  607 6699  637 8743   73  101 8777  254 4339   98  183 3141  676
  101  112   27 5225  101 7764]
[4885 6969  101  103 8714 1592  423 5072 5261 7683 1511 2848 3218 2354
 4725  117 3171 5279 1374 4574]
[0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0]


In [53]:
np.sum(labels), len(labels)

(2064631, 4129262)

In [54]:
labels = np.array(labels)

### 5. Train the Library Embedding

#### 5.1 Construct the Model

In [55]:
use_bias = True

In [56]:
# create some input variables
input_target = Input((1,))
input_context = Input((1,))

embedding = Embedding(vocab_size, config['vector_dim'], input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((config['vector_dim'], 1))(target)
context = embedding(input_context)
context = Reshape((config['vector_dim'], 1))(context)

# now perform the dot product operation to get a similarity measure
dot_product = dot([target, context], 1, normalize=True)
dot_product = Reshape((1,))(dot_product)

# add the sigmoid output layer
output = Dense(1, activation='sigmoid', use_bias=use_bias)(dot_product)

# create the primary training model
model = Model(inputs=[input_target, input_context], outputs=output)

model.compile(loss='binary_crossentropy', optimizer='rmsprop')

Instructions for updating:
Colocations handled automatically by placer.


#### 5.2 Train the Model

In [57]:
batch_size = 2**15
print('batch size:', batch_size)
model.fit([lib_target, lib_context], labels, batch_size=batch_size, verbose=2, epochs=config['epochs'], validation_split=0.0)

batch size: 32768
Instructions for updating:
Use tf.cast instead.
Epoch 1/100
 - 9s - loss: 0.5272
Epoch 2/100
 - 8s - loss: 0.4325
Epoch 3/100
 - 8s - loss: 0.4080
Epoch 4/100
 - 9s - loss: 0.3883
Epoch 5/100
 - 8s - loss: 0.3706
Epoch 6/100
 - 8s - loss: 0.3542
Epoch 7/100
 - 8s - loss: 0.3387
Epoch 8/100
 - 8s - loss: 0.3242
Epoch 9/100
 - 8s - loss: 0.3105
Epoch 10/100
 - 8s - loss: 0.2975
Epoch 11/100
 - 8s - loss: 0.2853
Epoch 12/100
 - 8s - loss: 0.2736
Epoch 13/100
 - 8s - loss: 0.2626
Epoch 14/100
 - 8s - loss: 0.2522
Epoch 15/100
 - 8s - loss: 0.2423
Epoch 16/100
 - 9s - loss: 0.2329
Epoch 17/100
 - 8s - loss: 0.2241
Epoch 18/100
 - 8s - loss: 0.2156
Epoch 19/100
 - 8s - loss: 0.2076
Epoch 20/100
 - 8s - loss: 0.2001
Epoch 21/100
 - 9s - loss: 0.1929
Epoch 22/100
 - 8s - loss: 0.1861
Epoch 23/100
 - 8s - loss: 0.1796
Epoch 24/100
 - 8s - loss: 0.1734
Epoch 25/100
 - 8s - loss: 0.1676
Epoch 26/100
 - 8s - loss: 0.1620
Epoch 27/100
 - 8s - loss: 0.1567
Epoch 28/100
 - 8s - loss

<keras.callbacks.History at 0x7f13c942be10>

#### 5.3 Extract our Embedding Matrix & Save to Disk

In [58]:
x = embedding.get_weights()[0]
y = lib_names

Save the embedding to disk

In [59]:
file_prefix = rootDir + 'models/library_vectors_dim' + str(config['vector_dim'])
file_prefix

'../datasets/javascript/models/library_vectors_dim100'

In [60]:
save_array(file_prefix + '_x.vec', x)
save_array(file_prefix + '_y.vec', y)

In [61]:
x = load_array(file_prefix + '_x.vec')
y = load_array(file_prefix + '_y.vec')

Save HD vectors as JSON

In [62]:
vecHD = {y[i]: x[i].tolist() for i,_ in enumerate(x)}

In [63]:
hd_file = rootDir + 'models/vectorsHD_dim{}.json'.format(config['vector_dim'])
with open(hd_file, 'w') as f:
    json.dump(vecHD, f)

In [64]:
! gzip -f $hd_file

#### 5.4 Convert embedding to Word2Vec format

In [65]:
w2v_file = rootDir + 'models/w2v_dim{}.txt'.format(config['vector_dim'])
with open(w2v_file, 'w') as f:
    f.write('{} {}\n'.format(x.shape[0], x.shape[1]));
    for i in range(0, len(y)):
        f.write('{} {}\n'.format(y[i], str(x[i].tolist()).replace(',','').replace('[','').replace(']','')))

In [66]:
!gzip -f $w2v_file

### Done!

Your vectors are available in directory $rootDir/models

- The raw vectors are files `library_vectors_dim*_x.vec` (embedding) and `library_vectors_dim*_y.vec` (labels).
- JSON-formatted vectors are availabile in the file `vectorsHD_dim*.json.gz`.

In [67]:
! echo DIR = {rootDir}models
! ls $rootDir/models

DIR = ../datasets/javascript/models
data_trn_dim100.json	      library_vectors_dim100_y.vec   w2v_dim100.txt.gz
lib_dict_dim100.json	      reversed_lib_dict_dim100.json
library_vectors_dim100_x.vec  vectorsHD_dim100.json.gz


### What has the model learned?

- You can visualize your trained embeddings with the [Visualization.ipynb](Visualization.ipynb) notebook.
- You can run nearest neighbour searches with the [Evaluation.ipynb](Evaluation.ipynb) notebook.