# Word2Vec hyperparameters search 

**Objective**

The idea is to train multiple Word2Vec models based on a grid search over the hyperparameters of the model. 
The results are then saved in the specified Hadoop path for subsequent processing.

## Start Spark session

In [1]:
%%time

# start Spark Session
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("sample_app_optimize_word2vec").getOrCreate()
spark

CPU times: user 41.1 ms, sys: 22.1 ms, total: 63.2 ms
Wall time: 6.19 s


## Read data from Spark

### Transfer errors only

In [3]:
%%time 

import pyspark.sql.functions as F
# 18th October 2019
day = "2020/03/09"

# FTS data path
path_list = ['/project/monitoring/archive/fts/raw/complete/2020/03/{:0>2}/*'.format(i) for i in range(9,13)]

# load the data in the json file
all_transfers = spark.read.json(path_list)

# retrieve just data
all_transfers = all_transfers.select("data.*")

# filter errors only
errors = all_transfers.filter(all_transfers["t_final_transfer_state_flag"] == 0)

# add row id and select only relevant variables
errors = errors.withColumn("msg_id", F.monotonically_increasing_id())

CPU times: user 188 ms, sys: 51.8 ms, total: 239 ms
Wall time: 2min 40s


### ATLAS data only

In [4]:
errors_atlas = errors.filter(errors["vo"]=="atlas").select(
    "msg_id", "t__error_message", "src_hostname", "dst_hostname", "timestamp_tr_comp", "timestamp_tr_st")

# sample 100 random rows
n = errors_atlas.count()

In [5]:
print("Total number of errors: {}".format(n))

Total number of errors: 1795737


### Alternative: read pre-saved data 

In [2]:
# import pandas as pd
# pd.set_option('display.max_colwidth', -1)

# errors = spark.read.json("sample_app_test_train.json").select(
#     "msg_id", "t__error_message", "src_hostname", "dst_hostname", "timestamp_tr_comp")

# # visualize data
# # errors.toPandas().head(10)

## Word2Vec models

### Tokenization 

In [7]:
%%time

# overwrite for memory reasons and delete errors_atlas
errors = errors_atlas
del errors_atlas

from language_models import tokenizer
err_tks = tokenizer(errors, err_col="t__error_message", id_col="msg_id")

# visualize tokenization
# err_tks.toPandas().head(4)

CPU times: user 107 ms, sys: 42.7 ms, total: 150 ms
Wall time: 316 ms


### Set metadata and hyperparameters grids

In [None]:
import datetime 
import time

# to track data interval of training
today = datetime.date.today() # day of the training
data_window = "9-13mar2020" # time period of data analysied
w2v_log = "results/sample_app/w2v_models" # Hadoop base path

vs_list = [100,150,200,250]
mc_list = [100, 500]
ws_list = [5, 8]

### Grid search over specified hyperparameters' values

In [22]:
%%time

import language_models
import importlib
importlib.reload(language_models)

for vs in vs_list:
    for mc in mc_list:
        for ws in ws_list:
            start_time = time.time()
            start_time_string = datetime.datetime.fromtimestamp(start_time).strftime('%Y-%m-%d %H:%M:%S')

            print("Training for vecotor_size={}, window_size={} and min_count={}".format(vs, ws, mc))
            print("Started at: {}\n".format(start_time_string))

            # train word2vec
            w2v_model = language_models.train_w2v(err_tks, tks_col="stop_token_1", id_col="msg_id", out_col='message_vector',
                                                  vec_size=vs, min_count=mc, mode="overwrite", win_size=ws, n_cores=12,
                                                  save_path="{}/data_window_{}/train_date_{}".format(w2v_log, data_window, today))

            print("\nTime elapsed: {} minutes and {} seconds.".format(int((time.time() - start_time)/60), 
                                                                      int((time.time() - start_time)%60)))
            print('--'*30)

Training for vecotor_size=100, window_size=5 and min_count=100
Started at: 2020-03-19 17:49:18


Time elapsed: 15 minutes and 2 seconds.
------------------------------------------------------------
Training for vecotor_size=100, window_size=8 and min_count=100
Started at: 2020-03-19 18:04:20


Time elapsed: 15 minutes and 29 seconds.
------------------------------------------------------------
Training for vecotor_size=100, window_size=5 and min_count=500
Started at: 2020-03-19 18:19:50


Time elapsed: 13 minutes and 41 seconds.
------------------------------------------------------------
Training for vecotor_size=100, window_size=8 and min_count=500
Started at: 2020-03-19 18:33:31


Time elapsed: 16 minutes and 2 seconds.
------------------------------------------------------------
Training for vecotor_size=150, window_size=5 and min_count=100
Started at: 2020-03-19 18:49:34


Time elapsed: 15 minutes and 55 seconds.
------------------------------------------------------------
Trainin