# Preprocessing and Word2Vec training

In this notoebook we iterate over an arbitrary amount of days pre-processing the total amount of error messages (here from 01/01/2020 to 05/11/2020). In order to study the timing involved, we save in a `csv` file the information about:
- the number of grouped day;
- the raw number of total messages before cleaning (bf_n);
- the number of messages that are equal after cleaning (af_n);
- the time required for cleaning [s];
- the time required for tokenizing [s];
- the time required for training [s].

Moreover at each iteration we create a `Word2Vec` model saving its vocabulary size as well.
In notebook [Timing Study](https://github.com/micolocco/ClusterLog/blob/micol_dev/training_study/timing_study.ipynb) we will plot the data collected.

To be able to analyze as many messages as the Analitix cluster computing power allows, we pre-process the error messages in a distributed manner (Spark functions in `dist_preproc.py` [(here)](https://github.com/micolocco/ClusterLog/blob/micol_dev/training_study/dist_preproc.py)).



In [5]:
import unique
import pandas as pd
from time import time
from gensim.models import Word2Vec
import csv

In [6]:
start_date='2020/01/01'
end_date='2020/11/05'
hdir='hdfs:///project/monitoring/archive/fts/raw/complete'   

dd=pd.date_range(start=start_date,end=end_date)

hpath=[]
path_list=[]
count=1

for day in dd:
    hpath.append('%s/%s' %(hdir, day.strftime("%y%y/%m/%d")))
    path_list.append(hpath[:count])
    count=count+1
   

In [7]:
len(path_list)

310

In [None]:
count=1

for iPath in path_list:
    start_time_clean = time()
    fts,bf_n,af_n=dist_preproc.preprocMex(spark,iPath,clean_short=True,stop_one=False).process()
    clean_time=time() - start_time_clean
    #print("--- time to clean corpus: %f seconds ---" % clean_time)
    start_time_tok=time()
    tokenized = dist_preproc.MyCorpus(fts)
    tok_time=time() - start_time_tok
    #print("--- time to tokenize corpus: %f seconds ---" % tok_time)
    start_time_train= time()
    model = Word2Vec(sentences=tokenized,compute_loss=False,size=300,window=7, min_count=1, workers=4, iter=30)
    train_time=time()-start_time_train
    len_voc=len(model.wv.vocab)
    print('finished day',count)
    with open('training_study.csv', mode='a',newline='') as tFile:
        file_writer = csv.writer(tFile)
        file_writer.writerow([count,bf_n,af_n,clean_time,tok_time,train_time,len_voc])
    model.save('models/groupedDay_{}.model'.format(count))
    count=count+1
    

## Playing with Word2Vec

Let's try different `Word2Vec` hyperparameters `(window=10, min_count=2)`.

In [19]:
model_10= Word2Vec(sentences=tokenized,compute_loss=False,size=300,window=10, sg=0,min_count=2, workers=4, iter=30)#try with mincount=2

In [20]:
model_10.save('10model_NUCLEUS.model')

In [10]:
w2c = dict()
for item in model.wv.vocab:
    w2c[item] = model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())
w=[q  for q, v in w2cSorted.items() if v==1]

Model vocabulary and word occurence.

In [None]:
w2cSorted

In order to test the model we can query it about similarity relantionships. Do they make sense?

In [33]:
print(model.wv.most_similar(positive=['refused','allowed'],negative=['permission'], topn=20))

[('method', 0.7579617500305176), ('abgelehnt', 0.7114009261131287), ('http', 0.7084829211235046), ('verbindungsaufbau', 0.7003512978553772), ('explaining', 0.6966121196746826), ('libcurl', 0.6929736137390137), ('rehusada', 0.6869792342185974), ('execute', 0.6826119422912598), ('conexi', 0.6795994639396667), ('rifiutata', 0.666997492313385), ('connessione', 0.6524865627288818), ('rpc', 0.6473551988601685), ('sends', 0.6334272623062134), ('plain', 0.6270217895507812), ('someone', 0.6157051920890808), ('refusat', 0.6126352548599243), ('insecure', 0.6063003540039062), ('when', 0.6039327383041382), ('authentification', 0.6002188920974731), ('post', 0.59538334608078)]


In [68]:
print(model.wv.most_similar(positive=['http','certificate'],negative=['insecure'], topn=20))

[('trusted', 0.7887747287750244), ('verification', 0.6823903322219849), ('different', 0.6768122911453247), ('certificates', 0.6685938835144043), ('client', 0.6505008935928345), ('pass', 0.6376693844795227), ('pem', 0.6124240159988403), ('issuer', 0.6032818555831909), ('errors', 0.5995227098464966), ('found', 0.5983811616897583), ('issued', 0.595291256904602), ('following', 0.5928809642791748), ('hostname', 0.5804721117019653), ('untrusted', 0.5694263577461243), ('validation', 0.5694255828857422), ('key', 0.5607132911682129), ('env', 0.5524048805236816), ('exist', 0.5510029196739197), ('wrong', 0.5474144220352173), ('chain', 0.545615553855896)]


In [55]:
print(model.wv.most_similar(positive=['connection','expirat'],negative=['timed'], topn=20))

[('connexi', 0.8243509531021118), ('rifiutata', 0.8095705509185791), ('refusat', 0.8022040724754333), ('inabastable', 0.8009135127067566), ('unterbrochen', 0.7835126519203186), ('para', 0.7832696437835693), ('outbound', 0.7807804346084595), ('bergabe', 0.7787008285522461), ('connessione', 0.7749203443527222), ('perm', 0.7741066217422485), ('reiniciat', 0.7740315198898315), ('xarxa', 0.7722450494766235), ('conexi', 0.7698987722396851), ('verbindungsaufbau', 0.7698805332183838), ('daten', 0.7695108652114868), ('quebrado', 0.7691769599914551), ('before', 0.7685776352882385), ('conex', 0.7684025168418884), ('remota', 0.7665600776672363), ('esgotado', 0.7645314931869507)]


In [60]:
print(model.wv.most_similar(positive=['certificate','valid'],negative=['problematic'], topn=20))

[('credentials', 0.8254427909851074), ('certificates', 0.7949425578117371), ('find', 0.7736103534698486), ('any', 0.7592921853065491), ('trusted', 0.7477348446846008), ('locations', 0.7434793710708618), ('possible', 0.7334886789321899), ('verify', 0.729857325553894), ('specified', 0.7099422216415405), ('cant', 0.6834237575531006), ('crl', 0.6822726130485535), ('foun', 0.6693485379219055), ('could', 0.6601654887199402), ('found', 0.6425882577896118), ('get', 0.639359712600708), ('not', 0.6384429335594177), ('signing', 0.6346465349197388), ('search', 0.6338939666748047), ('checksum', 0.6333378553390503), ('order', 0.615533173084259)]


In [19]:
print(model.wv.most_similar(positive=['connection','rifiutata'],negative=['refused'], topn=20))

[('connessione', 0.8520025014877319), ('marker', 0.8440203666687012), ('sockettimeoutexception', 0.833538293838501), ('connecttimeoutexception', 0.8005949258804321), ('nohttpresponseexception', 0.7997249364852905), ('httphostconnectexception', 0.799676775932312), ('socketexception', 0.7989269495010376), ('during', 0.7966036796569824), ('perf', 0.7838877439498901), ('della', 0.771793782711029), ('expirat', 0.7688853740692139), ('gib', 0.7646292448043823), ('pushing', 0.7617639899253845), ('sslhandshakeexception', 0.757806122303009), ('before', 0.753616213798523), ('xarxa', 0.7500908374786377), ('eod', 0.747723400592804), ('inabastable', 0.7452101707458496), ('shut', 0.7447163462638855), ('respond', 0.7446961402893066)]


In [83]:
print(model.wv.most_similar(positive=['connection','rifiutata'],negative=['refused'], topn=20))

[('protocols', 0.4475284218788147), ('tried.', 0.4403183162212372), ('fsync', 0.412409245967865), ('eod', 0.41186559200286865), ('incorrectlyperf', 0.4089449644088745), ('poller', 0.40332546830177307), ('connessione', 0.40249624848365784), ('hand', 0.3988437056541443), ('redirect', 0.3833411931991577), ('middle', 0.37699341773986816), ('eospps-ipa.cern.ch', 0.3738352656364441), ('req', 0.3718743920326233), ('connectionclosedexception', 0.3694261312484741), ('auth', 0.36754319071769714), ('ccsrm.ihep.ac.cn', 0.36702197790145874), ('aborted.', 0.35888367891311646), ('se-iep-grid.saske.sk', 0.3586326539516449), ('shake', 0.3567368686199188), ('khurana', 0.3511529862880707), ('indicated', 0.3497052788734436)]
