# Sample APP demonstration

**Details:**
    
    - python package: pyspark
    - Language model: Word2Vec (all errors)
    - Clustering algorithm: K-Means
    
**Mode:**

    - Word2Vec: training
    - K-Means: training (K optimized)

### Imports
Import libraries and write settings here.

In [2]:
%load_ext autoreload
%autoreload 2

from opint_framework.apps.example_app.nlp.pyspark_based.pyspark_nlp_adapter import pysparkNLPAdapter
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Analysis/Modeling
Do work here

## Instantiate NLPAdapter object

In [3]:
# setup sample data path
#data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/sample_data_5mar20.json"""
# data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/test_data_raw.json"""
data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/fts_20oct2020.json"""

# instantiate NLPAdapter object
pipeline = pysparkNLPAdapter(path_list=[data_path], vo="atlas", filter_T3=True,  # data
                             tks_col="stop_token_1",  # tokenization
                             w2v_model_path="results/w2v", w2v_mode="train", w2v_save_mode="overwrite",
                             emb_size=3, win_size=8, min_count=1, tks_vec="message_vector",  # word2vec
                             ft_col="features", kmeans_model_path="results/kmeans", kmeans_mode="train",
                             pred_mode="static", new_cluster_thresh=None, k_list=[2, 4, 6, 8],
                             distance="cosine", opt_initSteps=10, opt_tol=0.01, opt_maxIter=10,
                             log_path=None, n_cores=4,  # K_optim
                             tr_initSteps=30, tr_tol=0.001, tr_maxIter=30,  # train_kmeans
                             clust_col="prediction", wrdcld=True, timeplot=True)

## Pre-processing 

In [6]:
pipeline.pre_process()

In [7]:
print("dataset type:", type(pipeline.context['dataset']))
print("dataset columns:", pipeline.context['dataset'].columns)
print("dataset entries:", pipeline.context['dataset'].count())

print("\nHead:\n")
pipeline.context['dataset'].toPandas().set_index(pipeline.context["id_col"]).head()

dataset type: <class 'pyspark.sql.dataframe.DataFrame'>
dataset columns: ['tr_id', 't__error_message', 'src_hostname', 'src_rcsite', 'dst_hostname', 'dst_rcsite', 'tr_datetime_complete', 't_error_code', 't_failure_phase', 'tr_error_category', 'tr_error_scope']
dataset entries: 998

Head:



Unnamed: 0_level_0,t__error_message,src_hostname,src_rcsite,dst_hostname,dst_rcsite,tr_datetime_complete,t_error_code,t_failure_phase,tr_error_category,tr_error_scope
tr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2020-10-20-2052__grid-dav.rzg.mpg.de__fal-pygrid-30.lancs.ac.uk__1804077215__b9a30da8-1315-11eb-84f1-fa163e564087,"TRANSFER ERROR: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused",grid-dav.rzg.mpg.de,MPPMU,fal-pygrid-30.lancs.ac.uk,UKI-NORTHGRID-LANCS-HEP,2020-10-20T20:52:57UTC,1,TRANSFER,OPERATION_NOT_PERMITTED,TRANSFER
2020-10-20-2021__head01.aglt2.org__gridftp.nese.mghpcc.org__655616124__c0e9be0a-130f-11eb-ba8f-b49691292ed8,TRANSFER Operation timed out,head01.aglt2.org,AGLT2,gridftp.nese.mghpcc.org,BU_ATLAS_Tier2,2020-10-20T20:21:36UTC,110,TRANSFER,CONNECTION_TIMED_OUT,TRANSFER
2020-10-20-2053__atlaswebdav-kit.gridka.de__fal-pygrid-30.lancs.ac.uk__1804077469__bfb9b278-1315-11eb-942b-fa163e5a6d18,"TRANSFER ERROR: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused",atlaswebdav-kit.gridka.de,FZK-LCG2,fal-pygrid-30.lancs.ac.uk,UKI-NORTHGRID-LANCS-HEP,2020-10-20T20:53:04UTC,1,TRANSFER,OPERATION_NOT_PERMITTED,TRANSFER
2020-10-20-2053__atlas-gridftp.bu.edu__t2se01.physics.ox.ac.uk__1804075477__bca8bbf3-4d64-5416-b075-8cc2333962fe,Operation timed out,atlas-gridftp.bu.edu,BU_ATLAS_Tier2,t2se01.physics.ox.ac.uk,UKI-SOUTHGRID-OX-HEP,2020-10-20T20:53:06UTC,110,TRANSFER_PREPARATION,CONNECTION_TIMED_OUT,SOURCE
2020-10-20-2053__atlas-gridftp.bu.edu__t2se01.physics.ox.ac.uk__1804074461__67cdd260-1315-11eb-891a-fa163e564087,Operation timed out,atlas-gridftp.bu.edu,BU_ATLAS_Tier2,t2se01.physics.ox.ac.uk,UKI-SOUTHGRID-OX-HEP,2020-10-20T20:53:08UTC,110,TRANSFER_PREPARATION,CONNECTION_TIMED_OUT,SOURCE


## Run 

In [8]:
!rm -r results

In [9]:
kmeans_model = pipeline.run()


Training Word2Vec model
 ------------------------------------------------------------ 

Saving w2v model to: results/w2v/w2v_sample_app_example_VS=3_MC=1_WS=8

Selecting best number of clusters
 ------------------------------------------------------------ 

With K=2

Started at: 2021-02-01 19:07:21

Within Cluster Sum of Squared Errors = 143.839
Silhouette with cosine distance = 0.8048

Time elapsed: 0 minutes and 3 seconds.
------------------------------------------------------------
With K=4

Started at: 2021-02-01 19:07:21

Within Cluster Sum of Squared Errors = 62.7609
Silhouette with cosine distance = 0.6397

Time elapsed: 0 minutes and 3 seconds.
------------------------------------------------------------
With K=6

Started at: 2021-02-01 19:07:21

Within Cluster Sum of Squared Errors = 30.4018
Silhouette with cosine distance = 0.7323

Time elapsed: 0 minutes and 3 seconds.
------------------------------------------------------------
With K=8

Started at: 2021-02-01 19:07:21

Wi




Training K-Means model for best value of K
 ------------------------------------------------------------ 

With K=2

Started at: 2021-02-01 19:07:25

Within Cluster Sum of Squared Errors = 143.2753
Silhouette with cosine distance = 0.8048

Time elapsed: 0 minutes and 1 seconds.
------------------------------------------------------------
Saving K-Means model to: results/kmeans/kmeans_K=2


In [10]:
print("kmeans_model type:", type(kmeans_model))
print("kmeans_model keys:", kmeans_model.keys())

kmeans_model

kmeans_model type: <class 'dict'>
kmeans_model keys: dict_keys(['model', 'wsse', 'asw'])


{'model': KMeans_572b9d9c51f9,
 'wsse': 143.27531474070915,
 'asw': 0.8047605353531522}

In [12]:
print("Optimal number of clusters:", kmeans_model["model"].summary.k)
print("\nRaw predictions dataset:")

kmeans_model["model"].summary.predictions.toPandas().set_index(pipeline.context["id_col"]).head()

Optimal number of clusters: 2

Raw predictions dataset:


Unnamed: 0_level_0,t__error_message,src_hostname,src_rcsite,dst_hostname,dst_rcsite,tr_datetime_complete,t_error_code,t_failure_phase,tr_error_category,tr_error_scope,corrected_message,tokens,tokens_cleaned,stop_token,stop_token_1,message_vector,features,prediction
tr_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2020-10-20-2052__grid-dav.rzg.mpg.de__fal-pygrid-30.lancs.ac.uk__1804077215__b9a30da8-1315-11eb-84f1-fa163e564087,"TRANSFER ERROR: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused",grid-dav.rzg.mpg.de,MPPMU,fal-pygrid-30.lancs.ac.uk,UKI-NORTHGRID-LANCS-HEP,2020-10-20T20:52:57UTC,1,TRANSFER,OPERATION_NOT_PERMITTED,TRANSFER,"TRANSFER error: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused","[transfer, error:, copy, failed, with, mode, 3rd, pull,, with, error:, copy, http, 403, :, permission, refused]","[transfer, error, copy, failed, with, mode, 3rd, pull, with, error, copy, http, 403, , permission, refused]","[transfer, error, copy, failed, mode, 3rd, pull, error, copy, http, 403, , permission, refused]","[transfer, error, copy, failed, mode, 3rd, pull, error, copy, http, 403, permission, refused]","[-1.2231686619611888, 1.8497820496559143, -3.4540224442115197]","[-1.2231686619611888, 1.8497820496559143, -3.4540224442115197]",0
2020-10-20-2021__head01.aglt2.org__gridftp.nese.mghpcc.org__655616124__c0e9be0a-130f-11eb-ba8f-b49691292ed8,TRANSFER Operation timed out,head01.aglt2.org,AGLT2,gridftp.nese.mghpcc.org,BU_ATLAS_Tier2,2020-10-20T20:21:36UTC,110,TRANSFER,CONNECTION_TIMED_OUT,TRANSFER,TRANSFER Operation timed out,"[transfer, operation, timed, out]","[transfer, operation, timed, out]","[transfer, operation, timed]","[transfer, operation, timed]","[3.107237656911214, -2.602229247490565, -0.6326684951782227]","[3.107237656911214, -2.602229247490565, -0.6326684951782227]",1
2020-10-20-2053__atlaswebdav-kit.gridka.de__fal-pygrid-30.lancs.ac.uk__1804077469__bfb9b278-1315-11eb-942b-fa163e5a6d18,"TRANSFER ERROR: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused",atlaswebdav-kit.gridka.de,FZK-LCG2,fal-pygrid-30.lancs.ac.uk,UKI-NORTHGRID-LANCS-HEP,2020-10-20T20:53:04UTC,1,TRANSFER,OPERATION_NOT_PERMITTED,TRANSFER,"TRANSFER error: Copy failed with mode 3rd pull, with error: copy HTTP 403 : Permission refused","[transfer, error:, copy, failed, with, mode, 3rd, pull,, with, error:, copy, http, 403, :, permission, refused]","[transfer, error, copy, failed, with, mode, 3rd, pull, with, error, copy, http, 403, , permission, refused]","[transfer, error, copy, failed, mode, 3rd, pull, error, copy, http, 403, , permission, refused]","[transfer, error, copy, failed, mode, 3rd, pull, error, copy, http, 403, permission, refused]","[-1.2231686619611888, 1.8497820496559143, -3.4540224442115197]","[-1.2231686619611888, 1.8497820496559143, -3.4540224442115197]",0
2020-10-20-2053__atlas-gridftp.bu.edu__t2se01.physics.ox.ac.uk__1804075477__bca8bbf3-4d64-5416-b075-8cc2333962fe,Operation timed out,atlas-gridftp.bu.edu,BU_ATLAS_Tier2,t2se01.physics.ox.ac.uk,UKI-SOUTHGRID-OX-HEP,2020-10-20T20:53:06UTC,110,TRANSFER_PREPARATION,CONNECTION_TIMED_OUT,SOURCE,Operation timed out,"[operation, timed, out]","[operation, timed, out]","[operation, timed]","[operation, timed]","[5.156564295291901, -4.043393298983574, 0.43869757652282715]","[5.156564295291901, -4.043393298983574, 0.43869757652282715]",1
2020-10-20-2053__atlas-gridftp.bu.edu__t2se01.physics.ox.ac.uk__1804074461__67cdd260-1315-11eb-891a-fa163e564087,Operation timed out,atlas-gridftp.bu.edu,BU_ATLAS_Tier2,t2se01.physics.ox.ac.uk,UKI-SOUTHGRID-OX-HEP,2020-10-20T20:53:08UTC,110,TRANSFER_PREPARATION,CONNECTION_TIMED_OUT,SOURCE,Operation timed out,"[operation, timed, out]","[operation, timed, out]","[operation, timed]","[operation, timed]","[5.156564295291901, -4.043393298983574, 0.43869757652282715]","[5.156564295291901, -4.043393298983574, 0.43869757652282715]",1


# Results
Show graphs and stats here

## Post-processing

In [15]:
summary_table = pipeline.post_process(kmeans_model)

In [16]:
summary_table

Unnamed: 0,prediction,cluster_size,unique_strings,unique_patterns,pattern,src_rcsite,dst_rcsite,n,n_perc,rank
0,1,609,15,15,operation timed out,BU_ATLAS_Tier2,GRIF,44,0.072250,1
1,1,609,15,15,operation timed out,BU_ATLAS_Tier2,UKI-NORTHGRID-MAN-HEP,40,0.065681,2
2,1,609,15,15,operation timed out,BU_ATLAS_Tier2,wuppertalprod,35,0.057471,3
3,1,609,15,15,operation timed out,BU_ATLAS_Tier2,BNL-ATLAS,26,0.042693,4
4,1,609,15,15,operation timed out,BU_ATLAS_Tier2,NDGF-T1,21,0.034483,5
...,...,...,...,...,...,...,...,...,...,...
437,0,389,141,73,transfer globus_xio system error in recv connection reset by peer globus_xio a system call failed connection reset by peer,ZA-WITS-CORE,UAM-LCG2,1,0.002571,57
438,0,389,141,73,transfer error copy failed with mode 3rd pull with error copy (neon) could not parse redirect destination url,SE-SNIC-T2,CERN-PROD,1,0.002571,57
439,0,389,141,73,transfer globus_xio unable to connect to \$ADDRESS globus_xio system error in connect connection refused globus_xio a system call failed connection refused,UAM-LCG2,SBU_Tier3,1,0.002571,57
440,0,389,141,73,transfer checksum mismatch src and dst checksum are different source 585057cc destination 58315445,GRIF,CYFRONET-LCG2,1,0.002571,57


## All in one step 

In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
!rm -r results

# imports  and settings
from opint_framework.apps.example_app.nlp.pyspark_based.pyspark_nlp_adapter import pysparkNLPAdapter
import pandas as pd
pd.set_option('display.max_colwidth', None)

# setup sample data path
#data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/sample_data_5mar20.json"""
# data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/test_data.json"""
# data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/test_data_raw.json"""
data_path = """/home/luca/PycharmProjects/opint-framework/opint_framework/apps/example_app/nlp/pyspark_based/sample_data/fts_20oct2020.json"""

# instantiate NLPAdapter object
pipeline = pysparkNLPAdapter(path_list=[data_path], vo=None,  # data
                          tks_col="stop_token_1",  # tokenization
                          w2v_model_path="results/w2v", w2v_mode="train", w2v_save_mode="overwrite", 
                          emb_size=3, win_size=8, min_count=1, tks_vec="message_vector",  # word2vec
                          ft_col="features", kmeans_model_path="results/kmeans", kmeans_mode="train",
                          pred_mode="static", new_cluster_thresh=None, k_list=[2,4],
                          distance="cosine", opt_initSteps=10, opt_tol=0.01, opt_maxIter=10, 
                          log_path=None, n_cores=4, # K_optim
                          tr_initSteps=30, tr_tol=0.001, tr_maxIter=30,  # train_kmeans
                          clust_col="prediction", wrdcld=True, timeplot=True)

# get results
summary_table = pipeline.execute()


NLP Adapter - PySpark_adapter: Pre Processing input data

NLP Adapter - PySpark_adapter: Executing vectorization, tokenization and clusterization

Training Word2Vec model
 ------------------------------------------------------------ 

Saving w2v model to: results/w2v/w2v_sample_app_example_VS=3_MC=1_WS=8

Selecting best number of clusters
 ------------------------------------------------------------ 

With K=4

Started at: 2021-02-01 19:12:23

Within Cluster Sum of Squared Errors = 55.2373
Silhouette with cosine distance = 0.8493

Time elapsed: 0 minutes and 1 seconds.
------------------------------------------------------------
With K=2

Started at: 2021-02-01 19:12:23

Within Cluster Sum of Squared Errors = 150.1635
Silhouette with cosine distance = 0.8048

Time elapsed: 0 minutes and 1 seconds.
------------------------------------------------------------

Training K-Means model for best value of K
 ------------------------------------------------------------ 





With K=4

Started at: 2021-02-01 19:12:25

Within Cluster Sum of Squared Errors = 54.4526
Silhouette with cosine distance = 0.8493

Time elapsed: 0 minutes and 1 seconds.
------------------------------------------------------------
Saving K-Means model to: results/kmeans/kmeans_K=4

NLP Adapter - PySpark_adapter: Post processing


In [18]:
summary_table

Unnamed: 0,prediction,cluster_size,unique_strings,unique_patterns,pattern,src_rcsite,dst_rcsite,n,n_perc,rank
0,1,562,12,12,operation timed out,BU_ATLAS_Tier2,GRIF,44,0.078292,1
1,1,562,12,12,operation timed out,BU_ATLAS_Tier2,UKI-NORTHGRID-MAN-HEP,40,0.071174,2
2,1,562,12,12,operation timed out,BU_ATLAS_Tier2,wuppertalprod,35,0.062278,3
3,1,562,12,12,operation timed out,BU_ATLAS_Tier2,BNL-ATLAS,26,0.046263,4
4,1,562,12,12,operation timed out,BU_ATLAS_Tier2,NDGF-T1,21,0.037367,5
...,...,...,...,...,...,...,...,...,...,...
438,3,30,3,3,srm-ifce err connection timed out err [se][releasefiles][etimedout] \$URL /srm/managerv2 user timeout over,INFN-T1,UKI-NORTHGRID-MAN-HEP,2,0.066667,4
439,3,30,3,3,srm-ifce err connection timed out err [se][releasefiles][etimedout] \$URL /srm/managerv2 user timeout over,INFN-T1,RRC-KI-T1,2,0.066667,4
440,3,30,3,3,result connection timeout during redirection on \$URL \$FILE_PATH after 1 attempts,TRIUMF-LCG2,UKI-NORTHGRID-MAN-HEP,1,0.033333,7
441,3,30,3,3,srm-ifce err input/output error err [se][releasefiles][srm_failure] \$URL /srm/managerv2 internal error invalid request type for token a735c8cf-f416-4375-8386-06188fe1dbba empty,INFN-T1,ifae,1,0.033333,7
