# NLP methods applied to FTS transfer error messages

**Objective:** 

 - extract FTS transfer error data
 - explore the data
 - apply NLP ethods to error messages
 - possibly account for other features as well (e.g. source/destination sites, tansferprotocol, ...)

### Spark Session 

In [2]:
%%time

# start Spark Session
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("fts_data_extraction").getOrCreate()
#spark = SparkSession.builder.master("local[*]").appName("fts_data").getOrCreate()
spark

CPU times: user 32.1 ms, sys: 18.7 ms, total: 50.8 ms
Wall time: 7.61 s


In [3]:
print("Current (local) path:\n\n")
!pwd

Current (local) path:


/eos/home-l/lclissa/SWAN_projects/rucio-log-clustering/notebooks


In [4]:
print(spark.catalog.listTables())

[]


## Import data

**Note:** the period 7/10 - 10/10 is considered since there should be an issue with an FTS instance and we want our approach to be able to spot it.

In [5]:
%%time 

# FTS data path
path_list = ['/project/monitoring/archive/fts/raw/complete/2019/10/{:0>2}/*'.format(i) for i in range(7,11)]

### FOR SIMPLICITY JUST START WITH 10th OCTOBER

# load the data in the json file
all_transfers = spark.read.json(path_list)

CPU times: user 236 ms, sys: 55.1 ms, total: 291 ms
Wall time: 3min 37s


### Basic exploration 

The DataFrame obtained is a combination of two DataFrames, one containing the actual data and one the metadata. Thus, we extract just the data part since metadata are not so relevant to our scope:

In [6]:
# retrieve just data 
all_transfers_data = all_transfers.select("data.*")

all_transfers_data.printSchema()

root
 |-- activity: string (nullable = true)
 |-- block_size: long (nullable = true)
 |-- buf_size: long (nullable = true)
 |-- channel_type: string (nullable = true)
 |-- chk_timeout: long (nullable = true)
 |-- dest_srm_v: string (nullable = true)
 |-- dst_hostname: string (nullable = true)
 |-- dst_se: string (nullable = true)
 |-- dst_site_name: string (nullable = true)
 |-- dst_url: string (nullable = true)
 |-- endpnt: string (nullable = true)
 |-- f_size: long (nullable = true)
 |-- file_id: string (nullable = true)
 |-- file_size: long (nullable = true)
 |-- final_destination: string (nullable = true)
 |-- ipv6: boolean (nullable = true)
 |-- is_recoverable: boolean (nullable = true)
 |-- job_id: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- latency: long (nullable = true)
 |-- log_link: string (nullable = true)
 |-- nstreams: long (nullable = true)
 |-- operation_time: long (nullable = true)
 |-- remote_access: boolean (nullable = true)
 |-- retry: lon

In [7]:
%%time

n_trans = all_transfers_data.count()
n_vars = len(all_transfers_data.columns)

print("FTS transfer dataset shape:", n_trans, n_vars)

FTS transfer dataset shape: 17593901 68
CPU times: user 107 ms, sys: 41.8 ms, total: 149 ms
Wall time: 2min 9s


The dataset of all transfers attempted on 10/10 is made of ~6MLN rows and 68 columns. Let us now have a closer look into the variables and their meaning:

In [16]:
for idx, el in enumerate(all_transfers_data.head(3)[0]):
    print(all_transfers_data.columns[idx], ":\n", el, "\n\n")

activity :
 Data Consolidation 


block_size :
 0 


buf_size :
 0 


channel_type :
 urlcopy 


chk_timeout :
 0 


dest_srm_v :
 2.2.0 


dst_hostname :
 uct2-dc1.uchicago.edu 


dst_se :
 srm://uct2-dc1.uchicago.edu 


dst_site_name :
  


dst_url :
 srm://uct2-dc1.uchicago.edu:8443/srm/managerv2?SFN=/pnfs/uchicago.edu/atlasdatadisk/rucio/mc16_13TeV/ba/f3/DAOD_HIGG4D2.19353240._000089.pool.root.1 


endpnt :
 bnl 


f_size :
 4128472676 


file_id :
 287224235 


file_size :
 4128472676 


final_destination :
  


ipv6 :
 False 


is_recoverable :
 True 


job_id :
 7fe3424f-fa74-5db7-8d59-87be1d129883 


job_state :
 ACTIVE 


latency :
 None 


log_link :
 https://bnl:8449/fts3/ftsmon/#/job/7fe3424f-fa74-5db7-8d59-87be1d129883 


nstreams :
 1 


operation_time :
 None 


remote_access :
 True 


retry :
 0 


retry_max :
 0 


src_hostname :
 storage01.lcg.cscs.ch 


src_se :
 srm://storage01.lcg.cscs.ch 


src_site_name :
  


src_srm_v :
 2.2.0 


src_url :
 srm://storage01.lcg

### Variables' description

<div class="alert alert-block alert-danger">
<b>Reminder:</b> Check variable relevance to error detection with domain experts and then i) shorten the list (only relevant features with explanations), ii) add reference to documentation, iii) add legend with message/auxiliary variable colors. 
</div>


The following list contains a description of the variables' content:

**Generic information:**

  - "tr_id": "YEAR-MONTH-DAY-HOURMINUTE__sourcese__destse__file_id__job_id",
  - "endpnt": "FTS3 endpoint",
  - "src_srm_v": "Source SRM version, always 2.0 if srm is used",
  - "dest_srm_v": "Destination SRM version, always 2.0 if srm is used",
  - **<font color='green'>"vo"</font>:** "Virtual Organization",
  - "src_url": "Source URL",
  - "dst_url": "Destination URL",
  - **<font color='green'>"src_hostname"</font>:** "Source hostname",
  - **<font color='green'>"dst_hostname"</font>:** "Destination hostname",
  - "src_site_name": "", // Always empty
  - "dst_site_name": "", // Always empty
  - **<font color='green'>"t_channel"</font>:** "source_protocol://source_host__dest_protocol://dest_host",

**Time information:**
  - "timestamp_tr_st": 0, // Timestamp of the whole process start, in milliseconds
  - **<font color='green'>"timestamp_tr_comp"</font>:** 0, // Timestamp of the whole process completion, in milliseconds
  - "timestamp_chk_src_st": 0, // Timestamp when started the validation of the source checksum, in milliseconds
  - "timestamp_chk_src_ended": 0, // Timestamp when finished the validation of the source checksum, in milliseconds
  - "timestamp_checksum_dest_st": 0, // Timestamp when started the validation of the destination checksum, in milliseconds
  - "timestamp_checksum_dest_ended": 0, // Timestamp when finished the validation of the destination checksum, in milliseconds
  - "t_timeout": 0, // Timeout used for the transfer
  - "chk_timeout": 0, // Timeout used for the checksum operations

**Error informations:**
  - **<font color='red'>"t_error_code"</font>:** 0, // Error code: an errno value (i.e. ENOENT)
  
  > corresponds to the errno value returned by the url-copy process (i.e. ENOENT)
  
  - **<font color='red'>"tr_error_scope"</font>:** "Error scope, empty if ok",
  
  > 3 possible values, SOURCE, TRANSFER and DESTINATION depending on where the error happens. SOURCE for instance is set if the source file is not there or the source checksum query fails.
  
  - **<font color='red'>"t_failure_phase"</font>:** "Error phase, empty of ok",
  
  > 3 possible values TRANSFER_PREPARATION, TRANSFER, TRANSFER_FINALIZATION ( more or less they map to the values of tr_error_scope)
  
  - **<font color='red'>"tr_error_category"</font>:** "Error category, empty if ok",
  
  > this is the string representation of the t_error_code as returned by the strerror_r function (https://linux.die.net/man/3/strerror_r), possible values are:
COMMUNICATION_ERROR_ON_SEND, FILE_EXIST, PERMISSION_DENIED, etc

  - **<font color='red'>"t_final_transfer_state"</font>:** "Ok|Error|Abort",

  > != "Ok" for errors

  - **<font color='red'>"t__error_message"</font>:** "Error message, empty if ok",
  
  > string error from the storage



**Transfer metrics:**
  - "tr_bt_transfered": 0, // How many bytes have been transferred
  - "nstreams": 0, // How many streams have been used
  - "buf_size": 0, // TCP buffer size used (for backwards compatibility)
  - "tcp_buf_size": 0, // TCP buffer size used
  - "block_size": 0, // Unused

  - "f_size"  0, // Filesize

  - "time_srm_prep_st": 0, // Timestamp of the start of the SRM GET operation, if any, in milliseconds
  - "time_srm_prep_end": 0, // Timestamp of the completion of the SRM GET operation, if any, in milliseconds
  - "time_srm_fin_st": 0, // Timestamp of the start of the SRM PUT operation, if any, in milliseconds
  - "time_srm_fin_end": 0, // Timestamp of the completion of the SRM PUT operation, if any, in milliseconds

  - "srm_space_token_src": "Source space token, if any",
  - "srm_space_token_dst": "Destination space token, if any",

  - "tr_timestamp_start": 0, // Timestamp of the start of *only the transfer part* (excluding preparation), in milliseconds
  - "tr_timestamp_complete": 0, // Timestamp of the completion of *only the transfer part* (excluding preparation), in milliseconds


  - "channel_type": "urlcopy", // Always
  - "user_dn": "User that submitted the job",
  - "file_metadata": "File metadata set by the user at submission",
  - "job_metadata": "Job metadata set by the user at submission"


  - "retry": 0, // When retries are enabled, which retry is this transfer
  - "retry_max": 0, // When retries are enabled, max number of retries for this transfer
  - "job_m_replica": false, // true if this transfer belongs to a multiple replica job
  - "job_state": "Job state, if known",
  - "is_recoverable": false, // true if FTS3 considers this transfer could be retried (depends on the error code)
  - "ipv6": false, // true if the transfer took place over IPv6
  - "transfer_type": "streamed|3rd pull|3rd push" // How the transfer was done

In [21]:
documented_vars = [ "tr_id" , "endpnt" , "src_srm_v" , "dest_srm_v" , "vo" , "src_url" , "dst_url" , "src_hostname" , "dst_hostname" , "src_site_name" , 
                    "dst_site_name" , "t_channel" , "timestamp_tr_st" , "timestamp_tr_comp" , "timestamp_chk_src_st" , "timestamp_chk_src_ended" , 
                    "timestamp_checksum_dest_st" , "timestamp_checksum_dest_ended" , "t_timeout" , "chk_timeout" , "t_error_code" , "tr_error_scope" , 
                    "t_failure_phase" , "tr_error_category" , "t_final_transfer_state" , "t__error_message" , "tr_bt_transfered" , "nstreams" , "buf_size" , 
                    "tcp_buf_size" , "block_size" , "f_size" , "time_srm_prep_st" , "time_srm_prep_end" , "time_srm_fin_st" , "time_srm_fin_end" , "srm_space_token_src" , 
                    "srm_space_token_dst" , "tr_timestamp_start" , "tr_timestamp_complete" , "channel_type" , "user_dn" , "file_metadata" , "job_metadata" , "retry" , 
                    "retry_max" , "job_m_replica" , "job_state" , "is_recoverable" , "ipv6" , "transfer_type" ]

all_vars = all_transfers_data.columns

non_documented_vars = list(set(all_vars) - set(documented_vars))

<div class="alert alert-block alert-danger">
<b>Alert:</b> Some of the variables are not documented!
</div>

In [22]:
non_documented_vars

['t_final_transfer_state_flag',
 'file_id',
 'file_size',
 'activity',
 'timestamp_checksum_src_diff',
 'final_destination',
 'srm_finalization_time',
 'job_id',
 'timestamp_checksum_dst_diff',
 'src_se',
 'throughput',
 'operation_time',
 'srm_preparation_time',
 'srm_overhead_percentage',
 'srm_overhead_time',
 'log_link',
 'remote_access',
 'user',
 'latency',
 'dst_se']

### Error extraction

Now we focus on transfer errors only and split the DataFrame in multiple tables with similar pieces of information:

In [8]:
errors = all_transfers_data.filter(all_transfers_data["t_final_transfer_state_flag"] == 0)

errors.head(1)

[Row(activity='ASO', block_size=0, buf_size=0, channel_type='urlcopy', chk_timeout=0, dest_srm_v='', dst_hostname='ganymede.hep.kbfi.ee', dst_se='gsiftp://ganymede.hep.kbfi.ee', dst_site_name='', dst_url='gsiftp://ganymede.hep.kbfi.ee:2811//cms/store/user/kaehatah/2016v3_2019Sep30/DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/2016v3_2019Sep30_CHUNK1_DY1JetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8__RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3-v1/190930_211130/0000/tree_185.root', endpnt='fts3.cern.ch', f_size=179841642, file_id='2564707869', file_size=179841642, final_destination='', ipv6=False, is_recoverable=True, job_id='37564a72-e8cd-11e9-ad00-02163e018c08', job_state='UNKNOWN', latency=None, log_link='https://fts3.cern.ch:8449/fts3/ftsmon/#/job/37564a72-e8cd-11e9-ad00-02163e018c08', nstreams=1, operation_time=None, remote_access=True, retry=2, retry_max=3, src_hostname='storm.ifca.es', src_se='srm://storm.ifca.es', src_site_name='', src_srm_v='2.2.0

In [13]:
n_errs = errors.count()
n_errs

4716431

### **Messages**

In [10]:
err_mess = errors.select("t__error_message")

err_mess.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|t__error_message                                                                                                                                                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Unique messages 

In [11]:
%%time
distinct_mess = err_mess.select("t__error_message").distinct()

CPU times: user 2.03 ms, sys: 1.78 ms, total: 3.81 ms
Wall time: 16.2 ms


In [14]:
%%time
n_unique_mess = distinct_mess.count()
n_unique_mess

CPU times: user 339 ms, sys: 76.2 ms, total: 415 ms
Wall time: 4min 58s


271795

<div class="alert alert-block alert-info">
<b>Alert:</b> Out of the ~1.2MLN error messages, we have nearly 95k unique messages
</div>

Let now to explore a bit more which are the most common errors:

In [15]:
%who

SparkSession	 all_transfers	 all_transfers_data	 distinct_mess	 err_mess	 errors	 n_errs	 n_trans	 n_unique_mess	 
n_vars	 path_list	 spark	 


In [7]:
del all_transfers, all_transfers_data, errors, path_list

#### Frequency  

In [16]:
%%time

#n_errs = 1198958

error_freq = err_mess.groupBy("t__error_message").count()
error_freq = error_freq.orderBy(error_freq["count"].desc()).withColumn("percentage", error_freq["count"]/n_errs*100)
error_freq.show(50)

+--------------------+------+-------------------+
|    t__error_message| count|         percentage|
+--------------------+------+-------------------+
|TRANSFER  globus_...|570137| 12.088314235912707|
|TRANSFER  globus_...|324553|  6.881326155306841|
|Error on XrdCl::C...|224141|  4.752343456312623|
|DESTINATION SRM_P...|115402| 2.4468077662961676|
|TRANSFER  an end-...|113417| 2.4047208577842016|
|TRANSFER  Transfe...| 95104|  2.016439973361213|
|globus_ftp_client...| 81148| 1.7205382629365298|
|Destination file ...| 78000| 1.6537928785558402|
|TRANSFER  Operati...| 71783|  1.521977105145819|
|TRANSFER  globus_...| 69113| 1.4653665027644844|
|TRANSFER  globus_...| 60554| 1.2838945380521838|
|TRANSFER  ERROR: ...| 43011|  0.911939557686734|
|SOURCE CHECKSUM g...| 42556| 0.9022924325618248|
|TRANSFER  globus_...| 38749| 0.8215746186046187|
|TRANSFER  globus_...| 35904| 0.7612535834829345|
|TRANSFER  globus_...| 34574| 0.7330542946562771|
|TRANSFER  globus_...| 28598| 0.6063483171915374|


##### Top 20 errors 

In [17]:
error_freq.select("t__error_message").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|t__error_message                                                                                                                                                                                                                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [18]:
%%time

from pyspark.sql import Window

from pyspark.sql import functions as F

windowval = (Window.orderBy(error_freq['count'].desc()).rangeBetween(Window.unboundedPreceding, 0))

error_freq = error_freq.withColumn('cum_perc', F.sum('percentage').over(windowval)).withColumn("msg_id", F.monotonically_increasing_id())
error_freq = error_freq.withColumnRenamed("t__error_message", "message").select("msg_id", "message", "count", "percentage", "cum_perc")

error_freq = error_freq.orderBy(error_freq['count'].desc())


error_freq.show(50)

+------+--------------------+------+-------------------+------------------+
|msg_id|             message| count|         percentage|          cum_perc|
+------+--------------------+------+-------------------+------------------+
|     0|TRANSFER  globus_...|570137| 12.088314235912707|12.088314235912707|
|     1|TRANSFER  globus_...|324553|  6.881326155306841| 18.96964039121955|
|     2|Error on XrdCl::C...|224141|  4.752343456312623|23.721983847532172|
|     3|DESTINATION SRM_P...|115402| 2.4468077662961676| 26.16879161382834|
|     4|TRANSFER  an end-...|113417| 2.4047208577842016|28.573512471612542|
|     5|TRANSFER  Transfe...| 95104|  2.016439973361213|30.589952444973754|
|     6|globus_ftp_client...| 81148| 1.7205382629365298| 32.31049070791028|
|     7|Destination file ...| 78000| 1.6537928785558402|33.964283586466124|
|     8|TRANSFER  Operati...| 71783|  1.521977105145819|35.486260691611946|
|     9|TRANSFER  globus_...| 69113| 1.4653665027644844| 36.95162719437643|
|    10|TRAN

In [19]:
error_freq.filter(error_freq["msg_id"] > (n_unique_mess-50)).select("message").show(n=50, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|message                                                                                                                                                                                                                                                                                                                                                                                             

In [20]:
%who

F	 SparkSession	 Window	 distinct_mess	 err_mess	 error_freq	 error_freq1	 n_errs	 spark	 
windowval	 


In [20]:
error_freq_pd = error_freq.toPandas()

In [21]:
import numpy as np


len(np.unique(error_freq_pd.msg_id))

271795

In [24]:
import pandas as pd
pd.set_option('display.max_colwidth', -1) 

error_freq_pd[error_freq_pd.cum_perc>80].head(30)

Unnamed: 0,msg_id,message,count,percentage,cum_perc
2811,2811,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/mc/RunIISummer16NanoAODv5/BsToJpsiF2p1525_BMuonFilter_SoftQCDnonD_TuneCUEP8M1_13TeV-pythia8-evtgen/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/110000/76E1DE06-A2BF-554A-B158-35652DAD0810.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2812,2812,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/data/Run2018D/NoBPTX/NANOAOD/Nano1June2019_ver2-v1/30000/7821BF71-B8F6-D749-A382-3F92930A0B50.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2813,2813,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/mc/RunIIFall17NanoAODv5/ST_t-channel_top_4f_InclusiveDecays_TuneCP5up_PSweights_13TeV-powheg-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano1June2019_102X_mc2017_realistic_v7-v1/250000/EA3156D2-BD29-9B42-86E6-B164AA035078.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2814,2814,"srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://srmcms.pic.es:8443/srm/managerv2: CGSI-gSOAP running on fts437.cern.ch reports Error reading token data header: Connection closed",64,0.001357,80.059414
2815,2815,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/mc/RunIISummer16NanoAODv5/WprimeToENu_M-5400_TuneCUETP8M1_13TeV-pythia8/NANOAODSIM/PUMoriond17_Nano1June2019_102X_mcRun2_asymptotic_v7-v1/110000/116AA6F1-07C9-474F-8F24-5CD1EF6C3A8E.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2816,2816,"srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://stormfe1.pi.infn.it:8444/srm/managerv2: CGSI-gSOAP running on fts432.cern.ch reports Error reading token data header: Connection closed",64,0.001357,80.059414
2817,2817,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/data/Run2018D/NoBPTX/NANOAOD/Nano1June2019_ver2-v1/30000/BD1DAF06-1286-DA48-A029-9342AAEF5124.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2818,2818,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/mc/RunIIAutumn18NanoAODv5/ChargedHiggsToCB_M125_TuneCP5_PSweights_13TeV-madgraph-pythia8/NANOAODSIM/Nano1June2019_102X_upgrade2018_realistic_v19-v1/130000/E1EB417D-D801-3E4D-B8FD-1E95F13FE764.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414
2819,2819,"srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://srm.ihep.ac.cn:8443/srm/managerv2: CGSI-gSOAP running on lcgfts05.gridpp.rl.ac.uk reports could not open connection to srm.ihep.ac.cn:8443",64,0.001357,80.059414
2820,2820,DESTINATION SRM_PUTDONE Error on the surl srm://srm-cms.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/cms/store/test/rucio/cms//store/mc/RunIIAutumn18NanoAODv5/VBFHToWWToLNuQQ_M180_NNPDF31_TuneCP5_PSweights_13TeV_powheg_JHUGenV727_pythia8/NANOAODSIM/Nano1June2019_102X_upgrade2018_realistic_v19-v1/270000/4B692399-D55E-C04F-B633-BF0BF665AB81.root while putdone : [SE][PutDone][SRM_INVALID_PATH] This SURL does not exist in the original request,64,0.001357,80.059414


In [31]:
a = error_freq_pd[error_freq_pd.message.str.contains("DESTINATION SRM_PUTDONE", regex=False)]
a["percentage"].sum()

8.092135769610538

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

for i in range(9400,94010):
    print(error_freq_pd[error_freq_pd.msg_id==i].msg_id)

In [None]:
error_freq_pd[error_freq_pd.msg_id==94001]

In [63]:
top_80_perc = error_freq.filter("cum_perc > 80")
top_80_perc.agg({"msg_id":"min"}).show()

top_80_perc.agg({"msg_id": "max"}).collect()[0][0] - top_80_perc.head(1)[0][0]

+-----------+
|min(msg_id)|
+-----------+
|       1841|
+-----------+



92704

In [40]:
error_freq.filter(error_freq.msg_id == 1893).show()

+------+--------------------+-----+--------------------+----------------+
|msg_id|             message|count|          percentage|        cum_perc|
+------+--------------------+-----+--------------------+----------------+
|  1893|Protocol not supp...|   33|0.002752389991976366|80.0716121832487|
|  1893|srm-ifce err: Com...|   33|0.002752389991976366|80.0716121832487|
|  1893|SOURCE CHECKSUM M...|   33|0.002752389991976366|80.0716121832487|
|  1893|DESTINATION SRM_P...|   33|0.002752389991976366|80.0716121832487|
|  1893|SOURCE CHECKSUM M...|   33|0.002752389991976366|80.0716121832487|
|  1893|Protocol not supp...|   33|0.002752389991976366|80.0716121832487|
|  1893|DESTINATION SRM_P...|   33|0.002752389991976366|80.0716121832487|
|  1893|DESTINATION SRM_P...|   33|0.002752389991976366|80.0716121832487|
|  1893|SOURCE CHECKSUM M...|   33|0.002752389991976366|80.0716121832487|
|  1893|Protocol not supp...|   33|0.002752389991976366|80.0716121832487|
|  1893|SOURCE CHECKSUM M...|   33|0.0

In [11]:
!pip --version

pip 19.1.1 from /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib/python3.6/site-packages/pip (python 3.6)


# Vectorization

In [10]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF

# split text into tokens
tokenizer = Tokenizer(inputCol="t__error_message", outputCol="tokens")

# remove stop (common, non-relevant) words
stop_remove = StopWordsRemover(inputCol="tokens", outputCol="stop_token")

# count frequency of each token in each text (bag of words model)
count_vec = CountVectorizer(inputCol="stop_token", outputCol="count_vec")

# compute  tf-idf
idf = IDF(inputCol="count_vec", outputCol="tf_idf")

In [12]:
%%time
from pyspark.ml import Pipeline

data_prep_pipeline = Pipeline(stages = [tokenizer, stop_remove, count_vec, idf])

cleaner = data_prep_pipeline.fit(err_mess)

CPU times: user 332 ms, sys: 89.7 ms, total: 422 ms
Wall time: 3min 42s


In [13]:
%%time

clean_data = cleaner.transform(err_mess)

CPU times: user 48.4 ms, sys: 14.5 ms, total: 62.9 ms
Wall time: 231 ms


In [18]:
clean_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|    t__error_message|              tokens|          stop_token|           count_vec|              tf_idf|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|DESTINATION SRM_P...|[destination, srm...|[destination, srm...|(94563,[0,1,6,11,...|(94563,[0,1,6,11,...|
|TRANSFER  globus_...|[transfer, , glob...|[transfer, , glob...|(94563,[0,1,2,7,1...|(94563,[0,1,2,7,1...|
|Error on XrdCl::C...|[error, on, xrdcl...|[error, xrdcl::co...|(94563,[0,3,4,8,1...|(94563,[0,3,4,8,1...|
|TRANSFER  globus_...|[transfer, , glob...|[transfer, , glob...|(94563,[0,1,2,7,1...|(94563,[0,1,2,7,1...|
|DESTINATION SRM_P...|[destination, srm...|[destination, srm...|(94563,[0,1,6,11,...|(94563,[0,1,6,11,...|
|srm-ifce err: Com...|[srm-ifce, err:, ...|[srm-ifce, err:, ...|(94563,[0,7,10,23...|(94563,[0,7,10,23...|
|srm-ifce err: Com...|[srm-ifce, err:

In [19]:
clean_data.select("stop_token").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|stop_token                                                                                                                                                                                                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
clean_data.select("count_vec").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|count_vec                                                                                                                                                                             |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(94563,[0,1,6,11,13,16,22,43,66,68,70,71,72,73,74],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                                                                     |
|(94563,[0,1,2,7,19,21,29,40,46,67,87,103,413],[1.0,1.0,1.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0])                                                                                  |
|(94563,[0,3,4,8,18,20,47,58,65,79],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.