# Cluster interpretability

In this notebook, we explore several techniques for extracting patterns from the clusters.
The goal: have an interpretable representation of the relevance of the feature in our profiles/clusters
- For understanding the clusters
- For better profile explanation for the users

A way to do it: consider profiles as documents and the task as NLP.

The first approach is to use TF-IDF. However, they have some limitations as shared values, but with different distribution over the profiles/clusters, have zero weight. This behavior may represent a problem since we have fixed low-cardinality categorical classes, e.g., scheduling class or priority

Another solution is to develop One-vs-all classifiers (white box techniques). This approach:
- Gives us also the opportunity to evaluate the profile attribution/assignment (via classification)
- Clear and algorithmically grounded
- Gives an “interpretation” available for the final user



## Imports

In [1]:
import pandas as pd

import dask.dataframe as dd
from dask.delayed import delayed

import random

import os
from collections import Counter
from collections import defaultdict

from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import xgboost as xgb

import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.ticker import PercentFormatter

import numpy as np

import math

import seaborn as sns

## Load data

In [2]:
static_metrics = pd.read_csv('data/static_metrics_and_kmeans.csv')

In [3]:
static_metrics

Unnamed: 0,job ID,priority,different machines restriction,disk space request - Q1,disk space request - Q2,disk space request - Q3,disk space request - Q4,disk space request - Quartiles,memory request - Q1,memory request - Q2,...,CPU request - Quartiles,priority labels,user,logical job name,scheduling class,K-Means = 2,K-Means = 4,K-Means = 6,K-Means = 8,K-Means = 10
0,3418356,9,0,0,0,0,1,Q4,0,0,...,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,fGRnr2XEPDr3kQsPccU/k1LELeeQonkj6hDpTP7ALkg=,3,0,1,1,1,1
1,3418405,9,0,0,0,0,1,Q4,0,0,...,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,q6nwarTUw/Xct0ONQEdblvVhW8uWTquTp8C5la5YfRE=,3,0,1,1,1,1
2,6724949,9,0,0,0,0,1,Q4,0,0,...,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,4my5Elvc5RumesxoVeuFovkoS28KYA9C3pIBi2bY5Io=,3,0,1,1,1,1
3,28185708,1,0,1,0,0,0,Q1,0,1,...,Q3,"Free [0,1]",WVtO5qw3sNnP4MeiRUnqr07CekrYMU12Mc7GbsgnjhQ=,JGBCBdeRFciFaU6LrbO9Y5w1lBoZ1MyX5Pnx4m05HK8=,2,0,1,1,1,1
4,124371644,9,0,0,0,0,1,Q4,0,0,...,Q3,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,HcYZ4RNZRxmh/W+WuzNBOVk4sOCdDshVEB/McWxfyyk=,3,0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64928,6486612269,1,0,0,1,0,0,Q2,0,0,...,Q3,"Free [0,1]",E+9U+J1Dicd5PJklbq2e5THQ29X6D8jmqQ0Zu53Kc+o=,QZm5VyyEiL9TpVhasvtcMxPxENJqlNX4Fn2LdEe+Ff0=,1,1,0,0,0,0
64929,6486612710,4,0,0,0,0,1,Q4,0,0,...,Q2,"Other [2,8]",HxdST/yDI1TlIkr0Povf9KaXGGG8x4iVXX6o/iSIghM=,YI5372ouHbbTv6b594D2bDWC283uv6QcuE41Mst3BFk=,0,1,3,5,5,4
64930,6486630408,4,0,0,1,0,0,Q2,1,0,...,Q3,"Other [2,8]",fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,ZoHIMIMjYenKtCPq0iy70XUxyF0JDf1HuW6yUzS4dBM=,0,1,0,0,0,0
64931,6486631154,8,0,0,1,0,0,Q2,0,1,...,Q2,"Other [2,8]",fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,cLRsAgA+ajg6giausco6dsV5PxWsqEJZDLdHWhd6v0E=,2,1,0,0,0,0


## TF-IDF

In [4]:
for c in static_metrics.columns:
    print(c)

job ID
priority
different machines restriction
disk space request - Q1
disk space request - Q2
disk space request - Q3
disk space request - Q4
disk space request - Quartiles
memory request - Q1
memory request - Q2
memory request - Q3
memory request - Q4
memory request - Quartiles
CPU request - Q1
CPU request - Q2
CPU request - Q3
CPU request - Q4
CPU request - Quartiles
priority labels
user
logical job name
scheduling class
K-Means = 2
K-Means = 4
K-Means = 6
K-Means = 8
K-Means = 10


In [5]:
tfidf_df = static_metrics[['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority labels', 'user', 'logical job name', 'scheduling class']]

In [6]:
tfidf_df

Unnamed: 0,different machines restriction,disk space request - Quartiles,memory request - Quartiles,CPU request - Quartiles,priority labels,user,logical job name,scheduling class
0,0,Q4,Q4,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,fGRnr2XEPDr3kQsPccU/k1LELeeQonkj6hDpTP7ALkg=,3
1,0,Q4,Q4,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,q6nwarTUw/Xct0ONQEdblvVhW8uWTquTp8C5la5YfRE=,3
2,0,Q4,Q4,Q4,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,4my5Elvc5RumesxoVeuFovkoS28KYA9C3pIBi2bY5Io=,3
3,0,Q1,Q2,Q3,"Free [0,1]",WVtO5qw3sNnP4MeiRUnqr07CekrYMU12Mc7GbsgnjhQ=,JGBCBdeRFciFaU6LrbO9Y5w1lBoZ1MyX5Pnx4m05HK8=,2
4,0,Q4,Q4,Q3,Production [9],70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,HcYZ4RNZRxmh/W+WuzNBOVk4sOCdDshVEB/McWxfyyk=,3
...,...,...,...,...,...,...,...,...
64928,0,Q2,Q3,Q3,"Free [0,1]",E+9U+J1Dicd5PJklbq2e5THQ29X6D8jmqQ0Zu53Kc+o=,QZm5VyyEiL9TpVhasvtcMxPxENJqlNX4Fn2LdEe+Ff0=,1
64929,0,Q4,Q4,Q2,"Other [2,8]",HxdST/yDI1TlIkr0Povf9KaXGGG8x4iVXX6o/iSIghM=,YI5372ouHbbTv6b594D2bDWC283uv6QcuE41Mst3BFk=,0
64930,0,Q2,Q1,Q3,"Other [2,8]",fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,ZoHIMIMjYenKtCPq0iy70XUxyF0JDf1HuW6yUzS4dBM=,0
64931,0,Q2,Q2,Q2,"Other [2,8]",fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,cLRsAgA+ajg6giausco6dsV5PxWsqEJZDLdHWhd6v0E=,2


In [7]:
for lab in tfidf_df.columns:
    joined_vals = []
    for v in tfidf_df[lab].values:
        new_v = f"{lab} {v}"
        new_v = '_'.join(new_v.split(' '))
        joined_vals.append(new_v)
    tfidf_df[lab] = joined_vals

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tfidf_df[lab] = joined_vals


In [8]:
tfidf_df

Unnamed: 0,different machines restriction,disk space request - Quartiles,memory request - Quartiles,CPU request - Quartiles,priority labels,user,logical job name,scheduling class
0,different_machines_restriction_0,disk_space_request_-_Quartiles_Q4,memory_request_-_Quartiles_Q4,CPU_request_-_Quartiles_Q4,priority_labels_Production_[9],user_70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,logical_job_name_fGRnr2XEPDr3kQsPccU/k1LELeeQo...,scheduling_class_3
1,different_machines_restriction_0,disk_space_request_-_Quartiles_Q4,memory_request_-_Quartiles_Q4,CPU_request_-_Quartiles_Q4,priority_labels_Production_[9],user_70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,logical_job_name_q6nwarTUw/Xct0ONQEdblvVhW8uWT...,scheduling_class_3
2,different_machines_restriction_0,disk_space_request_-_Quartiles_Q4,memory_request_-_Quartiles_Q4,CPU_request_-_Quartiles_Q4,priority_labels_Production_[9],user_70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,logical_job_name_4my5Elvc5RumesxoVeuFovkoS28KY...,scheduling_class_3
3,different_machines_restriction_0,disk_space_request_-_Quartiles_Q1,memory_request_-_Quartiles_Q2,CPU_request_-_Quartiles_Q3,"priority_labels_Free_[0,1]",user_WVtO5qw3sNnP4MeiRUnqr07CekrYMU12Mc7GbsgnjhQ=,logical_job_name_JGBCBdeRFciFaU6LrbO9Y5w1lBoZ1...,scheduling_class_2
4,different_machines_restriction_0,disk_space_request_-_Quartiles_Q4,memory_request_-_Quartiles_Q4,CPU_request_-_Quartiles_Q3,priority_labels_Production_[9],user_70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,logical_job_name_HcYZ4RNZRxmh/W+WuzNBOVk4sOCdD...,scheduling_class_3
...,...,...,...,...,...,...,...,...
64928,different_machines_restriction_0,disk_space_request_-_Quartiles_Q2,memory_request_-_Quartiles_Q3,CPU_request_-_Quartiles_Q3,"priority_labels_Free_[0,1]",user_E+9U+J1Dicd5PJklbq2e5THQ29X6D8jmqQ0Zu53Kc+o=,logical_job_name_QZm5VyyEiL9TpVhasvtcMxPxENJql...,scheduling_class_1
64929,different_machines_restriction_0,disk_space_request_-_Quartiles_Q4,memory_request_-_Quartiles_Q4,CPU_request_-_Quartiles_Q2,"priority_labels_Other_[2,8]",user_HxdST/yDI1TlIkr0Povf9KaXGGG8x4iVXX6o/iSIghM=,logical_job_name_YI5372ouHbbTv6b594D2bDWC283uv...,scheduling_class_0
64930,different_machines_restriction_0,disk_space_request_-_Quartiles_Q2,memory_request_-_Quartiles_Q1,CPU_request_-_Quartiles_Q3,"priority_labels_Other_[2,8]",user_fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,logical_job_name_ZoHIMIMjYenKtCPq0iy70XUxyF0JD...,scheduling_class_0
64931,different_machines_restriction_0,disk_space_request_-_Quartiles_Q2,memory_request_-_Quartiles_Q2,CPU_request_-_Quartiles_Q2,"priority_labels_Other_[2,8]",user_fJeARInTaIjFrdXGqxap6x2T3TpJB84y9zvFsoGGhjk=,logical_job_name_cLRsAgA+ajg6giausco6dsV5PxWsq...,scheduling_class_2


In [9]:
k = 4
documents = []
for i in range(k):
    document_df = tfidf_df[static_metrics[f"K-Means = {k}"] == i]
    document = []
    for val in document_df.values:
        document = document + list(val)
    documents.append(document)

In [10]:
uniqueWords = set(documents[0])
for i in range(1, k):
    uniqueWords = uniqueWords.union(set(documents[i]))

In [11]:
numOfWords = []
for i in range(k):
    numOfWordsDoc = dict.fromkeys(uniqueWords, 0)
    for word in documents[i]:
        numOfWordsDoc[word] += 1
    numOfWords.append(numOfWordsDoc)

### Term Frequency (TF)
The number of times a word appears in a document, divided by the total number of words in the document. Every document has its own term frequency.

$\text{tf}(t, d) = \frac{f_{t,d}}{\sum_{t^{'} \in d}f_{t^{'},d}}$

where $f_{t,d}$ is the raw count of a term in a document, i.e., the number of times that term $t$ occurs in document $d$.

In [12]:
def computeTF(wordDict, bagOfWords):
    tfDict =  {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount)
    return tfDict

In [13]:
tfs = []
for i in range(k):
    tfs.append(computeTF(numOfWords[i], documents[i]))
    

### Inverse Document Frequency (IDF)
The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

$\text{idf}(t,D) = log\frac{N}{|\{ f_{t^{'},d}: \text{ } t^{'} \in d  \}|}  $

with:
- $N$: total number of documents in the corpus $ N = |D|$
- $|\{ f_{t^{'},d}: \text{ } t^{'} \in d  \}|$: number of documents where the term $t$ appears (i.e., $\text{tf}(t,d) \neq 0$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1 + |\{ f_{t^{'},d}: \text{ } t^{'} \in d  \}|$


In [14]:
def computeIDF(documents):
    N = len(documents)
    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
            else:
                idfDict[word] = 1
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict

In [15]:
idfs = computeIDF(numOfWords)

### TF-IDF

Lastly, the TF-IDF is simply the TF multiplied by IDF.

$\text{tfidf}(t, d, D) = \text{tf} \cdot \text{idf}(t, D)$

In [16]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf

In [17]:
tfidfs = []
for i in range(k):
    tfidfs.append(computeTFIDF(tfs[i], idfs))

In [18]:
tfidf_df = pd.DataFrame(tfidfs)

In [19]:
tfidf_df

Unnamed: 0,logical_job_name_okBmyjLBOcWT2DoKYecrherG9WFOoOoYhJxPutBoIRo=,logical_job_name_cX4isFgXx5cs/PeUhwgN+itgwSwnZlvJkhqM+ChJ4Rw=,logical_job_name_MhI8rAkL3VZMiRXvLtFyN9Ti0+CRvSW8IaEjs1hE0tk=,logical_job_name_bfM1ml8zZPrkbWQc93RJqOkmBv9iSCGPW2CDpn1F+Yg=,logical_job_name_4pVEl/jd6YEK5LfOAO/wQ5MOPQIFSQfRLk9CHX06zVw=,logical_job_name_Fl90fBVvcu40Zccqczs/KgJQmg4XKC61wNplpNLd5lc=,logical_job_name_aCn6yqik5lfjEOxTPg9W2dEUuU/czXr8kRu6PLtJCHI=,logical_job_name_aG95SNbdO1JyJ1lpZCW9uwmOGOEj0quXY7BV9QNxcYo=,logical_job_name_9jUi5SI8LjopngXrzhPFcZ1nKWZr7ciN7CNQI4AbDDI=,logical_job_name_TzGRwzlGZnw63lr8Gz5qURuff10m6ARR2uFNZHfg9Rw=,...,logical_job_name_VaZVDI2uTC4DTLo4WkoBqheScNJ58rgWRSQgOCUpDYw=,logical_job_name_HQhK2nuf0siPMxcATv+IeNMLloUwrLk5mrOkFWT1bd8=,user_Qa3VK/X2IoSeou0RPVShM6BIDRan8y6otNNWefTP8W4=,logical_job_name_pii/8UAsurgZi5WLZlhnyHfBv5EOpzXk5S9sSGVndGs=,logical_job_name_RIktO+Co4LdhX9a994G2+EKo6Ke+sL19A1YLyWc+15U=,logical_job_name_ggiMwWSbB6TxLrtEgHCIveNLL4dV90sNadHzMnDeKXM=,logical_job_name_8pO7ruFieHGPcvPj1XA7h7ChGGUpwg/ysFwzzITrRT8=,logical_job_name_7wOxGIT+kvnMoQ6X+9YsLY2a5ii8BSeHWjvMzFPzYuY=,logical_job_name_QvYElg2QruC5NEih6aEzRC5sn6LMNddXw0hm7yM+a4U=,logical_job_name_vK2axMSmXLcApaQTTpcKC/vVswFlA1sY78rqgcoWCc0=
0,1e-05,8e-06,7e-06,1.3e-05,0.0,3e-06,2e-06,7e-06,1.3e-05,3e-06,...,1e-05,0.00028,0.0,7e-06,0.0,7e-06,1.2e-05,3e-06,3e-06,7e-06
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000501,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.5e-05,0.0,0.0,8e-06,0.0,8e-06,0.0,0.0,0.0,...,0.0,0.000285,0.0,0.0,8e-06,0.0,8e-06,0.0,0.0,8e-06


In [20]:
tfidf_df.sort_values(by=3, ascending=False, axis=1)

Unnamed: 0,user_njTE8BZMxQTFTmz+xeDNc6MGCjP2WhS6B4xK9+rTh8E=,logical_job_name_j25eTfDZ4FFHzd7p+VKe13tP4+iQbKkHk9VI0SBK/nE=,user_P/b25hVu6/7A0BJOLJFXi0VLXUOprxPTVtOKXlp/w/Y=,logical_job_name_9q2rn++rPMOVeDylVv/NchnIKQxmfBwzF5ZoKhSvgP0=,logical_job_name_PqaHcqiH62FESqU41XFHq+UbXe0VvgZXuw+kdGrOucs=,logical_job_name_oHQYViMUeNEojiW9p3u0Vdt8N4KBASt2eODnyJhwSSo=,logical_job_name_G/9E4AW9fSviXbmdFO5BBcjVd49zuI1AIU5gHQJLm+8=,logical_job_name_ZQ+bFefVT1UByX7mRuMBv2rx61PckQWrFvG4Ymz2lF8=,logical_job_name_cNpu9y/02mA/fshHIHLxtmdpCmusNhJoThVIDS3WLHg=,logical_job_name_AmKr63lD9MIGXiAacmzoj6kQMGqk2U0M89A2RkgB6uc=,...,logical_job_name_SJxXg3days6/YJeQ4RnTVXhtcyuMqd0LsZG8ktkZoqY=,logical_job_name_HEPTM21y4+G36CQGmnfFL0iadS4Bf0zTfCj9geVaMX8=,user_QNAoYqaQ6zr6XI7IBXnHOYYG3aBK0enRLfpJFF8dUxQ=,logical_job_name_6snJ8nmjqhCbEHChlhu5qCB14QkSlqmChGU8EbM7Pvw=,logical_job_name_vfqjdzh/3MFwUmB8Y5YRAprGMfx9JocF+GF6LaoT9oQ=,logical_job_name_8QVFPdIBLbHgY4AvuEiCD7OM8XDrPJPvKGwfQIWbbUw=,logical_job_name_QGsw2XBuW91V3iPYeXAzia+K651aDze2Vc1F0DhPENQ=,logical_job_name_7URdQ6jM7uzbkzhSdXRu49b7VTYIQtQTBydFxEwFfrw=,logical_job_name_j6pv8mtSfzspbtOiNjsIPYwEnqX8fhS6oL7QXhC83ME=,user_80t8cCbClG5TBydQy26SoOacM7sJ10KDnGkT6uuLYlA=
0,0.001862,0.001077,0.000308,0.000233,0.000199,0.000235,0.000907,0.000434,0.000159,0.000461,...,3e-06,3e-06,0.0,3e-06,3e-06,3e-06,7e-06,3e-06,3e-06,3e-06
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000501,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000836,0.001003,0.000234,0.000201,0.0001,0.000234,0.000401,0.000334,0.000134,0.000468,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.002029,0.00084,0.000526,0.000485,0.000441,0.000441,0.000441,0.000384,0.000374,0.000339,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
sorted_tfidfs = [] 
for i in range(k):
    sorted_tfidfs.append(dict(sorted(tfidfs[i].items(), key=lambda item: item[1], reverse=True)))

In [22]:
x = list(sorted_tfidfs[0].keys())[:15]

In [23]:

ranking_features = []
for i in range(k):
    positions = defaultdict(list)
    x = list(sorted_tfidfs[i].keys())[:15]
    for el in x:
        for i in range(k):
            positions[f"position cl{i}"].append(list(sorted_tfidfs[i].keys()).index(el) + 1)
    positions["values"] = x
    ranking_features.append(positions)


In [24]:
pd.DataFrame(ranking_features[0]).set_index("values")

Unnamed: 0_level_0,position cl0,position cl1,position cl2,position cl3
values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
user_njTE8BZMxQTFTmz+xeDNc6MGCjP2WhS6B4xK9+rTh8E=,1,10055,4,1
logical_job_name_j25eTfDZ4FFHzd7p+VKe13tP4+iQbKkHk9VI0SBK/nE=,2,8266,2,2
logical_job_name_G/9E4AW9fSviXbmdFO5BBcjVd49zuI1AIU5gHQJLm+8=,3,1939,11,5
user_xQAGEBrubfzj6dt8N0gpTwfMk8daCreg5lnboUEhkf4=,4,3385,37,18
logical_job_name_AmKr63lD9MIGXiAacmzoj6kQMGqk2U0M89A2RkgB6uc=,5,11355,10,10
logical_job_name_ZQ+bFefVT1UByX7mRuMBv2rx61PckQWrFvG4Ymz2lF8=,6,4857,12,8
logical_job_name_zhgh8SYo7Dk/+QbhaeiOgF2ZnRWVnZJF0UVhVLkIubE=,7,7526,39,33
logical_job_name_C9Tdi+5fGw0kYGshyBA+FaX+NpQalBnfSkMUErZeOKQ=,8,2821,316,25
logical_job_name_odFg4veougZiUp92g2JDYdCrjAXU5irhWd9BGFpHOFg=,9,7050,345,40
logical_job_name_a9Rw9CFVtXZI7HSqJlgHjukiNSlw+gGyUPQkUzqhKoc=,10,5322,13,22


In [25]:
pd.DataFrame(ranking_features[1]).set_index("values")

Unnamed: 0_level_0,position cl0,position cl1,position cl2,position cl3
values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
logical_job_name_IVwT/XMeSzdXI5pftYrE74vZoJNz33avbdNDEVtH7B4=,4460,1,1156,4236
user_3PF0/VxH7vMzg6nj3l4Xmhbo2+5HSDbzCKKpZ9OQ1bk=,11316,2,8882,9724
user_70s3v5qRyCO/1PCdI6fVXnrW8FU/w+5CKRSa72xgcIo=,10228,3,1573,4520
user_CwGnjPqF6z6mEbr5zJVXafRsbETt6gvdwvOX2l+GpsI=,49,4,2632,64
user_qzYKr1BqweZKofd4U2tWF0VxZEbJhjG/LunlAwkI1kM=,2016,5,7868,9006
user_rNyxTd1B3RnDJBIofzoVksjnmaJEN+hW5W+yRHo2xsM=,2188,6,10732,679
user_VKmwSJS9DAS+J/beYyBoF1sTcJi7Z8Qm0VYlh18em8w=,1138,7,2071,4885
user_q6EjQj6yxdwUvvufuDKvfotZ1LoylbpNu4NguS48Lfo=,896,8,3824,562
user_62ZtsQfj7aFYgb0o35g5K/cRMfvHqUOGOoFeGqTsqOs=,927,9,6480,8021
user_QbRiaKYQ23mJU4y8Sp1lFaO3r3wA3+z429go6UqIP/w=,10978,10,6662,8150


In [26]:
pd.DataFrame(ranking_features[2]).set_index("values")

Unnamed: 0_level_0,position cl0,position cl1,position cl2,position cl3
values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
logical_job_name_Y+cfHMX6f6uUQjHk6eRo9YEJfat+HtYJMC6R3J2WPAA=,32,2242,1,47
logical_job_name_j25eTfDZ4FFHzd7p+VKe13tP4+iQbKkHk9VI0SBK/nE=,2,8266,2,2
logical_job_name_BzFcNI09HrdUe6J+CXY3m8gvubxz/uzQrT39FXzi7jo=,2159,10227,3,10688
user_njTE8BZMxQTFTmz+xeDNc6MGCjP2WhS6B4xK9+rTh8E=,1,10055,4,1
logical_job_name_nmu4veR9MGh7EkZhxxlmxGeQV3/waSOe1Tuo9lu5LfY=,299,5283,5,168
user_N5I2+pqV0s9j+1XwV8fwIxAb8p47z4I3dJ4o2I/RLxQ=,21,2185,6,31
logical_job_name_MW8tcTvQrf5pq7+4iGzi8Y64+fAPCFnOjEhFtpAIolU=,149,2035,7,70
logical_job_name_esro7fIKVdZ4YL4WoB810gpe503ZqsdvWSdMRBkujYk=,735,3976,8,6375
logical_job_name_mMaUdJMueeyxeb4MU10gMnJoRSxbu5qoXbs6hJoCsZ4=,5893,4064,9,6436
logical_job_name_AmKr63lD9MIGXiAacmzoj6kQMGqk2U0M89A2RkgB6uc=,5,11355,10,10


In [27]:
pd.DataFrame(ranking_features[3]).set_index("values")

Unnamed: 0_level_0,position cl0,position cl1,position cl2,position cl3
values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
user_njTE8BZMxQTFTmz+xeDNc6MGCjP2WhS6B4xK9+rTh8E=,1,10055,4,1
logical_job_name_j25eTfDZ4FFHzd7p+VKe13tP4+iQbKkHk9VI0SBK/nE=,2,8266,2,2
user_P/b25hVu6/7A0BJOLJFXi0VLXUOprxPTVtOKXlp/w/Y=,11,3108,33,3
logical_job_name_9q2rn++rPMOVeDylVv/NchnIKQxmfBwzF5ZoKhSvgP0=,17,8101,40,4
logical_job_name_G/9E4AW9fSviXbmdFO5BBcjVd49zuI1AIU5gHQJLm+8=,3,1939,11,5
logical_job_name_PqaHcqiH62FESqU41XFHq+UbXe0VvgZXuw+kdGrOucs=,20,4792,338,6
logical_job_name_oHQYViMUeNEojiW9p3u0Vdt8N4KBASt2eODnyJhwSSo=,16,5090,34,7
logical_job_name_ZQ+bFefVT1UByX7mRuMBv2rx61PckQWrFvG4Ymz2lF8=,6,4857,12,8
logical_job_name_cNpu9y/02mA/fshHIHLxtmdpCmusNhJoThVIDS3WLHg=,29,4180,317,9
logical_job_name_AmKr63lD9MIGXiAacmzoj6kQMGqk2U0M89A2RkgB6uc=,5,11355,10,10


## "Supervised" explanation

In [28]:
oneHotDf = static_metrics[['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority labels', 'user', 'logical job name', 'scheduling class']]

In [29]:
oneHotDf = pd.get_dummies(oneHotDf)

In [30]:
oneHotDf

Unnamed: 0,different machines restriction,scheduling class,disk space request - Quartiles_Q1,disk space request - Quartiles_Q2,disk space request - Quartiles_Q4,memory request - Quartiles_Q1,memory request - Quartiles_Q2,memory request - Quartiles_Q3,memory request - Quartiles_Q4,CPU request - Quartiles_Q1,...,logical job name_zx5jRgl3hgIiT8WOmfR+8bFfzezatJc9va5NQ/JLdlQ=,logical job name_zx9Ss5I6Uzzm3ZkTuJAon4jwIC/xrym2WcwUgFJFKqM=,logical job name_zxTiflFnl0GXvcXPG9Rzx1frVQK1yiEzPgFJjB1g/OU=,logical job name_zyyXeGZlaEeQP5RYw4ZB4E00xFtZLmjaAazSs4h2iRE=,logical job name_zz/mc7VG8WeWxgTXWyD9Jsc9EvlIsbYwbITVb7fKG34=,logical job name_zz4cFAgW4BaV9hGxq+9duJHJ7D4+vcJrzLB9j9+Yzm0=,logical job name_zzAZV4ZWjP8euy5y1ooNkdVj1M+PY/d4XcDtPfDfpQI=,logical job name_zzJSSVHLLM0MT4S0br104OHNMTAp1hsw4yasstN40BQ=,logical job name_zzfU/0NoMaKdmmJ+9aEMEW4L9b6sr/N03hKHZV5xsH0=,logical job name_zzyXXqYH1aYRpGl3JxyTk8XIApgQMz6NXs8Tpb56rbo=
0,0,3,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,3,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,3,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,2,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,3,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64928,0,1,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
64929,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
64930,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
64931,0,2,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Set the scoring metri

In [33]:
scoring = {'acc': 'accuracy',
           'f1_macro': 'f1_macro',
           'f1_weight': 'f1_weighted'}

### Random Forest

#### K-Means 4 - cl.0

In [31]:
clf0 = RandomForestClassifier(random_state=1, n_jobs=-1)
clf0.fit(oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values)

RandomForestClassifier(n_jobs=-1, random_state=1)

In [None]:
results = cross_validate(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values, scoring=scoring, cv=kfold)

In [32]:
kfold=10
resultsClf0 = cross_val_score(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, cv=kfold)

Process LokyProcess-67:
Process LokyProcess-9:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 446, in _process_worker
    del call_item
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/neuralnets/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 446, in _process_worker
    del call_item
KeyboardInterrupt

KeyboardInterrupt: 

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (resultsClf0.mean()*100, resultsClf0.std()*100))

In [None]:
recallClf0 = cross_val_score(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, cv=kfold, scoring='recall')

In [None]:
print("Recall: %.2f%% (%.2f%%)" % (recallClf0.mean() * 100, recallClf0.std()*100))

In [None]:
precisionClf0 = cross_val_score(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, cv=kfold, scoring='precision')

In [None]:
print("Precision: %.2f%% (%.2f%%)" % (precisionClf0.mean() * 100, precisionClf0.std()*100))

In [None]:
fscoreClf0 = cross_val_score(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, cv=kfold, scoring='f1_macro')

In [None]:
print("F-score macro: %.2f%% (%.2f%%)" % (fscoreClf0.mean() * 100, fscoreClf0.std()*100))

In [None]:
fscoreWeightClf0 = cross_val_score(clf0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, cv=kfold, scoring='f1_weighted')

In [None]:
print("F-score weighted: %.2f%% (%.2f%%)" % (fscoreWeightClf0.mean() * 100, fscoreWeightClf0.std()*100))

In [None]:
# Index sort the most important features
sorted_feature_weight_idxes = np.argsort(clf0.feature_importances_)[::-1] # Reverse sort

In [None]:
# Get the most important features names and weights
most_important_features = np.take_along_axis(
    np.array(oneHotDf.columns.tolist()), 
    sorted_feature_weight_idxes, axis=0)
most_important_weights = np.take_along_axis(
    np.array(clf0.feature_importances_), 
    sorted_feature_weight_idxes, axis=0)

In [None]:
# Show
res = pd.DataFrame((zip(most_important_features, most_important_weights)), columns = ["Feature", "Relevance"])

res.head(10)

#### K-Means 4 - cl.1

In [None]:
clf1 = RandomForestClassifier(random_state=1, n_jobs=-1)
clf1.fit(oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values)

In [None]:
resultsClf1 = cross_val_score(clf1, oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values, cv=kfold)

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (resultsClf1.mean()*100, resultsClf1.std()*100))

In [None]:
fmacroClf1 = cross_val_score(clf1, oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values, cv=kfold, scoring='f1_macro')

In [None]:
print("F-score macro: %.2f%% (%.2f%%)" % (fmacroClf1.mean()*100, fmacroClf1.std()*100))

In [None]:
fweightClf1 = cross_val_score(clf1, oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values, cv=kfold, scoring='f1_weighted')

In [None]:
print("F-score weighted: %.2f%% (%.2f%%)" % (fweightClf1.mean()*100, fweightClf1.std()*100))

In [None]:
# Index sort the most important features
sorted_feature_weight_idxes = np.argsort(clf1.feature_importances_)[::-1] # Reverse sort

In [None]:
# Get the most important features names and weights
most_important_features = np.take_along_axis(
    np.array(oneHotDf.columns.tolist()), 
    sorted_feature_weight_idxes, axis=0)
most_important_weights = np.take_along_axis(
    np.array(clf1.feature_importances_), 
    sorted_feature_weight_idxes, axis=0)

In [None]:
# Show
res = pd.DataFrame((zip(most_important_features, most_important_weights)), columns = ["Feature", "Relevance"])

res.head(10)

#### K-Means 4 - cl.2

In [None]:
clf2 = RandomForestClassifier(random_state=1, n_jobs=-1)
clf2.fit(oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values)

In [None]:
resultsClf2 = cross_val_score(clf2, oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values, cv=kfold)

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (resultsClf2.mean()*100, resultsClf2.std()*100))

In [None]:
fmacroClf2 = cross_val_score(clf1, oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values, cv=kfold, scoring='f1_macro')

In [None]:
print("F-score macro: %.2f%% (%.2f%%)" % (fmacroClf2.mean()*100, fmacroClf2.std()*100))

In [None]:
fweightClf2 = cross_val_score(clf2, oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values, cv=kfold, scoring='f1_weighted')

In [None]:
print("F-score weighted: %.2f%% (%.2f%%)" % (fweightClf2.mean()*100, fweightClf2.std()*100))

In [None]:
# Index sort the most important features
sorted_feature_weight_idxes = np.argsort(clf2.feature_importances_)[::-1] # Reverse sort

In [None]:
# Get the most important features names and weights
most_important_features = np.take_along_axis(
    np.array(oneHotDf.columns.tolist()), 
    sorted_feature_weight_idxes, axis=0)
most_important_weights = np.take_along_axis(
    np.array(clf2.feature_importances_), 
    sorted_feature_weight_idxes, axis=0)

In [None]:
# Show
res = pd.DataFrame((zip(most_important_features, most_important_weights)), columns = ["Feature", "Relevance"])

res.head(10)

#### K-Means 4 - cl.3

In [None]:
clf3 = RandomForestClassifier(random_state=1, n_jobs=-1)
clf3.fit(oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values)

In [None]:
resultsClf3 = cross_val_score(clf3, oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values, cv=kfold)

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (resultsClf3.mean()*100, resultsClf3.std()*100))

In [None]:
fmacroClf3 = cross_val_score(clf1, oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values, cv=kfold, scoring='f1_macro')

In [None]:
print("F-score macro: %.2f%% (%.2f%%)" % (fmacroClf3.mean()*100, fmacroClf3.std()*100))

In [None]:
fweightClf3 = cross_val_score(clf3, oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values, cv=kfold, scoring='f1_weighted')

In [None]:
print("F-score weighted: %.2f%% (%.2f%%)" % (fweightClf3.mean()*100, fweightClf3.std()*100))

In [None]:
# Index sort the most important features
sorted_feature_weight_idxes = np.argsort(clf3.feature_importances_)[::-1] # Reverse sort

In [None]:
# Get the most important features names and weights
most_important_features = np.take_along_axis(
    np.array(oneHotDf.columns.tolist()), 
    sorted_feature_weight_idxes, axis=0)
most_important_weights = np.take_along_axis(
    np.array(clf3.feature_importances_), 
    sorted_feature_weight_idxes, axis=0)

In [None]:
# Show
res = pd.DataFrame((zip(most_important_features, most_important_weights)), columns = ["Feature", "Relevance"])

res.head(10)

### XGBoost

In [None]:
model = xgb.XGBClassifier(tree_method='gpu_hist')

In [None]:
oneHotDf.columns = [col.replace('[', '').replace(']','').replace(',',' ').replace(' ', '_') for col in oneHotDf.columns]

#### K-Means 4 - cl.0

In [None]:
model.fit(oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values)

In [None]:
model.get_booster().get_score(importance_type='gain')

In [None]:
xgb.to_graphviz(model)

In [None]:
scoring = {'acc': 'accuracy',
           'f1_macro': 'f1_macro',
           'f1_weight': 'f1_weighted'}

In [None]:
kfold = KFold(n_splits=10)
model0 = xgb.XGBClassifier(tree_method='gpu_hist', max_depth=6, min_child_weight=1, subsample=0.8879, eta=0.099)
results0 = cross_validate(model0, oneHotDf, (static_metrics['K-Means = 4'] == 0).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results0['test_acc'].mean()*100, results0['test_acc'].std()*100) 

In [None]:
print(results0['test_f1_macro'].mean()*100, results0['test_f1_macro'].std()*100) 

In [None]:
print(results0['test_f1_weight'].mean()*100, results0['test_f1_weight'].std()*100) 

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

#### K-Means 4 -cl.1

In [None]:
kfold = KFold(n_splits=10)
model1 = xgb.XGBClassifier(tree_method='gpu_hist')
results1 = cross_validate(model, oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results1['test_acc'].mean()*100, results1['test_acc'].std()*100) 

In [None]:
print(results1['test_f1_macro'].mean()*100, results1['test_f1_macro'].std()*100) 

In [None]:
print(results1['test_f1_weight'].mean()*100, results1['test_f1_weight'].std()*100) 

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
model1.fit(oneHotDf, (static_metrics['K-Means = 4'] == 1).astype(int).values)

In [None]:
xgb.to_graphviz(model1)

In [None]:
image = xgb.to_graphviz(model2)

#Set a different dpi (work only if format == 'png')
image.graph_attr = {'dpi':'400'}

image.render('', format = format)

#### K-Means 4 - cl.2

In [None]:
kfold = KFold(n_splits=10)
model2 = xgb.XGBClassifier(tree_method='gpu_hist')
results2 = cross_validate(model2, oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results2['test_acc'].mean()*100, results2['test_acc'].std()*100) 

In [None]:
print(results2['test_f1_macro'].mean()*100, results2['test_f1_macro'].std()*100) 

In [None]:
print(results2['test_f1_weight'].mean()*100, results2['test_f1_weight'].std()*100) 

In [None]:
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
model2.fit(oneHotDf, (static_metrics['K-Means = 4'] == 2).astype(int).values)

In [None]:
xgb.to_graphviz(model2)

In [None]:
xgb.to_graphviz(model)

Interpretation of the graph: https://stackoverflow.com/questions/40926340/what-does-the-value-of-leaf-in-the-following-xgboost-model-tree-diagram-means

#### K-Means 4 - cl.3

In [None]:
kfold = KFold(n_splits=10)
model3 = xgb.XGBClassifier(tree_method='gpu_hist')
results3 = cross_validate(model3, oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results3['test_acc'].mean()*100, results3['test_acc'].std()*100) 

In [None]:
print(results3['test_f1_macro'].mean()*100, results3['test_f1_macro'].std()*100) 

In [None]:
print(results3['test_f1_weight'].mean()*100, results3['test_f1_weight'].std()*100) 

In [None]:
model3.fit(oneHotDf, (static_metrics['K-Means = 4'] == 3).astype(int).values)

In [None]:
xgb.to_graphviz(model3)

## "Supervised" explanation - reduced metadata

In [None]:
cat_attribs=['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')

In [None]:
oneHotDfReduced = static_metrics[['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority','scheduling class']]

In [None]:
encoder = full_pipeline.fit(oneHotDfReduced)
list_of_feature_names = ['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority','scheduling class']
x = encoder.get_feature_names_out(list_of_feature_names)
oneHotDfReduced = encoder.transform(oneHotDfReduced)

In [None]:
oneHotDfReduced = pd.DataFrame(oneHotDfReduced, columns=x)

In [None]:
oneHotDfReduced

### XGBoost

#### K-Means 4 - cl. 0

In [None]:
model0Red = xgb.XGBClassifier(tree_method='gpu_hist')

In [None]:
model0Red.fit(oneHotDfReduced, (static_metrics['K-Means = 4'] == 0).astype(int).values)

In [None]:
model0Red.get_booster().get_score(importance_type='gain')

In [None]:
xgb.to_graphviz(model0Red)

In [None]:
sklearn.__version__

In [None]:
scoring = {'acc': 'accuracy',
           'f1_macro': 'f1_macro',
           'f1_weight': 'f1_weighted'}

In [None]:
kfold = KFold(n_splits=10)
#model0Red = xgb.XGBClassifier(tree_method='gpu_hist', max_depth=6, min_child_weight=1, subsample=0.8879, eta=0.099)
results0Red = cross_validate(model0Red, oneHotDfReduced, (static_metrics['K-Means = 4'] == 0).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results0Red['test_acc'].mean()*100, results0Red['test_acc'].std()*100) 

In [None]:
print(results0Red['test_f1_macro'].mean()*100, results0Red['test_f1_macro'].std()*100) 

In [None]:
print(results0Red['test_f1_weight'].mean()*100, results0Red['test_f1_weight'].std()*100) 

#### K-Means 4 -cl.1

In [None]:
kfold = KFold(n_splits=10)
model1Red = xgb.XGBClassifier(tree_method='gpu_hist')
results1Red = cross_validate(model1Red, oneHotDfReduced, (static_metrics['K-Means = 4'] == 1).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results1Red['test_acc'].mean()*100, results1Red['test_acc'].std()*100) 

In [None]:
print(results1Red['test_f1_macro'].mean()*100, results1Red['test_f1_macro'].std()*100) 

In [None]:
print(results1Red['test_f1_weight'].mean()*100, results1Red['test_f1_weight'].std()*100) 

In [None]:
model1Red.fit(oneHotDfReduced, (static_metrics['K-Means = 4'] == 1).astype(int).values)

In [None]:
xgb.to_graphviz(model1Red)

#### K-Means 4 - cl.2

In [None]:
kfold = KFold(n_splits=10)
model2Red = xgb.XGBClassifier(tree_method='gpu_hist')
results2Red = cross_validate(model2Red, oneHotDfReduced, (static_metrics['K-Means = 4'] == 2).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results2Red['test_acc'].mean()*100, results2Red['test_acc'].std()*100) 

In [None]:
print(results2Red['test_f1_macro'].mean()*100, results2Red['test_f1_macro'].std()*100) 

In [None]:
print(results2Red['test_f1_weight'].mean()*100, results2Red['test_f1_weight'].std()*100) 

In [None]:
model2Red.fit(oneHotDfReduced, (static_metrics['K-Means = 4'] == 2).astype(int).values)

In [None]:
xgb.to_graphviz(model2Red)

Interpretation of the graph: https://stackoverflow.com/questions/40926340/what-does-the-value-of-leaf-in-the-following-xgboost-model-tree-diagram-means

#### K-Means 4 - cl.3

In [None]:
kfold = KFold(n_splits=10,)
model3Red = xgb.XGBClassifier(tree_method='gpu_hist')
results3Red = cross_validate(model3Red, oneHotDfReduced, (static_metrics['K-Means = 4'] == 3).astype(int).values, scoring=scoring, cv=kfold)

In [None]:
print(results3Red['test_acc'].mean()*100, results3Red['test_acc'].std()*100) 

In [None]:
print(results3Red['test_f1_macro'].mean()*100, results3Red['test_f1_macro'].std()*100) 

In [None]:
print(results3Red['test_f1_weight'].mean()*100, results3Red['test_f1_weight'].std()*100) 

In [None]:
model3Red.fit(oneHotDfReduced, (static_metrics['K-Means = 4'] == 3).astype(int).values)

In [None]:
xgb.to_graphviz(model3Red)

## Test set

In [None]:
readings_task_usage_df = pd.read_csv("data/sample_jobs_summary.csv", header=[0,1], index_col=[0])

In [None]:
data_path = "/data/cloud_data/Google-clusterdata-2011-2/clusterdata-2011-2/"

In [None]:
#job_events_files = [
#    os.path.join(data_path, 'job_events/part-00' + str(v).zfill(3) + '-of-00500.csv.gz')
#    for v in range(500)]
#cols_job_events = df_schema[df_schema['file pattern'] == 'job_events/part-?????-of-?????.csv.gz'].content.values
#
#dfs_job_events = [delayed(pd.read_csv)(fn, header=None, index_col=False, names=cols_job_events, delimiter=',') for fn
#                   in
#                   job_events_files]

In [None]:
readings_job_events_df = dd.from_delayed(dfs_job_events)
jobs_metadata = readings_job_events_df[readings_job_events_df["job ID"].isin(readings_task_usage_df.index)].groupby(["job ID", "user", "logical job name"])["scheduling class"].mean().compute().reset_index()


In [None]:
jobs_metadata

In [None]:
jobs_metadata["scheduling class"].value_counts()

In [None]:
cols_task_events = df_schema[df_schema['file pattern'] == 'task_events/part-?????-of-?????.csv.gz'].content.values
task_events_files = [os.path.join(data_path, 'task_events','part-00'+ str(v).zfill(3)+'-of-00500.csv.gz')
                        for v in range(0, 500)]

dfs_task_events = [delayed(pd.read_csv)(fn, header=None, index_col=False, names=cols_task_events, delimiter=',') for fn in
           task_events_files]
readings_task_events_df = dd.from_delayed(dfs_task_events)
readings_task_events_df = readings_task_events_df[readings_task_events_df['job ID'].isin(readings_task_usage_df.index)].groupby(["job ID"])["priority", "CPU request", "memory request", "disk space request", "different machines restriction"].mean().compute().reset_index()

In [None]:
readings_task_events_df 

**TODO**: Do one-hot encoding
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
- https://towardsdatascience.com/what-is-one-hot-encoding-and-how-to-use-pandas-get-dummies-function-922eb9bd4970


In [None]:
def extract_quartiles(df: pd.DataFrame, column: str):
    q25_val = df[column].quantile(0.25)
    median_val = df[column].median()
    q75_val = df[column].quantile(0.75)
    max_val = df[column].max()
    df[column+" - Q1"] = 0
    df[column+" - Q2"] = 0
    df[column+" - Q3"] = 0
    df[column+" - Q4"] = 0
    df.loc[df[column] <=q25_val, f"{column} - Q1"] = 1
    df.loc[(q25_val < df[column]) & (df[column] <=median_val) , f"{column} - Q2"] = 1
    df.loc[(median_val < df[column]) & (df[column] <=q75_val) , f"{column} - Q3"] = 1
    df.loc[(q75_val < df[column]) & (df[column] <=max_val) , f"{column} - Q4"] = 1
    df.loc[df[column] <=q25_val, f"{column} - Quartiles"] = 'Q1'
    df.loc[(q25_val < df[column]) & (df[column] <=median_val) , f"{column} - Quartiles"] = 'Q2'
    df.loc[(median_val < df[column]) & (df[column] <=q75_val) , f"{column} - Quartiles"] = 'Q3'
    df.loc[(q75_val < df[column]) & (df[column] <=max_val) , f"{column} - Quartiles"] = 'Q4'


In [None]:
def extract_priority(df: pd.DataFrame, column='priority'):
    df.loc[(0 <= df[column]) & (df[column] <= 1), 'priority labels'] = 'Free [0,1]'
    df.loc[(2 <= df[column]) & (df[column] <= 8) , 'priority labels'] = 'Other [2,8]'
    df.loc[(df[column] == 9) , 'priority labels'] = 'Production [9]'
    df.loc[(df[column] == 10), 'priority labels'] = 'Monitoring [10]'
    df.loc[(df[column] == 11), 'priority labels'] = 'Infrastructure [11]'

In [None]:
extract_quartiles(readings_task_events_df, "disk space request")

In [None]:
extract_quartiles(readings_task_events_df, "memory request")

In [None]:
extract_quartiles(readings_task_events_df, "CPU request")

In [None]:
extract_priority(readings_task_events_df)

In [None]:
readings_task_events_df

In [None]:
readings_task_events_df_final = readings_task_events_df.drop(columns=['CPU request', 'memory request', 'disk space request'])

In [None]:
static_metrics_test = readings_task_events_df_final.set_index("job ID").join(jobs_metadata.set_index('job ID'))

In [None]:
extract_priority(static_metrics_test)

In [None]:
static_metrics_test

In [None]:
static_metrics_test['priority'] = static_metrics_test['priority'].astype(int)
static_metrics_test['different machines restriction'] = static_metrics_test['different machines restriction'].astype(int)
static_metrics_test['scheduling class'] = static_metrics_test['scheduling class'].astype(int)

In [None]:
static_metrics_test.to_csv('data/static_metrics_test.csv')

In [None]:
static_metrics_test = pd.read_csv('data/static_metrics_test.csv')

In [None]:
oneHotDfTest = static_metrics[['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority labels', 'user', 'logical job name', 'scheduling class']]

In [None]:
oneHotDfTest = pd.get_dummies(oneHotDfTest)

In [None]:
oneHotDfTest.columns = [col.replace('[', '').replace(']','').replace(',',' ').replace(' ', '_') for col in oneHotDfTest.columns]

In [None]:
oneHotDfTestReduced = static_metrics_test[['different machines restriction', 'disk space request - Quartiles', 'memory request - Quartiles', 'CPU request - Quartiles', 'priority', 'scheduling class']]

In [None]:
oneHotDfTestReduced = encoder.transform(oneHotDfTestReduced)

In [None]:
oneHotDfTestReduced.columns = [col.replace('[', '').replace(']','').replace(',',' ').replace(' ', '_') for col in oneHotDfTestReduced.columns]

In [None]:
resTest0 = model0Red.predict_proba(oneHotDfTestReduced)

In [None]:
resTest1 = model1Red.predict_proba(oneHotDfTestReduced)

In [None]:
resTest2 = model2Red.predict_proba(oneHotDfTestReduced)

In [None]:
resTest3 = model3Red.predict_proba(oneHotDfTestReduced)

In [None]:
resTestDict = dict()

In [None]:
resTestDict['cl 0'] = [x[1] for x in resTest0]
resTestDict['cl 1'] = [x[1] for x in resTest1]
resTestDict['cl 2'] = [x[1] for x in resTest2]
resTestDict['cl 3'] = [x[1] for x in resTest3]

In [None]:
resTestDf = pd.DataFrame(resTestDict)

In [None]:
resTestDf

In [None]:
resTestDf['final label'] = resTestDf.idxmax(axis=1)

In [None]:
resTestDf['final label'].value_counts()

In [None]:
resTestDf['max proba'] = resTestDf.max(axis=1)

In [None]:
resTestDf

In [None]:
resTestDf['final label'].value_counts()

In [None]:


ecdf_summary = sns.ecdfplot(x='max proba', data=resTestDf, complementary=False) 
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.xlabel("max probability")
plt.title("CDF prediction probability")

## Watermark

In [None]:
%load_ext watermark
%watermark
%watermark --iversions