Authors: José Raúl Romero (jrromero@uco.es), Aurora Ramírez (aurora.ramirez@uma.es), Francisco Javier Alcaide (f52almef@uco.es)

**Notebook for tag prediction model problem in the UML dataset**

- This notebook contain the data preprocessing and dataset splitting for the tag prediction problem

- The installation and usage of LionForest are documented in the notebook "Modelset_Multilabel_LionForest.ipynb"
- We select instances that have 2 or more labels, that is, we will omit those cases with one label."

# Installation:

It is important to define the path to the folder that contains the Modelset files, in this notebook using the variable "MODELSET_HOME".

In [None]:
MODELSET_HOME="/content/drive/MyDrive/modelset"

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install modelset-py

Collecting modelset-py
  Downloading modelset_py-0.2.1-py3-none-any.whl (10 kB)
Collecting gensim==4.2.0 (from modelset-py)
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
Collecting wget (from modelset-py)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=4efbd57781f375a4f834f9f8fbbe0b232bfdb3b4acd988c14523bc70b6e9040a
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget, gensim, modelset-py
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.2
    Uninstalling gensim-4

In [None]:
import sys
import pandas as pd
import numpy as np
import os
import modelset.dataset as ds
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.inspection import permutation_importance

# Load and Preprocess:

In this section, we will perform data loading, cleaning, preprocessing, and an initial data analysis.

In [None]:
dataset = ds.load(MODELSET_HOME,modeltype = 'uml', selected_analysis = ['stats']) # load the dataset
modelset_df = dataset._Dataset__to_df()

In [None]:
modelset_df

Unnamed: 0,id,category,tags,language,type_Generalization,type_Class,type_Interaction,type_Relationship,type_Package,type_Actor,...,type_UseCase,diagram_usecase,elements,type_Component,type_Enumeration,type_Association,type_Activity,diagram_comp,diagram_interaction,diagram_sm
0,repo-genmymodel-uml/data/_WJKFoOBcEeeAyLDAJ12_...,computer-ui,,english,0,5,0,6,1,1,...,0,1.0,123,0,0,0,1,,,
1,repo-genmymodel-uml/data/_grOBAOs7EeiJfugOH9Y5...,computer-videogames,videgame,english,0,8,0,3,1,0,...,0,,56,0,1,0,0,,,
2,repo-genmymodel-uml/data/_3e5Z4BBDEeqa8dopbpYH...,unknown,,rusian,0,33,0,106,1,17,...,47,17.0,726,0,0,0,0,,,
3,repo-genmymodel-uml/data/_zRSRMDEsEemjcq-iJCnV...,unknown,,unknown,0,3,0,3,1,0,...,0,,32,0,0,0,0,,,
4,repo-genmymodel-uml/data/_1vnlQNqPEeiJYbNjsZ3w...,dummy,,english,0,0,0,5,1,2,...,3,2.0,53,1,0,0,1,1.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5115,repo-genmymodel-uml/data/f60ea665-be9c-4b7d-b0...,shopping,,english,0,8,0,20,1,3,...,4,3.0,270,0,2,0,1,,,
5116,repo-genmymodel-uml/data/_XUNZYJuFEeexEbmG8xrw...,shopping,,english,0,0,0,1,1,0,...,0,,112,0,0,0,1,,,
5117,repo-genmymodel-uml/data/1fd45148-722f-4b60-93...,realstate,real-state,english,0,13,0,15,1,0,...,0,,202,0,0,0,0,,,
5118,repo-genmymodel-uml/data/bc00e7fa-4d5b-4f96-8c...,computer-videogames,card-game,english,0,6,0,8,1,0,...,0,,84,0,2,0,0,,,


In [None]:
duplicates = modelset_df.duplicated(subset='id', keep=False)
inst_dup = modelset_df[duplicates]
inst_dup

Unnamed: 0,id,category,tags,language,type_Generalization,type_Class,type_Interaction,type_Relationship,type_Package,type_Actor,...,type_UseCase,diagram_usecase,elements,type_Component,type_Enumeration,type_Association,type_Activity,diagram_comp,diagram_interaction,diagram_sm


In [None]:
# Delete columns that are not useful
modelset_df = modelset_df.drop(['elements'], axis=1)
modelset_df = modelset_df.drop(['type_Interaction'], axis=1)
modelset_df = modelset_df.drop(['type_Generalization'], axis=1)
modelset_df = modelset_df.drop(['type_Association'], axis=1)
modelset_df = modelset_df.drop(['diagram_ad'], axis=1)
modelset_df = modelset_df.drop(['diagram_cd'], axis=1)
modelset_df = modelset_df.drop(['diagram_usecase'], axis=1)
modelset_df = modelset_df.drop(['diagram_comp'], axis=1)
modelset_df = modelset_df.drop(['diagram_interaction'], axis=1)
modelset_df = modelset_df.drop(['diagram_sm'], axis=1)
modelset_df = modelset_df.drop(['category'], axis=1)
modelset_df = modelset_df.drop(['language'], axis=1)
modelset_df = modelset_df.drop(['id'], axis=1)

In [None]:
modelset_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5120 entries, 0 to 5119
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   tags               1943 non-null   object
 1   type_Class         5120 non-null   int64 
 2   type_Relationship  5120 non-null   int64 
 3   type_Package       5120 non-null   int64 
 4   type_Actor         5120 non-null   int64 
 5   type_DataType      5120 non-null   int64 
 6   type_Operation     5120 non-null   int64 
 7   type_Transition    5120 non-null   int64 
 8   type_State         5120 non-null   int64 
 9   type_Property      5120 non-null   int64 
 10  type_UseCase       5120 non-null   int64 
 11  type_Component     5120 non-null   int64 
 12  type_Enumeration   5120 non-null   int64 
 13  type_Activity      5120 non-null   int64 
dtypes: int64(13), object(1)
memory usage: 600.0+ KB


In [None]:
modelset_df.head()

Unnamed: 0,tags,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,type_Component,type_Enumeration,type_Activity
0,,5,6,1,1,11,21,0,0,22,0,0,0,1
1,videgame,8,3,1,0,7,6,0,0,20,0,0,1,0
2,,33,106,1,17,1,0,0,0,203,47,0,0,0
3,,3,3,1,0,2,0,0,0,15,0,0,0,0
4,,0,5,1,2,0,0,0,0,4,3,1,0,1


In [None]:
modelset_df['tags'].value_counts()

generic                      249
cpu|cache|"graphics card"    210
hierarchy                    180
shopping-cart                136
teaching                     129
                            ... 
login|teaching                 1
yelp                           1
competition                    1
database|database              1
betting                        1
Name: tags, Length: 89, dtype: int64

In [None]:
modelset_df_filt = modelset_df.dropna() # Delete NaN cases
modelset_df_filt

Unnamed: 0,tags,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,type_Component,type_Enumeration,type_Activity
1,videgame,8,3,1,0,7,6,0,0,20,0,0,1,0
6,courses|teaching,9,7,1,0,1,0,0,0,45,0,0,0,0
9,api|api,2,1,1,0,17,13,0,0,0,0,0,0,0
17,storage,5,16,1,1,2,31,0,0,42,11,0,1,0
19,videogame,29,31,11,0,4,113,0,0,92,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5109,client-complaints,18,38,1,3,0,36,0,0,45,11,0,0,1
5110,generic,21,10,1,0,0,10,0,0,16,0,0,0,0
5111,poker-game,11,13,1,0,4,52,0,0,35,0,0,0,0
5117,real-state,13,15,1,0,6,36,0,0,53,0,0,0,0


In [None]:
modelset_df_filt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1943 entries, 1 to 5118
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   tags               1943 non-null   object
 1   type_Class         1943 non-null   int64 
 2   type_Relationship  1943 non-null   int64 
 3   type_Package       1943 non-null   int64 
 4   type_Actor         1943 non-null   int64 
 5   type_DataType      1943 non-null   int64 
 6   type_Operation     1943 non-null   int64 
 7   type_Transition    1943 non-null   int64 
 8   type_State         1943 non-null   int64 
 9   type_Property      1943 non-null   int64 
 10  type_UseCase       1943 non-null   int64 
 11  type_Component     1943 non-null   int64 
 12  type_Enumeration   1943 non-null   int64 
 13  type_Activity      1943 non-null   int64 
dtypes: int64(13), object(1)
memory usage: 227.7+ KB


In [None]:
modelset_df_filt.describe()

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,type_Component,type_Enumeration,type_Activity
count,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0,1943.0
mean,9.33299,14.855378,1.206897,1.0772,0.934637,13.406588,0.244467,0.181678,25.745754,4.378281,0.215646,0.13124,0.265054
std,8.601713,10.761298,1.045521,2.034767,3.433396,19.630978,2.756219,2.069949,24.321825,7.490088,0.580601,0.608697,0.771031
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0
50%,7.0,12.0,1.0,0.0,0.0,10.0,0.0,0.0,20.0,0.0,0.0,0.0,0.0
75%,18.0,21.0,1.0,2.0,0.0,21.0,0.0,0.0,34.0,9.0,0.0,0.0,0.0
max,75.0,108.0,12.0,18.0,52.0,370.0,46.0,36.0,247.0,79.0,7.0,8.0,13.0


In [None]:
modelset_df_filt = modelset_df_filt.reset_index(drop=True) # Reset index

In [None]:
modelset_df_filt

Unnamed: 0,tags,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,type_Component,type_Enumeration,type_Activity
0,videgame,8,3,1,0,7,6,0,0,20,0,0,1,0
1,courses|teaching,9,7,1,0,1,0,0,0,45,0,0,0,0
2,api|api,2,1,1,0,17,13,0,0,0,0,0,0,0
3,storage,5,16,1,1,2,31,0,0,42,11,0,1,0
4,videogame,29,31,11,0,4,113,0,0,92,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,client-complaints,18,38,1,3,0,36,0,0,45,11,0,0,1
1939,generic,21,10,1,0,0,10,0,0,16,0,0,0,0
1940,poker-game,11,13,1,0,4,52,0,0,35,0,0,0,0
1941,real-state,13,15,1,0,6,36,0,0,53,0,0,0,0


In [None]:
df_labels = modelset_df_filt['tags'].str.get_dummies('|') # get tags

In [None]:
df_labels

Unnamed: 0,"""graphics card""","""management system""",-state,admission,agenda,airport,algorithm,answers,api,application,...,teaching-evaluation,ticketing,train,trains,vehicles,videgame,videogame,visitor,web,yelp
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1939,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1940,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
row_sum = df_labels.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
1    1572
3     277
2      62
4      32
dtype: int64


In [None]:
df_labels.sum().sort_values(ascending=False).head(20)

generic                249
"graphics card"        210
cache                  210
cpu                    210
hierarchy              180
teaching               163
shopping-cart          136
libraries              128
hospital               118
employee-management    109
management              88
card-game               74
visitor                 58
json                    58
serialization           58
blackjack-game          40
health                  32
"management system"     31
book                    31
library                 31
dtype: int64

In [None]:
# Search for identical tags
labels_corr = df_labels.corr()
duplicates = []
for i in range(len(labels_corr.columns)):
    for j in range(i):
        if abs(labels_corr.iloc[i, j]) == 1.0:
            duplicates.append((labels_corr.columns[i], labels_corr.columns[j]))

duplicates

[('book', '"management system"'),
 ('cache', '"graphics card"'),
 ('cpu', '"graphics card"'),
 ('cpu', 'cache'),
 ('library', '"management system"'),
 ('library', 'book'),
 ('loan', '"management system"'),
 ('loan', 'book'),
 ('loan', 'library'),
 ('questions', 'answers'),
 ('serialization', 'json'),
 ('teaching-evaluation', 'survey'),
 ('visitor', 'json'),
 ('visitor', 'serialization'),
 ('web', 'http')]

In [None]:
# Delete duplicates
df_labels = df_labels.drop(['"management system"'], axis=1)
df_labels = df_labels.drop(['book'], axis=1)
df_labels = df_labels.drop(['loan'], axis=1)
df_labels = df_labels.drop(['"graphics card"'], axis=1)
df_labels = df_labels.drop(['cache'], axis=1)
df_labels = df_labels.drop(['answers'], axis=1)
df_labels = df_labels.drop(['json'], axis=1)
df_labels = df_labels.drop(['visitor'], axis=1)
df_labels = df_labels.drop(['http'], axis=1)
df_labels = df_labels.drop(['teaching-evaluation'], axis=1)

In [None]:
labels_corr2 = df_labels.corr()
duplicates_2 = []
for i in range(len(labels_corr2.columns)):
    for j in range(i):
        if abs(labels_corr2.iloc[i, j]) == 1.0:
            duplicates_2.append((labels_corr2.columns[i], labels_corr2.columns[j]))

duplicates_2

[]

In [None]:
df_labels

Unnamed: 0,-state,admission,agenda,airport,algorithm,api,application,appointment,banking,betting,...,survey,teaching,ticketing,train,trains,vehicles,videgame,videogame,web,yelp
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1939,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1940,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_labels.sum().sort_values(ascending=False).head(20)

generic                249
cpu                    210
hierarchy              180
teaching               163
shopping-cart          136
libraries              128
hospital               118
employee-management    109
management              88
card-game               74
serialization           58
blackjack-game          40
health                  32
library                 31
restaurant              29
api                     29
client-complaints       26
real-state              25
student-management      22
poker-game              20
dtype: int64

In [None]:
modelset_df_filt = modelset_df_filt.drop(['tags'], axis=1)
modelset_df_filt

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,type_Component,type_Enumeration,type_Activity
0,8,3,1,0,7,6,0,0,20,0,0,1,0
1,9,7,1,0,1,0,0,0,45,0,0,0,0
2,2,1,1,0,17,13,0,0,0,0,0,0,0
3,5,16,1,1,2,31,0,0,42,11,0,1,0
4,29,31,11,0,4,113,0,0,92,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,18,38,1,3,0,36,0,0,45,11,0,0,1
1939,21,10,1,0,0,10,0,0,16,0,0,0,0
1940,11,13,1,0,4,52,0,0,35,0,0,0,0
1941,13,15,1,0,6,36,0,0,53,0,0,0,0


In [None]:
atts = list(modelset_df_filt.columns)
atts

['type_Class',
 'type_Relationship',
 'type_Package',
 'type_Actor',
 'type_DataType',
 'type_Operation',
 'type_Transition',
 'type_State',
 'type_Property',
 'type_UseCase',
 'type_Component',
 'type_Enumeration',
 'type_Activity']

In [None]:
df = pd.concat([modelset_df_filt, df_labels], axis=1)
df

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,survey,teaching,ticketing,train,trains,vehicles,videgame,videogame,web,yelp
0,8,3,1,0,7,6,0,0,20,0,...,0,0,0,0,0,0,1,0,0,0
1,9,7,1,0,1,0,0,0,45,0,...,0,1,0,0,0,0,0,0,0,0
2,2,1,1,0,17,13,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,16,1,1,2,31,0,0,42,11,...,0,0,0,0,0,0,0,0,0,0
4,29,31,11,0,4,113,0,0,92,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,18,38,1,3,0,36,0,0,45,11,...,0,0,0,0,0,0,0,0,0,0
1939,21,10,1,0,0,10,0,0,16,0,...,0,0,0,0,0,0,0,0,0,0
1940,11,13,1,0,4,52,0,0,35,0,...,0,0,0,0,0,0,0,0,0,0
1941,13,15,1,0,6,36,0,0,53,0,...,0,0,0,0,0,0,0,0,0,0


# Performance by No. Tags:

We search for the best number of tags/labels to consider through a performance evaluation.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss

X_filt = modelset_df_filt[atts]
label_order = df_labels.sum().sort_values(ascending=False).index

for num_labels in range(10, 81, 10):

    labels_select = label_order[:num_labels]
    y_filt = df_labels[labels_select]
    X_train, X_test, y_train, y_test = train_test_split(X_filt, y_filt, test_size=0.3, random_state=42)

    model = OneVsRestClassifier(RandomForestClassifier(random_state=42))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    hamming_loss_value = hamming_loss(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='macro')
    print(f"No. Labels: {num_labels}, F1-score: {f1:.4f}, Accuracy: {accuracy:.4f}, Hamming Loss: {hamming_loss_value:.4f}")

No. Labels: 10, F1-score: 0.7492, Accuracy: 0.7890, Hamming Loss: 0.0230
No. Labels: 20, F1-score: 0.7863, Accuracy: 0.7530, Hamming Loss: 0.0139
No. Labels: 30, F1-score: 0.7022, Accuracy: 0.7221, Hamming Loss: 0.0103
No. Labels: 40, F1-score: 0.6016, Accuracy: 0.7136, Hamming Loss: 0.0081
No. Labels: 50, F1-score: 0.5013, Accuracy: 0.7136, Hamming Loss: 0.0067
No. Labels: 60, F1-score: 0.4178, Accuracy: 0.7101, Hamming Loss: 0.0056
No. Labels: 70, F1-score: 0.3581, Accuracy: 0.7067, Hamming Loss: 0.0049
No. Labels: 80, F1-score: 0.3133, Accuracy: 0.7050, Hamming Loss: 0.0043


# Obtain final Dataset:

In [None]:
max_labels = 50

frec_label = df_labels.sum()
toplabels_frec = frec_label.nlargest(max_labels).index
df_labels_select = df_labels[toplabels_frec]

df_labels_select

Unnamed: 0,generic,cpu,hierarchy,teaching,shopping-cart,libraries,hospital,employee-management,management,card-game,...,procedure,shipment,course-management,admission,cars,clinic,donation,malware,online-teaching,services
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1938,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1939,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1940,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
row_sum = df_labels_select.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
2    1864
4      46
0      27
6       6
dtype: int64


In [None]:
df_labels_select['sum'] = df_labels_select.sum(axis=1)

df_labels_final = df_labels_select[df_labels_select['sum'] >= 2]
df_labels_final = df_labels_final.drop(['sum'], axis=1)

In [None]:
df_labels_final

Unnamed: 0,generic,cpu,hierarchy,teaching,shopping-cart,libraries,hospital,employee-management,management,card-game,...,procedure,shipment,course-management,admission,cars,clinic,donation,malware,online-teaching,services
21,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
123,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
155,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
167,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
257,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
278,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
401,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
455,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
458,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
515,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_labels_final.sum().sort_values(ascending=False).head(20)

teaching              32
student-management    22
health                 7
internet               6
simulation             6
login                  5
appointment            5
registration           4
checkouts              4
api                    4
course-management      3
services               2
online-teaching        2
events                 2
admission              2
donation               2
procedure              1
protocol               1
game                   0
malware                0
dtype: int64

In [None]:
sum_col = df_labels_final.sum(axis=0)
col_zero = sum_col[sum_col <= 1].index
df_labels_final_filt = df_labels_final.drop(columns=col_zero)
df_labels_final_filt

Unnamed: 0,teaching,health,api,student-management,login,internet,simulation,registration,appointment,checkouts,events,course-management,admission,donation,online-teaching,services
21,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
123,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
155,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
167,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
257,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
278,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
401,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
455,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
458,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
515,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
df_labels_final_filt['sum'] = df_labels_final_filt.sum(axis=1)

df_labels_final_filt = df_labels_final_filt[df_labels_final_filt['sum'] >= 2]
df_labels_final_filt = df_labels_final_filt.drop(['sum'], axis=1)

In [None]:
df_labels_final_filt.sum().sort_values(ascending=False).head(20)

teaching              32
student-management    22
health                 7
internet               6
simulation             6
login                  5
appointment            5
api                    4
registration           4
checkouts              4
course-management      3
events                 2
admission              2
donation               2
online-teaching        2
services               2
dtype: int64

In [None]:
row_sum = df_labels_final_filt.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
2    43
3     8
dtype: int64


In [None]:
print(df_labels_final_filt.shape)
print(modelset_df_filt.shape)

(51, 17)
(1943, 13)


In [None]:
lbls = list(df_labels_final_filt.columns)
lbls

['teaching',
 'health',
 'api',
 'student-management',
 'login',
 'internet',
 'simulation',
 'registration',
 'appointment',
 'events',
 'checkouts',
 'course-management',
 'donation',
 'services',
 'online-teaching',
 'students',
 'admission']

In [None]:
modelset_df_filt2 = modelset_df_filt.loc[df_labels_final_filt.index]

In [None]:
modelset_df_final_uml = pd.concat([modelset_df_filt2, df_labels_final_filt], axis=1)
modelset_df_final_uml

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,registration,appointment,events,checkouts,course-management,donation,services,online-teaching,students,admission
21,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
123,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0
155,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0
167,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,1,0
257,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
278,7,13,1,0,0,13,0,0,29,0,...,0,0,0,0,0,0,0,0,0,0
401,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
455,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
458,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
515,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0


In [None]:
rows_with_zeros = modelset_df_final_uml[(modelset_df_final_uml[lbls] == 0).all(axis=1)]
num_cases_with_all_zeros = len(rows_with_zeros)
print("No. zero columns:", num_cases_with_all_zeros)

No. zero columns: 0


In [None]:
row_sum = modelset_df_final_uml[lbls].sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
2    43
3     8
dtype: int64


In [None]:
# test:
rows_with_sum_3 = modelset_df_final_uml[lbls].sum(axis=1) == 3
cases_sum_3 = modelset_df_final_uml[rows_with_sum_3]
print("Cases with count 1s = 3:")
cases_sum_3

Cases with count 1s = 3:


Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,registration,appointment,events,checkouts,course-management,donation,services,online-teaching,students,admission
167,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,1,0
800,0,15,1,6,0,0,0,0,24,13,...,1,0,0,0,0,0,0,0,0,0
1298,0,15,1,6,0,0,0,0,24,13,...,1,0,0,0,0,0,0,0,0,0
1449,5,19,1,3,0,15,0,0,43,9,...,0,0,0,0,0,0,0,1,0,0
1587,0,15,1,6,0,0,0,0,24,13,...,1,0,0,0,0,0,0,0,0,0
1589,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,0,1
1774,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,1,1
1811,4,19,1,3,0,14,0,0,42,9,...,0,0,0,0,0,0,0,1,0,0


In [None]:
modelset_df_final_uml.to_csv('modelset_df_final_uml_filt.csv', index=True)

# Analysis after data splitting in R:

In [None]:
df_train_r = pd.read_csv("/content/drive/MyDrive/train_uml_filt.csv", index_col=0)
df_test_r = pd.read_csv("/content/drive/MyDrive/test_uml_filt.csv", index_col=0)

In [None]:
df_train_r

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,events,checkouts,course.management,donation,services,online.teaching,students,admission,.labelcount,.SCUMBLE
1593,10.0,10.0,1.0,1.0,2.0,13.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
458,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.2
1116,7.0,13.0,1.0,0.0,0.0,13.0,0.0,0.0,29.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.011545
1514,0.0,15.0,1.0,6.0,0.0,0.0,0.0,0.0,24.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.011545
1589,0.0,24.0,1.0,3.0,0.0,0.0,0.0,0.0,34.0,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.437558
1168,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
1886,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.010257
1492,0.0,17.0,1.0,2.0,0.0,0.0,0.0,0.0,32.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.011545
1258,7.0,13.0,1.0,0.0,0.0,13.0,0.0,0.0,29.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.011545
1441,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


In [None]:
df_test_r

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,events,checkouts,course.management,donation,services,online.teaching,students,admission,.labelcount,.SCUMBLE
257,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.133975
21,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.320131
1751,10.0,10.0,1.0,1.0,2.0,13.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
1774,0.0,24.0,1.0,3.0,0.0,0.0,0.0,0.0,34.0,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,0.357826
1930,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.020204
803,0.0,15.0,1.0,6.0,0.0,0.0,0.0,0.0,24.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.028758
801,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
1397,10.0,10.0,1.0,1.0,2.0,13.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
1761,0.0,15.0,1.0,6.0,0.0,0.0,0.0,0.0,24.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.028758
851,0.0,24.0,1.0,4.0,0.0,0.0,0.0,0.0,30.0,16.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.320131


In [None]:
df_train = modelset_df_final_uml.loc[df_train_r.index]
df_test = modelset_df_final_uml.loc[df_test_r.index]

In [None]:
df_train.to_csv('train_uml_filt.csv', index=True)
df_test.to_csv('test_uml_filt.csv', index=True)

In [None]:
df_train

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,registration,appointment,events,checkouts,course-management,donation,services,online-teaching,students,admission
1593,10,10,1,1,2,13,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
458,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1116,7,13,1,0,0,13,0,0,29,0,...,0,0,0,0,0,0,0,0,0,0
1514,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0
1589,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,0,1
1168,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1886,0,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1492,0,17,1,2,0,0,0,0,32,10,...,0,0,0,0,0,0,0,0,0,0
1258,7,13,1,0,0,13,0,0,29,0,...,0,0,0,0,0,0,0,0,0,0
1441,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_test

Unnamed: 0,type_Class,type_Relationship,type_Package,type_Actor,type_DataType,type_Operation,type_Transition,type_State,type_Property,type_UseCase,...,registration,appointment,events,checkouts,course-management,donation,services,online-teaching,students,admission
257,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
21,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1751,10,10,1,1,2,13,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1774,0,24,1,3,0,0,0,0,34,14,...,0,0,0,0,0,0,0,0,1,1
1930,0,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
803,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0
801,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1397,10,10,1,1,2,13,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1761,0,15,1,6,0,0,0,0,24,13,...,0,0,0,0,0,0,0,0,0,0
851,0,24,1,4,0,0,0,0,30,16,...,0,0,0,0,0,0,0,0,0,0


In [None]:
for lbl in lbls:
  print(lbl)
  print("Train label count: ", df_train[lbl].value_counts()[1], " / ", df_train[lbl].value_counts()[1]/df_train.shape[0])
  print("Test label count: ", df_test[lbl].value_counts()[1], " / ", df_test[lbl].value_counts()[1]/df_test.shape[0])
  print("---------------------------------------------------")

teaching
Train label count:  19  /  0.6333333333333333
Test label count:  13  /  0.6190476190476191
---------------------------------------------------
health
Train label count:  4  /  0.13333333333333333
Test label count:  3  /  0.14285714285714285
---------------------------------------------------
api
Train label count:  2  /  0.06666666666666667
Test label count:  2  /  0.09523809523809523
---------------------------------------------------
student-management
Train label count:  14  /  0.4666666666666667
Test label count:  8  /  0.38095238095238093
---------------------------------------------------
login
Train label count:  3  /  0.1
Test label count:  2  /  0.09523809523809523
---------------------------------------------------
internet
Train label count:  4  /  0.13333333333333333
Test label count:  2  /  0.09523809523809523
---------------------------------------------------
simulation
Train label count:  4  /  0.13333333333333333
Test label count:  2  /  0.09523809523809523
--