Authors: José Raúl Romero (jrromero@uco.es), Aurora Ramírez (aurora.ramirez@uma.es), Francisco Javier Alcaide (f52almef@uco.es)

**Notebook for tag prediction model problem in the ECORE dataset**

- This notebook contain the data preprocessing and dataset splitting for the tag prediction problem

- The installation and usage of LionForest are documented in the notebook "Modelset_Multilabel_LionForest.ipynb"
- We select instances that have 2 or more labels, that is, we will omit those cases with one label."

# Installation:

It is important to define the path to the folder that contains the Modelset files, in this notebook using the variable "MODELSET_HOME".

In [None]:
MODELSET_HOME="/content/drive/MyDrive/modelset"

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install modelset-py

Collecting modelset-py
  Downloading modelset_py-0.2.1-py3-none-any.whl (10 kB)
Collecting gensim==4.2.0 (from modelset-py)
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
Collecting wget (from modelset-py)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=0911c05b1330be8d131a253f8ea6f541c887ae22a7db3932b1e9935df7a0e89e
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget, gensim, modelset-py
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.2
    Uninstalling gensim-4

In [None]:
import sys
import pandas as pd
import numpy as np
import dalex as dx
import os
import modelset.dataset as ds
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.inspection import permutation_importance

# Load and Preprocess:

In this section, we will perform data loading, cleaning, preprocessing, and an initial data analysis.

In [None]:
dataset = ds.load(MODELSET_HOME,modeltype = 'ecore', selected_analysis = ['stats']) # load the dataset
modelset_df = dataset._Dataset__to_df()

In [None]:
modelset_df

Unnamed: 0,id,category,tags,language,references,elements,classes,attributes,packages,enum,datatypes
0,repo-ecore-all/data/mde-optimiser/comma-18-map...,arguments,,english,12,42,5,6,1,0,0
2,repo-ecore-all/data/AmerPecuj/MBSE/dk.dtu.comp...,petrinet,behaviour,english,7,27,6,2,1,0,0
3,repo-ecore-all/data/nlohmann/service-technolog...,petrinet,behaviour,english,13,92,15,16,1,2,0
4,repo-ecore-all/data/damenac/puzzle/examples/em...,education,domainmodel,english,4,37,4,12,1,0,0
5,repo-ecore-all/data/ModelWriter/AlloyInEcore/S...,dummy,,english,0,71,6,19,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...
5470,repo-ecore-all/data/BlackBeltTechnology/emfbui...,company,,english,2,22,5,4,2,0,0
5471,repo-ecore-all/data/fmdkdd/monoge/emftuto/test...,dummy,,english,3,9,2,0,1,0,0
5472,repo-ecore-all/data/gssi/Edelta_bad_smells/mod...,dummy,,english,3,17,6,1,1,0,0
5473,repo-ecore-all/data/mathiasnh/TDT4250-Assignme...,education,university|domainmodel,english,24,101,11,12,1,2,0


In [None]:
duplicates = modelset_df.duplicated(subset='id', keep=False)
inst_dup = modelset_df[duplicates]
inst_dup

Unnamed: 0,id,category,tags,language,references,elements,classes,attributes,packages,enum,datatypes


In [None]:
modelset_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5466 entries, 0 to 5474
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          5466 non-null   object
 1   category    5466 non-null   object
 2   tags        3824 non-null   object
 3   language    5466 non-null   object
 4   references  5466 non-null   int64 
 5   elements    5466 non-null   int64 
 6   classes     5466 non-null   int64 
 7   attributes  5466 non-null   int64 
 8   packages    5466 non-null   int64 
 9   enum        5466 non-null   int64 
 10  datatypes   5466 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 512.4+ KB


In [None]:
modelset_df.head()

Unnamed: 0,id,category,tags,language,references,elements,classes,attributes,packages,enum,datatypes
0,repo-ecore-all/data/mde-optimiser/comma-18-map...,arguments,,english,12,42,5,6,1,0,0
2,repo-ecore-all/data/AmerPecuj/MBSE/dk.dtu.comp...,petrinet,behaviour,english,7,27,6,2,1,0,0
3,repo-ecore-all/data/nlohmann/service-technolog...,petrinet,behaviour,english,13,92,15,16,1,2,0
4,repo-ecore-all/data/damenac/puzzle/examples/em...,education,domainmodel,english,4,37,4,12,1,0,0
5,repo-ecore-all/data/ModelWriter/AlloyInEcore/S...,dummy,,english,0,71,6,19,2,2,2


In [None]:
modelset_df['tags'].value_counts()

domainmodel                                              658
behaviour                                                618
class|workflow|component|statemachine|interaction|uml    156
ddl                                                      116
classes                                                  105
                                                        ... 
expressions|smtlib|modelfinder                             1
modelbased|statemachine|behaviour                          1
graph|datastructure                                        1
database                                                   1
bpel                                                       1
Name: tags, Length: 537, dtype: int64

In [None]:
modelset_df_filt = modelset_df.dropna() # delete NaN cases
modelset_df_filt

Unnamed: 0,id,category,tags,language,references,elements,classes,attributes,packages,enum,datatypes
2,repo-ecore-all/data/AmerPecuj/MBSE/dk.dtu.comp...,petrinet,behaviour,english,7,27,6,2,1,0,0
3,repo-ecore-all/data/nlohmann/service-technolog...,petrinet,behaviour,english,13,92,15,16,1,2,0
4,repo-ecore-all/data/damenac/puzzle/examples/em...,education,domainmodel,english,4,37,4,12,1,0,0
6,repo-ecore-all/data/francoispfister/diagraph/o...,statemachine,behaviour,english,7,87,9,13,1,0,0
8,repo-ecore-all/data/gssi/metamodelsdataset-ECM...,petrinet,behaviour,english,3,17,4,2,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
5467,repo-ecore-all/data/atlanmod/modisco/modisco-m...,gpl,imperative|java,english,164,732,126,40,1,6,0
5468,repo-ecore-all/data/Barros-Lucas/DSL_State_Int...,statemachine,behaviour,english,3,22,5,4,1,0,0
5469,repo-ecore-all/data/luciuscode/test/projectStr...,library,domainmodel,english,4,34,6,3,1,1,0
5473,repo-ecore-all/data/mathiasnh/TDT4250-Assignme...,education,university|domainmodel,english,24,101,11,12,1,2,0


In [None]:
modelset_df_filt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3824 entries, 2 to 5474
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          3824 non-null   object
 1   category    3824 non-null   object
 2   tags        3824 non-null   object
 3   language    3824 non-null   object
 4   references  3824 non-null   int64 
 5   elements    3824 non-null   int64 
 6   classes     3824 non-null   int64 
 7   attributes  3824 non-null   int64 
 8   packages    3824 non-null   int64 
 9   enum        3824 non-null   int64 
 10  datatypes   3824 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 358.5+ KB


In [None]:
modelset_df_filt.describe()

Unnamed: 0,references,elements,classes,attributes,packages,enum,datatypes
count,3824.0,3824.0,3824.0,3824.0,3824.0,3824.0,3824.0
mean,30.835513,229.30387,30.030073,17.96705,1.5591,1.271705,1.599634
std,54.162942,406.712845,46.664557,35.702676,2.735561,3.004256,6.26125
min,0.0,2.0,0.0,0.0,1.0,0.0,0.0
25%,5.0,33.0,5.0,3.0,1.0,0.0,0.0
50%,11.0,77.0,11.0,7.0,1.0,0.0,0.0
75%,31.0,260.0,31.0,19.0,1.0,1.0,0.0
max,1128.0,4397.0,391.0,777.0,46.0,45.0,58.0


In [None]:
# Reset index
modelset_df_filt = modelset_df_filt.reset_index(drop=True)

# Delete columns that are not useful
modelset_df_filt = modelset_df_filt.drop(['elements'], axis=1)
modelset_df_filt = modelset_df_filt.drop(['category'], axis=1)
modelset_df_filt = modelset_df_filt.drop(['language'], axis=1)
modelset_df_filt = modelset_df_filt.drop(['id'], axis=1)

In [None]:
modelset_df_filt

Unnamed: 0,tags,references,classes,attributes,packages,enum,datatypes
0,behaviour,7,6,2,1,0,0
1,behaviour,13,15,16,1,2,0
2,domainmodel,4,4,12,1,0,0
3,behaviour,7,9,13,1,0,0
4,behaviour,3,4,2,1,0,0
...,...,...,...,...,...,...,...
3819,imperative|java,164,126,40,1,6,0
3820,behaviour,3,5,4,1,0,0
3821,domainmodel,4,6,3,1,1,0
3822,university|domainmodel,24,11,12,1,2,0


In [None]:
df_labels = modelset_df_filt['tags'].str.get_dummies('|') # get tags

In [None]:
df_labels = df_labels.rename(columns={'classes':'Classes'}) # rename conflict case
df_labels

Unnamed: 0,"""age of mythology""","""argument markup language""",acceleo,accesscontrol,accounting,actions,activities,actors,actuators,ada,...,windturbines,wireframes,workflow,xbase,xdsl,xml,xpath,xpdl,xtend,zest
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3819,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3820,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3821,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3822,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_labels.sum().sort_values(ascending=False).head(20)

domainmodel      840
behaviour        817
expressions      324
imperative       253
uml              208
statemachine     204
class            198
workflow         182
component        158
interaction      157
datastructure    154
Classes          144
programming      142
ddl              133
java             125
graph            123
ocl              110
tree              90
university        64
modelling         60
dtype: int64

In [None]:
df_labels.sum().sort_values(ascending=False).tail(20)

ecore                  1
formatexchange         1
runner                 1
implementationmodel    1
haxe                   1
idl                    1
ide                    1
html                   1
reactive               1
redblack               1
henshin                1
hairdresses            1
futsal                 1
hairdessers            1
graphql                1
geo                    1
reveng                 1
fuzzy                  1
robots                 1
"age of mythology"     1
dtype: int64

In [None]:
row_sum = df_labels.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
1    2423
2     862
3     243
6     158
4     101
5      36
7       1
dtype: int64


In [None]:
modelset_df_filt = modelset_df_filt.drop(['tags'], axis=1)
modelset_df_filt

Unnamed: 0,references,classes,attributes,packages,enum,datatypes
0,7,6,2,1,0,0
1,13,15,16,1,2,0
2,4,4,12,1,0,0
3,7,9,13,1,0,0
4,3,4,2,1,0,0
...,...,...,...,...,...,...
3819,164,126,40,1,6,0
3820,3,5,4,1,0,0
3821,4,6,3,1,1,0
3822,24,11,12,1,2,0


In [None]:
atts = list(modelset_df_filt.columns)
atts

['references', 'classes', 'attributes', 'packages', 'enum', 'datatypes']

In [None]:
df = pd.concat([modelset_df_filt, df_labels], axis=1)
df

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,"""age of mythology""","""argument markup language""",acceleo,accesscontrol,...,windturbines,wireframes,workflow,xbase,xdsl,xml,xpath,xpdl,xtend,zest
0,7,6,2,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,13,15,16,1,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,4,12,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,7,9,13,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,4,2,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3819,164,126,40,1,6,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3820,3,5,4,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3821,4,6,3,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3822,24,11,12,1,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Performance by No. Tags:

We search for the best number of tags/labels to consider through a performance evaluation.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss

X_filt = modelset_df_filt[atts]
label_order = df_labels.sum().sort_values(ascending=False).index

for num_labels in range(10, 81, 10):

    labels_select = label_order[:num_labels]
    y_filt = df_labels[labels_select]
    X_train, X_test, y_train, y_test = train_test_split(X_filt, y_filt, test_size=0.3, random_state=42)

    model = OneVsRestClassifier(RandomForestClassifier(random_state=42))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    hamming_loss_value = hamming_loss(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='macro')
    print(f"No. Labels: {num_labels}, F1-score: {f1:.4f}, Accuracy: {accuracy:.4f}, Hamming Loss: {hamming_loss_value:.4f}")

No. Labels: 10, F1-score: 0.8525, Accuracy: 0.7317, Hamming Loss: 0.0351
No. Labels: 20, F1-score: 0.7306, Accuracy: 0.6629, Hamming Loss: 0.0262
No. Labels: 30, F1-score: 0.7139, Accuracy: 0.6420, Hamming Loss: 0.0193
No. Labels: 40, F1-score: 0.6964, Accuracy: 0.6220, Hamming Loss: 0.0154
No. Labels: 50, F1-score: 0.6491, Accuracy: 0.6141, Hamming Loss: 0.0129
No. Labels: 60, F1-score: 0.6265, Accuracy: 0.6045, Hamming Loss: 0.0111
No. Labels: 70, F1-score: 0.6310, Accuracy: 0.5976, Hamming Loss: 0.0097
No. Labels: 80, F1-score: 0.6084, Accuracy: 0.5932, Hamming Loss: 0.0087



---

❗❗ **We verified that the biggest performance jump in most metrics occurs between the cases with 10 and 20 most frequent tags. Therefore, we are going to consider that the best choice is 10 tags.**

---



# Obtain final Dataset:

In [None]:
max_labels = 12

frec_label = df_labels.sum()
toplabels_frec = frec_label.nlargest(max_labels).index
df_labels_final = df_labels[toplabels_frec]

df_labels_final

Unnamed: 0,domainmodel,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,datastructure,Classes
0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3819,0,0,0,1,0,0,0,0,0,0,0,0
3820,0,1,0,0,0,0,0,0,0,0,0,0
3821,1,0,0,0,0,0,0,0,0,0,0,0
3822,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
row_sum = df_labels_final.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
1    2207
0    1232
2     191
6     156
3      38
dtype: int64


In [None]:
# Select instances with 2 o more labels
df_labels_final['sum'] = df_labels_final.sum(axis=1)

df_labels_final_filt = df_labels_final[df_labels_final['sum'] >= 2]
df_labels_final_filt = df_labels_final_filt.drop(columns=['sum'])

In [None]:
df_labels_final_filt

Unnamed: 0,domainmodel,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,datastructure,Classes
11,0,0,1,0,0,1,0,0,0,0,0,0
13,0,0,0,0,1,1,1,1,1,1,0,0
19,0,0,1,1,0,0,0,0,0,0,0,0
22,0,0,0,0,1,1,1,1,1,1,0,0
33,0,0,0,0,1,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3761,0,0,1,1,0,0,0,0,0,0,0,0
3775,0,1,1,0,0,0,0,0,0,0,0,0
3797,0,0,1,1,0,0,0,0,0,0,0,0
3801,0,0,1,1,0,0,0,0,0,0,0,0


In [None]:
df_labels_final_filt.sum().sort_values(ascending=False).head(15)

class            192
statemachine     190
uml              176
workflow         175
interaction      157
component        156
expressions      146
imperative       120
behaviour         91
Classes           29
domainmodel        0
datastructure      0
dtype: int64

In [None]:
row_sum = df_labels_final_filt.sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
2    191
6    156
3     38
dtype: int64


In [None]:
sum_col = df_labels_final_filt.sum(axis=0)
col_zero = sum_col[sum_col == 0].index
df_labels_final_filt = df_labels_final_filt.drop(columns=col_zero)
df_labels_final_filt

Unnamed: 0,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
11,0,1,0,0,1,0,0,0,0,0
13,0,0,0,1,1,1,1,1,1,0
19,0,1,1,0,0,0,0,0,0,0
22,0,0,0,1,1,1,1,1,1,0
33,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
3761,0,1,1,0,0,0,0,0,0,0
3775,1,1,0,0,0,0,0,0,0,0
3797,0,1,1,0,0,0,0,0,0,0
3801,0,1,1,0,0,0,0,0,0,0


In [None]:
lbls = list(df_labels_final_filt.columns)
lbls

['behaviour',
 'expressions',
 'imperative',
 'uml',
 'statemachine',
 'class',
 'workflow',
 'component',
 'interaction',
 'Classes']

In [None]:
modelset_df_filt2 = modelset_df_filt.loc[df_labels_final_filt.index]

In [None]:
modelset_df_final = pd.concat([modelset_df_filt2, df_labels_final_filt], axis=1)
modelset_df_final

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
11,146,96,39,1,0,0,0,1,0,0,1,0,0,0,0,0
13,1,197,1,1,0,0,0,0,0,1,1,1,1,1,1,0
19,94,77,16,1,1,0,0,1,1,0,0,0,0,0,0,0
22,3,18,0,1,0,0,0,0,0,1,1,1,1,1,1,0
33,168,120,61,23,5,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3761,64,58,11,1,1,0,0,1,1,0,0,0,0,0,0,0
3775,26,35,8,1,4,0,1,1,0,0,0,0,0,0,0,0
3797,75,59,15,1,0,0,0,1,1,0,0,0,0,0,0,0
3801,19,29,3,1,0,0,0,1,1,0,0,0,0,0,0,0


In [None]:
rows_with_zeros = modelset_df_final[(modelset_df_final[lbls] == 0).all(axis=1)]
num_cases_with_all_zeros = len(rows_with_zeros)
print("No. zero columns:", num_cases_with_all_zeros)

No. zero columns: 0


In [None]:
modelset_df_final

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
11,146,96,39,1,0,0,0,1,0,0,1,0,0,0,0,0
13,1,197,1,1,0,0,0,0,0,1,1,1,1,1,1,0
19,94,77,16,1,1,0,0,1,1,0,0,0,0,0,0,0
22,3,18,0,1,0,0,0,0,0,1,1,1,1,1,1,0
33,168,120,61,23,5,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3761,64,58,11,1,1,0,0,1,1,0,0,0,0,0,0,0
3775,26,35,8,1,4,0,1,1,0,0,0,0,0,0,0,0
3797,75,59,15,1,0,0,0,1,1,0,0,0,0,0,0,0
3801,19,29,3,1,0,0,0,1,1,0,0,0,0,0,0,0


In [None]:
row_sum = modelset_df_final[lbls].sum(axis=1)
counts = row_sum.value_counts()
print("Count 1s for rows:")
print(counts)

Count 1s for rows:
2    191
6    156
3     38
dtype: int64


In [None]:
# test:
rows_with_sum_6 = modelset_df_final[lbls].sum(axis=1) == 6
cases_sum_6 = modelset_df_final[rows_with_sum_6]
print("Cases with count 1s = 6:")
cases_sum_6

Cases with count 1s = 6:


Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
13,1,197,1,1,0,0,0,0,0,1,1,1,1,1,1,0
22,3,18,0,1,0,0,0,0,0,1,1,1,1,1,1,0
40,0,33,2,1,0,0,0,0,0,1,1,1,1,1,1,0
62,41,227,20,1,3,0,0,0,0,1,1,1,1,1,1,0
97,38,227,15,1,3,0,0,0,0,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3364,3,36,0,1,0,0,0,0,0,1,1,1,1,1,1,0
3500,3,18,0,1,0,0,0,0,0,1,1,1,1,1,1,0
3563,1,33,0,1,0,0,0,0,0,1,1,1,1,1,1,0
3572,0,32,1,1,0,0,0,0,0,1,1,1,1,1,1,0


In [None]:
modelset_df_final.to_csv('modelset_df_final.csv', index=True)

# Analysis after data splitting in R:

In [None]:
df_train_r = pd.read_csv("/content/drive/MyDrive/train_ecore_filt.csv", index_col=0)
df_test_r = pd.read_csv("/content/drive/MyDrive/test_ecore_filt.csv", index_col=0)

In [None]:
df_test_r

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes,.labelcount,.SCUMBLE
612,114,94,29,22,3,0,0,0,0,0,0,1,1,0,0,0,2,0.000661
201,90,78,14,1,0,0,0,1,0,0,1,0,0,0,0,0,2,0.011090
33,168,120,61,23,5,0,0,0,0,1,0,1,1,0,0,0,3,0.000583
1178,28,13,18,1,0,0,1,0,0,0,1,0,0,0,0,0,2,0.068877
1533,4,198,1,1,1,0,0,0,0,1,1,1,1,1,1,0,6,0.003878
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,75,59,15,1,0,0,0,1,1,0,0,0,0,0,0,0,2,0.006854
2067,1,227,0,1,0,0,0,0,0,1,1,1,1,1,1,0,6,0.003878
1095,22,12,2,1,0,0,0,0,0,0,0,1,0,0,0,1,2,0.375782
154,44,46,11,8,3,4,0,1,1,0,0,0,0,0,0,0,2,0.006854


In [None]:
df_train = modelset_df_final.loc[df_train_r.index]
df_test = modelset_df_final.loc[df_test_r.index]

In [None]:
df_train

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
2650,17,15,11,1,0,0,1,0,0,0,1,0,0,0,0,0
897,4,198,1,1,1,0,0,0,0,1,1,1,1,1,1,0
1208,2,34,0,1,0,0,0,0,0,1,1,1,1,1,1,0
1103,33,36,16,1,0,0,0,0,0,0,0,1,0,0,0,1
1894,172,122,61,25,5,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2142,1,197,1,1,1,0,0,0,0,1,1,1,1,1,1,0
3025,1,198,2,1,1,0,0,0,0,1,1,1,1,1,1,0
483,3,36,0,1,0,0,0,0,0,1,1,1,1,1,1,0
1187,17,22,4,1,0,3,0,1,1,0,0,0,0,0,0,0


In [None]:
df_test

Unnamed: 0,references,classes,attributes,packages,enum,datatypes,behaviour,expressions,imperative,uml,statemachine,class,workflow,component,interaction,Classes
612,114,94,29,22,3,0,0,0,0,0,0,1,1,0,0,0
201,90,78,14,1,0,0,0,1,0,0,1,0,0,0,0,0
33,168,120,61,23,5,0,0,0,0,1,0,1,1,0,0,0
1178,28,13,18,1,0,0,1,0,0,0,1,0,0,0,0,0
1533,4,198,1,1,1,0,0,0,0,1,1,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,75,59,15,1,0,0,0,1,1,0,0,0,0,0,0,0
2067,1,227,0,1,0,0,0,0,0,1,1,1,1,1,1,0
1095,22,12,2,1,0,0,0,0,0,0,0,1,0,0,0,1
154,44,46,11,8,3,4,0,1,1,0,0,0,0,0,0,0


In [None]:
df_test.loc[13]

references        1
classes         197
attributes        1
packages          1
enum              0
datatypes         0
behaviour         0
expressions       0
imperative        0
uml               1
statemachine      1
class             1
workflow          1
component         1
interaction       1
Classes           0
Name: 13, dtype: int64

In [None]:
for lbl in lbls:
  print(lbl)
  print("Train label count: ", df_train[lbl].value_counts()[1], " / ", df_train[lbl].value_counts()[1]/df_train.shape[0])
  print("Test label count: ", df_test[lbl].value_counts()[1], " / ", df_test[lbl].value_counts()[1]/df_test.shape[0])
  print("---------------------------------------------------")

behaviour
Train label count:  64  /  0.23703703703703705
Test label count:  27  /  0.23478260869565218
---------------------------------------------------
expressions
Train label count:  103  /  0.3814814814814815
Test label count:  43  /  0.3739130434782609
---------------------------------------------------
imperative
Train label count:  86  /  0.31851851851851853
Test label count:  34  /  0.2956521739130435
---------------------------------------------------
uml
Train label count:  123  /  0.45555555555555555
Test label count:  53  /  0.4608695652173913
---------------------------------------------------
statemachine
Train label count:  132  /  0.4888888888888889
Test label count:  58  /  0.5043478260869565
---------------------------------------------------
class
Train label count:  135  /  0.5
Test label count:  57  /  0.4956521739130435
---------------------------------------------------
workflow
Train label count:  122  /  0.45185185185185184
Test label count:  53  /  0.46086956