# ModelSet - Basic tutorial

ModelSet is a dataset of software models originally intented to help in the application of machine learning techniques to solve modelling tasks.

In this tutorial we will explain how to load the dataset and extract basic features to perform a classification task: inferring the category of a model.

## Installation

First of all, you need to download and install ModelSet. 

 1. Download the package containing the raw models and the associated databases. Available at http://modelset.github.io/download/current.
 2. Unzip the package
 3. Make sure that the variable (see below) MODELSET_HOME points to the location in which you unzipped the package
 4. Install the python library using pip
    * If you have downloaded the source code of the library from http://github.com/modelset/modelset-py ,
      then use `sys.path.append("/path/to/modelset-py/src")` as a shortcut to load it dynamically.
    



In [None]:
# Do change this path to fit your local installation
MODELSET_HOME="/home/jesus/projects/mde-ml/modelset/modelset-dataset"

## Loading the dataset

The ModelSet library offers a convenient interface to dump the contents of the underlying database into a dataframe. In particular, there are several features available in the output dataframe:

 * The identifier of the model
 * The category of the model (manually labelled). Reflects the domain of the model.
 * Associated tags (zero or more manually labelled) which provide additional insights about the type of model.
 * The language of the model (typically english)
 * Basic stats. In the case of Ecore, number of elements, references, classes, attributes, packages, enumerations and datatypes


In [None]:
import sys
import pandas as pd
import os

import modelset.dataset as ds

In [None]:
dataset = ds.load(MODELSET_HOME, modeltype = 'ecore', selected_analysis = ['stats'])

In [None]:
modelset_df = dataset.to_normalized_df()

In [None]:
modelset_df

In [None]:
# Make sure that all categories are defined
modelset_df = modelset_df[~modelset_df['category'].isna()]

## Training

In [None]:
from sklearn.model_selection import train_test_split

# These dataframes are vectors
ids     = modelset_df['id']
labels  = modelset_df['category']

train_X, test_X, train_y, test_y = train_test_split(ids, labels, test_size=0.2, random_state=42)

## Selecting features



In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer

train_filenames = [ dataset.txt_file(id) for id in train_X ]
test_filenames  = [ dataset.txt_file(id) for id in test_X ]

# max_iter=1000
# stop_words = None, tokenizer = custom_tokenizer, min_df = 2

vectorizer = TfidfVectorizer(input='filename', min_df = 2)
X = vectorizer.fit_transform(train_filenames)
T = vectorizer.transform(test_filenames)

In [None]:
# The output of the TF-IDF vectorization is a large matrix with len(train_X) rows and as many columns as words in the vocabulary
print(X.shape)

## Training

In [None]:
from sklearn.neural_network import MLPClassifier

#input_layer = X.shape[1]
clf = MLPClassifier(solver='adam', learning_rate_init=0.01, hidden_layer_sizes=(64), random_state=1)
clf.fit(X, train_y)

## Evaluation

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

First, we evaluate the results obtained in the training set.

In [None]:
predict_train = clf.predict(X)
# print(confusion_matrix(train_y, predict_train))
train_report = classification_report(train_y, predict_train, output_dict = True)
print("Training accuracy: ", train_report['accuracy'])

Then, we evaluate the classifier over the test set.

In [None]:
predict_test = clf.predict(T)
test_report = classification_report(test_y, predict_test, output_dict = True)
print("Training accuracy: ", test_report['accuracy'])