# ModelSet - Basic tutorial

ModelSet is a dataset of software models originally intented to help in the application of machine learning techniques to solve modelling tasks.

In this tutorial we will explain how to load the dataset and extract basic features to perform a classification task: inferring the category of a model.

## Installation

First of all, you need to download and install ModelSet. 

 1. Download the package containing the raw models and the associated databases. Available at http://modelset.github.io/download/current.
 2. Unzip the package
 3. Make sure that the variable (see below) MODELSET_HOME points to the location in which you unzipped the package
 4. Install the python library using pip
    * pip install modelset-py
    * If you have downloaded the source code of the library from http://github.com/modelset/modelset-py ,
      then use `sys.path.append("/path/to/modelset-py/src")` as a shortcut to load it dynamically.
    



In [None]:
# Do change this path to fit your local installation
MODELSET_HOME="/path/to/modelset"

## Loading the dataset

The ModelSet library offers a convenient interface to dump the contents of the underlying database into a dataframe. In particular, there are several features available in the output dataframe:

 * The identifier of the model
 * The category of the model (manually labelled). Reflects the domain of the model.
 * Associated tags (zero or more manually labelled) which provide additional insights about the type of model.
 * The language of the model (typically english)
 * Basic stats. In the case of Ecore, number of elements, references, classes, attributes, packages, enumerations and datatypes


In [None]:
import sys
import pandas as pd
import os

#sys.path.append("/path/to/modelset/modelset-py/src")
import modelset.dataset as ds

In [None]:
dataset = ds.load(MODELSET_HOME, modeltype = 'ecore', selected_analysis = ['stats'])
# You can just use: ds.load(MODELSET_HOME, modeltype = 'ecore') to speedup the loading if you don't need the stats

Convert the dataset into a Pandas dataframe. There are two methods: 
 * `to_df()` converts the complete dataset. 
 * `to_normalized_df()` only considers examples with a minimum number of examples (7 by default), written in english and removing special categories (dummy and unknown).

In [None]:
modelset_df = dataset.to_normalized_df()
# You can configure the elements of the dataframe:
# modelset_df = dataset.to_normalized_df(min_ocurrences_per_category = 7, languages = ['english'], remove_categories = ['dummy', 'unknown'])

In [None]:
modelset_df

## Spliting the dataset

To train our model we are interested on the category attribute, which will be our target variable (the label that we want to predict) and we are going to use the model identifiers as input data because we will use them to lookup the corresponding textual representation (see below). 

We need to split our dataset into training and test, so that we can evaluate later the accuracy of our model.

In [None]:
from sklearn.model_selection import train_test_split

# These dataframes are vectors
ids     = modelset_df['id']
labels  = modelset_df['category']

train_X, test_X, train_y, test_y = train_test_split(ids, labels, test_size=0.2, random_state=42)

## Selecting features

A neural networks takes an input a numerical vector. So, we need a way to encode a model into a vector. A simple way is to use a TF-IDF encoding. Essentially, [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a measure of the relevance of a word by comparing the number of times that a word appears in a document with respect to the number of documents in which the word appears. 

To apply TF-IDF, the first thing that we need to do is to extract a textual representation of each model. We use the `txt_file` method to obtain the path to the text file associated with a given model. This is a feature provided by ModelSet: for each model there is already `.txt` which contains its 1-gram (i.e., the values of the string attributes).  

Then, we can easily compute the TF-IDF using scikit-learn. The `X` and `T` matrices contain one row per model with a number of columns equals to the number of words in the models.  

In [None]:
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer

train_filenames = [ dataset.txt_file(id) for id in train_X ]
test_filenames  = [ dataset.txt_file(id) for id in test_X ]

vectorizer = TfidfVectorizer(input='filename', min_df = 2)
X = vectorizer.fit_transform(train_filenames)
T = vectorizer.transform(test_filenames)

In [None]:
# The output of the TF-IDF vectorization is a large matrix with len(train_X) rows and as many columns as words in the vocabulary
X.shape

## Training

We use a neural network with one hidden layer as our model. This is straightforward with scikit-learn.

In [None]:
from sklearn.neural_network import MLPClassifier

#input_layer = X.shape[1]
clf = MLPClassifier(solver='adam', learning_rate_init=0.01, hidden_layer_sizes=(64), random_state=1)
clf.fit(X, train_y)

## Evaluation

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

First, we evaluate the results obtained in the training set. In particular, we focus on the accuracy (the fraction of correctly classified examples).

In [None]:
predict_train = clf.predict(X)
# print(confusion_matrix(train_y, predict_train))
train_report = classification_report(train_y, predict_train, output_dict = True)
print("Training accuracy: ", train_report['accuracy'])

Then, we evaluate the classifier over the test set. As can be seen the results are good, and in principle, we can assume that our model is ok and we can use it in practice.

In [None]:
predict_test = clf.predict(T)
test_report = classification_report(test_y, predict_test, output_dict = True)
print("Test accuracy: ", test_report['accuracy'])