<a href="https://colab.research.google.com/github/p82maavd/MIML/blob/main/src/miml/tutorial/data_miml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install the library on enviroment
!pip install mimllearning

Collecting mimllearning
  Downloading mimllearning-1.0.11-py3-none-any.whl (966 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m966.1/966.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mimllearning
Successfully installed mimllearning-1.0.11


# Load and show dataset

Show different ways to load a dataset:
*   From current path
*   From given path
*   From library

Show dataset in two different modes:

*   Mode table
*   Mode csv






In [2]:
from miml.data import load_dataset

# Different ways to load a dataset
# dataset = load_dataset("miml_birds.csv"))
# dataset = load_dataset("C:/Users/Damián/Downloads/miml_birds.arff")
# Load dataset from library
dataset = load_dataset("toy.arff", from_library=True)
print("Show dataset in table mode")
print("--------------------------")
dataset.show_dataset(mode="table")
print("")
print("Show dataset in compact mode")
print("----------------------------")
dataset.show_dataset(mode="csv")

Show dataset in table mode
--------------------------
+--------+------+------+------+----------+----------+----------+----------+
|  bag1  |  f1  |  f2  |  f3  |  label1  |  label2  |  label3  |  label4  |
|   0    |  42  | -198 | -109 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
|   1    | 41.9 | -191 | -142 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
|   2    |  35  | 14.2 | 6.33 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
+--------+-------+------+------+----------+----------+----------+----------+
|  bag2  |  f1   |  f2  |  f3  |  label1  |  label2  |  label3  |  label4  |
|   0    | 11.25 | -98  |  10  |    0     |    1     |    1     |    0     |
+--------+-------+------+------+----------+----------+----------+----------+
|   1    |  31   | 40.5 | 7.85

# Metrics of dataset
Print different metrics of the dataset

In [3]:
# Shows dataset metrics
dataset.describe()

-----MULTILABEL-----
Cardinality:  2.0
Density:  0.5
Distinct:  0.125

-----MULTIINSTANCE-----
Nº of bags:  2
Total instances:  5
Average Instances per bag:  2.5
Min Instances per bag:  2
Max Instances per bag:  3
Features per bag:  3
Labels per bag:  4
Attributes per bag:  7

Distribution of bags:
	Bags with  2  instances:  1
	Bags with  3  instances:  1


# Manage MIMLDataset, Bag and Instance class objects
Show different methods to manage data class objects. We can see how parameters 'start' and 'end' are used in show_dataset to select the range of bag to print

In [4]:
import numpy

# Add a new attribute to the dataset and modify one of its attributes
dataset.add_attribute(name="new_feature", position=2, values=numpy.random.rand(dataset.get_number_instances()))
dataset.set_attribute(bag=0, index_instance=0, attribute=2, value=3.13)
dataset.show_dataset(start=0, end=1)

# Delete the attribute added before
dataset.delete_attribute(position=2)
dataset.show_dataset(start=0, end=1)

+--------+------+------+---------------+------+----------+----------+----------+----------+
|  bag1  |  f1  |  f2  |  new_feature  |  f3  |  label1  |  label2  |  label3  |  label4  |
|   0    |  42  | -198 |     3.13      | -109 |    1     |    0     |    0     |    1     |
+--------+------+------+---------------+------+----------+----------+----------+----------+
|   1    | 41.9 | -191 |   0.995497    | -142 |    1     |    0     |    0     |    1     |
+--------+------+------+---------------+------+----------+----------+----------+----------+
|   2    |  35  | 14.2 |   0.854059    | 6.33 |    1     |    0     |    0     |    1     |
+--------+------+------+---------------+------+----------+----------+----------+----------+
+--------+------+------+------+----------+----------+----------+----------+
|  bag1  |  f1  |  f2  |  f3  |  label1  |  label2  |  label3  |  label4  |
|   0    |  42  | -198 | -109 |    1     |    0     |    0     |    1     |
+--------+------+------+------+-----

Here we can see how it is possible to add instances and bag to MIMLDataset object

In [5]:
from miml.data import Instance, Bag

# Creation and modification of an instance
values = [38, 62, 5.09, 1, 0, 0, 1]
instance = Instance(values)
instance.set_attribute(attribute=2, value=74)
instance.show_instance()
print("")

# Add an instance to the dataset
dataset.add_instance(bag=0, instance=instance)


# Create a bag and add it to the dataset
bag=Bag("bag3")
bag.add_instance(instance)
dataset.add_bag(bag)
dataset.show_dataset()


+----+----+----+---+---+---+---+
| 38 | 62 | 74 | 1 | 0 | 0 | 1 |
+----+----+----+---+---+---+---+

+--------+------+------+------+----------+----------+----------+----------+
|  bag1  |  f1  |  f2  |  f3  |  label1  |  label2  |  label3  |  label4  |
|   0    |  42  | -198 | -109 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
|   1    | 41.9 | -191 | -142 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
|   2    |  35  | 14.2 | 6.33 |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
|   3    |  38  |  62  |  74  |    1     |    0     |    0     |    1     |
+--------+------+------+------+----------+----------+----------+----------+
+--------+-------+------+------+----------+----------+----------+----------+
|  bag2  |  f1   |  f2  |  f3  |  label1  |  label2  |  label3 

Going through all bags and instances of the dataset

In [6]:
# Shows all bags in the dataset
for bag_index in range(dataset.get_number_bags()):

    # Recover a bag
    bag = dataset.get_bag(bag_index)
    print("Bag:", bag.key)
    print("\tNumInstances:", bag.get_number_instances())
    print("\tNumFeatures:", bag.get_number_features())
    print("\tNumLabels:", bag.get_number_labels())
    print("\tNumAttributes:", bag.get_number_attributes())

    # Shows all instances in the bag
    for instance_index in range(bag.get_number_instances()):
        # Recovers an instance
        instance = dataset.get_instance(bag.key, instance_index)
        print("\t\tInstance:", instance_index, "NumAttributes:", instance.get_number_attributes())
        for attribute_index in range(instance.get_number_attributes()):
            print("\t\t\tAttribute", attribute_index, ":", instance.get_attribute(attribute=attribute_index))

Bag: bag1
	NumInstances: 4
	NumFeatures: 3
	NumLabels: 4
	NumAttributes: 7
		Instance: 0 NumAttributes: 7
			Attribute 0 : 42.0
			Attribute 1 : -198.0
			Attribute 2 : -109.0
			Attribute 3 : 1.0
			Attribute 4 : 0.0
			Attribute 5 : 0.0
			Attribute 6 : 1.0
		Instance: 1 NumAttributes: 7
			Attribute 0 : 41.9
			Attribute 1 : -191.0
			Attribute 2 : -142.0
			Attribute 3 : 1.0
			Attribute 4 : 0.0
			Attribute 5 : 0.0
			Attribute 6 : 1.0
		Instance: 2 NumAttributes: 7
			Attribute 0 : 35.0
			Attribute 1 : 14.2
			Attribute 2 : 6.33
			Attribute 3 : 1.0
			Attribute 4 : 0.0
			Attribute 5 : 0.0
			Attribute 6 : 1.0
		Instance: 3 NumAttributes: 7
			Attribute 0 : 38.0
			Attribute 1 : 62.0
			Attribute 2 : 74.0
			Attribute 3 : 1.0
			Attribute 4 : 0.0
			Attribute 5 : 0.0
			Attribute 6 : 1.0
Bag: bag2
	NumInstances: 2
	NumFeatures: 3
	NumLabels: 4
	NumAttributes: 7
		Instance: 0 NumAttributes: 7
			Attribute 0 : 11.25
			Attribute 1 : -98.0
			Attribute 2 : 10.0
			Attribute 3 : 0.

# Split Datasets

Example of the two partitions methods availables in the library

In [7]:
dataset = load_dataset("miml_birds.arff", from_library=True)

## Train-Test

In [8]:
dataset_train, dataset_test = dataset.split_dataset(train_percentage=0.8, seed=5)

print("Nº of bags: ", dataset_train.get_number_bags(), dataset_train.data.keys())
print("Nº of bags: ",dataset_test.get_number_bags(), dataset_test.data.keys())

Nº of bags:  206 dict_keys(['326', '472', '526', '554', '489', '20', '561', '297', '516', '631', '429', '359', '588', '253', '96', '419', '422', '424', '425', '428', '431', '432', '435', '436', '438', '444', '446', '452', '454', '459', '461', '465', '470', '471', '475', '480', '481', '482', '487', '488', '490', '491', '496', '497', '504', '505', '506', '508', '509', '512', '513', '524', '528', '529', '532', '533', '534', '537', '538', '539', '546', '551', '552', '553', '555', '556', '562', '568', '570', '571', '576', '577', '578', '579', '582', '584', '585', '589', '591', '592', '595', '596', '598', '604', '610', '611', '613', '614', '619', '620', '621', '625', '630', '632', '634', '635', '640', '641', '642', '2', '4', '5', '12', '13', '18', '19', '24', '27', '28', '30', '31', '32', '33', '38', '39', '40', '43', '44', '45', '47', '49', '50', '51', '55', '64', '69', '70', '71', '72', '73', '82', '87', '88', '90', '97', '100', '102', '103', '110', '113', '114', '116', '117', '125', '129'

## Cross Validation K-Fold


In [9]:
datasets_train, datasets_test = dataset.split_dataset_cv(folds=3, seed=5)

for dataset_train, dataset_test in zip(datasets_train, datasets_test):
  print("Nº of bags: ", dataset_train.get_number_bags(), dataset_train.data.keys())
  print("Nº of bags: ",dataset_test.get_number_bags(), dataset_test.data.keys())
  print("")

Nº of bags:  177 dict_keys(['326', '472', '526', '554', '489', '20', '561', '297', '516', '631', '429', '359', '588', '253', '96', '635', '640', '641', '642', '2', '4', '5', '12', '13', '18', '19', '24', '27', '28', '30', '31', '32', '33', '38', '39', '40', '43', '44', '45', '47', '49', '50', '51', '55', '64', '69', '70', '71', '72', '73', '82', '87', '88', '90', '97', '100', '102', '103', '110', '113', '114', '116', '117', '125', '129', '130', '132', '138', '140', '144', '150', '151', '154', '160', '161', '165', '167', '168', '169', '175', '176', '181', '182', '183', '186', '187', '190', '191', '192', '193', '194', '196', '198', '199', '200', '201', '203', '204', '205', '206', '208', '209', '210', '213', '214', '223', '227', '228', '229', '230', '231', '237', '238', '244', '245', '246', '248', '254', '255', '256', '258', '260', '261', '265', '268', '272', '273', '274', '275', '279', '281', '285', '287', '289', '290', '295', '301', '306', '307', '308', '311', '317', '318', '323', '325'

# Save modified datasets

Test of how dataset can be saved

In [10]:
dataset_train.save_arff("dataset_train.arff")
dataset_test.save_csv("dataset_test.csv")

# Check creation of datasets
!ls -l

# Check of created dataset
!cat dataset_test.csv | head -n 10

total 672
-rw-r--r-- 1 root root 243042 Jun 10 02:22 dataset_test.csv
-rw-r--r-- 1 root root 435909 Jun 10 02:22 dataset_train.arff
drwxr-xr-x 1 root root   4096 Jun  6 14:21 sample_data
19
id,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,BRCR,PAWR,PSFL,RBNU,DEJU,OSFL,HETH,CBCH,VATH,HEWA,SWTH,HAFL,WETA,BHGB,GCKI,WAVI,MGWA,STJA,CONI
203,0.99215,0.996107,0.528552,0.552838,0.028449,0.057087,0.002997,-0.002518,0.002796,0.006582,0.592157,0.846645,36.9232,24.590559,47.0,254.0,207.0,314.0,23984.0,1464.0,89.36358,0.368996,0.083177,0.055451,0.048028,0.054557,0.0652,0.082372,0.067078,0.055809,0.052053,0.053484,0.061086,0.072981,0.05402,0.048386,0.057061,0.089259,0,0,1,0,1,0,1,1,0,1,1,0,0,0,0,0,0,0,0
203,0.991347,0.994875,0.534398,0.496746,0.022715,0.050038,-0.001589,0.002177,0.001059,0.005607,0.658824,0.412245,32.137417,19.035416,50.0,202.0,152.0,246.0,20296.0,1413.0,98.372536,0.54279,0.074853,0.06428