# SMuDGE tutorial : the sample types 

In this tutorial, we will learn how to generate a multiview dataset presenting :

* redundancy, 
* complementarity and
* mutual error. 

## Definitions

In this tutorial, will will denote a sample as 

* **Redundant** if all the views have enough information to classify it correctly without collaboration, 
* **Complementary** if only some of the views have enough information to classify it correctly without collaboration it is useful the assess the ability to extract the relevant information among the views.
* Part of the **Mutual Error** if none of the views has enough information to classify it correctly without collaboration. A mutliview classifier able to classify these examples is apt to get information from several features from different views and combine it to classify the examples.




## Hands on experience : initialization 

We will initialize the arguments as earlier : 

In [1]:
from multiview_generator.multiple_sub_problems import MultiViewSubProblemsGenerator
from tabulate import tabulate
import numpy as np
import os

random_state = np.random.RandomState(42)
name = "tuto"
n_views = 4
n_classes = 3
error_matrix = [
   [0.4, 0.4, 0.4, 0.4],
   [0.55, 0.4, 0.4, 0.4],
   [0.4, 0.5, 0.52, 0.55]
]
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]

To control the three previously introduced characteristics, we have to provide three floats : 

In [2]:
complementarity = 0.3
redundancy = 0.2
mutual_error = 0.1

Now we can generate the dataset with the given configuration. 

In [3]:
generator = MultiViewSubProblemsGenerator(name=name, n_views=n_views, 
                                          n_classes=n_classes, 
                                          n_samples=n_samples, 
                                          n_features=n_features, 
                                          class_weights=class_weights, 
                                          error_matrix=error_matrix, 
                                          random_state=random_state, 
                                          redundancy=redundancy, 
                                          complementarity=complementarity, 
                                          mutual_error=mutual_error)

view_data, y = generator.generate_multi_view_dataset()

[array([399, 399, 399, 399]), array([299, 399, 399, 399]), array([399, 333, 319, 299])]
400.0
0 0 399 228
1 0 399 239
2 0 399 230
3 0 399 233
0 1 299 237
1 1 399 236
2 1 399 226
3 1 399 231
0 2 399 229
1 2 333 234
2 2 319 233
3 2 299 234


Here, the generator distinguishes four types of examples, the thrre previously introduced and the ones that were used to fill the dataset. 

## Dataset analysis using [SuMMIT](https://gitlab.lis-lab.fr/baptiste.bauvin/summit)

In order to differentiate them, we use `generator.example_ids`. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :

In [4]:
generator.example_ids[:10]

['Complementary_193_1',
 'redundancy_56_2',
 'Complementary_64_0',
 'redundancy_26_1',
 'Complementary_141_2',
 'example_5',
 'redundancy_54_1',
 'Complementary_157_1',
 'example_8',
 'example_9']

Here, we printed the 10 first ones, and we have : 

* the redundant samples tagged `redundancy_`,
* the mutual error ones tagged `mutual_error_`,
* the complementary ones tagged `complementary_` and
* the filling ones tagged `example_`. 

To get a visualization on these properties, we will use SuMMIT with decision trees on each view. 

In [5]:
from multiview_platform.execute import execute  

generator.to_hdf5_mc('supplementary_material')
execute(config_path=os.path.join('supplementary_material','config_summit.yml'))


Start:	 Initializing monoview classifiers arguments
Done:	 Initializing monoview classifiers arguments
Start:	 Initializing monoview classifiers arguments
Done:	 Initializing monoview classifiers arguments
Start:	 Executing all the needed benchmarks
Start:	 Benchmark initialization
Done:	 Benchmark initialization
Start:	 monoview benchmark
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_1 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : decision_tree
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.014620825997553766[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for gener

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_2 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : adaboost
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.02157829000498168[s]
Start:	 Getting results

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_2 with

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_3 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : random_forest
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.04144220901071094[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_3 with random_forest.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_3	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- RandomForest with n_estimators : 10, max_dep

Done:	 Saving results
Done:	 monoview benchmark
Start:	 multiview arguments initialization
Done:	 multiview arguments initialization
Start:	 multiview benchmark
Done:	 multiview benchmark
Start:	 Analyzing all results
Start:	 Score graph generation for accuracy_score*
Done:	 Score graph generation for accuracy_score*
Start:	 Label analysis figure generation
locator: <matplotlib.ticker.FixedLocator object at 0x7f28541aedd8>
Using fixed locator on colorbar
Setting pcolormesh
Done:	 Label analysis figures generation
Done:	 Analyzing all results
Start:	 Benchmark initialization
Done:	 Benchmark initialization
Start:	 monoview benchmark
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_1 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : decision_tree
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:

Start:	 Saving preds
Classification on tuto for generated_view_2 with decision_tree.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_2	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- DecisionTree with max_depth : 3, criterion : gini, splitter : best, random_state : <mtrand.RandomState object at 0x7f2859ef0090>
	- Executed on 1 core(s) 


	For Accuracy score using {}, (higher is better) : 
		- Score on train : 0.5805805805805806
		- Score on test : 0.5225225225225225

Test set confusion matrix : 

╒═════════╤═══════════╤═══════════╤═══════════╕
│         │   label_1 │   label_2 │   label_3 │
╞═════════╪═══════════╪═══════════╪═══════════╡
│ label_1 │       166 │        81 │        86 │
├─────────┼───────────┼───────────┼───────────┤
│ label_2 │        70 │       202 │        61 │
├─────────┼───────────┼───────────┼───────────┤
│ label_3 │    

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_3 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : adaboost
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.019016871985513717[s]
Start:	 Getting results

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_3 wit

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_4 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : random_forest
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.03309051800169982[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_4 with random_forest.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_4	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- RandomForest with n_estimators : 10, max_dep

Start:	 Saving preds
Classification on tuto for generated_view_1 with random_forest.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_1	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- RandomForest with n_estimators : 10, max_depth : None, criterion : gini, random_state : <mtrand.RandomState object at 0x7f2859ef0240>
	- Executed on 1 core(s) 


	For Accuracy score using {}, (higher is better) : 
		- Score on train : 0.9669669669669669
		- Score on test : 0.43043043043043044

Test set confusion matrix : 

╒═════════╤═══════════╤═══════════╤═══════════╕
│         │   label_1 │   label_2 │   label_3 │
╞═════════╪═══════════╪═══════════╪═══════════╡
│ label_1 │       167 │       105 │        61 │
├─────────┼───────────┼───────────┼───────────┤
│ label_2 │       117 │       127 │        89 │
├─────────┼───────────┼───────────┼───────────┤
│ label_3

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_3 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : decision_tree
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.012263860990060493[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_3 with decision_tree.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_3	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- DecisionTree with max_depth : 3, criterion 

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_4 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : adaboost
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.017790345998946577[s]
Start:	 Getting results

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_4 wit

Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.03680643701227382[s]
Start:	 Getting results

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_1 with adaboost.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_1	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- Adaboost with n_estimators : 50, base_estimator : DecisionTreeClassifier
	- Executed on 1 core(s) 


	For Accuracy score using {}, (higher is better) : 
		- Score on train : 1.0
		- Score on test : 0.38538538538538536

Test set confusion matrix : 

╒═════════╤═══════════╤═══════════╤═══════════╕
│         │   label_1 │   label_2 │   label_3 │
╞═════════╪═══════════╪

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_2 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : random_forest
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.08894029300427064[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_2 with random_forest.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_2	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- RandomForest with n_estimators : 10, max_dep

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_4 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : decision_tree
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.01210012601222843[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_4 with decision_tree.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_4	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- DecisionTree with max_depth : 3, criterion :

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_1 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : adaboost
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.024542364990338683[s]
Start:	 Getting results

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.

Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_1 wit

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_2 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : random_forest
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.05181503898347728[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_2 with random_forest.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_2	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- RandomForest with n_estimators : 10, max_dep

Done:	 Saving results
Start:	 Loading data
Done:	 Loading data
Info:	 Classification - Database:tuto View:generated_view_4 train ratio:0.5, CrossValidation k-folds: 5, cores:1, algorithm : decision_tree
Start:	 Determine Train/Test split
Info:	 Shape X_train:(999, 3), Length of y_train:999
Info:	 Shape X_test:(999, 3), Length of y_test:999
Done:	 Determine Train/Test split
Start:	 Generate classifier args
Done:	 Generate classifier args
Start:	 Training
Done:	 Training
Start:	 Predicting
Done:	 Predicting
Info:	 Duration for training and predicting: 0.012050641991663724[s]
Start:	 Getting results
Done:	 Getting results
Start:	 Saving preds
Classification on tuto for generated_view_4 with decision_tree.

Database configuration : 
	- Database name : tuto
	- View name : generated_view_4	 View shape : (1998, 3)
	- Learning Rate : 0.5
	- Labels used : label_1, label_2, label_3
	- Number of cross validation folds : 5

Classifier configuration : 
	- DecisionTree with max_depth : 3, criterion 

Start:	 Score graph generation for accuracy_score*
Done:	 Score graph generation for accuracy_score*
Start:	 Label analysis figure generation
locator: <matplotlib.ticker.FixedLocator object at 0x7f28553bcf28>
Using fixed locator on colorbar
Setting pcolormesh
Done:	 Label analysis figures generation
Start:	 Score graph generation for accuracy_score*
Done:	 Score graph generation for accuracy_score*
Start:	 Label analysis figure generation
locator: <matplotlib.ticker.FixedLocator object at 0x7f28531a7400>
Using fixed locator on colorbar
Setting pcolormesh
Done:	 Label analysis figures generation
Start:	 Score graph generation for accuracy_score*
Done:	 Score graph generation for accuracy_score*
Start:	 Label analysis figure generation
locator: <matplotlib.ticker.FixedLocator object at 0x7f285329fef0>
Using fixed locator on colorbar
Setting pcolormesh
Done:	 Label analysis figures generation
Done:	 Analyzing all results
Start:	 Global label analysis figure generation
locator: <matplotlib

To extract the result, we need a small script that will fetch the right folder :

In [6]:
import os
from datetime import datetime
from IPython.display import display
from IPython.display import IFrame

def fetch_latest_dir(experiment_directories, latest_date=datetime(1560,12,25,12,12)):
    for experiment_directory in experiment_directories:
        experiment_time = experiment_directory.split("-")[0].split("_")[1:]
        experiment_time += experiment_directory.split('-')[1].split("_")[:2]
        experiment_time = map(int, experiment_time)
        dt = datetime(*experiment_time)
        if dt > latest_date:
            latest_date=dt
            latest_experiment_dir = experiment_directory
    return latest_experiment_dir

experiment_directory = fetch_latest_dir(os.listdir(os.path.join('supplementary_material', 'tuto')))
error_fig_path = os.path.join('supplementary_material','tuto', experiment_directory, "error_analysis_2D.html")

IFrame(src=error_fig_path, width=900, height=500)


This graph represents the failure of each classifier on each sample. So a black rectangle on row i, column j means that classifier j always failed to classify example i. 
So, by [zooming in](link_to_gif), we can focus on several samples and we see that the type of samples are well defined as the mutual error ones are systematically misclassified by the decision trees, the redundant ones are well-classified and the complementary ones are classified only by a portion of the views. 
  

