# Multiview Dataset Generator Demo

Once you have [installed](link_ton_install) SMuDGE, you are able to run it with this notebook.

In [1]:
from multiview_generator.multiple_sub_problems import MultiViewSubProblemsGenerator
from tabulate import tabulate
import numpy as np

random_state = np.random.RandomState(42)

## Basic configuration

Let us suppose that you want to build a multiview dataset with 4 views and three classes : 

In [2]:
name = "demo"
n_views = 4
n_classes = 3

In order to configure the dataset, you have to provide the error matrix that gives the expected error of the Byaes classifier for Class i on View j as the value in row i column j :  

In [3]:
error_matrix = [
   [0.4, 0.4, 0.4, 0.4],
   [0.55, 0.4, 0.4, 0.4],
   [0.4, 0.5, 0.52, 0.55]
]
print(tabulate(error_matrix, tablefmt="grid"))

+------+-----+------+------+
| 0.4  | 0.4 | 0.4  | 0.4  |
+------+-----+------+------+
| 0.55 | 0.4 | 0.4  | 0.4  |
+------+-----+------+------+
| 0.4  | 0.5 | 0.52 | 0.55 |
+------+-----+------+------+


Once this has been defined, you can set all the other parameters of the dataset : 
* the number of samples, 
* the number of features of each view,
* the proportion of samples in each class.

In [4]:
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]

## Generate the dataset

With the basic configuration done, we can generate the dataset :

In [5]:
generator = MultiViewSubProblemsGenerator(name=name, n_views=n_views, 
                                          n_classes=n_classes, 
                                          n_samples=n_samples, 
                                          n_features=n_features, 
                                          class_weights=class_weights, 
                                          error_matrix=error_matrix, 
                                          random_state=random_state)  

view_data, y = generator.generate_multi_view_dataset()

for view_index, view_datum in enumerate(view_data):
    print("View {} of shape {}".format(view_index+1, view_datum.shape))


[array([399, 399, 399, 399]), array([299, 399, 399, 399]), array([399, 333, 319, 299])]
400.0
View 1 of shape (1998, 3)
View 2 of shape (1998, 3)
View 3 of shape (1998, 3)
View 4 of shape (1998, 3)


Here, we see that the output shape is 999 instead of 1000 as the classes are supposed to be equivalent. 

## Get a description of it

Now, if you wish to get information about the generated dataset, run : 

In [6]:
description = generator.gen_report(save=False)

This will generate a markdown report on the dataset. Here, we used `save=False` so the description is not saved in a file. 

To print it in this notebook, we use : 

In [7]:
from IPython.display import display,Markdown
display(Markdown(description))

# Generated dataset description

The dataset named `demo` has been generated by [SMuDGE](https://gitlab.lis-lab.fr/dev/multiview_generator) and is comprised of 

* 1998 examples, splitted in 
* 3 classes, described by 
* 4 views.

The input error matrix is 
 
|         |   View 1 |   View 2 |   View 3 |   View 4 |
|---------|----------|----------|----------|----------|
| Class 1 |     0.4  |      0.4 |     0.4  |     0.4  |
| Class 2 |     0.55 |      0.4 |     0.4  |     0.4  |
| Class 3 |     0.4  |      0.5 |     0.52 |     0.55 |

 The classes are balanced as : 

* Class 1 : 666 examples (33% of the dataset)
* Class 2 : 666 examples (33% of the dataset)
* Class 3 : 666 examples (33% of the dataset)

 The views have 

* 0.0% redundancy, 
* 0.0% mutual error and 
* 0.0% complementarity,

the remaining examples are randomly mis-labelled to fit the input error matrix.

## Views description

### View 1

This view is generated with [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), with the following configuration : 
```yaml
class_sep: 10.0
flip_y: 0
hypercube: true
n_clusters_per_class: 1
n_features: 3
n_informative: 3
n_redundant: 0
n_repeated: 0
scale: 1.0
shift: 0.0
shuffle: false
```

### View 2

This view is generated with [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), with the following configuration : 
```yaml
class_sep: 10.0
flip_y: 0
hypercube: true
n_clusters_per_class: 1
n_features: 3
n_informative: 3
n_redundant: 0
n_repeated: 0
scale: 1.0
shift: 0.0
shuffle: false
```

### View 3

This view is generated with [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), with the following configuration : 
```yaml
class_sep: 10.0
flip_y: 0
hypercube: true
n_clusters_per_class: 1
n_features: 3
n_informative: 3
n_redundant: 0
n_repeated: 0
scale: 1.0
shift: 0.0
shuffle: false
```

### View 4

This view is generated with [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), with the following configuration : 
```yaml
class_sep: 10.0
flip_y: 0
hypercube: true
n_clusters_per_class: 1
n_features: 3
n_informative: 3
n_redundant: 0
n_repeated: 0
scale: 1.0
shift: 0.0
shuffle: false
```

This report has been automatically generated on April 29, 2020 at 09:34:50

But if you just want to save it, you can use : 

In [8]:
%%capture

generator.gen_report(output_path="supplementary_material", save=True)

This will save the description in the current directory, in a file called `demo.md` as the name of the dataset is "demo".

## Save it in an HDF5 file 

Moreover, it is possible to save tha dataset in an HDF5 file, compatible with [SuMMIT](https://gitlab.lis-lab.fr/baptiste.bauvin/summit/) with 
 

In [9]:
generator.to_hdf5_mc(saving_path='supplementary_material')

## Visualizing the dataset with [plotly](https://plotly.com/)

Here, we purposely used ony 3 featrues per view, so the generated dataset is easily plottable in 3D. 

Let us plot each view : 

In [10]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.colors import DEFAULT_PLOTLY_COLORS

fig = make_subplots(rows=2, cols=2,
                    subplot_titles= ["View {}".format(view_index) 
                                     for view_index in range(n_views)],
                    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}, ],
                               [{'type': 'scatter3d'},
                                {'type': 'scatter3d'}, ]])
row = 1
col = 1
show_legend = True
# Plot the data for each view and each label
for view_index in range(n_views):
    for lab_index in range(n_classes):
        concerned_examples = np.where(generator.y == lab_index)[0]
        fig.add_trace(
            go.Scatter3d(
                x=generator.view_data[view_index][concerned_examples, 0],
                y=generator.view_data[view_index][concerned_examples, 1],
                z=generator.view_data[view_index][concerned_examples, 2],
                text=[generator.example_ids[ind] for ind in concerned_examples],
                hoverinfo='text',
                legendgroup="Class {}".format(lab_index),
                mode='markers', marker=dict(size=1,
                                            color=DEFAULT_PLOTLY_COLORS[lab_index],
                                            opacity=0.8), 
                name="Class {}".format(lab_index), 
                showlegend=show_legend),
            row=row, col=col)
    show_legend = False
    col += 1
    if col == 3:
        col = 1
        row += 1
fig.show()

The figure shows us the dataset with a 3D-subplot for each view.  It is possible to remove the samples of a specific class by clicking on a label in the legend.

## Getting the outputted error matrix

In order to measure the outputted error matrix, as the views have been generated with [make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), 
the [DecisionTree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) is a good approximation of the Bayes classifier.

In order to estimate the test error in the dataset for each class with a Decision Tree, we use a [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) :

In [11]:
from sklearn.model_selection import StratifiedKFold

n_folds = 5

folds_generator = StratifiedKFold(n_folds, random_state=random_state,
                                 shuffle=True)
# Splitting the array containing the indices of the samples
folds = folds_generator.split(np.arange(generator.y.shape[0]), generator.y)

# Getting the list of each the sample indices in each fold.
folds = [[list(train), list(test)] for train, test in folds]

Then, we get a Decision Tree of depth 3 (as each view has 3 features), and fit it on each view, for each fold. 
The ouptuted score is the cross-validation score on the 5 folds. 

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

dt = DecisionTreeClassifier(max_depth=3)
confusion_mat = np.zeros((n_folds, n_views, n_classes, n_classes))
n_sample_per_class = np.zeros((n_views, n_classes, n_folds))

# For each view
for view_index in range(n_views):
    # For each fold 
    for fold_index, [train, test] in enumerate(folds):
        
        # Fit the decision tree on the training set
        dt.fit(generator.view_data[view_index][train, :], generator.y[train])
        
        # Predict on the testing set
        pred = dt.predict(generator.view_data[view_index][test, :])
        
        # Get the confusion matrix
        confusion_mat[fold_index, view_index, :, :] = confusion_matrix(generator.y[test], pred)
        for class_index in range(n_classes):
            n_sample_per_class[view_index, class_index, fold_index] = np.where(generator.y[test]==class_index)[0].shape[0]
confusion_mat = np.mean(confusion_mat, axis=0)
n_sample_per_class = np.mean(n_sample_per_class, axis=2)
output = np.zeros((n_classes, n_views))
# Get the class error thanks with the confusion matrix
for class_index in range(n_classes):
    for view_index in range(n_views):
        output[class_index, view_index] = 1-confusion_mat[view_index, class_index, class_index]/n_sample_per_class[view_index, class_index]
        
print("Input error matrix : \n{}\n\nOutputted error matrix : \n{}\n\nDifference :\n{}".format(tabulate(error_matrix, tablefmt='grid'), tabulate(output, tablefmt='grid'), tabulate(error_matrix-output, tablefmt='grid')))

Input error matrix : 
+------+-----+------+------+
| 0.4  | 0.4 | 0.4  | 0.4  |
+------+-----+------+------+
| 0.55 | 0.4 | 0.4  | 0.4  |
+------+-----+------+------+
| 0.4  | 0.5 | 0.52 | 0.55 |
+------+-----+------+------+

Outputted error matrix : 
+----------+----------+----------+----------+
| 0.402402 | 0.40991  | 0.411411 | 0.436937 |
+----------+----------+----------+----------+
| 0.54955  | 0.403904 | 0.405405 | 0.43994  |
+----------+----------+----------+----------+
| 0.442943 | 0.504505 | 0.534535 | 0.542042 |
+----------+----------+----------+----------+

Difference :
+-------------+-------------+-------------+-------------+
| -0.0024024  | -0.00990991 | -0.0114114  | -0.0369369  |
+-------------+-------------+-------------+-------------+
|  0.00045045 | -0.0039039  | -0.00540541 | -0.0399399  |
+-------------+-------------+-------------+-------------+
| -0.0429429  | -0.0045045  | -0.0145345  |  0.00795796 |
+-------------+-------------+-------------+-------------+


Here, we can see that there is a slight difference between the input error matrix and the ouput one.

## Conclusion

In this demo, we used SMuDGE to generate a basic multiview dataset, and we performed a naive analysis on it. 
The next tutorial will be focused on introducing redundancy, mutual error and complementarity. 