## Data overview:
You are provided with unlabeled data for which labeling is to be computed. There are 20 subjects of fMRI scan in total, for each subject there are 16 (2x2x2x2) representations corresponding to – two different brain atlas partitions (Brainnetome and Schaefer200), times two different smoothing strategies, times two segments of scan sequence, times two different sequences. Shape structure for the dataset translates aas follows `[20*16 objects, 10 timesteps from scan sequence, 246 number of features in larger atlas]`. Note that since two atlases with different number of partitions are used, some data arrays are padded with `np.nan`s, so that data shape is uniform.

*Public Test Data:*

- IHB dataset: 10 subject

*Private Test Data:*

- IHB dataset: 10 subjects

File with all $20 \times 16$ scan data representations is available via this repository lfs. 

## Objective
The task aims to simulate a realistic research workflow:

-	Data Collection Constraints: Collecting fMRI data is challenging and costly, resulting in small proprietary datasets.
-	Dataset Variability: These open datasets inherently differ from proprietary data in aspects such as scanner type, geographic location of data collection, and average age of participants. Such aspect are simulated through choice of different brain states, time sequences, atlas-based aggregations.
The primary goal is to develop a model capable of identifying a person using fMRI as a "fingerprint" which is consistent across different perturbations and aspects of data.

## Performance Metric and Deliverables

*Evaluation Metric:* Adjusted Rand Score (rescaled to be between 0.0 for random prediction and worse and 1.0 for perfect labeling: $\text{ari}(y, \hat{y}) = \frac{\text{ri}(y, \hat{y}) - 0.85}{0.15}$)

*Required Deliverables:*
-	`<name>.csv` submission file that contains column named `prediction` which has same integer labels for objects corresponding to same class (subject's fMRI scan).

## Example

In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import pandas as pd

**Use standard scaling to account for different scan smoothing strategies, kmeans for computing cluster centers and subsequent distance-based ranking**

In [7]:
data = np.load('../data/ts_cut/ihb.npy')

common_brain_region_data = data[:, :, 0]

model = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=20)
)

model.fit(common_brain_region_data)

**Use greedy algorithm that takes top 16 objects that are closest to cluster centers and consumes them iteratively**

In [3]:
cluster_distances = model.transform(common_brain_region_data)

labeling = np.zeros(len(data), dtype=int)
leftover_indexes = np.arange(len(data))
for i in range(20):
    distances_from_current_cluster_center = cluster_distances[:, i]
    if len(distances_from_current_cluster_center) > 16:
        top16 = np.argpartition(distances_from_current_cluster_center, 16)[:16]
        labeling[leftover_indexes[top16]] = i
        cluster_distances = np.delete(cluster_distances, top16, axis=0)
        leftover_indexes = np.delete(leftover_indexes, top16)
    else:
        labeling[leftover_indexes] = i

In [13]:
common_brain_region_data.shape

(320, 10)

**form final submission**

In [4]:
pd.DataFrame({'prediction': labeling}).to_csv('../submission.csv', index=False)