# Initial Data Preprocessing

This notebook loads the raw data from the [MultiBench package](https://github.com/pliang279/MultiBench/tree/main), including datasets such as [MOSI](https://drive.google.com/drive/folders/1uEK737LXB9jAlf9kyqRs6B9N6cDncodq) and [MOSEI](https://drive.google.com/drive/folders/1A_hTmifi824gypelGobgl2M-5Rw9VWHv). The initial preprocessing focuses on simplifying and structuring the data to ensure compatibility with the latent representation models. Key transformations include averaging specific views to generate vector representations for each text segment.

- **Process Overview**: read *_raw.pkl files and transform them into *_transformed.pkl files, preparing data for the next stages
- **Note**: Text data is represented in GloVe embeddings; to convert these back into words, refer to the `preprocessin_transform_from_glove` file.

## Packages & functions

In [None]:
import numpy as np
from tqdm import tqdm
import pickle

In [9]:
def transform_data_raw(dataset_raw, funny=False):
    
    view_text = []
    view_vision = []
    view_audio = []
    labels = []
    index = []
    data_split = []
    data_keys = list(dataset_raw.keys())

    for set_ind in range(len(data_keys)):
        data_set = data_keys[set_ind]

        N = len(dataset_raw[data_set]['text'])
        
        zeros_set = np.abs(dataset_raw[data_set]['text']).sum(axis=2)
        
        for i in tqdm(range(N)):
            zeros_set_i = zeros_set[i,:]

            zeros_index_i = np.where(np.cumsum(zeros_set_i) == 0)[0]

            if len(zeros_index_i) == 0:
                first_meaningful_i = 0
            else:
                first_meaningful_i = zeros_index_i[-1] + 1

            text_i = dataset_raw[data_set]['text'][i,first_meaningful_i:,:]

            vision_i = dataset_raw[data_set]['vision'][i,first_meaningful_i:,:].mean(axis=0)
            audio_i = dataset_raw[data_set]['audio'][i,first_meaningful_i:,:].mean(axis=0)

            view_text.append(text_i)
            view_vision.append(vision_i)
            view_audio.append(audio_i)

        labels.append(dataset_raw[data_set]['labels'])
        index.append(dataset_raw[data_set]['id'])

        data_split.append(np.repeat(data_set, N))

    if funny == False:
        labels_dataset = np.concatenate(labels)[:,0,:]
    else: 
        labels_dataset = np.concatenate(labels)[:,:]

    dataset = {
        'M0': np.vstack(view_vision),
        'M1': np.vstack(view_audio),
        'M2': view_text,
        'labels': labels_dataset,
        'index': np.concatenate(index),
        'train_val_test': np.concatenate(data_split)
    }

    return dataset

## MOSEI

In [3]:
with open('./datasets_download/MOSEI/mosei_raw.pkl', "rb") as input_file:
    dataset_raw = pickle.load(input_file)

In [None]:
dataset = transform_data_raw(dataset_raw)

In [None]:
dataset['M0'].shape, dataset['M1'].shape, len(dataset['M2']), dataset['labels'].shape, dataset['index'].shape, dataset['train_val_test'].shape

In [8]:
with open('./datasets_download/MOSEI/MOSEI_transformed.pkl', 'wb') as f:
    pickle.dump(dataset, f, protocol=pickle.HIGHEST_PROTOCOL)

## MOSI

In [23]:
with open('./datasets_download/MOSI/mosi_raw.pkl', "rb") as input_file:
    dataset_raw = pickle.load(input_file)

In [None]:
dataset = transform_data_raw(dataset_raw)

In [None]:
dataset['M0'].shape, dataset['M1'].shape, len(dataset['M2']), dataset['labels'].shape, dataset['index'].shape, dataset['train_val_test'].shape

In [16]:
with open('./datasets_download/MOSI/MOSI_transformed.pkl', 'wb') as f:
    pickle.dump(dataset, f, protocol=pickle.HIGHEST_PROTOCOL)