# CMU-MOSEI Data

## Summary 

**Number of data points**
- 23,453

**Number of distinct speakers**
- 1,000

**Subset Modalities**
- Language
- Video 
- Audio

**Labels**
- Sentiment
- Emotion

**Total Number of Video Hours**
- 65 hours 53 minutes 36 seconds

**Dataset Statistics**

![](Images/CMU-MOSEI_dataset_statistics.png)

## Labels

### Sentiment 

| Scale | Description | 
| --- | --- | 
| -3 | highly negative | 
| -2 | negative | 
| -1 | weakly negative | 
| 0 | neutral | 
| +1 | weakly positive | 
| +2 | positive | 
| +3 | highly positive | 

### Emotions 

{happiness, sadness, anger, fear, disgust, surprise}

Scale for presence of emotion $x$

| Scale | Description | 
| --- | --- | 
| 0 | no evidence of $x$ | 
| 1 | weakly $x$ | 
| 2 | $x$ | 
| 3 | highly $x$ | 

## Extracted Features

### Language

| Feature | Extraction Method |
| --- | --- |
| word vectors | GloVe word embeddings | 

### Visual 

Frames are extracted from full videos at 30Hz

| Feature | Extraction Method |
| --- | --- |
| bounding box of face | MTCNN face detection algorithm |
| facial action units | Facial Action Coding System (FACS) |
| six basic emotions | Emotient FACET |
| 68 facial landmarks | MultiComp OpenFace |
| 20 facial shape parameters | MultiComp OpenFace |
| facial HoG features | MultiComp OpenFace |
| head pose | MultiComp OpenFace |
| head orientation | MultiComp OpenFace |
| eye gaze | MultiComp OpenFace |
| face embeddings | DeepFace, FaceNet, SphereFace |

### Audio

| Feature | Extraction Method | 
| --- | --- |
| 12 Mel-frequency cepstral coefficients | COVAREP |
| pitch | COVAREP |
| voiced/unvoiced segmenting | COVAREP |
| glottal source parameters | COVAREP |
| peak slope parameters | COVAREP |
| maxima dispersion quotients | COVAREP |

## Alignment 

Words and audio are aligned at phoneme level using P2FA fored alignment model 
- The visual and audio modalities are aligned to words by interpolation

## Using the Data

### Import features

In [4]:
from mmsdk import mmdatasdk

audio_file = './data/cmumosei/CMU_MOSEI_COVAREP.csd'
visual_file = './data/cmumosei/CMU_MOSEI_VisualOpenFace2.csd'

features = [ audio_file, visual_facet_file, visual_openface_file]


data = mmdatasdk.mmdataset({'audio': audio_file, 'visual': visual_file})


  3%|▎         | 121/3836 [00:00<00:03, 1202.58 Computational Sequence Entries/s][92m[1m[2021-02-19 21:08:55.800] | Success | [0mComputational sequence read from file ./data/cmumosei/CMU_MOSEI_COVAREP.csd ...
[94m[1m[2021-02-19 21:08:55.899] | Status  | [0mChecking the integrity of the <COVAREP> computational sequence ...
[94m[1m[2021-02-19 21:08:55.899] | Status  | [0mChecking the format of the data in <COVAREP> computational sequence ...
[92m[1m[2021-02-19 21:08:58.905] | Success | [0m<COVAREP> computational sequence data in correct format.
[94m[1m[2021-02-19 21:08:58.905] | Status  | [0mChecking the format of the metadata in <COVAREP> computational sequence ...
[92m[1m[2021-02-19 21:08:58.908] | Success | [0mComputational sequence read from file ./data/cmumosei/CMU_MOSEI_VisualOpenFace2.csd ...
  2%|▏         | 76/3837 [00:00<00:04, 753.63 Computational Sequence Entries/s][94m[1m[2021-02-19 21:08:59.897] | Status  | [0mChecking the integrity of the <OpenFace_2> 

### Align dataset to labels

In [None]:
label_file = './data/cmumosei/CMU_MOSEI_Labels.csd'
data.add_computational_sequences({'label': label_file}, destination = None)
data.align('label')

### Create dataframe of items

In [None]:
import numpy as np

for segment in data['label'].keys():
    label = data['label'][segment]['features']
    audio = data['audio'][segment]['features']
    video = data['video'][segment]['features']

    # remove NAN values
    label = np.nan_to_num(label).flatten()
    audio = np.nan_to_num(audio).flatten()
    video = np.nan_to_num(video).flatten()
