# Identifying EMOTIC images for survey

EMOTIC was saved in a MATLAB structure. I cannot remember well how to use MATLAB so i'll reshape everything for python.

## Exploring dataset structure

In [1]:
from scipy.io import loadmat
import pandas as pd

In [2]:
mat = loadmat('../Annotations/Annotations/Annotations.mat')

In [3]:
# mat['train'][0,0]['filename']

In [4]:
# mat['train'][0,0]['folder']

In [5]:
# mat['train'][0,0]['image_size']

In [6]:
# mat['train'][0,0]['original_database']

In [7]:
# mat['train'][0,0]['person']

In [8]:
# mat['train'][0,0]['person']['gender']

In [9]:
# mat['train'][0,0]['person']['age']

In [10]:
# mat['train'][0,0]['person']['annotations_categories']

## Person sub-structure

In [11]:
# mat['test'][0]['person'][1]['annotations_categories']

In [12]:
# mat['train'][0]['person'][18]

In [13]:
# mat['val'][0]['person'][3]['combined_categories']

## Structured array to Pandas DF

In [14]:
# Train dataset 
df1 = pd.DataFrame.from_records(mat['train'][0])

In [15]:
# Test dataset 
# df2 = pd.DataFrame.from_records(mat['test'][0])

In [16]:
# Validation dataset
# df3 = pd.DataFrame.from_records(mat['val'][0])

In [17]:
print(f'Train dataframe shape: {df1.shape}')

Train dataframe shape: (17077, 5)


### Removing repeated annotatios

There are images with two or more individuals annotated. 
I can only use images with only one individual annotated for the survey. 
The 'person' field contains all the information provided for each annotator. If 'categories' occurs more than once, it means two or more individuals were tagged.

In [18]:
df1['repeated_annotations'] = df1['person'].astype(str).str.count('categories')

In [19]:
#df2['repeated_annotations'] = df2['person'].astype(str).str.count('categories')

In [20]:
#df3['repeated_annotations'] = df3['person'].astype(str).str.count('categories')

I want to keep the rows with one repetition (a single person annotated)

In [21]:
mask_filter_repeats = df1['repeated_annotations'] < 2

In [22]:
#mask_filter_repeats_2 = df2['repeated_annotations'] < 2

In [23]:
#mask_filter_repeats_3 = df3['repeated_annotations'] < 2

Use boolean mask to keep images with one person labeled

In [24]:
df1 = df1[mask_filter_repeats]

In [25]:
#df2 = df2[mask_filter_repeats_2]

In [26]:
#df3 = df3[mask_filter_repeats_3]

In [27]:
print(f'Filtered Train dataset shape: {df1.shape}')

Filtered Train dataset shape: (12608, 6)


## Count how many times each emotion word was used to label each image

EMOTIC was labeled in stages with 26 emotion labels. Each image could be labeled with multiple labels. 

The 'training' set was labeled by one person, meaning each image has at most one label for each emotion.

The 'testing' and 'validation' set by multiple people, meaning each image can have several times the same label. 

After exploring the data, I realized the 'testing' and 'validation' were labeled inconsistently, sometimes by one person, and sometimes by multiple. This means I cannot rely on how many times a image was labeled for a given emotion. Hence, I'll stick with the 'training' set as it has a single (trained) annotator. 

In [28]:
# True if a given emotion was annotated for the image

emotions = ['Peace', 'Affection', 'Esteem', 'Anticipation', 'Engagement',
            'Confidence', 'Happiness', 'Pleasure', 'Excitement', 'Surprise',
            'Sympathy', 'Doubt/Confusion',  'Disconnection', 'Fatigue', 'Embarrassment',
            'Yearning', 'Disapproval', 'Aversion', 'Annoyance', 'Anger',
            'Sensitivity', 'Sadness', 'Disquietment', 'Fear', 'Pain', 'Suffering'
           ]
for e in emotions:
    df1[e] = df1['person'].astype(str).str.contains(e)

In [29]:
df1.head(2)

Unnamed: 0,filename,folder,image_size,original_database,person,repeated_annotations,Peace,Affection,Esteem,Anticipation,...,Disapproval,Aversion,Annoyance,Anger,Sensitivity,Sadness,Disquietment,Fear,Pain,Suffering
0,[COCO_val2014_000000562243.jpg],[mscoco/images],"[[[[[640]], [[640]]]]]","[[[['mscoco'], [[(array([[562243]], dtype=int3...","[[[[[ 86 58 564 628]], [[(array([[array(['Dis...",1,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,[COCO_train2014_000000288841.jpg],[mscoco/images],"[[[[[640]], [[480]]]]]","[[[['mscoco'], [[(array([[288841]], dtype=int3...","[[[[[485 149 605 473]], [[(array([[array(['Ant...",1,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


Counting (horizontal sum) how many emotions were detected for each image will give a indication of the ambiguity of the emotion. 

For instance: an image labeled for 10 different emotions, probably is ambiguous, whereas an image with a single emotion, probably is less. 

In [30]:
df1['n_emotions'] = df1[emotions].astype(int).sum(axis=1)

In [31]:
df1['n_emotions'].head(3)

0    2
1    1
2    3
Name: n_emotions, dtype: int64

In [32]:
# emotions = ['Peace', 'Affection', 'Esteem', 'Anticipation', 'Engagement',
#             'Confidence', 'Happiness', 'Pleasure', 'Excitement', 'Surprise',
#             'Sympathy', 'Doubt/Confusion',  'Disconnection', 'Fatigue', 'Embarrassment',
#             'Yearning', 'Disapproval', 'Aversion', 'Annoyance', 'Anger',
#             'Sensitivity', 'Sadness', 'Disquietment', 'Fear', 'Pain', 'Suffering'
#            ]
# for e in emotions:
#     df2[e] = df2['person'].astype(str).str.count(e)

In [33]:
# emotions = ['Peace', 'Affection', 'Esteem', 'Anticipation', 'Engagement',
#             'Confidence', 'Happiness', 'Pleasure', 'Excitement', 'Surprise',
#             'Sympathy', 'Doubt/Confusion',  'Disconnection', 'Fatigue', 'Embarrassment',
#             'Yearning', 'Disapproval', 'Aversion', 'Annoyance', 'Anger',
#             'Sensitivity', 'Sadness', 'Disquietment', 'Fear', 'Pain', 'Suffering'
#            ]
# for e in emotions:
#     df3[e] = df3['person'].astype(str).str.count(e)

### Encode gender

The gender information (only Male/Female) is contained in the 'person' column, so I need to clone it and do some string manipulation

In [34]:
df1['gender'] = df1['person']

In [35]:
df1.loc[df1['gender'].astype(str).str.contains('Male'), 'gender'] = 'Male'
df1.loc[df1['gender'].astype(str).str.contains('Female'), 'gender'] = 'Female'

In [36]:
df1['gender'].value_counts()

Male      8494
Female    4114
Name: gender, dtype: int64

### Encode Age

The age information (only kid/teen/adult) is contained in the 'person' column, so I need to clone it and do some string manipulation

In [37]:
df1['age'] = df1['person']

In [38]:
df1.loc[df1['age'].astype(str).str.contains('Kid'), 'age'] = 'Kid'
df1.loc[df1['age'].astype(str).str.contains('Teenager'), 'age'] = 'Teenager'
df1.loc[df1['age'].astype(str).str.contains('Adult'), 'age'] = 'Adult'

In [39]:
df1['age'].value_counts()

Adult       9963
Kid         1354
Teenager    1291
Name: age, dtype: int64

## Cleaning strings

In [40]:
df1['filename'] = df1['filename'].astype(str).str.replace(r"[\'\[\],]", '')

In [41]:
df1['folder'] = df1['folder'].astype(str).str.replace(r"[\'\[\],]", '')

In [42]:
df1.loc[df1['original_database'].astype(str).str.contains('mscoco'), 'original_database'] = 'mscoco'
df1.loc[df1['original_database'].astype(str).str.contains('framesdb'), 'original_database'] = 'framesdb'
df1.loc[df1['original_database'].astype(str).str.contains('emodb_small'), 'original_database'] = 'emodb_small'
df1.loc[df1['original_database'].astype(str).str.contains('ade20k'), 'original_database'] = 'ade20k'

In [43]:
df1['original_database'].astype(str).value_counts()

mscoco         9702
framesdb       2168
emodb_small     522
ade20k          216
Name: original_database, dtype: int64

In [44]:
df1.head(2)

Unnamed: 0,filename,folder,image_size,original_database,person,repeated_annotations,Peace,Affection,Esteem,Anticipation,...,Anger,Sensitivity,Sadness,Disquietment,Fear,Pain,Suffering,n_emotions,gender,age
0,COCO_val2014_000000562243.jpg,mscoco/images,"[[[[[640]], [[640]]]]]",mscoco,"[[[[[ 86 58 564 628]], [[(array([[array(['Dis...",1,False,False,False,False,...,False,False,False,False,False,False,False,2,Male,Adult
1,COCO_train2014_000000288841.jpg,mscoco/images,"[[[[[640]], [[480]]]]]",mscoco,"[[[[[485 149 605 473]], [[(array([[array(['Ant...",1,False,False,False,True,...,False,False,False,False,False,False,False,1,Male,Adult


## Subsetting categories


- Adult / Minor
- Male / Female
- Happiness / Aversion / Surprise / Anger / Sadness / Fear
- 2 x 2 x 6 = 24 cells

In [69]:
basic_emotions =['Happiness', 'Surprise', 'Aversion', 'Anger', 'Sadness', 'Fear']

In [70]:
d = {}

In [71]:
for be in basic_emotions:
    d[be + str('_adult_female')] = df1.loc[(df1[be] == True) & (df1['n_emotions'] < 2) & (df1['gender'] == 'Female') & (df1['age'] == 'Adult')]['filename'].tolist()

In [72]:
for be in basic_emotions:
    d[be + str('_adult_male')] = df1.loc[(df1[be] == True) & (df1['n_emotions'] < 2) & (df1['gender'] != 'Female') & (df1['age'] == 'Adult')]['filename'].tolist()

In [73]:
for be in basic_emotions:
    d[be + str('_minor_female')] = df1.loc[(df1[be] == True) & (df1['n_emotions'] < 2) & (df1['gender'] == 'Female') & (df1['age'] != 'Adult')]['filename'].tolist()

In [74]:
for be in basic_emotions:
    d[be + str('_minor_male')] = df1.loc[(df1[be] == True) & (df1['n_emotions'] < 2) & (df1['gender'] != 'Female') & (df1['age'] != 'Adult')]['filename'].tolist()

In [75]:
d.keys()

dict_keys(['Happiness_adult_female', 'Surprise_adult_female', 'Aversion_adult_female', 'Anger_adult_female', 'Sadness_adult_female', 'Fear_adult_female', 'Happiness_adult_male', 'Surprise_adult_male', 'Aversion_adult_male', 'Anger_adult_male', 'Sadness_adult_male', 'Fear_adult_male', 'Happiness_minor_female', 'Surprise_minor_female', 'Aversion_minor_female', 'Anger_minor_female', 'Sadness_minor_female', 'Fear_minor_female', 'Happiness_minor_male', 'Surprise_minor_male', 'Aversion_minor_male', 'Anger_minor_male', 'Sadness_minor_male', 'Fear_minor_male'])

In [76]:
len(d.keys())

24

## Save to text file

In [90]:
# tex file with names for sub groups
with open('categories_names.txt', 'w') as filehandle:
    for listitem in d.keys():
        filehandle.write('%s\n' % listitem)    

In [88]:
# tex file with names for sub groups
with open('categories_names_fold.txt', 'w') as filehandle:
    for listitem in d.keys():
        filehandle.write('./%s\n' % listitem)    

In [89]:
# tex file with names for sub groups
with open('categories_names_mk.txt', 'w') as filehandle:
    for listitem in d.keys():
        filehandle.write('%s\n' % listitem)    

In [83]:
# text files for filenames for each sub group
for k in d.keys():
    with open(k+'.txt', 'w') as filehandle:
        for listitem in d[k]:
            filehandle.write('%s\n' % listitem)