# Train/val split 

It's quite complex to split the dataset into a train, test and validation set. This is because in my SWDance list, some spoken word pieces appear multiple times. For the validation set one would prefer to have a set of motions + text that the model has never seen before. 

In [25]:
import pandas as pd 
import numpy as np

index_df = pd.read_csv('dataset/SWDance2/index.csv')
index_df.head(1)

Unnamed: 0,file_idx,video_idx,start_frame,end_frame,start_time,end_time,new_name,fps,caption,no_frames
0,0,0,149,387,7.45,19.36,000000.npy,20,"[' Music, breathing of statues, perhaps silenc...",239


You could just take 1 video, one of which you know the text only appears once in the dataeset. 

In [26]:
index_df[index_df['video_idx'] == 42].head(1)

Unnamed: 0,file_idx,video_idx,start_frame,end_frame,start_time,end_time,new_name,fps,caption,no_frames
2640,1320,42,18,154,0.89,7.68,001320.npy,20,"["" I cry when there is no end and I cry becaus...",137


In [4]:
# get all filenames of this video

vid_42_idcs = []
for name in index_df['new_name'][index_df['video_idx'] == 42]:
    vid_42_idcs += [name.strip('.npy')]

print(len(vid_42_idcs), vid_42_idcs)

18 ['001320', 'M001320', '001321', 'M001321', '001322', 'M001322', '001323', 'M001323', '001324', 'M001324', '001325', 'M001325', '001326', 'M001326', '001327', 'M001327', '001328', 'M001328']


I also added some sentences based on representation of each augmented emotion into the validation set 'manually'. An example of how to do this is shown in the next section. 

In [15]:
emo_list = [
    "when do we draw the line", 
    "dear you remember that I love you",
    "it all comes down to this", 
    "there's nobody's missing me", 
    "I know I am because I said",
    "We're used up and we're sad and drunk",
    "Maybe I'm just not good enough",
    "So how am I supposed to let go", 
    "So we grew up believing known whatever fall in love with us", 
    "But what if it isn't easy",
]

emo_idcs = []
for sentence in emo_list:
    # NOTE: make sure to get each occurence of this sentence, and put it into the validation set! 
    idcs = np.where(index_df['caption'].str.contains(sentence, case=False) == True)
    for idx in idcs[0].tolist():
        print(index_df['caption'][idx])
        emo_idcs += [index_df['new_name'][idx].strip('.npy')]

print(len(emo_idcs), emo_idcs)



[' When do we draw the line?', ' When do we draw the line? joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
[' When do we draw the line?', ' When do we draw the line? joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
[' When do we draw the line?', ' When do we draw the line? joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
[' When do we draw the line?', ' When do we draw the line? joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
[' But when do we draw the line?', ' But when do we draw the line? anger frustration outrage resentment dismay fear', 'anger frustration outrage resentment dismay fear']
[' But when do we draw the line?', ' But when do we draw the line? anger frustration outrage resentment dismay fear', 'anger frustration outrage resentment dismay fear']
[' When do we draw the line?', ' Whe

In [24]:
# valudation set is the lists of idcs described above
val_idcs = list(set(emo_idcs + vid_42_idcs))

# all other videos are for the training set
train_idcs = []
for name in index_df['new_name']:
    name = name.strip('.npy')
    if name in val_idcs:
        continue
    else:
        train_idcs += [name]

print(len(train_idcs), len(val_idcs))

# Check for overlap 
if not set(train_idcs).intersection(set(val_idcs)):
    print("No overlap!")
else:
    print("Overlap...")

4364 76
No overlap!


In [None]:
# Save in correct format. 
with open('dataset/SWDance/train.txt', 'w') as f:
    f.write('\n'.join(train_idcs))
with open('dataset/SWDance/val.txt', 'w') as f:
    f.write('\n'.join(val_idcs))

Create a text file with all validation texts, so you can easily sample from it. I just took the text+augmented emotion sentence for this. 

In [31]:
sampling_methods = {
    "just text": 0,
    "text+emo": 1,
    "just emo": 2,
}
val_sampling = "text+emo"

val_texts = []
for file in val_idcs:
    with open("dataset/SWDance/texts/"+file.strip('\n')+".txt", "r") as txtfile:
        text = txtfile.readlines()
        if " # /X#0.0#0.0\n" in text:
            continue
        else:
            val_texts += [text[sampling_methods[val_sampling]].split("#")[0]]
        
val_texts = list(dict.fromkeys(val_texts)) # remove duplicates
val_texts += [" "]
print(len(val_texts))

with open("sample/val_texts2.txt", "w") as sample_file:
    for txt in val_texts:
        sample_file.write(txt+"\n")

27


## Extract sentences per emotion
Here's an example of how to get a few sentences per emotion. Might be useful for shaping the validation set. 

In [29]:

import numpy as np
emotions = ['joy', 'surprise', 'anger', 'fear', 'sadness']
n_samples = 2# samples per emotion
 
all_emo_idcs = []
all_emo_caps = []
for emo in emotions:
    idcs = np.where(index_df['caption'].str.contains(emo) == True)

    for idx in idcs[0].tolist()[:n_samples]:
        all_emo_idcs += [index_df['new_name'][idx].strip('.npy')]
        all_emo_caps += [index_df['caption'][idx]]

print(len(all_emo_idcs), all_emo_idcs)
for i, cap in enumerate(all_emo_caps):
    print(all_emo_idcs[i], cap)


10 ['000000', 'M000000', '000274', 'M000274', '000001', 'M000001', '000001', 'M000001', '000020', 'M000020']
000000 [' Music, breathing of statues, perhaps silence of paintings, you language where all language', ' Music, breathing of statues, perhaps silence of paintings, you language where all language joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
M000000 [' Music, breathing of statues, perhaps silence of paintings, you language where all language', ' Music, breathing of statues, perhaps silence of paintings, you language where all language joy delight happiness passion excitement love', 'joy delight happiness passion excitement love']
000274 [" Because they see your heart for they see your skin, but she's only ever always been amazing. He", " Because they see your heart for they see your skin, but she's only ever always been amazing. He surprise surprising surprised chance", 'surprise surprising surprised chance']
M000274 [" Because 