# MTurk Dataset Collection

This notebook contains the pipeline for our MTurk dataset collection process. It includes initial EDA for our testing dataset, which is ~2.2k faces from the 10k dataset that have corresponding demographic attributes. Each row corresponds to an MTurk participants responses regarding the attribute questions, and there are approximately 12 responses for each unique image  **add on as we go**

### imports

In [74]:
import pandas as pd
from statistics import mode
import shutil
import os

## 2.2k face dataset EDA & Cleaning

In [75]:
attribute_df = pd.read_csv("demographic-others-labels.csv")
attribute_df

Unnamed: 0,Filename,Image #,Age,Attractive,Is this person famous?,Common?,How much emotion is in this face?,Emotion?,Eyes direction?,Face direction?,...,Friendly,Makeup?,Gender,Would you cast this person as the star of a movie?,Would this be a good profile picture?,Image quality,Race,Memorable,At what speed do you think this expression is happening?,How much teeth is showing?
0,Google_1_Danielle Goble_5_oval.jpg,1,3.0,5.0,0.0,2.0,2.0,0.0,1.0,1.0,...,4.0,0.0,1.0,2.0,2.0,5.0,6.0,5.0,1.0,0.0
1,Google_1_Danielle Goble_5_oval.jpg,1,2.0,3.0,0.0,2.0,3.0,1.0,1.0,4.0,...,4.0,0.0,1.0,1.0,2.0,5.0,1.0,4.0,3.0,0.0
2,Google_1_Danielle Goble_5_oval.jpg,1,3.0,3.0,0.0,4.0,1.0,6.0,1.0,5.0,...,5.0,0.0,1.0,1.0,2.0,5.0,5.0,5.0,5.0,0.0
3,Google_1_Danielle Goble_5_oval.jpg,1,3.0,4.0,0.0,2.0,2.0,0.0,1.0,4.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,4.0,3.0,0.0
4,Google_1_Danielle Goble_5_oval.jpg,1,3.0,2.0,1.0,3.0,4.0,1.0,1.0,1.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26658,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,4.0,1.0,0.0,0.0,2.0,4.0,1.0,3.0,4.0,1.0
26659,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,4.0,3.0,1.0,1.0,1.0,...,4.0,2.0,0.0,0.0,2.0,2.0,1.0,2.0,3.0,1.0
26660,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,2.0,0.0,4.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,3.0,1.0
26661,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,3.0,2.0,1.0


In [80]:
# we want to see the race breakdown in the 2.2k face dataset 
attribute_df["Race"].value_counts()

Race
1.0    20984
2.0     2593
5.0     1262
3.0      664
6.0      501
4.0      371
0.0      276
Name: count, dtype: int64

In [81]:
# we're imputing the mode to find the most frequent response for each attribute for image file 
# out of the 12 participant responses (grouping by image file)
mode_per_filename = attribute_df.groupby("Filename").apply(lambda x: x.mode())

In [82]:
mode_per_filename = mode_per_filename.dropna()

In [83]:
len(mode_per_filename)

2222

In [84]:
mode_per_filename["Race"].value_counts()

Race
1.0    1836
2.0     220
5.0      72
3.0      63
6.0      24
4.0       5
0.0       2
Name: count, dtype: int64

In [85]:
mode_per_filename["Race"].value_counts() / len(mode_per_filename)

Race
1.0    0.826283
2.0    0.099010
5.0    0.032403
3.0    0.028353
6.0    0.010801
4.0    0.002250
0.0    0.000900
Name: count, dtype: float64

Based on unique images, whites account for 82.63% of the data. 

Further, we are combining the South Asian (4) and East Asian (3) categories into "Asian (South or East)" since the sample size of South Asians is too small for proper analysis. Additionally, we are dropping the rows corresponding to "Other" since they account for less than 10% of the dataset. 

In [86]:
files_to_drop = mode_per_filename[mode_per_filename["Race"] == 0.0]["Filename"].values
print("files to drop: ", files_to_drop)

files to drop:  ['Google_1_Clarence Morehouse_8_oval.jpg'
 'Google_1_Steven Mahan_1_oval.jpg']


In [88]:
# combining East Asian and South Asian into "Asian (South or East)"
mode_per_filename.loc[mode_per_filename["Race"] == 6.0, "Race"] = 4.0
mode_per_filename.loc[mode_per_filename["Race"] == 4.0]

#Drop other
rows_to_drop = mode_per_filename[mode_per_filename["Race"] == 0.0].index
mode_per_filename = mode_per_filename.drop(rows_to_drop)

In [90]:
mode_per_filename["Race"].value_counts()

Race
1.0    1836
2.0     220
5.0      72
3.0      63
4.0      29
Name: count, dtype: int64

In [91]:
# removing the images that corresponds to the images to drop 

len(files_to_drop )

2

In [92]:
def image_file_names_2k(input_file, lines_to_remove):
    files = []
    with open(input_file, 'r') as file:
        lines = file.readlines()
        for line in lines:
            if line.strip() not in lines_to_remove:
                files.append(line)
    return files


In [93]:
image_files = image_file_names_2k("target-filenames.txt", files_to_drop)

In [94]:
len(image_files)

2222

In [95]:
mode_per_filename.columns

Index(['Filename', 'Image #', 'Age', 'Attractive', 'Is this person famous?',
       'Common?', 'How much emotion is in this face?', 'Emotion?',
       'Eyes direction?', 'Face direction?', 'Facial hair?', 'Catch question',
       'Friendly', 'Makeup?', 'Gender',
       'Would you cast this person as the star of a movie?',
       'Would this be a good profile picture?', 'Image quality', 'Race',
       'Memorable', 'At what speed do you think this expression is happening?',
       'How much teeth is showing?'],
      dtype='object')

In [98]:
#Extract images where face direction is looking straightforward

df_forward_facing_images = mode_per_filename.copy()

df_forward_facing_images = mode_per_filename.loc[mode_per_filename["Face direction?"] == 1.0]
df_forward_facing_images["Race"].value_counts() / len(df_forward_facing_images)


Race
1.0    0.834802
2.0    0.099670
5.0    0.029185
3.0    0.024780
4.0    0.011564
Name: count, dtype: float64

In [97]:
df_forward_facing_images.columns

Index(['Filename', 'Image #', 'Age', 'Attractive', 'Is this person famous?',
       'Common?', 'How much emotion is in this face?', 'Emotion?',
       'Eyes direction?', 'Face direction?', 'Facial hair?', 'Catch question',
       'Friendly', 'Makeup?', 'Gender',
       'Would you cast this person as the star of a movie?',
       'Would this be a good profile picture?', 'Image quality', 'Race',
       'Memorable', 'At what speed do you think this expression is happening?',
       'How much teeth is showing?'],
      dtype='object')

In [106]:
import random

random.seed(23)

#df_forward_facing_images = df_forward_facing_images.sample(frac = 1)

race_set = set(df_forward_facing_images["Race"].values)
image_lis = []

for race in race_set:
    
    race_df = df_forward_facing_images.loc[(df_forward_facing_images["Race"] == race)]
    race_attractive_df = race_df.loc[race_df["Attractive"] == 3.0]
    race_attractive_memorability_df = race_attractive_df.loc[race_attractive_df["Memorable"] == 3.0]
    
    random_row_index = random.randint(0, len(race_attractive_memorability_df) - 1)

    random_column_value = race_attractive_memorability_df.iloc[random_row_index]
    image_lis.append(random_column_value[[ "Filename", "Attractive", "Race", "Memorable"]])

image_lis


[Filename      Google_1_Jason Eaton_17_oval.jpg
 Attractive                                 3.0
 Race                                       1.0
 Memorable                                  3.0
 Name: (Google_1_Jason Eaton_17_oval.jpg, 0), dtype: object,
 Filename      Google_1_Annie Revell_17_oval.jpg
 Attractive                                  3.0
 Race                                        2.0
 Memorable                                   3.0
 Name: (Google_1_Annie Revell_17_oval.jpg, 0), dtype: object,
 Filename      Google_1_Anthony Adcock_13_oval.jpg
 Attractive                                    3.0
 Race                                          3.0
 Memorable                                     3.0
 Name: (Google_1_Anthony Adcock_13_oval.jpg, 0), dtype: object,
 Filename      Google_1_Scott Gupta_9_oval.jpg
 Attractive                                3.0
 Race                                      4.0
 Memorable                                 3.0
 Name: (Google_1_Scott Gupta_9_ov

In [107]:
#Extract images from dataframe

source_folder = 'Face Images'  # Folder with original 10K face dataset
target_folder = 'MTurk Test'  # Folder that contains the 2,200 annotated images


for image in image_lis: # iterate through each row of the dataframe
    image_file_name = image['Filename']
    source_path = os.path.join(source_folder, image_file_name) #uses image file name to create full path to the original file
    target_path = os.path.join(target_folder, image_file_name) #creates full path to the target location where the image file will be copied
    
    
    shutil.copyfile(source_path, target_path) #copies the content of the source file to the target file


In [111]:
#Drop test Mturk images from original dataset

file_name_list = []
for image in image_lis:
    file_name_list.append(image['Filename'])

values_to_drop = file_name_list
filtered_df = df_forward_facing_images.drop(df_forward_facing_images[df_forward_facing_images['Filename'].isin(values_to_drop)].index)
len(filtered_df)


1816

In [112]:
#Get images for the non-dropped Mturk Images

source_folder = 'Face Images'  # Folder with original 10K face dataset
target_folder = 'MTurk Images Face Forward'  # Folder that contains the 2,200 annotated images


for index, row in filtered_df.iterrows(): # iterate through each row of the dataframe
    image_file_name = row['Filename']
    source_path = os.path.join(source_folder, image_file_name) #uses image file name to create full path to the original file
    target_path = os.path.join(target_folder, image_file_name) #creates full path to the target location where the image file will be copied
    
    shutil.copyfile(source_path, target_path) #copies the content of the source file to the target file