# MTurk Dataset Collection

This notebook contains the pipeline for our MTurk dataset collection process. It includes initial EDA for our testing dataset, which is ~2.2k faces from the 10k dataset that have corresponding demographic attributes. Each row corresponds to an MTurk participants responses regarding the attribute questions, and there are approximately 12 responses for each unique image  **add on as we go**

### imports

In [1]:
import pandas as pd
from statistics import mode
import shutil
import os

## 2.2k face dataset EDA & Cleaning

In [2]:
attribute_df = pd.read_csv("demographic-others-labels.csv")
attribute_df

Unnamed: 0,Filename,Image #,Age,Attractive,Is this person famous?,Common?,How much emotion is in this face?,Emotion?,Eyes direction?,Face direction?,...,Friendly,Makeup?,Gender,Would you cast this person as the star of a movie?,Would this be a good profile picture?,Image quality,Race,Memorable,At what speed do you think this expression is happening?,How much teeth is showing?
0,Google_1_Danielle Goble_5_oval.jpg,1,3.0,5.0,0.0,2.0,2.0,0.0,1.0,1.0,...,4.0,0.0,1.0,2.0,2.0,5.0,6.0,5.0,1.0,0.0
1,Google_1_Danielle Goble_5_oval.jpg,1,2.0,3.0,0.0,2.0,3.0,1.0,1.0,4.0,...,4.0,0.0,1.0,1.0,2.0,5.0,1.0,4.0,3.0,0.0
2,Google_1_Danielle Goble_5_oval.jpg,1,3.0,3.0,0.0,4.0,1.0,6.0,1.0,5.0,...,5.0,0.0,1.0,1.0,2.0,5.0,5.0,5.0,5.0,0.0
3,Google_1_Danielle Goble_5_oval.jpg,1,3.0,4.0,0.0,2.0,2.0,0.0,1.0,4.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,4.0,3.0,0.0
4,Google_1_Danielle Goble_5_oval.jpg,1,3.0,2.0,1.0,3.0,4.0,1.0,1.0,1.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26658,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,4.0,1.0,0.0,0.0,2.0,4.0,1.0,3.0,4.0,1.0
26659,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,4.0,3.0,1.0,1.0,1.0,...,4.0,2.0,0.0,0.0,2.0,2.0,1.0,2.0,3.0,1.0
26660,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,2.0,0.0,4.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,3.0,1.0
26661,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,3.0,2.0,1.0


In [3]:
# we want to see the race breakdown in the 2.2k face dataset 
attribute_df["Race"].value_counts()

1.0    20984
2.0     2593
5.0     1262
3.0      664
6.0      501
4.0      371
0.0      276
Name: Race, dtype: int64

In [4]:
# we're imputing the mode to find the most frequent response for each attribute for image file 
# out of the 12 participant responses (grouping by image file)
mode_per_filename = attribute_df.groupby("Filename").apply(lambda x: x.mode())

In [5]:
mode_per_filename = mode_per_filename.dropna()

In [6]:
len(mode_per_filename)

2222

In [7]:
mode_per_filename["Race"].value_counts()

1.0    1836
2.0     220
5.0      72
3.0      63
6.0      24
4.0       5
0.0       2
Name: Race, dtype: int64

In [8]:
mode_per_filename["Race"].value_counts() / len(mode_per_filename)

1.0    0.826283
2.0    0.099010
5.0    0.032403
3.0    0.028353
6.0    0.010801
4.0    0.002250
0.0    0.000900
Name: Race, dtype: float64

Based on unique images, whites account for 82.63% of the data. 

Further, we are combining the South Asian (4) and Middle Eastern (6) categories into "South Asian or Middle Eastern" since the sample size of South Asians is too small for proper analysis. These races are often mistaken for one another, so we chose to combine Middle Eastern and South Asian rather than East Asian and South Asian. Additionally, we are dropping the rows corresponding to "Other" since they account for less than 10% of the dataset. 

In [9]:
files_to_drop = mode_per_filename[mode_per_filename["Race"] == 0.0]["Filename"].values
print("files to drop: ", files_to_drop)

files to drop:  ['Google_1_Clarence Morehouse_8_oval.jpg'
 'Google_1_Steven Mahan_1_oval.jpg']


In [10]:
# combining Middle Eastern and South Asian into "Asian (South or East)"
mode_per_filename.loc[mode_per_filename["Race"] == 6.0, "Race"] = 4.0
mode_per_filename.loc[mode_per_filename["Race"] == 4.0]

#Drop other
rows_to_drop = mode_per_filename[mode_per_filename["Race"] == 0.0].index
mode_per_filename = mode_per_filename.drop(rows_to_drop)

In [11]:
mode_per_filename["Race"].value_counts()

1.0    1836
2.0     220
5.0      72
3.0      63
4.0      29
Name: Race, dtype: int64

In [12]:
# removing the images that corresponds to the images to drop 

len(files_to_drop )

2

In [13]:
mode_per_filename.columns

Index(['Filename', 'Image #', 'Age', 'Attractive', 'Is this person famous?',
       'Common?', 'How much emotion is in this face?', 'Emotion?',
       'Eyes direction?', 'Face direction?', 'Facial hair?', 'Catch question',
       'Friendly', 'Makeup?', 'Gender',
       'Would you cast this person as the star of a movie?',
       'Would this be a good profile picture?', 'Image quality', 'Race',
       'Memorable', 'At what speed do you think this expression is happening?',
       'How much teeth is showing?'],
      dtype='object')

In [14]:
#Extract images where face direction is looking straightforward to standardize pose across the dataset

df_forward_facing_images = mode_per_filename.copy()

df_forward_facing_images = mode_per_filename.loc[mode_per_filename["Face direction?"] == 1.0]
df_forward_facing_images["Race"].value_counts() / len(df_forward_facing_images)


1.0    0.834802
2.0    0.099670
5.0    0.029185
3.0    0.024780
4.0    0.011564
Name: Race, dtype: float64

### **TALK MORE ABOUT EDA....**

### creating image folders and output DataFrames for 2k test images and race testing images for MTurk experiments

In [15]:
def get_aws_url(bucket_name: str, file_name: str):
    """
    Gets the public AWS file name for the corresponding image (buckets were already uploaded to 
    AWS)
    Should be implemented using apply function 
    """
    file_name = f"https://{bucket_name}.s3.amazonaws.com/{file_name}"
    file_name_formatted = file_name.replace(" ", "%20")
    return file_name_formatted

In [16]:
relevant_columns = ["Attractive", "Race", "Memorable", "Filename"]

### race testing dataset

In [17]:
import random

random.seed(23)

#df_forward_facing_images = df_forward_facing_images.sample(frac = 1)

race_set = set(df_forward_facing_images["Race"].values)
image_lis = []

for race in race_set:
    
    race_df = df_forward_facing_images.loc[(df_forward_facing_images["Race"] == race)]
    race_attractive_df = race_df.loc[race_df["Attractive"] == 3.0]
    race_attractive_memorability_df = race_attractive_df.loc[race_attractive_df["Memorable"] == 3.0]
    
    random_row_index = random.randint(0, len(race_attractive_memorability_df) - 1)

    random_column_value = race_attractive_memorability_df.iloc[random_row_index]
    image_lis.append(random_column_value[[ "Filename", "Attractive", "Race", "Memorable"]])

race_testing_df = pd.DataFrame(image_lis)
race_testing_df = race_testing_df.reset_index()
race_testing_df = race_testing_df[relevant_columns]

In [18]:
race_testing_df["AWSFile"] = race_testing_df["Filename"].apply(lambda file: get_aws_url("race-testing-mturk-images", file))

In [19]:
race_testing_df

Unnamed: 0,Attractive,Race,Memorable,Filename,AWSFile
0,3.0,1.0,3.0,Google_1_Jason Eaton_17_oval.jpg,https://race-testing-mturk-images.s3.amazonaws...
1,3.0,2.0,3.0,Google_1_Annie Revell_17_oval.jpg,https://race-testing-mturk-images.s3.amazonaws...
2,3.0,3.0,3.0,Google_1_Anthony Adcock_13_oval.jpg,https://race-testing-mturk-images.s3.amazonaws...
3,3.0,4.0,3.0,Google_1_Scott Gupta_9_oval.jpg,https://race-testing-mturk-images.s3.amazonaws...
4,3.0,5.0,3.0,Google_1_Della Settles_1_oval.jpg,https://race-testing-mturk-images.s3.amazonaws...


### writing output df to new csv

In [20]:
race_testing_df.to_csv(
"race_testing.csv", mode="w", index=False)

### creating image folder

In [21]:
#Extract images from dataframe

source_folder = '10k-Faces'  # Folder with original 10K face dataset
target_folder = 'MTurk-Race-Test'  # Folder that contains the 2,200 annotated images

# create a new folder for the target directory that contains the images
current_directory = os.getcwd()
final_directory = os.path.join(current_directory, target_folder)
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

# for image in race_testing_df["Filename"].values
for image in image_lis: # iterate through each row of the dataframe
    image_file_name = image['Filename']
    source_path = os.path.join(source_folder, image_file_name) #uses image file name to create full path to the original file
    target_path = os.path.join(target_folder, image_file_name) #creates full path to the target location where the image file will be copied
    
    
    shutil.copyfile(source_path, target_path) #copies the content of the source file to the target file

### 2.2k face dataset

In [22]:
#Drop test Mturk images from original dataset

file_name_list = []
for image in image_lis:
    file_name_list.append(image['Filename'])

values_to_drop = file_name_list
images_2k_df = df_forward_facing_images.drop(df_forward_facing_images[df_forward_facing_images['Filename'].isin(values_to_drop)].index)

images_2k_df = images_2k_df[relevant_columns]
images_2k_df = images_2k_df.reset_index(drop = True)
images_2k_df


Unnamed: 0,Attractive,Race,Memorable,Filename
0,3.0,1.0,3.0,Aaron_Dollar_13_oval.jpg
1,3.0,1.0,2.0,Aaron_Mink_9_oval.jpg
2,3.0,2.0,3.0,Aaron_Turner_13_oval.jpg
3,3.0,1.0,3.0,Ada_Galbreath_19_oval.jpg
4,4.0,1.0,3.0,Ada_Riddick_15_oval.jpg
...,...,...,...,...
1806,3.0,2.0,4.0,Jeremy_Bowlin_15_oval.jpg
1807,3.0,1.0,2.0,Joann_Pickering_16_oval.jpg
1808,2.0,1.0,3.0,Jonathan_DArden_13_oval.jpg
1809,2.0,1.0,3.0,Josephine_Anderson_17_oval.jpg


In [23]:
images_2k_df["AWSFile"] = images_2k_df["Filename"].apply(lambda file: get_aws_url("mturk-2k-images", file))
images_2k_df

Unnamed: 0,Attractive,Race,Memorable,Filename,AWSFile
0,3.0,1.0,3.0,Aaron_Dollar_13_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Aaron...
1,3.0,1.0,2.0,Aaron_Mink_9_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Aaron...
2,3.0,2.0,3.0,Aaron_Turner_13_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Aaron...
3,3.0,1.0,3.0,Ada_Galbreath_19_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Ada_G...
4,4.0,1.0,3.0,Ada_Riddick_15_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Ada_R...
...,...,...,...,...,...
1806,3.0,2.0,4.0,Jeremy_Bowlin_15_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Jerem...
1807,3.0,1.0,2.0,Joann_Pickering_16_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Joann...
1808,2.0,1.0,3.0,Jonathan_DArden_13_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Jonat...
1809,2.0,1.0,3.0,Josephine_Anderson_17_oval.jpg,https://mturk-2k-images.s3.amazonaws.com/Josep...


### writing output df to new csv

In [24]:
images_2k_df.to_csv(
"2k-images.csv", mode="w", index=False)

### creating image folder

In [25]:
#Get images for the non-dropped Mturk Images

source_folder = '10k-Faces'  # Folder with original 10K face dataset
target_folder = '2k-MTurk-Images'  # Folder that contains the 2,200 annotated images

# create a new folder for the target directory that contains the images
current_directory = os.getcwd()
final_directory = os.path.join(current_directory, target_folder)
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

for index, row in images_2k_df.iterrows(): # iterate through each row of the dataframe
    image_file_name = row['Filename']
    source_path = os.path.join(source_folder, image_file_name) #uses image file name to create full path to the original file
    target_path = os.path.join(target_folder, image_file_name) #creates full path to the target location where the image file will be copied
    
    shutil.copyfile(source_path, target_path) #copies the content of the source file to the target file