# MTurk Dataset Collection

This notebook contains the pipeline for our MTurk dataset collection process. It includes initial EDA for our testing dataset, which is ~2.2k faces from the 10k dataset that have corresponding demographic attributes. Each row corresponds to an MTurk participants responses regarding the attribute questions, and there are approximately 12 responses for each unique image  **add on as we go**

### imports

In [1]:
import pandas as pd
from statistics import mode

## 2.2k face dataset EDA & Cleaning

In [2]:
attribute_df = pd.read_csv("demographic-others-labels.csv")
attribute_df

Unnamed: 0,Filename,Image #,Age,Attractive,Is this person famous?,Common?,How much emotion is in this face?,Emotion?,Eyes direction?,Face direction?,...,Friendly,Makeup?,Gender,Would you cast this person as the star of a movie?,Would this be a good profile picture?,Image quality,Race,Memorable,At what speed do you think this expression is happening?,How much teeth is showing?
0,Google_1_Danielle Goble_5_oval.jpg,1,3.0,5.0,0.0,2.0,2.0,0.0,1.0,1.0,...,4.0,0.0,1.0,2.0,2.0,5.0,6.0,5.0,1.0,0.0
1,Google_1_Danielle Goble_5_oval.jpg,1,2.0,3.0,0.0,2.0,3.0,1.0,1.0,4.0,...,4.0,0.0,1.0,1.0,2.0,5.0,1.0,4.0,3.0,0.0
2,Google_1_Danielle Goble_5_oval.jpg,1,3.0,3.0,0.0,4.0,1.0,6.0,1.0,5.0,...,5.0,0.0,1.0,1.0,2.0,5.0,5.0,5.0,5.0,0.0
3,Google_1_Danielle Goble_5_oval.jpg,1,3.0,4.0,0.0,2.0,2.0,0.0,1.0,4.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,4.0,3.0,0.0
4,Google_1_Danielle Goble_5_oval.jpg,1,3.0,2.0,1.0,3.0,4.0,1.0,1.0,1.0,...,3.0,0.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26658,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,4.0,1.0,0.0,0.0,2.0,4.0,1.0,3.0,4.0,1.0
26659,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,4.0,3.0,1.0,1.0,1.0,...,4.0,2.0,0.0,0.0,2.0,2.0,1.0,2.0,3.0,1.0
26660,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,2.0,0.0,4.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,3.0,1.0
26661,Google_1_Eileen Burd_7_oval.jpg,2222,3.0,3.0,0.0,5.0,3.0,1.0,1.0,1.0,...,3.0,0.0,0.0,0.0,1.0,3.0,1.0,3.0,2.0,1.0


In [3]:
# we want to see the race breakdown in the 2.2k face dataset 
attribute_df["Race"].value_counts()

1.0    20984
2.0     2593
5.0     1262
3.0      664
6.0      501
4.0      371
0.0      276
Name: Race, dtype: int64

In [14]:
# we're imputing the mode to find the most frequent response for each attribute for image file 
# out of the 12 participant responses (grouping by image file)
mode_per_filename = attribute_df.groupby("Filename").apply(lambda x: x.mode())

In [17]:
mode_per_filename = mode_per_filename.dropna()

In [18]:
len(mode_per_filename)

2222

In [19]:
mode_per_filename["Race"].value_counts()

1.0    1836
2.0     220
5.0      72
3.0      63
6.0      24
4.0       5
0.0       2
Name: Race, dtype: int64

In [20]:
mode_per_filename["Race"].value_counts() / len(mode_per_filename)

1.0    0.826283
2.0    0.099010
5.0    0.032403
3.0    0.028353
6.0    0.010801
4.0    0.002250
0.0    0.000900
Name: Race, dtype: float64

Based on unique images, whites account for .... 

Further, we are combining the South Asian (4) and East Asian (3) categories into "Asian (South or East)" since the sample size of South Asians is too small for proper analysis. Additionally, we are dropping the rows corresponding to "Other" since they account for less than 10% of the dataset. 

In [9]:
# combining East Asian and South Asian into "Asian (South or East)"
mode_per_filename.loc[mode_per_filename["Race"] == 4.0, "Race"] = 3.0
mode_per_filename.loc[mode_per_filename["Race"] == 4.0]

#Drop other
rows_to_drop = mode_per_filename[mode_per_filename["Race"] == 0.0].index
mode_per_filename = mode_per_filename.drop(rows_to_drop)

In [10]:
mode_per_filename["Race"].value_counts()

1.0    1836
2.0     222
5.0      83
3.0      70
6.0      31
Name: Race, dtype: int64