# Clean Race Data

We web-scraped 6686 profiles from IMDb user-curated lists based on race/ethnicity (see `IMDbRaceEthnicityLists.csv`). After removing duplicate profiles and balancing the categories, we retained a list of 5201 individuals.

We cross-checked our IMDb ground truth values with ChatGPT (GPT3.5) and Wikipedia (if applicable). 



**ChatGPT prompt:**

*Please help me determine the race/ethnicity of some famous individuals (actors, actresses, producers and directors). You may choose from the following classes: Asian, Black, White, and Hispanic/Latino, as defined by the United States Census categories.  Given names, please list the race as follows:*
*Name 1: Race 1 *
*Name 2: Race 2 ...*

*Here is the list of names:*

In [28]:
import pandas as pd

root_dir = ".."
df = pd.read_csv(f"{root_dir}/data/RaceEthnicityGroundTruth.csv")

In [29]:
df["IMDb"].value_counts()

Black              1500
Asian              1338
White              1211
Hispanic/Latino    1152
Name: IMDb, dtype: int64

In [15]:
df.head()

Unnamed: 0,name,href,id,IMDb,GPT1,GPT2,Wikipedia,Agreement
0,Fred Astaire,/name/nm0000001,1,White,White,White,,White
1,Lauren Bacall,/name/nm0000002,2,White,White,White,,White
2,Ingrid Bergman,/name/nm0000006,6,White,White,White,,White
3,Humphrey Bogart,/name/nm0000007,7,White,White,White,,White
4,Marlon Brando,/name/nm0000008,8,White,White,White,,White


In [16]:
df["Agreement"].value_counts()

Black              1304
White              1207
Asian              1178
Hispanic/Latino     862
DISAGREEMENT        650
Name: Agreement, dtype: int64

In [17]:
df[df["Agreement"] == "DISAGREEMENT"]

Unnamed: 0,name,href,id,IMDb,GPT1,GPT2,Wikipedia,Agreement
20,Rita Hayworth,/name/nm0000028,28,Hispanic/Latino,Hispanic/Latino,White,Hispanic/Latino,DISAGREEMENT
54,Raquel Welch,/name/nm0000079,79,Hispanic/Latino,White,White,Hispanic/Latino,DISAGREEMENT
77,Tia Carrere,/name/nm0000119,119,Asian,Mixed (Asian/White),Asian,Asian,DISAGREEMENT
93,Cameron Diaz,/name/nm0000139,139,Hispanic/Latino,White,White,Hispanic/Latino,DISAGREEMENT
140,Keanu Reeves,/name/nm0000206,206,Asian,White,White,Asian,DISAGREEMENT
...,...,...,...,...,...,...,...,...
5182,Desiree Geraldine,/name/nm9514709,9514709,Black,Black,Hispanic/Latino,,DISAGREEMENT
5189,Conor Husting,/name/nm9716268,9716268,Hispanic/Latino,White,White,,DISAGREEMENT
5190,Saige Hooke,/name/nm9731856,9731856,Black,Black,White,,DISAGREEMENT
5192,Harper Grace Robinson,/name/nm9781688,9781688,Black,Black,White,,DISAGREEMENT


In [27]:
# statistics

disagree_count = len(df[df["Agreement"] == "DISAGREEMENT"])
disagree_frac = disagree_count/len(df)
print(f"{disagree_count} of {len(df)} are disagreements. This is {disagree_frac*100:.3f}% of the dataset")

650 of 5201 are disagreements. This is 12.498% of the dataset


## Drop Disagreement in Ground Truth

We drop profiles whose ground truth race/ethnicity differs between data sources. By taking a closer look at the 'disagreements', we find that the majority of these people are of mixed race heritage, making it difficult to place them in a single ground truth category. 

In [9]:
df = df[df["Agreement"] != "DISAGREEMENT"]

df["IMDb"].value_counts()

Black              1304
White              1207
Asian              1178
Hispanic/Latino     862
Name: IMDb, dtype: int64

## Rebalancing Dataset

We supplement our filtered dataset with a list of famous Hispanic/Latino actors and actresses from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Hispanic_and_Latino_American_actors). 

In [10]:
wiki_hispa_df = pd.read_csv(f"{root_dir}/data/wiki-hispa-list.csv")

wiki_hispa_df.head()

Unnamed: 0,name,href,description
0,Fernando Michelena,/wiki/Vera_Michelena,"Fernando Michelena (1858–1921), Venezuelan bor..."
1,Bijou Fernandez,/wiki/Bijou_Fernandez,Bijou Fernandez (1877 – 1961) Broadway actress...
2,Teresa Luisa Michelena,/wiki/Donna_Barrell,Teresa Luisa Michelena (1889 – 1941) American ...
3,Leo Carrillo,/wiki/Leo_Carrillo,Leo Carrillo (1880-1961) Californio actor best...
4,Steve Clemente,/wiki/Steve_Clemente,Steve Clemente (1885 – 1950) Mexican-understan...
