# Biography Text Ablation Analysis

To understand why the LLM is able to predict race so effectively, we performed an ablation analysis by replacing salient keywords with generic terms (e.g. "Ang Lee" is replaced with "NAME", "Venezuela" is replaced with "LOCATION") and observing how model performance changes. We focus on `names` (was the model trained on these people, so it has background information from pre-training), `location` (does the model have priors about what races tend to be concentrated in what locations?), and `ethnicity` (does it recognize explicit race-related information in the bio).

Notebooks Used
1. BioAblationAnalysis --> Relabel hypothesized salient keywords using algorithmic techniques (e.g. NER)
2. BioAblationTesting --> Use BioRaceBERT to classify relabeled bios and assess performance
3. BioAblationBreakdown --> Determine how common name, location, ethnicity are in bios

In [9]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../script")
from BiographyAblation import FlairHelper
flair = FlairHelper()
import re

root_dir = ".." # jw10
df = pd.read_csv(f"{root_dir}/data/flair_bios.csv")
df = df.replace(np.nan, "", regex=True)

print("Relevant flair tags\n", flair.entities)

2023-09-13 02:02:02.939442: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2023-09-13 02:03:22,338 SequenceTagger predicts: Dictionary with 75 tags: O, S-PERSON, B-PERSON, E-PERSON, I-PERSON, S-GPE, B-GPE, E-GPE, I-GPE, S-ORG, B-ORG, E-ORG, I-ORG, S-DATE, B-DATE, E-DATE, I-DATE, S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-NORP, B-NORP, E-NORP, I-NORP, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL, S-LOC, B-LOC, E-LOC, I-LOC, S-TIME, B-TIME, E-TIME, I-TIME, S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART, S-FAC
Relevant flair tags
 {'ethnicity': {'LANGUAGE', 'NORP'}, 'location': {'LOC', 'GPE'}, 'people': {'PERSON'}, 'ethnicity+location': {'GPE', 'LANGUAGE', 'LOC', 'NORP'}, 'ethnicity+people': {'LANGUAGE', 'PERSON', 'NORP'}, 'location+people': {'GPE', 'LOC', 'PERSON'}, 'ethnicity+location+people': {'LANGUAGE', 'GPE', 'LOC', 'PERSON', 'NORP'}}


In [10]:
def match_entity(text, key_set: set):
    '''
    1 if a flair category (name, ethnicity, location) was found in bio text, otherwise 0
    '''
    keys = ""
    for i, entity in enumerate(key_set):
        keys+= entity
        if i < len(key_set)-1:
            keys+="|"
    return 1 if re.search(f"({keys})", text) else 0

In [11]:
import pandas as pd
import numpy as np

root_dir = ".." # jw10
df = pd.read_csv(f"{root_dir}/data/flair_bios.csv")
df = df.replace(np.nan, "", regex=True)

In [12]:
df.head(3)

Unnamed: 0,name,href,race,role,image,bio,bio_preprocessed,flair_ethn_bio,flair_loc_bio,flair_ppl_bio,flair_ethn+ppl_bio,flair_ethn+loc_bio,flair_loc+ppl_bio,flair_ethn+loc+ppl_bio,flair_person_only_bio,flair_person+ethn_bio,flair_person+loc_bio,flair_person+ethn+loc_bio
0,Ang Lee,/name/nm0000487,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BODA2MT...,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...",bear pingtung taiwan ang lee today great conte...,"Born in 1954 in Pingtung , Taiwan , Ang Lee ha...","Born in 1954 in GPE , GPE , Ang Lee has become...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , Ang Lee has become...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in Pingtung, Taiwan, PERSON has b...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ..."
1,James Wan,/name/nm1490123,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTY5Nz...,James Wan (born 26 February 1977) is an Austra...,james wan bear february australian film produc...,James Wan ( born 26 February 1977 ) is an NORP...,James Wan ( born 26 February 1977 ) is an Aust...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...,James Wan ( born 26 February 1977 ) is an NORP...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON (born 26 February 1977) is an Australia...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...
2,Jon M. Chu,/name/nm0160840,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BNDM0Nj...,Jon is an alumni of the USC School of Cinema-T...,jon alumnus usc school cinema television win p...,Jon is an alumni of the USC School of Cinema-T...,Jon is an alumni of the USC School of Cinema-T...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,Jon is an alumni of the USC School of Cinema-T...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...


In [13]:
# replace flair_ethn_bio 	flair_loc_bio 	flair_ppl_bio 	with 1 or 0 matches
keywords = ["ethnicity", "location", "people"]
for i, relabeled_bio in enumerate(df.loc[:, "flair_ethn_bio": "flair_ppl_bio"].columns):
    print(i, relabeled_bio, keywords[i], flair.entities[keywords[i]])
    df[f"{relabeled_bio}"] = df[relabeled_bio].apply(lambda x: match_entity(x, flair.entities[keywords[i]]))

0 flair_ethn_bio ethnicity {'LANGUAGE', 'NORP'}
1 flair_loc_bio location {'LOC', 'GPE'}
2 flair_ppl_bio people {'PERSON'}


In [14]:
df.head(3)

Unnamed: 0,name,href,race,role,image,bio,bio_preprocessed,flair_ethn_bio,flair_loc_bio,flair_ppl_bio,flair_ethn+ppl_bio,flair_ethn+loc_bio,flair_loc+ppl_bio,flair_ethn+loc+ppl_bio,flair_person_only_bio,flair_person+ethn_bio,flair_person+loc_bio,flair_person+ethn+loc_bio
0,Ang Lee,/name/nm0000487,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BODA2MT...,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...",bear pingtung taiwan ang lee today great conte...,1,1,1,"Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , Ang Lee has become...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in Pingtung, Taiwan, PERSON has b...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ..."
1,James Wan,/name/nm1490123,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTY5Nz...,James Wan (born 26 February 1977) is an Austra...,james wan bear february australian film produc...,1,1,1,PERSON ( born 26 February 1977 ) is an NORP fi...,James Wan ( born 26 February 1977 ) is an NORP...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON (born 26 February 1977) is an Australia...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...
2,Jon M. Chu,/name/nm0160840,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BNDM0Nj...,Jon is an alumni of the USC School of Cinema-T...,jon alumnus usc school cinema television win p...,0,0,1,PERSON is an alumni of the USC School of Cinem...,Jon is an alumni of the USC School of Cinema-T...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...


In [15]:
# replace flair_ethn+ppl_bio 	flair_ethn+loc_bio 	flair_loc+ppl_bio 	flair_ethn+loc+ppl_bio with 1 or 0 matches

df["flair_ethn+ppl_bio"] = df.apply(lambda x: x["flair_ethn_bio"] and x["flair_ppl_bio"], axis=1)
df["flair_ethn+loc_bio"] = df.apply(lambda x: x["flair_ethn_bio"] and x["flair_loc_bio"], axis=1)
df["flair_loc+ppl_bio"] = df.apply(lambda x: x["flair_loc_bio"] and x["flair_ppl_bio"], axis=1)
df["flair_ethn+loc+ppl_bio"] = df.apply(lambda x: x["flair_ethn_bio"] and x["flair_loc_bio"] and x["flair_ppl_bio"], axis=1)

In [16]:
df.head()

Unnamed: 0,name,href,race,role,image,bio,bio_preprocessed,flair_ethn_bio,flair_loc_bio,flair_ppl_bio,flair_ethn+ppl_bio,flair_ethn+loc_bio,flair_loc+ppl_bio,flair_ethn+loc+ppl_bio,flair_person_only_bio,flair_person+ethn_bio,flair_person+loc_bio,flair_person+ethn+loc_bio
0,Ang Lee,/name/nm0000487,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BODA2MT...,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...",bear pingtung taiwan ang lee today great conte...,1,1,1,1,1,1,1,"Born in 1954 in Pingtung, Taiwan, PERSON has b...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ..."
1,James Wan,/name/nm1490123,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTY5Nz...,James Wan (born 26 February 1977) is an Austra...,james wan bear february australian film produc...,1,1,1,1,1,1,1,PERSON (born 26 February 1977) is an Australia...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...
2,Jon M. Chu,/name/nm0160840,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BNDM0Nj...,Jon is an alumni of the USC School of Cinema-T...,jon alumnus usc school cinema television win p...,0,0,1,0,0,0,0,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...
3,Taika Waititi,/name/nm0169806,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMzk4MD...,"Taika Waititi, also known as Taika Cohen, hail...",taika waititi know taika cohen hail raukokore ...,1,1,1,1,1,1,1,"PERSON, also known as PERSON Cohen, hails from...","PERSON , also known as PERSON Cohen , hails fr...","PERSON , also known as PERSON Cohen , hails fr...","PERSON , also known as PERSON Cohen , hails fr..."
4,Karyn Kusama,/name/nm0476201,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTUzMT...,"Karyn Kusama was born on March 21, 1968 in Bro...",karyn kusama bear march brooklyn new york usa ...,0,1,1,0,0,1,0,"PERSON was born on March 21, 1968 in Brooklyn,...","PERSON was born on March 21 , 1968 in Brooklyn...","PERSON was born on March 21 , 1968 in GPE , GP...","PERSON was born on March 21 , 1968 in GPE , GP..."


In [18]:
for i, relabeled_bio in enumerate(df.loc[:, "flair_ethn_bio": "flair_ethn+loc+ppl_bio"].columns):
    print(i, relabeled_bio, df[relabeled_bio].sum())

0 flair_ethn_bio 2177
1 flair_loc_bio 4154
2 flair_ppl_bio 5161
3 flair_ethn+ppl_bio 2170
4 flair_ethn+loc_bio 2007
5 flair_loc+ppl_bio 4132
6 flair_ethn+loc+ppl_bio 2000


In [19]:
# Adjust for double counting
ethn_loc_ppl = df["flair_ethn+loc+ppl_bio"].sum()
ethn_ppl = df["flair_ethn+ppl_bio"].sum() - ethn_loc_ppl
ethn_loc = df["flair_ethn+loc_bio"].sum() - ethn_loc_ppl
loc_ppl = df["flair_loc+ppl_bio"].sum() - ethn_loc_ppl
ethn_only = df["flair_ethn_bio"].sum() - ethn_ppl - ethn_loc - ethn_loc_ppl
loc_only = df["flair_loc_bio"].sum() - loc_ppl - ethn_loc - ethn_loc_ppl
ppl_only = df["flair_ppl_bio"].sum() - ethn_ppl - loc_ppl - ethn_loc_ppl
none = len(df) - ethn_loc_ppl - ethn_ppl - ethn_loc - loc_ppl - ethn_only - loc_only - ppl_only

labels = ("ethn_loc_ppl", "ethn_ppl" ,  "ethn_loc" , "loc_ppl" , "ethn_only", "ppl_only" , "loc_only" , "none")
counts = (ethn_loc_ppl, ethn_ppl ,  ethn_loc , loc_ppl , ethn_only, ppl_only , loc_only , none)

for label, count in zip(labels, counts):
    print(label, count)

ethn_loc_ppl 2000
ethn_ppl 170
ethn_loc 7
loc_ppl 2132
ethn_only 0
ppl_only 859
loc_only 15
none 18


## Manually check bios that claim to contain none...

In [20]:
test = df.loc[(df["flair_ethn_bio"] == 0) & (df["flair_loc_bio"] == 0 )& (df["flair_ppl_bio"] == 0)]
print(len(test))

18


In [21]:
for i in range(len(test)):
    print(test.iloc[i]["name"], "|", test.iloc[i]["bio"])
    print()

Preet Bharara | 

Jasmeet 'Jusreign' Singh | Jasmeet 'Jusreign' Singh is known for Ultimate Expedition (2018).

Priyanka Wali | 

Abby Govindan | 

Pato Escala Pierart | Pato Escala Pierart is known for Bear Story (2014), Duerme Tranquila (2016) and Angry Birds - Bubble Trouble (2020).

Santa Sierra | Santa Sierra is known for Narcos (2015), Power Book III: Raising Kanan (2021) and Mayans M.C. (2018).

Lionel Barrymore | Famed actor, composer, artist, author and director. His talents extended to the authoring of the novel "Mr. Cartonwine: A Moral Tale" as well as his autobiography. In 1944, he joined ASCAP, and composed "Russian Dances", "Partita", "Ballet Viennois", "The Woodman and the Elves", "Behind the Horizon", "Fugue Fantasia", "In Memorium", "Hallowe'en", "Preludium & Fugue", "Elegie for Oboe, Orch.", "Farewell Symphony (1-act opera)", "Elegie (piano pieces)", "Rondo for Piano" and "Scherzo Grotesque".

Vivicca A. Whitsett | 2011 National Co-Chair of Screen Actors Guild EEO Com

In [22]:
test = df.loc[(df["flair_ethn_bio"] == 0) & (df["flair_loc_bio"] != 0 )& (df["flair_ppl_bio"] == 0)]
print(len(test))

for i in range(len(test)):
    print(test.iloc[i]["name"], "|", test.iloc[i]["bio"])
    print()

15
King Hu | He was educated in art school in Beijing, left China for Hong Kong in 1949 and entered the film industry in 1951 in the art department. In the 1950s he began acting and in 1958 joined Shaw Brothers as an actor and writer, and later a director. In 1967 he left to start his own studio in Taiwan, returned to Hong Kong in 1970s, working in Hong Kong, Taiwan and China before his death.

Kim Ki-duk | He studied fine arts in Paris in 1990-1992. In 1993 he won the award for Best Screenplay from the Educational Institute of Screenwriting with "A Painter and A Criminal Condemned to Death". After two more screenplay awards, he made his directorial debut with Crocodile (1996) ("Crocodile"). Then he went on to direct Wild Animals (1997) ("Wild Animals"), Birdcage Inn (1998) ("Birdcage Inn"), The Isle (2000) ("The Isle") and the highly experimental Shilje sanghwang (2000) ("Real Fiction"), shot in just 200 minutes. In 1999, Suchwiin bulmyeong (2001) ("Address Unknown") was selected by t

Of the 18 people who Flair did not recognize information in, we determine that:

- 13 contain Name but no Ethnicity or Location
- 1 contains no Name, Ethnicity, Location information
- 4 contain no bios (IMDb did not autogenerate properly)

Of the 15 people who Flair only recognized location (likely missed name), we determine that:
- 5 contained names that flair didn't recognize (e.g. Carroll Parrot Blue)

Thus, we readjust:
- ethn_loc_ppl 2000
- ethn_ppl 170
- ethn_loc 7
- loc_ppl 2132 + 5
- ethn_only 0
- ppl_only 859 + 13
- loc_only 15 - 5
- none 18 - 13

____
- ethn_loc_ppl 2000
- ethn_ppl 170
- ethn_loc 7
- loc_ppl 2137
- ethn_only 0
- ppl_only 872
- loc_only 10
- none 5

In [29]:
# note this df only saves 2000 bios with ethn/loc/name tags
ethn_loc_ppl_tags = df[df["flair_ethn+loc+ppl_bio"] == 1].loc[:,"name": "href"]
print(len(ethn_loc_ppl_tags))
ethn_loc_ppl_tags.to_csv(f"{root_dir}/data/ablation/flair_ethn_loc_ppl_tags.csv", index=None)

2000
