# Biography Text Ablation Analysis

To understand why the LLM is able to predict race so effectively, we performed an ablation analysis by replacing salient keywords with generic terms (e.g. "Ang Lee" is replaced with "PERSON", "Venezuela" is replaced with "LOCATION") and observing how model performance changes. We focus on `names` (was the model trained on these people, so it has background information from pre-training), `location` (does the model have priors about what races tend to be concentrated in what locations?), and `ethnicity` (does it recognize explicit race-related information in the bio).

## Relabel Named Entities with Flair

We use `Flair`, a well-known Named Entity Recognition (NER) library to relabel named entities in the biography text. We relabel 1) names, 2) locations, 3) ethnicities, as well as combinations of the three together. 

In [2]:
import pandas as pd
import sys
sys.path.append("../script")
from BiographyAblation import FlairHelper
from tqdm.notebook import tqdm
tqdm.pandas()

root_dir = ".."
df = pd.read_csv(f"{root_dir}/data/cleaned_final_sample_metadata.csv", header=0) # if needed to read in
df["bio"] = df["bio"].astype(str)
flair = FlairHelper()

2023-09-07 23:50:46,999 SequenceTagger predicts: Dictionary with 75 tags: O, S-PERSON, B-PERSON, E-PERSON, I-PERSON, S-GPE, B-GPE, E-GPE, I-GPE, S-ORG, B-ORG, E-ORG, I-ORG, S-DATE, B-DATE, E-DATE, I-DATE, S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-NORP, B-NORP, E-NORP, I-NORP, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL, S-LOC, B-LOC, E-LOC, I-LOC, S-TIME, B-TIME, E-TIME, I-TIME, S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART, S-FAC


In [9]:
df.head(1)

In [11]:
# NON-ETHNICITY ENTITIES

df["flair_ethn_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["ethnicity"]))

# NON-LOCATION ENTITIES
df["flair_loc_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["location"]))

# NON-PERSON ENTITIES
df["flair_ppl_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["people"]))

  0%|          | 0/5201 [00:00<?, ?it/s]

In [15]:
file = "flair_bios.csv"
df.to_csv(f"{root_dir}/data/{file}", index=False)

In [16]:
# ETHNICITY AND PERSON ENTITIES
df["flair_ethn+ppl_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["ethnicity+people"]))

# ETHNICITY AND LOCATION ENTITIES
df["flair_ethn+loc_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["ethnicity+location"]))

# LOCATION AND PERSON ENTITIES
df["flair_loc+ppl_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["location+people"]))

  0%|          | 0/5201 [00:00<?, ?it/s]

  0%|          | 0/5201 [00:00<?, ?it/s]

  0%|          | 0/5201 [00:00<?, ?it/s]

In [17]:
file = "flair_bios.csv"
df.to_csv(f"{root_dir}/data/{file}", index=False)

In [18]:
# ETHNICITY AND LOCATION AND PERSON ENTITIES
df["flair_ethn+loc+ppl_bio"] = df["bio"].progress_apply(lambda x: flair.label_specific_entities(x, flair.entities["ethnicity+location+people"]))

  0%|          | 0/5201 [00:00<?, ?it/s]

In [20]:
file = "flair_bios.csv"
df.to_csv(f"{root_dir}/data/{file}", index=False)

## Relabel person names with regular expressions

In [1]:
import pandas as pd
import sys
sys.path.append("../script")
from BiographyAblation import FlairHelper
from tqdm.notebook import tqdm
tqdm.pandas()

root_dir = ".."
file = "flair_bios.csv"

df = pd.read_csv(f"{root_dir}/data/{file}")
df["bio"] = df["bio"].astype(str)
flair = FlairHelper()

2023-09-09 20:36:28.461319: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2023-09-09 20:38:03,926 SequenceTagger predicts: Dictionary with 75 tags: O, S-PERSON, B-PERSON, E-PERSON, I-PERSON, S-GPE, B-GPE, E-GPE, I-GPE, S-ORG, B-ORG, E-ORG, I-ORG, S-DATE, B-DATE, E-DATE, I-DATE, S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-NORP, B-NORP, E-NORP, I-NORP, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL, S-LOC, B-LOC, E-LOC, I-LOC, S-TIME, B-TIME, E-TIME, I-TIME, S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART, S-FAC


In [4]:
# PERSON ITSELF ONLY (removes only person name but keeps mentions of others)
df["flair_person_only_bio"] = df.progress_apply(lambda x: flair.label_name_as_keyword(string=x["bio"], 
                                                                                      name=x["name"],
                                                                                      keyword="PERSON"
                                                                                      ), axis=1)

  0%|          | 0/5201 [00:00<?, ?it/s]

In [9]:
df["flair_ethn_bio"] = df["flair_ethn_bio"].astype(str)
df["flair_loc_bio"] = df["flair_loc_bio"].astype(str)
df["flair_ethn+loc_bio"] = df["flair_ethn+loc_bio"].astype(str)

# PERSON ITSELF ONLY (removes only person name but keeps mentions of others)
df["flair_person+ethn_bio"] = df.progress_apply(lambda x: flair.label_name_as_keyword(string=x["flair_ethn_bio"], 
                                                                                      name=x["name"],
                                                                                      keyword="PERSON"
                                                                                      ), axis=1)

# PERSON ITSELF ONLY (removes only person name but keeps mentions of others)
df["flair_person+loc_bio"] = df.progress_apply(lambda x: flair.label_name_as_keyword(string=x["flair_loc_bio"], 
                                                                                      name=x["name"],
                                                                                      keyword="PERSON"
                                                                                      ), axis=1)

# PERSON ITSELF ONLY (removes only person name but keeps mentions of others)
df["flair_person+ethn+loc_bio"] = df.progress_apply(lambda x: flair.label_name_as_keyword(string=x["flair_ethn+loc_bio"], 
                                                                                      name=x["name"],
                                                                                      keyword="PERSON"
                                                                                      ), axis=1)

  0%|          | 0/5201 [00:00<?, ?it/s]

  0%|          | 0/5201 [00:00<?, ?it/s]

  0%|          | 0/5201 [00:00<?, ?it/s]

In [10]:
file = "flair_bios.csv"
df.to_csv(f"{root_dir}/data/{file}", index=False)

In [11]:
df.head()

Unnamed: 0,name,href,race,role,image,bio,bio_preprocessed,flair_ethn_bio,flair_loc_bio,flair_ppl_bio,flair_ethn+ppl_bio,flair_ethn+loc_bio,flair_loc+ppl_bio,flair_ethn+loc+ppl_bio,flair_person_only_bio,flair_person+ethn_bio,flair_person+loc_bio,flair_person+ethn+loc_bio
0,Ang Lee,/name/nm0000487,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BODA2MT...,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...",bear pingtung taiwan ang lee today great conte...,"Born in 1954 in Pingtung , Taiwan , Ang Lee ha...","Born in 1954 in GPE , GPE , Ang Lee has become...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , Ang Lee has become...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in Pingtung, Taiwan, PERSON has b...","Born in 1954 in Pingtung , Taiwan , PERSON has...","Born in 1954 in GPE , GPE , PERSON has become ...","Born in 1954 in GPE , GPE , PERSON has become ..."
1,James Wan,/name/nm1490123,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTY5Nz...,James Wan (born 26 February 1977) is an Austra...,james wan bear february australian film produc...,James Wan ( born 26 February 1977 ) is an NORP...,James Wan ( born 26 February 1977 ) is an Aust...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...,James Wan ( born 26 February 1977 ) is an NORP...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON (born 26 February 1977) is an Australia...,PERSON ( born 26 February 1977 ) is an NORP fi...,PERSON ( born 26 February 1977 ) is an Austral...,PERSON ( born 26 February 1977 ) is an NORP fi...
2,Jon M. Chu,/name/nm0160840,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BNDM0Nj...,Jon is an alumni of the USC School of Cinema-T...,jon alumnus usc school cinema television win p...,Jon is an alumni of the USC School of Cinema-T...,Jon is an alumni of the USC School of Cinema-T...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,Jon is an alumni of the USC School of Cinema-T...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...,PERSON is an alumni of the USC School of Cinem...
3,Taika Waititi,/name/nm0169806,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMzk4MD...,"Taika Waititi, also known as Taika Cohen, hail...",taika waititi know taika cohen hail raukokore ...,"Taika Waititi , also known as Taika Cohen , ha...","Taika Waititi , also known as Taika Cohen , ha...","PERSON , also known as PERSON , hails from the...","PERSON , also known as PERSON , hails from the...","Taika Waititi , also known as Taika Cohen , ha...","PERSON , also known as PERSON , hails from the...","PERSON , also known as PERSON , hails from the...","PERSON, also known as PERSON Cohen, hails from...","PERSON , also known as PERSON Cohen , hails fr...","PERSON , also known as PERSON Cohen , hails fr...","PERSON , also known as PERSON Cohen , hails fr..."
4,Karyn Kusama,/name/nm0476201,Asian,Filmmaker,https://m.media-amazon.com/images/M/MV5BMTUzMT...,"Karyn Kusama was born on March 21, 1968 in Bro...",karyn kusama bear march brooklyn new york usa ...,"Karyn Kusama was born on March 21 , 1968 in Br...","Karyn Kusama was born on March 21 , 1968 in GP...","PERSON was born on March 21 , 1968 in Brooklyn...","PERSON was born on March 21 , 1968 in Brooklyn...","Karyn Kusama was born on March 21 , 1968 in GP...","PERSON was born on March 21 , 1968 in GPE , GP...","PERSON was born on March 21 , 1968 in GPE , GP...","PERSON was born on March 21, 1968 in Brooklyn,...","PERSON was born on March 21 , 1968 in Brooklyn...","PERSON was born on March 21 , 1968 in GPE , GP...","PERSON was born on March 21 , 1968 in GPE , GP..."


We use `flair_bios.csv` to make predictions in `BioAblationTesting.ipynb` (hosted on CoLab with our BioRaceBERT models). After testing is complete, we use the file `flair_bios_probs.csv`here to continue our analysis!

# Ablation Analysis Results

In [44]:
import pandas as pd
import numpy as np

root_dir = ".."
file = "flair_bios_probs.csv"

df = pd.read_csv(f"{root_dir}/data/ablation/{file}")
df.head()

# transform str to numpy arr
for column in df.loc[:, "flair_ethn_bio_probs":"flair_person+ethn+loc_bio_probs"].columns:
    print(column)
    df[column] = df[column].apply(lambda x: np.fromstring(x.replace('\n','').replace('[','')
                                                .replace(']','').replace('  ',' '), sep=' '))

flair_ethn_bio_probs
flair_ppl_bio_probs
flair_loc_bio_probs
flair_ethn+loc_bio_probs
flair_loc+ppl_bio_probs
flair_ethn+ppl_bio_probs
flair_ethn+loc+ppl_bio_probs
flair_person_only_bio_probs
flair_person+ethn_bio_probs
flair_person+loc_bio_probs
flair_person+ethn+loc_bio_probs


In [45]:
# transform probs to preds
for column in df.loc[:, "flair_ethn_bio_probs":"flair_person+ethn+loc_bio_probs"].columns:
    pred_column = column[:-5]+"pred"
    print(pred_column)
    df[pred_column] = df[column].apply(np.argmax)

flair_ethn_bio_pred
flair_ppl_bio_pred
flair_loc_bio_pred
flair_ethn+loc_bio_pred
flair_loc+ppl_bio_pred
flair_ethn+ppl_bio_pred
flair_ethn+loc+ppl_bio_pred
flair_person_only_bio_pred
flair_person+ethn_bio_pred
flair_person+loc_bio_pred
flair_person+ethn+loc_bio_pred


In [46]:
df.head()

Unnamed: 0,val_index,name,href,race,race_cat,bio,flair_ethn_bio_probs,flair_ppl_bio_probs,flair_loc_bio_probs,flair_ethn+loc_bio_probs,...,flair_ppl_bio_pred,flair_loc_bio_pred,flair_ethn+loc_bio_pred,flair_loc+ppl_bio_pred,flair_ethn+ppl_bio_pred,flair_ethn+loc+ppl_bio_pred,flair_person_only_bio_pred,flair_person+ethn_bio_pred,flair_person+loc_bio_pred,flair_person+ethn+loc_bio_pred
0,4,Karyn Kusama,/name/nm0476201,Asian,0,"Karyn Kusama was born on March 21, 1968 in Bro...","[0.89769673, 0.05522941, 0.02654695, 0.02052698]","[0.15194972, 0.3393335, 0.11186691, 0.3968498]","[0.90568954, 0.05156705, 0.02243588, 0.02030754]","[0.90568954, 0.05156705, 0.02243588, 0.02030754]",...,3,0,0,3,3,3,3,3,3,3
1,12,Michael Aki,/name/nm0406866,Asian,0,"Michael Aki is known for Strangers (2012), Sun...","[0.18161274, 0.7488212, 0.05243589, 0.01713014]","[0.04335849, 0.899591, 0.03256469, 0.02448585]","[0.18161274, 0.7488212, 0.05243589, 0.01713014]","[0.18161274, 0.7488212, 0.05243589, 0.01713014]",...,1,1,1,1,1,1,1,1,1,1
2,13,Tze Chun,/name/nm2309735,Asian,0,"Tze Chun was born in 1980 in Chicago, Illinois...","[0.99025285, 0.00470042, 0.00267916, 0.00236759]","[0.12482573, 0.34341195, 0.08811145, 0.44365084]","[0.99141055, 0.00423765, 0.00232041, 0.00203141]","[0.99141055, 0.00423765, 0.00232041, 0.00203141]",...,3,0,0,1,3,1,3,3,3,3
3,23,Jon Moritsugu,/name/nm0605748,Asian,0,"Jon Moritsugu was born in Honolulu, Hawaii and...","[0.9827559, 0.00791049, 0.00380127, 0.00553237]","[0.45687556, 0.15597484, 0.0341593, 0.3529903]","[0.96962756, 0.01498768, 0.00535408, 0.0100307]","[0.96962756, 0.01498768, 0.00535408, 0.0100307]",...,0,0,0,3,0,3,3,3,3,3
4,39,Arvin Chen,/name/nm1789897,Asian,0,"Arvin Chen was born on November 26, 1978 in Bo...","[0.99072254, 0.00457601, 0.00260002, 0.00210143]","[0.4317788, 0.34700966, 0.11402743, 0.10718403]","[0.991614, 0.00414406, 0.00234378, 0.00189816]","[0.991614, 0.00414406, 0.00234378, 0.00189816]",...,0,0,0,1,0,1,0,0,0,0


In [47]:
from sklearn.metrics import classification_report

report = ""
# transform probs to preds
for column in df.loc[:, "flair_ethn_bio_pred":"flair_person+ethn+loc_bio_pred"].columns:
    report += f"{column}\n"
    report += classification_report(df["race_cat"], df[column])
    report += "\n"
    
with open(f"{root_dir}/data/full_report.txt", "w") as file:
    file.write(report)

In the `BioAblationBreakdown` notebook, we found that 2000 people in our dataset contain bios with name, ethnicity, and location. To see how removing salient terms affects the model without diluting the differences with bios that do not contain these tags, we do this again with the 2000 only.

In [76]:
import pandas as pd
import numpy as np

root_dir = ".."

df = pd.read_csv(f"{root_dir}/data/ablation/flair_bios_probs.csv")
df2 = pd.read_csv(f"{root_dir}/data/BioRaceBERT-final.csv")
df2 = df2[["href", "pred", "pred_cat"]]
df = df.merge(df2, how="left", on=["href"])

subset = pd.read_csv(f"{root_dir}/data/ablation/flair_ethn_loc_ppl_tags.csv")
subset = subset.merge(df, how="left", on=["name", "href"])

# transform str to numpy arr
for column in subset.loc[:, "flair_ethn_bio_probs":"flair_person+ethn+loc_bio_probs"].columns:
    subset[column] = subset[column].apply(lambda x: np.fromstring(x.replace('\n','').replace('[','')
                                                .replace(']','').replace('  ',' '), sep=' '))

print(f"{len(subset)} bios contain ethnicity, location, and name tags")

2000 bios contain ethnicity, location, and name tags


In [77]:
# transform probs to preds
for column in subset.loc[:, "flair_ethn_bio_probs":"flair_person+ethn+loc_bio_probs"].columns:
    pred_column = column[:-5]+"pred"
    print(pred_column)
    subset[pred_column] = subset[column].apply(np.argmax)
    
subset.head()

flair_ethn_bio_pred
flair_ppl_bio_pred
flair_loc_bio_pred
flair_ethn+loc_bio_pred
flair_loc+ppl_bio_pred
flair_ethn+ppl_bio_pred
flair_ethn+loc+ppl_bio_pred
flair_person_only_bio_pred
flair_person+ethn_bio_pred
flair_person+loc_bio_pred
flair_person+ethn+loc_bio_pred


Unnamed: 0,name,href,val_index,race,race_cat,bio,flair_ethn_bio_probs,flair_ppl_bio_probs,flair_loc_bio_probs,flair_ethn+loc_bio_probs,...,flair_ppl_bio_pred,flair_loc_bio_pred,flair_ethn+loc_bio_pred,flair_loc+ppl_bio_pred,flair_ethn+ppl_bio_pred,flair_ethn+loc+ppl_bio_pred,flair_person_only_bio_pred,flair_person+ethn_bio_pred,flair_person+loc_bio_pred,flair_person+ethn+loc_bio_pred
0,Ang Lee,/name/nm0000487,0,Asian,0,"Born in 1954 in Pingtung, Taiwan, Ang Lee has ...","[0.99847263, 0.00066385587, 0.00051732763, 0.0...","[0.99842072, 0.00068193884, 0.00055345672, 0.0...","[0.99819928, 0.00081236725, 0.00061019597, 0.0...","[0.9978981, 0.0010070485, 0.00066028308, 0.000...",...,0,0,0,0,0,0,0,0,0,0
1,James Wan,/name/nm1490123,1,Asian,0,James Wan (born 26 February 1977) is an Austra...,"[0.99908316, 0.00035026207, 0.00029325642, 0.0...","[0.99937797, 0.00023431388, 0.0002744417, 0.00...","[0.99940217, 0.00022724959, 0.00025402274, 0.0...","[0.99910283, 0.00035023357, 0.00028576891, 0.0...",...,0,0,0,0,3,3,0,3,0,3
2,Taika Waititi,/name/nm0169806,3,Asian,0,"Taika Waititi, also known as Taika Cohen, hail...","[0.99753428, 0.00071190548, 0.00029314714, 0.0...","[0.9690823, 0.0099647, 0.00503781, 0.01591518]","[0.99798477, 0.00054067979, 0.00042303358, 0.0...","[0.99894899, 0.00043529656, 0.00015622516, 0.0...",...,0,0,0,0,3,0,0,3,0,3
3,Destin Daniel Cretton,/name/nm2308774,5,Asian,0,Destin Daniel Cretton is an American filmmaker...,"[0.99937469, 0.00031833138, 7.7739089e-05, 0.0...","[0.99985337, 6.7081281e-05, 2.2032673e-05, 5.7...","[0.99982005, 8.1814796e-05, 2.5762196e-05, 7.2...","[0.13964729, 0.8090083, 0.00675368, 0.0445907]",...,0,0,1,0,0,1,0,0,0,0
4,Cary Joji Fukunaga,/name/nm1560977,6,Asian,0,Cary Joji Fukunaga is a Japanese-American film...,"[0.99862421, 0.00054787938, 0.00066704396, 0.0...","[0.9992494, 0.00027673534, 0.00037278712, 0.00...","[0.99916244, 0.00029407119, 0.00040605458, 0.0...","[0.99865985, 0.00063580065, 0.00054617401, 0.0...",...,0,0,0,0,1,1,0,3,0,3


In [79]:
from sklearn.metrics import classification_report

report = ""
# transform probs to preds
for column in subset.loc[:, "pred_cat":"flair_person+ethn+loc_bio_pred"].columns:
    print(column)
    report += f"{column}\n"
    report += classification_report(subset["race_cat"], subset[column])
    report += "\n"
    print(classification_report(subset["race_cat"], subset[column]))

pred_cat
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       460
           1       0.97      0.93      0.95       329
           2       0.98      0.98      0.98       425
           3       0.97      0.99      0.98       786

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000

flair_ethn_bio_pred
              precision    recall  f1-score   support

           0       0.97      0.92      0.94       460
           1       0.88      0.86      0.87       329
           2       0.95      0.93      0.94       425
           3       0.93      0.98      0.96       786

    accuracy                           0.94      2000
   macro avg       0.93      0.92      0.93      2000
weighted avg       0.94      0.94      0.93      2000

flair_ppl_bio_pred
              precision    recall  f1-score   support

           0       0.94    

In [81]:
print(report)
with open(f"{root_dir}/data/subset_report.txt", "w") as file:
    file.write(report)

pred_cat
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       460
           1       0.97      0.93      0.95       329
           2       0.98      0.98      0.98       425
           3       0.97      0.99      0.98       786

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000

flair_ethn_bio_pred
              precision    recall  f1-score   support

           0       0.97      0.92      0.94       460
           1       0.88      0.86      0.87       329
           2       0.95      0.93      0.94       425
           3       0.93      0.98      0.96       786

    accuracy                           0.94      2000
   macro avg       0.93      0.92      0.93      2000
weighted avg       0.94      0.94      0.93      2000

flair_ppl_bio_pred
              precision    recall  f1-score   support

           0       0.94    

### The performance is still very high, even after dropping these labels... why is this happening?

We save the 2000 bios to a file in order to understand what other attributes are relevant. Visit `BioAblationTesting` to see the results of looking through it (planning to use LIME here).

In [92]:
df = pd.read_csv(f"{root_dir}/data/flair_bios.csv")
subset = subset.loc[:,"name": "href"]
subset = subset.merge(df, how="left", on=["name", "href"])
subset.to_csv(f"{root_dir}/data/flair_bios_subset.csv")
print(len(subset))

2000
