# Classify Test Set 

In this notebook, we will use the test set `../data/test_sample_metadata.csv` to predict race using the DistilBERT model.

## [Part 1](#1)
1. [Load model and set up reusable functions](#setup)
2. [Conduct manual tests on sample dataset](#manual-test)
    1. I tested a sample of names without important entities (person name, location, nationality/ethnicity/race) and evaluate accuracy
    
## [Part 2](#2)
3. [Named Entity Recognition (Algorithmic method)](#ner-alg)
    1. [Transform Named Entities in Sentence to Category Labels (tests different versions)](#transform)
    2. [Remove individual categories in Sentence to Category Labels](#individual)
    
## [Part 3](#3)
4. [Remove individual categories in Sentence to Category Labels](#individual)

# Part One <a href name="1">

In [10]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../script")
import text_preprocessing
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification
import tensorflow as tf
from sklearn.metrics import classification_report

main_dir = ".."

2023-07-18 22:55:08.470819: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load model and set up reusable functions <a href name="setup">

In [4]:
# load model
loaded_tokenizer = DistilBertTokenizerFast.from_pretrained(f"{main_dir}/model/distilbert")
loaded_model = TFDistilBertForSequenceClassification.from_pretrained(f"{main_dir}/model/distilbert")

# set up reusable functions
def predict_one(raw_text):
    '''
    Params:
        raw_text: unprocessed biography text 
    Returns:
        prediction_value: one of {0, 1, 2, 3} which maps to {Asian, Black, Hispanic, White}
    '''
    text = ' '.join(text_preprocessing.preprocess(raw_text, lemmatization=True))
    predict_input = loaded_tokenizer.encode(text,
                                  truncation=True,
                                  padding=True,
                                  return_tensors="tf")

    output = loaded_model(predict_input)[0]
    prediction_value = tf.argmax(output, axis=1).numpy()[0]
    return prediction_value

def predict_race_for(file_path, should_eval=True):
    '''
    Predicts race label using `predict` function
    Returns:
        test_predictions: list of race label predictions
    '''
    test_df = pd.read_csv(file_path)
    test_df = test_df.replace(np.nan, "", regex=True)
    test_predictions = [predict_one(text) for text in test_df["mini_bio"]]
    pred_df = pd.DataFrame({
                        "name": test_df["name"],
                        "href": test_df["href"],
                        "text": test_df["mini_bio"],
                        "label": test_df["label"],
                        "pred": test_predictions
                                                })
    if should_eval:
        print(classification_report(pred_df["label"], pred_df["pred"]))
    return pred_df

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at ../model/distilbert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


## Conduct manual tests on sample dataset <a href name="manual-test">

In [5]:
# set up sample
df = pd.read_csv(f"{main_dir}/data/test_sample_metadata.csv")
df = df.sample(10, random_state=42) # take a small sample and explore results # TODO once fixing all???
df.to_csv(f"{main_dir}/data/test_sample_manual.csv", index=False)

### No Name

In [6]:
pred_df = predict_race_for(file_path=f"{main_dir}/data/test_sample_manual_no_name.csv")
pred_df

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         2
           2       1.00      1.00      1.00         3
           3       1.00      1.00      1.00         4

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



Unnamed: 0,name,href,text,label,pred
0,Hortensia Santoveña,/name/nm0764326,"PERSON was born on October 17, 1912 in Tlalpuj...",2,2
1,Rosalind Russell,/name/nm0751426,"The middle of seven children, she was named, n...",3,3
2,Patrick Warburton,/name/nm0911320,"PERSON is known to many for the role of ""Puddy...",3,3
3,Alexander Etseyatse,/name/nm3411562,"PERSON is known for Otis (2018), Mikel's Faith...",1,1
4,Patricia Kazadi,/name/nm2131865,PERSON's acting career started when she was on...,1,1
5,Sharon Stone,/name/nm0000232,"PERSON was born and raised in Meadville, a sma...",3,3
6,Nat Wolff,/name/nm1822659,"PERSON is an American actor, musician, and sin...",3,3
7,Shawn Mendes,/name/nm6658398,"PERSON was born on August 8, 1998 in Toronto, ...",2,2
8,Ellen Wong,/name/nm2771798,"PERSON was born in Scarborough, Ontario, Canad...",0,0
9,Ricky Martin,/name/nm0005193,"Born and raised in Puerto Rico, PERSON initiat...",2,2


### No Info About Target Person

In [7]:
pred_df = predict_race_for(file_path=f"{main_dir}/data/test_sample_manual_no_personal_info.csv")
pred_df

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       0.67      1.00      0.80         2
           2       1.00      1.00      1.00         3
           3       1.00      0.75      0.86         4

    accuracy                           0.90        10
   macro avg       0.92      0.94      0.91        10
weighted avg       0.93      0.90      0.90        10



Unnamed: 0,name,href,text,label,pred
0,Hortensia Santoveña,/name/nm0764326,"PERSON was born on October 17, 1912 in LOCATIO...",2,2
1,Rosalind Russell,/name/nm0751426,"The middle of seven children, she was named, n...",3,3
2,Patrick Warburton,/name/nm0911320,"PERSON is known to many for the role of ""Puddy...",3,3
3,Alexander Etseyatse,/name/nm3411562,"PERSON is known for Otis (2018), Mikel's Faith...",1,1
4,Patricia Kazadi,/name/nm2131865,PERSON's acting career started when she was on...,1,1
5,Sharon Stone,/name/nm0000232,"PERSON was born and raised in LOCATION, a smal...",3,3
6,Nat Wolff,/name/nm1822659,"PERSON is an NATIONALITY/ETHNICITY/RACE actor,...",3,1
7,Shawn Mendes,/name/nm6658398,"PERSON was born on August 8, 1998 in LOCATION,...",2,2
8,Ellen Wong,/name/nm2771798,PERSON was born in LOCATION. Her first role wa...,0,0
9,Ricky Martin,/name/nm0005193,"Born and raised in LOCATION, PERSON initiated ...",2,2


### No Info About Any Named Entities

In [8]:
pred_df = predict_race_for(file_path=f"{main_dir}/data/test_sample_manual_no_info.csv")
pred_df

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.40      1.00      0.57         2
           2       1.00      0.33      0.50         3
           3       0.75      0.75      0.75         4

    accuracy                           0.60        10
   macro avg       0.54      0.52      0.46        10
weighted avg       0.68      0.60      0.56        10



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,name,href,text,label,pred
0,Hortensia Santoveña,/name/nm0764326,"PERSON was born on October 17, 1912 in LOCATIO...",2,1
1,Rosalind Russell,/name/nm0751426,"The middle of seven children, she was named, n...",3,3
2,Patrick Warburton,/name/nm0911320,PERSON is known to many for the role of “”PERS...,3,3
3,Alexander Etseyatse,/name/nm3411562,"PERSON is known for Otis (2018), Mikel's Faith...",1,1
4,Patricia Kazadi,/name/nm2131865,PERSON's acting career started when she was on...,1,1
5,Sharon Stone,/name/nm0000232,"PERSON was born and raised in LOCATION, a smal...",3,3
6,Nat Wolff,/name/nm1822659,"PERSON is an NATIONALITY/ETHNICITY/RACE actor,...",3,1
7,Shawn Mendes,/name/nm6658398,"PERSON was born on August 8, 1998 in LOCATION,...",2,1
8,Ellen Wong,/name/nm2771798,PERSON was born in LOCATION. Her first role wa...,0,3
9,Ricky Martin,/name/nm0005193,"Born and raised in LOCATION, PERSON initiated ...",2,2


### No Info About Any Named Entities (labeled Entity)

In [9]:
pred_df = predict_race_for(file_path=f"{main_dir}/data/test_sample_manual_ENTITY.csv")
pred_df

              precision    recall  f1-score   support

           0       0.20      1.00      0.33         1
           1       0.50      0.50      0.50         2
           2       1.00      0.67      0.80         3
           3       1.00      0.25      0.40         4

    accuracy                           0.50        10
   macro avg       0.68      0.60      0.51        10
weighted avg       0.82      0.50      0.53        10



Unnamed: 0,name,href,text,label,pred
0,Hortensia Santoveña,/name/nm0764326,"ENTITY was born on October 17, 1912 in ENTITY....",2,2
1,Rosalind Russell,/name/nm0751426,"The middle of seven children, she was named, n...",3,3
2,Patrick Warburton,/name/nm0911320,ENTITY is known to many for the role of “”ENTI...,3,0
3,Alexander Etseyatse,/name/nm3411562,"ENTITY is known for Otis (2018), Mikel's Faith...",1,1
4,Patricia Kazadi,/name/nm2131865,ENTITY's acting career started when she was on...,1,0
5,Sharon Stone,/name/nm0000232,"ENTITY was born and raised in ENTITY, a small ...",3,1
6,Nat Wolff,/name/nm1822659,"ENTITY is an ENTITY actor, musician, and singe...",3,0
7,Shawn Mendes,/name/nm6658398,"ENTITY was born on August 8, 1998 in ENTITY, t...",2,0
8,Ellen Wong,/name/nm2771798,ENTITY was born in ENTITY. Her first role was ...,0,0
9,Ricky Martin,/name/nm0005193,"Born and raised in ENTITY, ENTITY initiated hi...",2,2


# Part Two <a href name="2">

## Named Entity Recognition (Algorithmic method) <a href name="ner-alg">

In [4]:
from flair.data import Sentence
from flair.models import SequenceTagger

# flair example 

# load tagger
tagger = SequenceTagger.load("flair/ner-english")

# make example sentence
sentence = Sentence("George Washington went to Washington")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)


2023-07-15 20:30:44,086 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
Sentence[5]: "George Washington went to Washington" → ["George Washington"/PER, "Washington"/LOC]
The following NER tags are found:
Span[0:2]: "George Washington" → PER (0.9985)
Span[4:5]: "Washington" → LOC (0.9706)


In [26]:
# set up sample
df = pd.read_csv(f"{main_dir}/data/test_sample_metadata.csv")
df = df.sample(10, random_state=42) # take a small sample and explore results # TODO once fixing all???
df.to_csv(f"{main_dir}/data/test_sample_manual.csv", index=False)

test = df.iloc[0]["mini_bio"]
# make example sentence
sentence = Sentence(test)

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

Sentence[66]: "Hortensia Santoveña was born on October 17, 1912 in Tlalpujahua, Michoacán, Mexico. She was an actress, known for Two Mules for Sister Sara (1970), Talpa (1956) and Eugenia Grandet (1953). She was previously married to Pascual García Peña. She died on July 11, 1986 in Mexico, D.F., Mexico." → ["Hortensia Santoveña"/PER, "Tlalpujahua"/LOC, "Michoacán"/LOC, "Mexico"/LOC, "Two Mules for Sister Sara"/MISC, "Talpa"/MISC, "Eugenia Grandet"/PER, "Pascual García Peña"/PER, "Mexico"/LOC, "D.F."/LOC, "Mexico"/LOC]


In [111]:
for label in sentence.get_labels():
    print(label)

Span[0:2]: "Hortensia Santoveña" → PER (0.9905)
Span[10:11]: "Tlalpujahua" → LOC (0.9996)
Span[12:13]: "Michoacán" → LOC (0.9914)
Span[14:15]: "Mexico" → LOC (0.9996)
Span[23:28]: "Two Mules for Sister Sara" → MISC (0.8566)
Span[32:33]: "Talpa" → MISC (0.8108)
Span[37:39]: "Eugenia Grandet" → PER (0.6856)
Span[48:51]: "Pascual García Peña" → PER (0.9987)
Span[60:61]: "Mexico" → LOC (0.9719)
Span[62:63]: "D.F." → LOC (0.7862)
Span[64:65]: "Mexico" → LOC (0.9998)


In [64]:
print((test.split(" ")))
for label in sentence.get_labels()[:1]:
    loc = str(label)
    print(loc.split(": ")[0])
    entity = label.value
    print(sentence[0:2])


['Hortensia', 'Santoveña', 'was', 'born', 'on', 'October', '17,', '1912', 'in', 'Tlalpujahua,', 'Michoacán,', 'Mexico.', 'She', 'was', 'an', 'actress,', 'known', 'for', 'Two', 'Mules', 'for', 'Sister', 'Sara', '(1970),', 'Talpa', '(1956)', 'and', 'Eugenia', 'Grandet', '(1953).', 'She', 'was', 'previously', 'married', 'to', 'Pascual', 'García', 'Peña.', 'She', 'died', 'on', 'July', '11,', '1986', 'in', 'Mexico,', 'D.F.,', 'Mexico.']
Span[0:2]
Span[0:2]: "Hortensia Santoveña" → PER (0.9905)


## Remove All Named Entities in Sentence to Category Labels <a href name="transform">

In [122]:
# version that does not collapse array of text into entity label 

print(sentence)

final = ""
string = sentence.to_tokenized_string().split(" ")
for label in sentence.get_labels():
    arr = str(label).split(": ")[0][5:-1]
    start, end = arr.split(":")
    start, end = int(start), int(end)
    entity = label.value
    print(string[start:end])
    diff = end-start
    for i in range(diff):
        string[start+i] = entity
    print(string)
    final = " ".join(string)
    
print(final)

Sentence[66]: "Hortensia Santoveña was born on October 17, 1912 in Tlalpujahua, Michoacán, Mexico. She was an actress, known for Two Mules for Sister Sara (1970), Talpa (1956) and Eugenia Grandet (1953). She was previously married to Pascual García Peña. She died on July 11, 1986 in Mexico, D.F., Mexico." → ["Hortensia Santoveña"/PER, "Tlalpujahua"/LOC, "Michoacán"/LOC, "Mexico"/LOC, "Two Mules for Sister Sara"/MISC, "Talpa"/MISC, "Eugenia Grandet"/PER, "Pascual García Peña"/PER, "Mexico"/LOC, "D.F."/LOC, "Mexico"/LOC]
['Hortensia', 'Santoveña']
['PER', 'PER', 'was', 'born', 'on', 'October', '17', ',', '1912', 'in', 'Tlalpujahua', ',', 'Michoacán', ',', 'Mexico', '.', 'She', 'was', 'an', 'actress', ',', 'known', 'for', 'Two', 'Mules', 'for', 'Sister', 'Sara', '(', '1970', ')', ',', 'Talpa', '(', '1956', ')', 'and', 'Eugenia', 'Grandet', '(', '1953', ')', '.', 'She', 'was', 'previously', 'married', 'to', 'Pascual', 'García', 'Peña', '.', 'She', 'died', 'on', 'July', '11', ',', '1986', '

In [121]:
# V2.1 version that collapes array of text into entity label 
print(sentence)

final = ""
string = sentence.to_tokenized_string().split(" ")
for label in reversed(sentence.get_labels()):
    print(label)
    arr = str(label).split(": ")[0][5:-1]
    start, end = arr.split(":")
    start, end = int(start), int(end)
    entity = label.value
    del string[start:end]
    string.insert(start,entity)
    print(string)
    final = " ".join(string)
    
print("\n",final)

Sentence[66]: "Hortensia Santoveña was born on October 17, 1912 in Tlalpujahua, Michoacán, Mexico. She was an actress, known for Two Mules for Sister Sara (1970), Talpa (1956) and Eugenia Grandet (1953). She was previously married to Pascual García Peña. She died on July 11, 1986 in Mexico, D.F., Mexico." → ["Hortensia Santoveña"/PER, "Tlalpujahua"/LOC, "Michoacán"/LOC, "Mexico"/LOC, "Two Mules for Sister Sara"/MISC, "Talpa"/MISC, "Eugenia Grandet"/PER, "Pascual García Peña"/PER, "Mexico"/LOC, "D.F."/LOC, "Mexico"/LOC]
Span[64:65]: "Mexico" → LOC (0.9998)
['Hortensia', 'Santoveña', 'was', 'born', 'on', 'October', '17', ',', '1912', 'in', 'Tlalpujahua', ',', 'Michoacán', ',', 'Mexico', '.', 'She', 'was', 'an', 'actress', ',', 'known', 'for', 'Two', 'Mules', 'for', 'Sister', 'Sara', '(', '1970', ')', ',', 'Talpa', '(', '1956', ')', 'and', 'Eugenia', 'Grandet', '(', '1953', ')', '.', 'She', 'was', 'previously', 'married', 'to', 'Pascual', 'García', 'Peña', '.', 'She', 'died', 'on', 'July'

In [135]:
# V2.2: SIMPLIFIED version that collapes array of text into entity label 
print(sentence)

def get_span_position(label):
    '''
    "Span[0:2]: "Hortensia Santoveña" → PER (0.9905)" --> 0, 2
    "Span[64:65]: "Mexico" → LOC (0.9998)" --> 64, 65
    '''
    span = str(label).split(": ")[0][5:-1]
    start, end = span.split(":")
    start, end = int(start), int(end)

    return start, end

clean_string = sentence.to_tokenized_string().split(" ")
for label in reversed(sentence.get_labels()):
    entity = label.value
    start, end = get_span_position(label)
    del clean_string[start:end]
    clean_string.insert(start,entity)
    
print("\n",clean_string)
print(" ".join(clean_string))

Sentence[66]: "Hortensia Santoveña was born on October 17, 1912 in Tlalpujahua, Michoacán, Mexico. She was an actress, known for Two Mules for Sister Sara (1970), Talpa (1956) and Eugenia Grandet (1953). She was previously married to Pascual García Peña. She died on July 11, 1986 in Mexico, D.F., Mexico." → ["Hortensia Santoveña"/PER, "Tlalpujahua"/LOC, "Michoacán"/LOC, "Mexico"/LOC, "Two Mules for Sister Sara"/MISC, "Talpa"/MISC, "Eugenia Grandet"/PER, "Pascual García Peña"/PER, "Mexico"/LOC, "D.F."/LOC, "Mexico"/LOC]

 ['PER', 'was', 'born', 'on', 'October', '17', ',', '1912', 'in', 'LOC', ',', 'LOC', ',', 'LOC', '.', 'She', 'was', 'an', 'actress', ',', 'known', 'for', 'MISC', '(', '1970', ')', ',', 'MISC', '(', '1956', ')', 'and', 'PER', '(', '1953', ')', '.', 'She', 'was', 'previously', 'married', 'to', 'PER', '.', 'She', 'died', 'on', 'July', '11', ',', '1986', 'in', 'LOC', ',', 'LOC', ',', 'LOC', '.']
PER was born on October 17 , 1912 in LOC , LOC , LOC . She was an actress , kno

In [142]:
# V2.3: SIMPLIFIED version that collapes array of text into entity label 
print(sentence)

def get_span_position(label):
    '''
    "Span[0:2]: "Hortensia Santoveña" → PER (0.9905)" --> 0, 2
    "Span[64:65]: "Mexico" → LOC (0.9998)" --> 64, 65
    '''
    span = str(label).split(": ")[0][5:-1]
    start, end = span.split(":")
    start, end = int(start), int(end)

    return start, end

def substitute(clean_string, label):
    '''
    Replaces each named entity with its corresponding label

    ['Hortensia', 'Santoveña'] at position [0:2] is replaced with "PER"
    ['Two', 'Mules', 'for', 'Sister', 'Sara'] at position [23:28] is replaced with "MISC"
    '''
    entity = label.value
    start, end = get_span_position(label)
    del clean_string[start:end]
    clean_string.insert(start,entity)
    return clean_string

clean_string = sentence.to_tokenized_string().split(" ")
test = [substitute(clean_string, label) for label in reversed(sentence.get_labels())][-1]
print(" ".join(test))

Sentence[66]: "Hortensia Santoveña was born on October 17, 1912 in Tlalpujahua, Michoacán, Mexico. She was an actress, known for Two Mules for Sister Sara (1970), Talpa (1956) and Eugenia Grandet (1953). She was previously married to Pascual García Peña. She died on July 11, 1986 in Mexico, D.F., Mexico." → ["Hortensia Santoveña"/PER, "Tlalpujahua"/LOC, "Michoacán"/LOC, "Mexico"/LOC, "Two Mules for Sister Sara"/MISC, "Talpa"/MISC, "Eugenia Grandet"/PER, "Pascual García Peña"/PER, "Mexico"/LOC, "D.F."/LOC, "Mexico"/LOC]
PER was born on October 17 , 1912 in LOC , LOC , LOC . She was an actress , known for MISC ( 1970 ) , MISC ( 1956 ) and PER ( 1953 ) . She was previously married to PER . She died on July 11 , 1986 in LOC , LOC , LOC .


In [153]:
##### We choose the approach from last version (V2.3).
def clean(string):
    '''
    Loops through bio text in reverse order, substituting named entities for labels
    '''
    # initialize flair 'Sentence'
    sentence = Sentence(string)

    # predict NER tags
    tagger.predict(sentence)
    # print(sentence)

    # clean string
    tokens = sentence.to_tokenized_string().split(" ")
    clean_string = [substitute(tokens, label) for label in reversed(sentence.get_labels())][-1] # select final str
    return " ".join(clean_string)

def substitute(tokens, label):
    '''
    Replaces each named entity with its corresponding label

    ['Hortensia', 'Santoveña'] at position [0:2] is replaced with "PER"
    ['Two', 'Mules', 'for', 'Sister', 'Sara'] at position [23:28] is replaced with "MISC"
    '''
    entity = label.value
    start, end = get_span_position(label)
    del tokens[start:end]
    tokens.insert(start,entity)
    return tokens

def get_span_position(label):
    '''
    "Span[0:2]: "Hortensia Santoveña" → PER (0.9905)" --> 0, 2
    "Span[64:65]: "Mexico" → LOC (0.9998)" --> 64, 65
    '''
    span = str(label).split(": ")[0][5:-1]
    start, end = span.split(":")
    start, end = int(start), int(end)

    return start, end

# make example sentence
test = df.iloc[4]["mini_bio"]
print(test)

# clean sentence
clean(test)

Patricia Kazadi's acting career started when she was only 2 years old - that is when she booked her first role in a music video. Just a couple of months later she starred in a children's theatre, where she worked for 10 years on various plays, both as a dancer and an actress. After that it was time to transition onto the small screen. She has landed her first big break, a role as a regular in a new series on national television. It was a start to a successful acting career. Since then, Patricia has been a leading lady in 15 TV series and 7 feature films, the most recent of which, "Bodo", is based on a true story from the 1920's and 30's era in American and Polish cinematography. This year she starred in a holiday mega hit movie - "Love beyond all" - for which she co-wrote and recorded the theme song of the same title.In 2018 Patricia Kazadi went back to the theatre after an 18 years absence, garnering top reviews from the most prestigious critics. She also starred in the polish version

'PER \'s acting career started when she was only 2 years old - that is when she booked her first role in a music video . Just a couple of months later she starred in a children \'s theatre , where she worked for 10 years on various plays , both as a dancer and an actress . After that it was time to transition onto the small screen . She has landed her first big break , a role as a regular in a new series on national television . It was a start to a successful acting career . Since then , PER has been a leading lady in 15 TV series and 7 feature films , the most recent of which , " MISC " , is based on a true story from the 1920 \'s and 30 \'s era in MISC and MISC cinematography . This year she starred in a holiday mega hit movie - " MISC " - for which she co-wrote and recorded the theme song of the same title.In 2018 PER went back to the theatre after an 18 years absence , garnering top reviews from the most prestigious critics . She also starred in the polish version of the PER movie 

I transform biographies in V2.3 in `named-entity-remover.py`, and I test their accuracy in `named-entity-tester.py`

# Part Three <a href name="3">

## Remove individual categories in Sentence to Category Labels <a href name="individual">

In [12]:
from flair.data import Sentence
from flair.models import SequenceTagger
import pandas as pd

tagger = SequenceTagger.load("flair/ner-english-ontonotes-fast")
main_dir = ".."

# set up sample
df = pd.read_csv(f"{main_dir}/data/test_sample_metadata.csv")
df = df.sample(10, random_state=3) # take a small sample and explore results # TODO once fixing all???

test = df.iloc[0]["mini_bio"]
# make example sentence
sentence = Sentence(test)

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

2023-07-18 22:57:02,967 SequenceTagger predicts: Dictionary with 75 tags: O, S-PERSON, B-PERSON, E-PERSON, I-PERSON, S-GPE, B-GPE, E-GPE, I-GPE, S-ORG, B-ORG, E-ORG, I-ORG, S-DATE, B-DATE, E-DATE, I-DATE, S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-NORP, B-NORP, E-NORP, I-NORP, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL, S-LOC, B-LOC, E-LOC, I-LOC, S-TIME, B-TIME, E-TIME, I-TIME, S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART, S-FAC
Sentence[2026]: "Natalie Wood was an American actress of Russian and Ukrainian descent. She started her career as a child actress and eventually transitioned into teenage roles, young adult roles, and middle-aged roles. She drowned off Catalina Island on November 29, 1981 at age 43.Wood was born July 20, 1938 in San Francisco to Russian immigrant parents: housewife Maria Gurdin (née Zudilova), known by multiple aliases including Mary, Marie and Musia, and secon

In [13]:
# we don't change these functions
def substitute_named_entities(tokens, label):
    '''
    Replaces each named entity with its corresponding label

    ['Hortensia', 'Santoveña'] at position [0:2] is replaced with "PER"
    ['Two', 'Mules', 'for', 'Sister', 'Sara'] at position [23:28] is replaced with "MISC"
    '''
    entity = label.value
    start, end = get_span_position(label)
    del tokens[start:end]
    tokens.insert(start,entity)
    return tokens

def get_span_position(label):
    '''
    "Span[0:2]: "Hortensia Santoveña" → PER (0.9905)" --> 0, 2
    "Span[64:65]: "Mexico" → LOC (0.9998)" --> 64, 65
    '''
    span = str(label).split(": ")[0][5:-1]
    start, end = span.split(":")
    start, end = int(start), int(end)

    return start, end


In [19]:
##### We choose the approach from last version (V2.3).
def clean_specific_entities(string, entities: set = {}):
    '''
    Removes specified named entities by looping through bio text in reverse order

    '''
    # initialize flair 'Sentence'
    sentence = Sentence(string)
    # predict NER tags
    tagger.predict(sentence)
    # convert flair Sentence to tokenized string
    tokens = sentence.to_tokenized_string().split(" ")

    # final clean string

    clean_string = ""

    for label in reversed(sentence.get_labels()):
        if label.value in entities:
            clean_string = substitute_named_entities(tokens, label)
            print(clean_string)
#     clean_string = [substitute_named_entities(tokens, label) for label in reversed(sentence.get_labels()) if label.value in entities][-1]
    
    return " ".join(clean_string)


# make example sentence
test = df.iloc[4]["mini_bio"]
print(test)

# clean sentence
entities = {'PERSON'}
string = clean_specific_entities(test, entities)
print(string)

Amelia Marshall was born on April 2, 1958 in Albany, Georgia, USA. She is an actress, known for All My Children (1970), Passions (1999) and According to Spencer (2001). She was previously married to Kent Schaffer and Daryl Waters.
['Amelia', 'Marshall', 'was', 'born', 'on', 'April', '2', ',', '1958', 'in', 'Albany', ',', 'Georgia', ',', 'USA', '.', 'She', 'is', 'an', 'actress', ',', 'known', 'for', 'All', 'My', 'Children', '(', '1970', ')', ',', 'Passions', '(', '1999', ')', 'and', 'According', 'to', 'Spencer', '(', '2001', ')', '.', 'She', 'was', 'previously', 'married', 'to', 'Kent', 'Schaffer', 'and', 'PERSON', '.']
['Amelia', 'Marshall', 'was', 'born', 'on', 'April', '2', ',', '1958', 'in', 'Albany', ',', 'Georgia', ',', 'USA', '.', 'She', 'is', 'an', 'actress', ',', 'known', 'for', 'All', 'My', 'Children', '(', '1970', ')', ',', 'Passions', '(', '1999', ')', 'and', 'According', 'to', 'Spencer', '(', '2001', ')', '.', 'She', 'was', 'previously', 'married', 'to', 'PERSON', 'and', 'P


