# Extract Titles for LCCN Retrieval
This notebook extracts titles from relevant columns in the dataset for LCCN retrieval and outputs them to a CSV file named `titles_needed.csv`.

## Import Required Libraries
Import the necessary libraries for data manipulation and text processing.

In [None]:
# Import Required Libraries
import pandas as pd
import re

## Define Helper Functions
Define functions to extract text between brackets ([[ ]]) from specified columns.

In [None]:
# Define a function to extract text between double brackets ([[ ]])
def extract_titles_from_column(column_data):
    """
    Extracts text between double brackets ([[ ]]) from a pandas Series.

    Args:
        column_data (pd.Series): The column data to process.

    Returns:
        list: A list of extracted titles.
    """
    titles = []
    for entry in column_data.dropna():
        if isinstance(entry, str):
            matches = re.findall(r'\[\[(.*?)\]\]', entry)
            titles.extend(matches)
    return titles

## Extract Titles from Relevant Columns
Iterate through the relevant columns (identified in `process_place_columns`, `process_date_columns`, and `process_occupation_column`) and extract titles using the helper functions.

In [None]:
# Load the dataset
file_path = './LOD_Person_Prep_Script_Data.xlsx'  # Update with the actual file path
df_persons = pd.read_excel(file_path)

# Define the relevant columns for title extraction
relevant_columns = [
    'Place of Birth (P19)', 'Place of Death', 'Place of Residence',
    'Occupation', 'Birth Date', 'Death Date', 'Marriage Date'
]

# Extract titles from each relevant column
all_titles = []
for column in relevant_columns:
    if column in df_persons.columns:
        titles = extract_titles_from_column(df_persons[column])
        all_titles.extend(titles)

## Combine and Deduplicate Titles
Combine all extracted titles into a single list and remove duplicates.

In [None]:
# Combine and deduplicate titles
unique_titles = list(set(all_titles))

## Export Titles to CSV
Save the deduplicated list of titles to a CSV file named `titles_needed.csv`.

In [None]:
# Export the deduplicated titles to a CSV file
output_file = 'titles_needed.csv'
pd.DataFrame({'Title': unique_titles}).to_csv(output_file, index=False, encoding='utf-8-sig')

print(f"Titles have been successfully exported to {output_file}.")