## Setup

This notebook uses the results from the `parse_resources.ipynb` notebook. The parse resources step pulls data from ArchivesSpace and creates a dataframe that was output to a CSV file. This notebook starts from the CSV file, but it could relatively easily be changed to take the previous dataframe as an input.

In [1]:
import pandas as pd
import os
import re

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

_Note:_ the following functions and code is based on work by Ella Li, who created an initial version of this project that parsed EAD data from XML files. The process here is similar but continues to use the data pulled from the ArchivesSpace API, which exports data in JSON rather than XML.

## Provide Terms

In [2]:
# read in the txt file term list
term_list_file = os.path.join('term-lists', 'terms-LGBTQ.txt')
# term_list_file = 'terms-nativeAmerican.txt'
# term_list_file = 'terms-philippines.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from term-lists\terms-LGBTQ.txt and recorded 8 terms of interest.


## Match Terms

In [None]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # split the column into paragraphs
                # wonky try/except to work through integers, if not converted to strings
                try:
                    paragraphs = row[col].split('\n')
                except:
                    paragraphs = str(row[col]).split('\n')
                # loop through each paragraph
                for paragraph in paragraphs:
                    # check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Tag': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
                        
    return results

def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print("Matched results for", name)

    # Export to CSV without the index
    sorted_results_df.to_csv(os.path.join('data', 'matched_results-' + name + '.csv'), index=False)
    return sorted_results_df 


In [5]:

eads_df = pd.read_csv(os.path.join('data', 'data-allIDs.csv'), encoding='utf-8')
# eads_df = pd.read_csv('data-nativeAmerican.csv', encoding='utf-8')
# eads_df = pd.read_csv('data-philippines.csv', encoding='utf-8')

match_and_visualize(eads_df, 'lgbtq')
# match_and_visualize(eads_df, 'nativeAmerican')
# match_and_visualize(eads_df, 'philippines')

Matched results for lgbtq


Unnamed: 0,Term,Occurrence (ead_ID),Tag,Collection,Context
256,GLBT,umich-bhl-2014034,subjects,"Social Justice in Michigan Web Archive, 2010-2014",African American men; African American youth; ...
33,GLBT,umich-bhl-04105,bioghist,"Beth Bashert Papers, 1988-2010","In 1996, she participated in the planning for ..."
300,GLBT,umich-bhl-0092,bioghist,Triangle Foundation Records,Staff monitor and respond to media coverage of...
299,GLBT,umich-bhl-0092,bioghist,Triangle Foundation Records,The Foundation takes every opportunity to edu...
298,GLBT,umich-bhl-0092,bioghist,Triangle Foundation Records,The Triangle Foundation is a also founding me...
...,...,...,...,...,...
367,queer,umich-bhl-2024013,bioghist,the Ross Chambers papers.,"Aside from his work on queer studies, Chambers..."
368,queer,umich-bhl-2024013,subjects,the Ross Chambers papers.,AIDS death and dying.; AIDS (Disease); Languag...
85,transsexual,umich-bhl-0398,bioghist,"Charles L. Duty Papers, 1997-2000","On February 12, 1997, Tri-Pride, a group of ga..."
86,transsexual,umich-bhl-0398,bioghist,"Charles L. Duty Papers, 1997-2000","Its members described themselves as, ""a commun..."
