# KWIC Report

This notebook uses the results from the `parse_resources.ipynb` notebook.
The parse resources step pulls data from ArchivesSpace and creates a
dataframe that was output to a CSV file. 
This notebook starts from the CSV file, but it could
relatively easily be changed to take the previous dataframe
as an input. 

## Credit

Developed by Ella Li, June 2024. Original code at https://github.com/jiaqili0803/ReConnect-ReCollect_Automation/tree/main/NEW%20-%20Term%20report

## Setup

If continuing or adapting this code, you have likely already imported these libraries.
We will use `pandas` for data processing, and `re` for text analysis. Since the data is gathered in the `parse_resources.ipynb` notebook there is not a need for additional parsing.

In [1]:
import pandas as pd
import re



## Functions

Two functions provide for text processing. These rely on data previously
parsed from EAD data in the `parse_resources.ipynb` notebook,
and saved as `results.csv`.

`match_terms()` pulls from the previously collected EAD data. 
Unlike the similar function in the visualization notebook (`match_visualize.ipynb`),
this function matches terms, records the term in question, and
keys that data to its textual context, specifically providing the
title of the finding aid (generally a direct match to the collection name),
provides a file ID (corresponding to the repository's unique ID for the EAD),
and the "context" which is derived from the immediate surrounding text
(the function looks for blank lines to determine paragraph breaks and provides the whole paragraph). 

`match_and_visualize()` takes a dataframe with the match data from the EADS,
reshapes the matched data to consistently name columns,
groups the rows based on the matched term, 
and it outputs a CSV file with a unique name based on a repository identifier
provided by the user as an input to the function. 

In [2]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # Split the column into paragraphs Note: the str() conversion was provided to catch a datatype error
                paragraphs = str(row[col]).split('\n')
                # Loop through each paragraph
                for paragraph in paragraphs:
                    # Check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Field': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
    return results

In [3]:
def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print(f'Matched results for {name}')

    # Export to CSV without the index
    sorted_results_df.to_csv('matched_results_incontext_' + name + '.csv', index=False)
    return sorted_results_df 

## Provide Terms

Terms of interest are supplied in a plain text file, with
each term of interest on its own line. 

In [4]:
# read in the txt file term list
term_list_file = 'terms_all.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

In [5]:
print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from terms_all.txt and recorded 104 terms of interest.


## Provide EAD Data

Create a dataframe using the `results.csv` from the `parse_resources.ipynb` notebook.

In [6]:
eads_df = pd.read_csv('results.csv', encoding='utf-8')

In [7]:
eads_df = eads_df.rename(columns={'eadid':'ead_id'})

In [8]:
eads_df.head()

Unnamed: 0,resource_id,ead_id,titleproper,abstract,language,scopecontent,bioghist,subjects,subjects_source,genreforms,genreforms_source,geognames,geognames_source,persnames,persnames_source,corpnames,corpnames_source,famnames,famnames_source
0,3011,umich-bhl-00138,the Ralph M. Hodnett papers,,The finding aid is written in English,This collection consists of reminiscences (wri...,Ralph M. Hodnett was an officer in the U.S. Ar...,"Soldiers; World War, 1914-1918; Soldiers",lcsh; lcsh; lctgm,Diaries.; Photographs.,lcsh; gmgpc,Philippines,lcsh,"Hodnett, Ralph M.; Hodnett, Ralph M.",lcnaf; lcnaf,United States. Army.,lcnaf,Oram family.,lcnaf
1,267,umich-bhl-0052,Bentley Historical Library publications. 1935-...,The Bentley Historical Library (BHL) houses th...,The finding aid is written in English,The PUBLICATIONS (3.7 linear feet) are divided...,The origins of the Bentley Historical Library ...,,,Annual reports.; Newsletters.; Bibliographies....,aat; aat; aat; aat; aat; aat; aat; aat; aat,,,,,Bentley Historical Library.; Michigan Historic...,lcnaf; lcnaf; lcnaf,,
2,996,umich-bhl-0142,the Frank C. Gates papers,Frank C. Gates was a professor of botany at th...,The finding aid is written in <language encodi...,The Frank C. Gates papers are dated from 1871-...,"Frank Caleb Gates was born on September 12, 18...",Bird watching.; Botany; Forests and forestry; ...,lcsh; lcsh; lcsh; aat,Lantern slides.; Photographs.; Postcards.,aat; aat; aat,Philippines; Plants,lcsh; lctgm,"Gates, Frank C. (Frank Caleb), 1887-1955; Gate...",lcnaf; lcnaf; lcnaf,University of the Philippines.,lcnaf,,
3,2722,umich-bhl-03171,"Mike Wallace CBS 60 Minutes Papers, 1922-2007","Papers of Mike Wallace (1918-2012), broadcast ...",The finding aid is written in English,"The Mike Wallace CBS/ <title render=""italic"">6...",Mike Wallace was born Myron Leon Wallace on Ma...,60 minutes (Television program); Television br...,lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh...,Photographs.; Sound recordings.; Videotapes.,aat; aat; aat,,,"Wallace, Mike, 1918-2012; Wallace, Mike, 1918-...",lcnaf; lcnaf; lcnaf; lcnaf,CBS News.; CBS News.,lcnaf; lcnaf,,
4,1051,umich-bhl-0336,"Grant Kohn Goodman papers, 1943-1995",Grant K. Goodman was a student at the Universi...,The finding aid is written in English,The Grant K. Goodman collection documents the ...,Grant Kohn Goodman was born in 1924 in Clevela...,"World War, 1939-1945; World War, 1939-1945; Wo...",lcsh; lcsh; lcsh,Digital file formats.; Photographs.; Sound rec...,aat; aat; aat; aat,Tokyo.,lcsh,"Goodman, Grant Kohn, 1924-2014; Goodman, Grant...",lcnaf; lcnaf,United States. Army. Japanese Language School ...,lcnaf; lcnaf; lcnaf,,


## Run the KWIC Report

Use the functions to create the CSV with terms in context.
To do this, run the `match_and_visualize()` function with
the EAD dataframe and the name of the group that is being analyzed.

To customize the report, provide a dataframe of your own
and change your data name. In the example below,
we use `Bentley` because the EAD data was retrieved from
the Bentley Historical Library at the University of Michigan.

In [9]:
match_and_visualize(eads_df, 'Bentley')

Matched results for Bentley


Unnamed: 0,Term,Occurrence (ead_ID),Field,Collection,Context
140,Colonial,umich-bhl-8772,bioghist,"Luce Philippine Project interviews, 1975-1980",In 1977 the University of Michigan Center for ...
63,Colonial,umich-bhl-851733,bioghist,Harry Burns Hutchins papers,Mary Hutchins was a member of many organizatio...
145,Colonial,umich-bhl-8868,scopecontent,"Blanchard Family Papers, ca. 1835-ca. 2000",The Blanchard Family Papers will be of value t...
66,Colonial,umich-bhl-851764,abstract,"George A. Malcolm papers, 1896-1965","Correspondence, scrapbooks, printed reports, a..."
90,Colonial,umich-bhl-85419,scopecontent,"Owen A. Tomlinson papers, 1899-1920",Within the Photograph series will be found six...
...,...,...,...,...,...
43,Types,umich-bhl-2014136,bioghist,University Herbarium (University of Michigan) ...,The U-M Herbarium is also a leader in digitizi...
73,Types,umich-bhl-85193,scopecontent,Philip A. Hart Papers,Hart himself and his staff had discarded certa...
50,Types,umich-bhl-851285,scopecontent,Thomas Francis Papers,Types of records in these unprocessed subserie...
180,Types,umich-bhl-9840,scopecontent,"Charles W. Lane papers, 1935-1997",The researcher will be interested in the varie...
