# NLP with Splitting

## Docs (in progress)

This notebook splits one big input file into multiple sections, as determined by a regex (ex. splitting up obituary OCR by image). Then it runs all the specified regular expressions on each section. You can find all matches of a regex in the section (findall) or just the first match (search).

# Imports

In [1]:
import os
import re
import csv
import sys
# import unicodedata
# from unidecode import unidecode

# Functions/Other Preliminary Code

## Read in the lines in the right format

In [2]:
def pre_process(fname):
    with open(fname, encoding='utf-8') as fin:
        lines = fin.read()
    return lines

## Split text by the image it came from

Todo: define the regex you want to use to split the file into different entries, and turn on/off the option to delete the last string in the list (depending on the format of the input, may always be empty)

In [3]:
def split_input(text):
    """Splits text into different strings, using a regex as a seperator"""  
    if type(text) is not str:
        raise NotImplementedError
    
    # CHANGE THE REGEX STRING IN THE LINE BELOW IF YOU WANT TO SEPARATE THE TEXT DIFFERENTLY
    split_re = r'--- PAGE END ---'
    split_strings = re.split(split_re, text)
    
    # IF EACH SECTION WILL HAVE A MATCH FOR SPLIT_RE AT THE END, RUN THIS TO DELETE THE LAST STRING IN THE LIST (WILL BE EMPTY)
    last_is_empty = True
    if last_is_empty and len(split_strings) > 0:
        del split_strings[-1] # removes empty string at the end
    return split_strings


## Give file information

Todo: name the output file and give paths for the input and output folders

In [4]:
# GIVE A NAME FOR THE OUTPUT FILE HERE (not a path, just something recognizable)
output_name = "adjusted_NLP"

# Name of the csv file to write to
target = f"{output_name}.csv"

# PUT INPUT FOLDER HERE
INPUT_PATH = r'V:\FHSS-JoePriceResearch\papers\current\tree_growth\US\Skagit\skagit_obits\2_NLP\Adjusted_Input'

# PUT OUTPUT FOLDER HERE
OUTPUT_PATH = r'V:\FHSS-JoePriceResearch\papers\current\tree_growth\US\Skagit\skagit_obits\2_NLP\Adjusted_Output'

os.chdir(OUTPUT_PATH)

### (Makes the output replace the previous file, instead of appending to the previous file)

In [5]:
# Update target file name so that we aren't appending to an existing file
    
if os.path.exists(target):
    i = 1
    name, ext = target.split('.')
    while os.path.exists(f'{name}_{i}.{ext}'):
        i += 1
    target = f'{name}_{i}.{ext}'

# (took out code for the check file; don't know how to create one with this method

# Define Regex
Go to https://regex101.com/ to test your regex.

## Search regex
Will return only the first match in each entry

Todo: Define regular expressions for which you want to use the search technique, and add them to the list

In [6]:
# Define regex
#image_re = re.compile(r'(?<=-image:)(?P<imageName>[^\n]+)')  # not as efficient, but will only capture image name

# Add to list (so the program runs all of them on each entry), in the order you want the columns to be
search_regex_list = []

## Findall regex
Will return a semicolon-separated list of all matches in the entry

Todo: Define regular expressions for which you want to use the findall technique, and add them to the list

In [7]:
# Define regex
date_re = re.compile(r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)\.? (?P<day>\d{1,2})(?:, (?P<year>\d{4}))?')
name_re = re.compile(r'(?P<firstNames>(?:[\'\"]?[A-Z][\'\"A-Za-z]+ )+(?:[A-Z]. )*)(?P<lastName>[A-Z][\'A-Za-z]+)')

# Add to list
findall_regex_list = [date_re, name_re]


## Group-separated regex
Uses search, but will output each named capturing group in its own column

Todo: Define regular expressions for which you want to have each named capturing group separated, and add them to the list

In [8]:
# Define regex
image_re = re.compile(r'-image:(?P<imageName>[^\n]+)')
death_date_re = re.compile(r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)\.? (?P<day>\d{1,2})(?:, (?P<year>\d{4}))')

# Add to list
separated_regex_list = [image_re, name_re, death_date_re]

## Date Regex


## Define column names (will be printed at top of file)
The regex matches will be outputted in the following order:
  Group-separated regex, beginning of list to end of list
  Search regex, beginning of list to end of list
  Findall regex, beginning of list to end of list
  
If you want column names printed at the top, write the titles you want in order here

In [9]:
column_names = ['Image','First_Names', 'Last_Names', 'Dates', 'Names_or_Places']

# Run Regex

## Run search regex

In [10]:
def search_regex(text):
    
    # list that holds the matches of all the search regex (ex. information like name)
    matches = []
    
    # iterate through all the search regex, and add its matches to the overall list
    for regex in search_regex_list:
        re_match = re.search(regex, text) # finds the first match
        if re_match:
            matches.append(re_match.group().replace('\n', ' ')) # if match exists, add the string of the match to the list without newlines
        else:
            matches.append("") # makes an empty cell instead of shifting the row over
            
    return matches    
    

## Run findall regex

In [11]:
def findall_regex(text):
    
    # list that holds the matches of all the findall regex (ex. all dates)
    findall_matches = []
        
      # iterate through all the search regex, adding any matches to the list
    for regex in findall_regex_list:
        match_iter = re.finditer(regex, text) # does the searching
        combined_string = "" # string that will hold all the results from one regex
        if match_iter:
            for match in match_iter:
                combined_string = combined_string + match.group() + "; "
        findall_matches.append(combined_string.replace('\n', ' '))
            
            
    return findall_matches

## Run group-separated regex

In [12]:
def separated_regex(text):
    
    # list that holds the matches of all the group-separated regex (ex. information like names, split into given/surname).
    # Each entry is a list of named subgroups
    matches = []
    
    # iterate through all the search regex, and add its matches to the overall list
    for regex in separated_regex_list:
        re_match = re.search(regex, text) # finds the first match
        if re_match:
            re_dict = re_match.groupdict() # returns a dictionary of all named capturing groups
            re_list = list(re_dict.values()) # get the values out of the key pairs
            # take out newlines from the matches
            for i in range(len(re_list)):
                re_list[i] = re_list[i].replace('\n', ' ')
            matches.append(re_list)
        # if it doesn't match, add the right number of empty cells to the row
        else:
            group_num = len(regex.groupindex)
            empty_list = []
            for i in range(group_num):
                empty_list.append('')
            matches.append(empty_list) # makes an empty cell instead of shifting the row over
            
    return matches   

# Main Code Body

In [13]:
# Find matches within the split text strings, and write those matches to the target file (.csv)

os.chdir(OUTPUT_PATH)

with open(target, 'a', newline='', encoding='utf-8-sig') as fout:
    writer = csv.writer(fout)
    # PUT SECTION HEADERS HERE IF YOU WANT THEM
    writer.writerow(column_names)
    
    os.chdir(INPUT_PATH)
    
    for txt in [i for i in os.listdir() if i[-4:] == '.txt']:
        os.chdir(INPUT_PATH)
    
        # Use the split_input() function to split the text document, so the OCR output for each source image is separate
        print(f'finding obituaries in {txt}...')
        split_strings = split_input(pre_process(txt))
        num = len(split_strings)
        print(f'found {num} obituaries. Writing to file {target}...')
    
        os.chdir(OUTPUT_PATH)
        
        # Run each regex on the OCR text from each image
        for string in split_strings:
            
            all_matches = [] # list of strings that will become each entry in a row
            
            # Run group-separated regex, and add each group to the list
            separated_matches = separated_regex(string)
            if separated_matches is not None:
                for match in separated_matches:
                    all_matches.extend(match)
            
            # Run search regex, add the list of matches it returns to the combined match list
            search_matches = search_regex(string)
            if search_matches is not None:
                #search_matches = search_matches if isinstance(search_matches, list) else [search_matches, "test"]
                all_matches.extend(search_matches)
            
            # Run findall regex, add its matches to the list
            findall_matches = findall_regex(string)
            if findall_matches is not None:
                all_matches.extend(findall_matches)
        
            # Write out the row
            writer.writerow(all_matches)

finding obituaries in 0-transcribed.txt...
found 50 obituaries. Writing to file adjusted_NLP.csv...
