# Multi-Line NLP

## Intro / When to Use

This code uses an alternate method of doing NLP. It's useful when the page layout has separate columns of information (ex. student name, home city, address) that are hard to separate in a regular expression if combined on one line.

You use an OCR option that outputs each column entry on its own line -- so unlike the regular notebook, you want it to look like this:

    John Doe
    New York City
    21 Main Street

Option 12 has worked well in the past, but another might be better for a specific source.

Once you have OCR in that format, you can write a regex that uses newlines to keep the information separate. On one line, it would be difficult to separate the name and home city ("John Doe New York City" -- when does the name stop and the city start?), but the newline gives a definitive break between the two pieces of information.

The newline character can be included as a character using '\n'. Another useful trick is putting '^' at the beginning of your expression, which will match to the beginning of a line. '$' matches to the end. (If you're trying this outside this program, make sure to use the re.MULTILINE flag, otherwise it will match to the beginning and end of the entire string.)

regex101.com is a useful website for trying out regular expressions (personally, I like it more than Pythex).
 

## How to Use This Notebook

After you've written a multi-line regular expression, there are a few more things you have to change for this notebook to work (unfortunately...if anybody can improve it, go for it).

All the places to change are marked with comments in all caps. Feel free to make a copy of the notebook and adapt the code that's there to suit the source. Here's an overview, and it will probably make more sense once you see the code:

In [5]:
    1. Paste in regular expression
    2. Change the name to your school (the name of the school's folder)
In [7]:
    3. Replace the column names in the code with the columns you want in the spreadsheet
    4. Make a variable for each column, and set it either to a named capturing group in your regex, or a hardcoded value (if         it's the same for every row, like school or Chetty tier)
    5. Replace my variables with your variables, in the same order you put the columns in step 3
   

# Imports

In [1]:
import os
import re
import csv
import sys
# import unicodedata
# from unidecode import unidecode

### Reads all lines instead of splicing them apart

In [2]:
def pre_process(fname):
    with open(fname, encoding='utf-8') as fin:
        lines = fin.read()
    return lines

# Function that makes lists of people

In [3]:
def collect(text):
    """Collects all names and returns a formatted list of matches from text<str>."""  
    if type(text) is not str:
        raise NotImplementedError

    # This loop looks through lines (from above) and appends each match to a list, formatted the way we want
    raw_matches = []
    non_matches = []
    
    # Find all matches in the text and put it in a list
    # raw_matches = people_re.findall(text, re.MULTILINE)
    
    for match in people_re.finditer(text, re.MULTILINE):
        raw_matches.append(match)
    
    # Add everything that isn't in a match to non_matches (make substrings from the gaps in-between matches)
    prev_end_index = 0
    length = len(raw_matches)
    for i in range(length):
        currentMatch = raw_matches[i]
        
        # Get start and end values
        matchSpan = currentMatch.span()
        start = matchSpan[0]
        end = matchSpan[1]
        
        non_matches.append(text[prev_end_index:start])
        prev_end_index = end
    # add last bit (last match to end of string)
    non_matches.append(text[prev_end_index:-1])
    
    
    return raw_matches, non_matches


# Define the regex and which school to search
Go to https://regex101.com/ to test your regex.

In [4]:
# PUT REGULAR EXPRESSION HERE, (1/5)
people_re = re.compile(r'(^(?P<studentName>[a-zA-Z -]+(, Jr.)?)\n\n(?P<city>[a-zA-Z./ ]+)(, (?P<state>[a-zA-Z./ ]+))?)', flags=re.MULTILINE)

# PUT NAME OF SCHOOL HERE (aka the name of the folder), (2/5)
source = "YaleNew"

# Name of the csv file to write to
target = f"NLP_Output_{source}.csv"

# This is where the non-matched lines will be written
chk_file = 'check.csv'

os.chdir(r'..\output\University CSVs\{}'.format(source))

# Run this just to make sure that if you rerun the code,
# it makes a new file instead of appending

In [5]:
# Update target file name so that we aren't appending to an existing file
    
if os.path.exists(target):
    i = 1
    name, ext = target.split('.')
    while os.path.exists(f'{name}_{i}.{ext}'):
        i += 1
    target = f'{name}_{i}.{ext}'

if os.path.exists(chk_file):
    i = 1
    name, ext = chk_file.split('.')
    while os.path.exists(f'{name}_{i}.{ext}'):
        i += 1
    chk_file = f'{name}_{i}.{ext}'

# Main function -- create rows and variables for each row, then output
If you interrupt this cell or the previous one and the next run gives an OS error, restart the kernel and then try again.

In [6]:
os.chdir(r'..\..\University Text Files\{}'.format(source))


for txt in [i for i in os.listdir() if i[-4:] == '.txt']:
    # This is where the work happens. Uses the collect() function.
    print(f'finding names in {txt}...')
    
    result, check = collect(pre_process(txt))
    num = len(result)
    print(f'found {num} names. Writing to file {target}...')
    os.chdir(r'..\..\University CSVs\{}'.format(source))

    # Write matches to the target file (.csv)
    
    with open(target, 'a', newline='', encoding='utf-8-sig') as fout:
        writer = csv.writer(fout)
        # CHANGE ROWS HERE TO MATCH YOUR EXPRESSION AND WHATEVER STUFF YOU WANT TO ADD, (3/5)
        writer.writerow(['Name', 'City' , 'State', 'Standing', 'Year', 'School', 'School_State', 'Chetty_Tier'])
        

        # Output each match in a row
        for match in result:
            # USE THIS SYNTAX TO MAKE VARIABLES FOR YOUR NAMED GROUPS/SET THE OUTPUT FOR OTHER COLUMNS, (4/5)
            # variable_name = match.group('namedGroupInExpression')
            name = match.group('studentName')
            city = match.group('city')
            state = match.group('state')
            # hardcode stuff in if you don't want to add the columns later
            standing = 'Senior'
            year = '1923'
            school = 'Yale University'
            school_state = 'Connecticut'
            chetty_tier = '1'
            
            # PUT VARIABLE NAMES IN ORDER HERE TO OUTPUT THEM IN A ROW (5/5)
            writer.writerow([name, city, state, standing, year, school, school_state, chetty_tier])
            

        # Total names found that year
        writer.writerow(['Names found:', num])

    # Write non-matches to the check file
    with open(chk_file, 'a', newline='', encoding='utf-8-sig') as fout:
        writer = csv.writer(fout)
        for i in check:
            try:
                for line in i.split('\n'):
                    writer.writerow([line])
            except UnicodeEncodeError as e:
                writer.writerow([e])

    os.chdir(r'..\..\University Text Files\{}'.format(source))

finding names in 1_Yale_1923_Seniors_1923.txt...
found 288 names. Writing to file NLP_Output_YaleNew_29.csv...
finding names in 2_Yale_1923_Juniors_1924.txt...
found 290 names. Writing to file NLP_Output_YaleNew_29.csv...
finding names in 3_Yale_1923_Sophomores_1925.txt...
found 450 names. Writing to file NLP_Output_YaleNew_29.csv...
