Martin T. WRDS Take Home Assignment \\
Input: TXT files & questions \\
Output: Answers to the questions (no specific format required) \\

Questions:
1. Who is the borrower
2. Who are the lenders, and their individual loan amount
3. Information on the signature page (company name, party type, signing person, title of signing person) \\

If any of the above details are missing from the loan contract, please output "N/A".

\\
Links for cleaned versions of loan contracts (for reference only):
https://www.sec.gov/Archives/edgar/data/edgar/data/edgar/data/89800/0001193125-16-542735-index.html
https://www.sec.gov/Archives/edgar/data/edgar/data/edgar/data/1618673/0001193125-19-326635-index.html
https://www.sec.gov/Archives/edgar/data/edgar/data/edgar/data/1262976/0001193125-14-017279-index.html

Links for unstructured .txt versions of loan contracts: \\
https://www.sec.gov/Archives/edgar/data/89800/000119312516542735/0001193125-16-542735.txt
https://www.sec.gov/Archives/edgar/data/1618673/000119312519326635/0001193125-19-326635.txt
https://www.sec.gov/Archives/edgar/data/1262976/000119312514017279/0001193125-14-017279.txt



In [1]:
# import libaries
import openai #analyze text with openai API
import re # regex for parsing text

# upload the text files manually into Colab
from google.colab import files
uploaded = files.upload()

# load API key (COPY AND PASTE API key below)
openai.api_key = "paste_your_openai_api_key_here"

Saving 0001193125-16-542735.txt to 0001193125-16-542735.txt
Saving 0001193125-19-326635.txt to 0001193125-19-326635.txt
Saving 0001193125-14-017279.txt to 0001193125-14-017279.txt


💡 The cell below processes the uploaded text files by:


1.   Retreiving file names.
2.   Read each text file into memory as string.
3.   Store text content into dictionary:
 *   Keys = file_name
 *   Values = file content (text)
4. Printing the first line of each file to confirm that it is read in correctly.



In [2]:
# retreive file names from uploaded files
file_names = list(uploaded.keys())

# read the text file and return its  content
def read_txt_file(file_name):
    """Reads a text file and returns its content as a string."""
    with open(file_name, "r", encoding="utf-8") as file:
        return file.read()

# load all loan contracts into a dictionary (file_name : file content)
documents = {file_name: read_txt_file(file_name) for file_name in file_names}

# print first line of each file to confirm that it is read in correctly
for file_name, content in documents.items():
    first_line = content.split("\n")[0] if content else "(Empty file)"
    print(f"Successfully loaded {file_name}: First line -> {first_line}")

Successfully loaded 0001193125-16-542735.txt: First line -> <SEC-DOCUMENT>0001193125-16-542735.txt : 20160415
Successfully loaded 0001193125-19-326635.txt: First line -> <SEC-DOCUMENT>0001193125-19-326635.txt : 20191231
Successfully loaded 0001193125-14-017279.txt: First line -> <SEC-DOCUMENT>0001193125-14-017279.txt : 20140122


💡 The cell below extracts the relevant sections from the loan agreements.
*   Loan contracts are large unstructured text files, and we need to extract key details about borrowers, lenders, and signatures.
*   After comparing the cleaned versions  and unstructured versions, I found consistent patterns where the relevant information appears:

 *    Info regarding the borrower/lender followed shortly after the **second** appearance of the phrase: "Entry into a Material Definitive Agreement." The first occurrence of this phrase is in the introductory text (not useful). The second occurrence appears in the main body and contains borrower & lender details.
 *   Info about the signature page always follows after the first (and only) appearance of the phrase: "to be signed on its behalf by"

 I used regex to locate these key phrases in the document. Once found, a few thousand characters following each phrase are extracted. These sections are then sent to OpenAI's API to analyze.





In [3]:
# extracts text after the SECOND occurence of the key phrase for borrower/lender
def extract_text_after_second_occurrence(text, keyword, char_limit=8000):
    """Finds the SECOND occurrence of the phrase and extracts text after it."""

    # Find all occurrences of the keyword
    matches = [m.start() for m in re.finditer(re.escape(keyword), text)]

    # Ensure there are at least two occurrences and get the second occurance
    if len(matches) >= 2:
        start = matches[1]
        end = start + char_limit
        # extracts text up to limit
        return text[start:end].strip()

    # returns N/A if valid info isn't found
    return "N/A"

# extracts text after the FIRST occurence of the key phrase for signature
def extract_text_after_keyword(text, keyword, char_limit=4000):
    """Finds the FIRST occurrence of the phrase and extracts text after it."""
    match = re.search(re.escape(keyword), text)

    # extracts the text up to limit
    if match:
        start = match.end()
        end = start + char_limit
        return text[start:end].strip()

    # returns N/A if valid info isn't found
    return "N/A"

# key phrases where relavent info is found
borrower_lender_keyword = "Entry into a Material Definitive Agreement"
signature_keyword = "to be signed on its behalf by"

# extract relevant sections from each document
extracted_sections = {}
for file_name, content in documents.items():
    extracted_sections[file_name] = {
        "borrower_lender_section": extract_text_after_second_occurrence(content, borrower_lender_keyword, char_limit=8000),
        "signature_section": extract_text_after_keyword(content, signature_keyword, char_limit=4000)
    }


💡 The cell below sends the extracted text to OpenAI and formats the data to display.
*   Uses structured prompt to extract borrower name, lender/loan amount, and signature info
*   Prints response in a consistent format for easier readability



In [4]:
# extracts structured data using OpenAI
def ask_openai(text, prompt):
    """Send extracted text to OpenAI to retrieve structured borrower/lender/signature details."""
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that extracts structured data from loan contracts."},
            {"role": "user", "content": prompt + "\n\n" + text}
        ],
        # helps control randomness
        temperature=0.1
    )

    return response.choices[0].message.content.strip()

# defines prompt to ensure OpenAI extracts key financial details
# did use ChatGPT here to help me with the prompt
prompt = """
You are analyzing a loan agreement document. Extract the following details:

1. Borrower Name (company or individual who took the loan)
2. Lenders and their individual loan amounts
3. Information from the signature page (Company Name, Signing Person, Title)

If any details are unclear, infer based on context. If unavailable, return 'N/A'.
Format the response neatly as:
- Borrower Name:
- Lenders and Loan Amounts:
- Signature Page Information (Company Name, Signing Person, Title):

Here is the extracted loan contract text:
"""
# use open ai to extract data
extracted_data = {}
for file_name, sections in extracted_sections.items():
    extracted_text = sections["borrower_lender_section"] + "\n\n" + sections["signature_section"]
    extracted_data[file_name] = ask_openai(extracted_text, prompt)  # Send to OpenAI

# displays extracted data
for file_name, extracted_text in extracted_data.items():
    print("=" * 50)
    print(f"File: {file_name}\n")
    print(extracted_text)
    print("=" * 50 + "\n")


File: 0001193125-16-542735.txt

- Borrower Name: The Sherwin-Williams Company

- Lenders and Loan Amounts: 
  1. Citibank, N.A. and Citigroup Global Markets Inc. - $7.3 billion (Bridge Credit Agreement)
  2. Citibank, N.A., Wells Fargo Bank, National Association, Morgan Stanley Senior Funding, Inc. and PNC Bank, National Association - $2.0 billion (Term Loan Credit Agreement)

- Signature Page Information (Company Name, Signing Person, Title):
  - Company Name: The Sherwin-Williams Company
  - Signing Person: Catherine M. Kilbane
  - Title: Senior Vice President, General Counsel and Secretary

File: 0001193125-19-326635.txt

- Borrower Name: PFGC, Inc. and Performance Food Group, Inc.
- Lenders and Loan Amounts: Wells Fargo Bank, National Association and other lenders, with an aggregate principal amount of $3.0 billion.
- Signature Page Information (Company Name, Signing Person, Title):
  - Company Name: PERFORMANCE FOOD GROUP COMPANY
  - Signing Person: A. Brent King
  - Title: Senior

# Summary
## **1️⃣ Overview**
This take home assignment extracts financial details from unstructured loan agreement text files. Some key details include:
*   Borrower Name
*   Lenders and their loan amounts
*   Signature Page Info (Company, Signing Person, Title)

## **2️⃣ Approach**
### **🔹 Text Extraction**
- I used regex to find specific key phrases in the document.
- Extracted borrower/lender details after the **second occurrence** of:
  > `"Entry into a Material Definitive Agreement"`
- Extracted signature page details after the **first occurrence** of:
  > `"to be signed on its behalf by"`

### **🔹 Structured Data Extraction**
- Passed extracted sections to **OpenAI's GPT-4 API** for processing.
- Used a  to ensure structured output.

### **🔹  Why This Approach?**
- I chose this method because I am more familiar with OpenAI’s API and its ability to analyze text.
-  While similar results might be achieved using NLP libraries like NLTK, I decided to go for an approach I was more comfortable with. I'm interested in learning NLTK in the future though!

## **3️⃣ Limitations**
🚧 While this method works well for the given loan agreements, I haven't tested it on a broader set of **8-K Forms**. Some challenges may include:
- **Text Formatting Variations** → Some documents may use slightly different phrasing.
- **Multi-line Keyword Issues** →  The target phrases sometimes break across multiple lines, making them harder to find. I encountered this issue with the signature phrase and had to adjust it to a shorter version to ensure that it worked.
- **Varying Section Lengths** → Signature details are relatively similar in length, but borrower/lender details can vary a lot. Some documents contain much longer lender/amount sections, so the character limit may need to be increased if extracted information appears incomplete.
- **Cost Considerations** → Frequent API calls can add up costs over time. It might be better to use gpt-3.5-turbo to save on costs.




