**Optimized Prompt Formatting:**
- Added clear instructions to the prompt.
- Specified that the assistant should only return the JSON response.
- Used consistent formatting and included a note emphasizing not to add extra text.

**Clarified the Three Questions:**
- Rephrased the questions to be more specific and understandable.

**Combined Answer and Evidence into One Column**

**Ensured Progress is Saved**



In [14]:
from langchain_ollama import OllamaLLM
from tqdm import tqdm
import json
import os
import pandas as pd


llm = OllamaLLM(model="llama3.2")

# Check if processed file exists
if os.path.exists('prospectuses_data_processed.csv'):
    df = pd.read_csv('prospectuses_data_processed.csv')
else:
    df = pd.read_csv('prospectuses_data.csv')
    # Filter out rows that have "failed parsing" in the Section ID column
    df = df[df['Section ID'] != "failed parsing"]
    
df = pd.read_csv('prospectuses_data.csv')

# Filter out rows that have "failed parsing" in the Section ID column
df = df[df['Section ID'] != "failed parsing"]

# Define the questions corresponding to each column
# questions = {
#     "Market Dynamics - a": "Exposure to cyclical products",
#     "Market Dynamics - b": "Impact of demographic and structural trends",
#     "Market Dynamics - c": "Seasonal industry volatility"
# }
questions = {
    "Market Dynamics - a": "Is the company exposed to risks associated with cyclical products?",
    "Market Dynamics - b": "Does the text mention risks related to demographic or structural trends affecting the market?",
    "Market Dynamics - c": "Does the text discuss risks due to seasonal volatility in the industry?"
}

# Ensure the answer and evidence columns are created with a compatible data type
for column_name in questions.keys():
    df[column_name] = ""  # Initialize answer columns as empty strings
    #df[f"{column_name} - Evidence"] = ""  # Initialize evidence columns as empty strings

In [8]:
def analyze_prospectus_row_single_question(row, question):
    # System and user prompts
    system_prompt = "You are an expert in analyzing bond prospectuses and identifying specific risk factors."

    # Format the user prompt using the row's data
    prompt = f"""
{system_prompt}

Please answer the following question based on the given text. Provide a clear "Yes" or "No" answer. If "Yes", include the exact phrases or sentences from the text that support your answer.

Text:
Subsubsection Title: {row['Subsubsection Title']}
Subsubsection Text: {row['Subsubsection Text']}

Question:
{question}

Please provide your answer in the following JSON format:

{{
  "Answer": "Yes" or "No",
  "Evidence": "The exact phrases or sentences from the text if 'Yes'; otherwise, leave blank."
}}

Note: Only provide the JSON response without any additional text.
"""
    # Run the prompt through the model
    response = llm.invoke(input=prompt)

    # Parse the response
    try:
        # Extract the JSON from the response
        start_idx = response.find('{')
        end_idx = response.rfind('}') + 1
        json_str = response[start_idx:end_idx]
        result = json.loads(json_str)
        answer = result.get("Answer", "").strip()
        evidence = result.get("Evidence", "").strip()
    except json.JSONDecodeError:
        answer = "Parsing Error"
        evidence = ""
    
    # Combine answer and evidence
    if answer.lower() == "yes" and evidence:
        combined_answer = f"Yes: {evidence}"
    elif answer.lower() == "yes":
        combined_answer = "Yes"
    elif answer.lower() == "no":
        combined_answer = "No"
    else:
        combined_answer = "Parsing Error"

    # For debugging
    if combined_answer == "Parsing Error":
        print("Parsing Error encountered. Response was:")
        print(response)
    
    return combined_answer

In [9]:
# Loop over each row in the DataFrame
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    for column_name, question in questions.items():
        # Check if the answer column is already filled
        if pd.notnull(df.at[index, column_name]) and df.at[index, column_name] != "":
            # Skip processing this row for this question
            continue
        answer = analyze_prospectus_row_single_question(row, question)
        df.at[index, column_name] = answer

    # Save progress every 100 rows and break after the first 100 rows
    if index % 50 == 0 and index != 0:
        df.to_csv('prospectuses_data_processed.csv', index=False)
        break


# Save the final DataFrame
df.to_csv('prospectuses_data_processed.csv', index=False)

  0%|          | 0/74 [00:04<?, ?it/s]


KeyboardInterrupt: 

In [10]:
df

Unnamed: 0,Prospectus ID,Original Filename,Section ID,Section Title,Subsection ID,Subsection Title,Subsubsection ID,Subsubsection Title,Subsubsection Text,Market Dynamics - a,Market Dynamics - b,Market Dynamics - c,LLM Answer,Evidence Text,Parsing Error,Market Dynamics - a - Evidence,Market Dynamics - b - Evidence,Market Dynamics - c - Evidence
3,4,FR0014001YE4.pdf,1,RISK FACTORS,1.1,1. Risks related to the Issuer,1.1.1,,Risk factors relating to the Issuer and the Gr...,,,,,,,,,
4,4,FR0014001YE4.pdf,1,RISK FACTORS,1.2,2. Risks related to the Bonds 2.1 Risks relati...,1.2.1,2.1.1 The Bonds may be redeemed prior to maturity,The Issuer reserves the right to purchase Bond...,,,,,,,,,
5,4,FR0014001YE4.pdf,1,RISK FACTORS,1.2,2. Risks related to the Bonds 2.1 Risks relati...,1.2.2,2.1.2 Change of control put option,"In accordance with each Condition 4(d), upon t...",,,,,,,,,
6,4,FR0014001YE4.pdf,1,RISK FACTORS,1.2,2. Risks related to the Bonds 2.1 Risks relati...,1.2.3,2.1.3 Interest rate risks,As provided for in Condition 3 of the Terms an...,,,,,,,,,
7,4,FR0014001YE4.pdf,1,RISK FACTORS,1.3,2.2 Risks for the Bondholders as creditors of ...,1.3.1,2.2.1 French insolvency law,"As a société anonyme incorporated in France, F...",,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,6,XS2010028343.pdf,1,RISK FACTORS,1.7,Risks related to the market generally,1.7.2,Exchange rate risks and exchange controls.,The Issuer will repay principal of and pay int...,,,,,,,,,
73,6,XS2010028343.pdf,1,RISK FACTORS,1.7,Risks related to the market generally,1.7.3,Legal considerations may restrict certain inve...,The investment activities of certain investors...,,,,,,,,,
74,6,XS2010028343.pdf,1,RISK FACTORS,1.7,Risks related to the market generally,1.7.4,Interest rate risks.,Investment in the Securities involves the risk...,,,,,,,,,
75,6,XS2010028343.pdf,1,RISK FACTORS,1.7,Risks related to the market generally,1.7.5,The Interest rate reset may result in a declin...,As the Securities feature a fixed interest rat...,,,,,,,,,
