In [37]:
# !pip install -q openai datasets
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [38]:
# import json
import numpy as np
import pandas as pd
from openai import AzureOpenAI
from datasets import load_dataset
from sklearn.metrics import classification_report
# from google.colab import userdata
from tqdm import tqdm
import os

In [39]:
azure_api_key = os.getenv('azure_api_key')
azure_api_endpoint = os.getenv('azure_endpoint')

In [40]:
client = AzureOpenAI(
  azure_endpoint = azure_api_endpoint,
  api_key=azure_api_key,
  api_version="2024-02-01"
)

In [41]:
model_name = 'gpt-35-turbo' # deployment name

**Examples and Gold Examples**

A set of examples and gold examples for sentiment classification of Amazon product reviews is hosted in a HuggingFace dataset. Let us load this data and take a look at the samples in this data.

In [42]:
amazon_reviews_examples_df = pd.read_pickle("synthetic_data/loan_application_data_examples.pkl")
amazon_reviews_gold_examples_df = pd.read_pickle("synthetic_data/loan_application_data_gold_examples.pkl")

In [43]:
amazon_reviews_examples_df.shape, amazon_reviews_gold_examples_df.shape

((20, 8), (30, 8))

As the above outputs indicate, there are 20 examples and 30 gold examples. We will sample from the examples to create the few shot prompt and evaluate the prompt on all 30 gold examples.

In [44]:
amazon_reviews_examples_df.sample(4)
#amazon_reviews_examples_df

Unnamed: 0,Application No.,Income,Credit Score,Family Members,Outstanding Debt,Loan Request,Employment Status,Loan Decision
11,12,3741.76,640,4,3853.91,17211.04,Unemployed,Not Approved
17,18,4563.75,737,2,1299.52,46329.88,Employed,Approved
0,1,2842.71,379,0,3134.6,11572.01,Employed,Not Approved
2,3,8248.1,532,5,2023.62,7019.05,Self-Employed,Not Approved


In [45]:
#amazon_reviews_gold_examples_df

**Assembling the prompt**

In [46]:
system_message = """
You are a loan approval application, to determine whether the loan should be approved or not. The business logic will follow a series of checks and calculations to ensure responsible lending based on risk factors like income, credit score, and family responsibilities. Approach as below:
Business Logic for Loan Approval:
1. Eligibility Verification (Initial Checks)
Before proceeding with risk assessment, we ensure that all necessary customer details are provided and meet basic eligibility criteria:
•	Income: Ensure that income is provided (either monthly or annually).
•	Credit Score: Ensure the credit score is a valid number (typically between 300 and 850).
•	Family Members: Ensure the number of family members is provided (must be a positive integer).
•	Employment Status: Ensure that employment status is provided (employed/self-employed/unemployed).
•	Loan Amount: Ensure the loan amount requested is specified.
•	Outstanding Debts: If provided, these should be accounted for in assessing the customer’s financial obligations.
If any of these details are missing or invalid, do not proceed further and the application should return "Not Approved" with an explanation like "Missing or invalid customer details."
2. Risk Assessment Score Calculation
Once the necessary details are verified, we calculate a Risk Score based on three main factors: Credit Score, Income Stability, and Family Responsibilities.
•	Credit Score (Weight: 50%):
o	Score Range:
	720+ (Excellent) = 5 points
	680–719 (Good) = 4 points
	640–679 (Fair) = 3 points
	600–639 (Poor) = 2 points
	Below 600 (Very Poor) = 1 point
o	The higher the credit score, the more favourable it is for approval.
•	Income Stability (Weight: 30%):
o	Income is evaluated based on how well it can cover the loan request, existing debt, and family needs. Use the Debt-to-Income Ratio (DTI), calculated as: DTI=Total Monthly Debt Payments (incl. loan)Monthly Income×100\text{DTI} = \frac{\text{Total Monthly Debt Payments (incl. loan)} }{\text{Monthly Income}} \times 100DTI=Monthly IncomeTotal Monthly Debt Payments (incl. loan)×100
	DTI < 35% (Low Risk) = 5 points
	DTI 35%–49% (Moderate Risk) = 3 points
	DTI ≥ 50% (High Risk) = 1 point
o	Lower DTI suggests a higher ability to repay the loan.
•	Family Responsibilities (Weight: 20%):
o	The number of dependents impacts financial obligations. More dependents may reduce disposable income.
	0–1 dependents = 5 points
	2–3 dependents = 3 points
	4+ dependents = 1 point
3. Decision Based on Risk Score
Combine the scores from each factor and calculate the total Risk Score. The maximum possible score is 5 (Credit Score) + 5 (Income) + 5 (Family) = 15 points.
•	Approval Threshold:
o	Approved: Risk Score ≥ 10
o	Not Approved: Risk Score < 10
4. Detailed Decision Output
If any of these input details are missing or invalid, do not proceed further and the application should return "Not Approved" with an explanation like "Missing or invalid customer details."
•	Approved: "Your loan has been approved based on your strong credit score, manageable debt-to-income ratio, and household size."
•	Not Approved: Specific reasons for rejection should be provided, such as:
o	"Your loan application was not approved due to a low credit score."
o	"Your debt-to-income ratio exceeds the acceptable threshold, indicating a high risk."
o	"Your family responsibilities and existing debts reduce your disposable income."
"""

In [47]:
few_shot_prompt = [{'role':'system', 'content': system_message}]

We need to iterate over the rows of the examples DataFrame to append these examples as `user` and `assistant` messages to the few-shot prompt. We achieve this using the `iterrows` method.

In [48]:
for index, row in amazon_reviews_examples_df.iterrows():
    print('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1]) 
          + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
          +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
          +' \n Employment Status: '+ str(row[6]))
    print('Loan Decision: '+ row[7])
    break

Example input: 
 Application No.: 1 
 Income: 2842.71 
 Credit Score: 379  
 Family Members: 0 
 Outstanding Debt: 3134.6 
 Loan Request: 11572.01 
 Employment Status: Employed
Loan Decision: Not Approved


  print('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1])
  + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
  +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
  +' \n Employment Status: '+ str(row[6]))
  print('Loan Decision: '+ row[7])


Notice that the label is an integer. However, LLMs accept only strings. So we need to convert the integer label to a string label as we assemble the few-shot prompt. Let us assemble a few-shot prompt with 4 examples.

In [49]:
for index, row in amazon_reviews_examples_df.sample(8).iterrows():
    example_inp = ('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1]) 
          + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
          +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
          +' \n Employment Status: '+ str(row[6]))
    example_app = row[7]

    few_shot_prompt.append(
        {
            'role': 'user',
            'content': example_inp
        }
    )

    few_shot_prompt.append(
        {
            'role': 'assistant',
            'content': str(example_app) # LLMs accept only string inputs
        }
    )

  example_inp = ('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1])
  + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
  +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
  +' \n Employment Status: '+ str(row[6]))
  example_app = row[7]


In [50]:
few_shot_prompt

[{'role': 'system',
  'content': '\nYou are a loan approval application, to determine whether the loan should be approved or not. The business logic will follow a series of checks and calculations to ensure responsible lending based on risk factors like income, credit score, and family responsibilities. Approach as below:\nBusiness Logic for Loan Approval:\n1. Eligibility Verification (Initial Checks)\nBefore proceeding with risk assessment, we ensure that all necessary customer details are provided and meet basic eligibility criteria:\n•\tIncome: Ensure that income is provided (either monthly or annually).\n•\tCredit Score: Ensure the credit score is a valid number (typically between 300 and 850).\n•\tFamily Members: Ensure the number of family members is provided (must be a positive integer).\n•\tEmployment Status: Ensure that employment status is provided (employed/self-employed/unemployed).\n•\tLoan Amount: Ensure the loan amount requested is specified.\n•\tOutstanding Debts: If pr

We now have 4 examples in the few shot prompt that is ready for use. Before we deploy this prompt, we need to get an estimate of the performance of this prompt. Here is where we use gold examples to estimate the accuracy.

## Evaluation

In [51]:
predictions, ground_truths = [], []

In [52]:
for index, row in tqdm(amazon_reviews_gold_examples_df.iterrows()):
    gold_inp = ('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1]) 
          + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
          +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
          +' \n Employment Status: '+ str(row[6]))
    gold_app = row[7]

    user_input = [{'role':'user', 'content': gold_inp}]

    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=few_shot_prompt + user_input,
            temperature=0
        )

        predictions.append(response.choices[0].message.content) 
        ground_truths.append(gold_app)
    except Exception as e:
        print(e) # Log error and continue
        continue

  gold_inp = ('Example input: \n Application No.: ' + str(row[0]) +' \n Income: '+ str(row[1])
  + ' \n Credit Score: '+ str(row[2]) +'  \n Family Members: '+ str(row[3])
  +' \n Outstanding Debt: '+ str(row[4]) +' \n Loan Request: '+ str(row[5])
  +' \n Employment Status: '+ str(row[6]))
  gold_app = row[7]
30it [00:12,  2.34it/s]


In [53]:
predictions = np.array(predictions)
ground_truths = np.array(ground_truths)
(predictions == ground_truths).mean()

0.8333333333333334

The output above indicates that the accuracy of the few-shot prompt on gold examples. More fine-grained evaluation (e.g., F1 score) could also be used to establish the estimated accuracy of the prompt.

In [54]:
print(classification_report(ground_truths, predictions))

              precision    recall  f1-score   support

    Approved       0.50      0.40      0.44         5
Not Approved       0.88      0.92      0.90        25

    accuracy                           0.83        30
   macro avg       0.69      0.66      0.67        30
weighted avg       0.82      0.83      0.83        30



>More examples does not imply better accuracy. Increasing the number of examples in the few-shot prompt beyond 16 is not known to yield better performance.