# EXAMPLE.

The following is a usage example of how to use SMART to generate subgroups on which your supervised learning model is likely to fail.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from openai import AzureOpenAI
from SMART import SMART
from openai_config import get_openai_config

# Step 1: Create a synthetic dataset with numeric features only
np.random.seed(0)

# Generate 1000 samples
n_samples = 1000
age = np.random.randint(18, 70, n_samples)  # Age between 18 and 70
income = np.random.randint(20000, 100000, n_samples)  # Income between 20,000 and 100,000
credit_score = np.random.randint(300, 850, n_samples)  # Credit scores between 300 and 850
loan_amount = np.random.randint(1000, 50000, n_samples)  # Loan amount between 1,000 and 50,000
loan_default = np.random.choice([0, 1], n_samples)  # Binary target: 0 (no default) or 1 (default)

# Create DataFrame
df = pd.DataFrame({
    'age': age,
    'income': income,
    'credit_score': credit_score,
    'loan_amount': loan_amount,
    'loan_default': loan_default
})

# Define features and target
X = df.drop(columns=['loan_default'])
y = df['loan_default']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Instantiate and train a logistic regression model
model = LogisticRegression(max_iter=1000)  # Increase max_iter to ensure convergence
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 0.52


In [12]:
# Step 3: Initialize AzureOpenAI client (replace with your actual configuration)
config = get_openai_config()

llm = AzureOpenAI(
    api_version=config["api_version"],
    azure_endpoint=config["api_base"],
    api_key=config["api_key"])

# Step 4: Create SMART instance
subgroup_finder = SMART(llm=llm, config=config, verbose=True)

# Step 5: Fit SMART to the training data
context = "This dataset contains information about loan defaults based on age, income, and education level."
context_target = "loan_default"

subgroup_finder.fit(X_train, context=context, context_target=context_target, evaluate_feasibility=False)

# Step 6: Display the identified subgroups
print("Identified Subgroups:")
print(subgroup_finder.subgroups)


----------INPUT TEXT --------------
Your task is to propose possible hypotheses as to which subgroups within the dataset might have worse predictive performance than on average because of societal bias in the dataset, insufficient data, other relationships, or others. The subgroups might be based on any of the provided characteristics, as well as on any combination of such characteristics. 
        
        Dataset information: This dataset contains information about loan defaults based on age, income, and education level.. loan_default
        
        The dataset contains 4 columns. The columns are age, income, credit_score, loan_amount. The values are dict_items([('age', {'min': 18, 'mean': 43.65375, 'max': 69}), ('income', {'min': 20131, 'mean': 60256.615, 'max': 99850}), ('credit_score', {'min': 300, 'mean': 564.70375, 'max': 849}), ('loan_amount', {'min': 1019, 'mean': 25195.7225, 'max': 49975})])
        
        Task: Create 5 hypotheses as to which subgroups within the dataset

In [13]:
print("Identified Subgroups:")
print(subgroup_finder.subgroups)

Identified Subgroups:
{0: 'income < 60256.615', 1: 'age < 43.65375', 2: 'credit_score < 564.70375', 3: 'loan_amount > 25195.7225', 4: 'age > 43.65375'}


In [14]:
print("Hypotheses generated about each subgroup")
print(subgroup_finder.hypotheses)

Hypotheses generated about each subgroup
Hypothesis 1: The model may perform worse for individuals with lower income levels; Justification: There may be societal biases in the dataset where individuals with lower income levels are more likely to default on loans. This could lead to the model over-predicting defaults for this group, resulting in worse performance.

Hypothesis 2: The model may perform worse for younger individuals; Justification: Younger individuals may have less credit history and therefore less data available for the model to learn from. This could lead to the model under-performing for this group.

Hypothesis 3: The model may perform worse for individuals with lower credit scores; Justification: Individuals with lower credit scores may be more likely to default on loans due to societal biases in the dataset. This could lead to the model over-predicting defaults for this group, resulting in worse performance.

Hypothesis 4: The model may perform worse for individuals w

In [None]:
_ = subgroup_finder.generate_model_report(X_train, y_train, X_test, y_test, model)

----------INPUT TEXT --------------

        The following is the context: This dataset contains information about loan defaults based on age, income, and education level.. The following is the target context: loan_default. The following is a table summarizing the information about the results on the training dataset:                                           Hypothesis  \
0  The model may perform worse for individuals wi...   
1  The model may perform worse for younger indivi...   
2  The model may perform worse for individuals wi...   
3  The model may perform worse for individuals wi...   
4  The model may perform worse for older individuals   

                                       Justification  \
0  There may be societal biases in the dataset wh...   
1  Younger individuals may have less credit histo...   
2  Individuals with lower credit scores may be mo...   
3  Higher loan amounts may be associated with a h...   
4  Older individuals may have a different pattern...   

      