## Data preparation and Exploring Teacher Models

### To make this as multioutput task creating additional continuous target 'max_loan' based on following rule

- **'max_loan' to define maximum amount of eligible loan for 'loan_status' == 1 and assign random values between 90000 to 300000 with interval of 5k based on following rules:**

1) Generally loan amount is less for the age < 30 as the percieved risk is high, however there is no direct inverse linear relationship
2) For working class for equally educated person following is the order of preference from high to low Federal-gov > State-gov > Local-gov > Private  > Self-emp-not-inc > Without-pay > Never-worked
3) naturally education-num has positive co-rrelation with maximum amount of eligible loan.
4) Based on the occupation make a good judgement about percieved risk and maximum amount of eligible loan. For  example persons with Exec-managerial, Prof-specialty occupation are eligible for higher loan amount than Sales/Adm-clerical. Sales/Adm-clerical are elible for higher loan than people with blue collar jobs like Machine-op-inspct, Farming-fishing etc. Naturally there is a direct relationship between occupation and education-num.
5) relationship, race, sex etc. have no connection with maximum amount of eligible loan. So consider them as no influencer.
6) net of capital-gain and capital-loss(capital-gain - capital-loss) has generally +ve  co-rrelation with maximum amount of eligible loan.
7) hours-per-week generally have no direct co-rrelation with maximum amount of eligible loan. Generally working for less than 35 hours infers not a full time high paying job. At the same time working more than 50 hours may indicate blue collar jobs and hence also not very high paying. consider this this in conjunction with the occupation. 


In [None]:
from fairlearn.datasets import fetch_adult
import pandas as pd
import numpy as np
from tabpfn import TabPFNRegressor

regressor_model = TabPFNRegressor()
import numpy as np
from sklearn.datasets import make_regression

In [None]:
def getDataset():
    data = fetch_adult(as_frame=True)
    df = data.data
    data.target.replace({ "<=50K": 0, ">50K": 1 }, inplace=True)
    df['loan_status'] = data.target
    
    # Create max_loan column only for loan_status == 1
    df['max_loan'] = 0
    
    # Filter for eligible loan applicants (loan_status == 1)
    eligible_mask = df['loan_status'] == 1
    eligible_df = df[eligible_mask].copy()
    
    if len(eligible_df) > 0:
        # Set random seed for reproducibility
        np.random.seed(42)
        
        # Initialize base loan amounts (90k to 300k in 5k intervals)
        loan_amounts = np.arange(90000, 305000, 5000)
        base_loans = np.random.choice(loan_amounts, size=len(eligible_df))
        
        # Apply adjustments based on various factors
        loan_adjustments = np.ones(len(eligible_df))
        
        # 1. Age factor (age < 30 gets lower amounts, but not strictly linear)
        age_factor = np.where(eligible_df['age'] < 30, 
                             np.random.uniform(0.7, 0.9, size=len(eligible_df)),
                             np.where(eligible_df['age'] > 50,
                                     np.random.uniform(0.9, 1.1, size=len(eligible_df)),
                                     np.random.uniform(0.85, 1.05, size=len(eligible_df))))
        loan_adjustments *= age_factor
        
        # 2. Workclass factor (Federal-gov > State-gov > Local-gov > Private > Self-emp-not-inc > Without-pay > Never-worked)
        workclass_multipliers = {
            'Federal-gov': 1.2,
            'State-gov': 1.15,
            'Local-gov': 1.1,
            'Private': 1.0,
            'Self-emp-inc': 0.95,
            'Self-emp-not-inc': 0.85,
            'Without-pay': 0.6,
            'Never-worked': 0.5
        }
        # Handle categorical workclass properly
        workclass_factor = []
        for wc in eligible_df['workclass']:
            if pd.isna(wc) or wc not in workclass_multipliers:
                workclass_factor.append(0.8)  # Default for unknown/missing workclass
            else:
                workclass_factor.append(workclass_multipliers[wc])
        workclass_factor = np.array(workclass_factor)
        loan_adjustments *= workclass_factor
        
        # 3. Education factor (positive correlation with education-num)
        education_factor = 0.7 + (eligible_df['education-num'] / 16) * 0.6  # Scale from 0.7 to 1.3
        loan_adjustments *= education_factor
        
        # 4. Occupation factor (risk-based assessment)
        occupation_multipliers = {
            'Exec-managerial': 1.3,
            'Prof-specialty': 1.25,
            'Tech-support': 1.1,
            'Sales': 1.0,
            'Adm-clerical': 0.95,
            'Protective-serv': 0.9,
            'Craft-repair': 0.85,
            'Transport-moving': 0.8,
            'Machine-op-inspct': 0.75,
            'Other-service': 0.7,
            'Farming-fishing': 0.65,
            'Handlers-cleaners': 0.6,
            'Priv-house-serv': 0.55,
            'Armed-Forces': 1.05
        }
        # Handle categorical occupation properly
        occupation_factor = []
        for occ in eligible_df['occupation']:
            if pd.isna(occ) or occ not in occupation_multipliers:
                occupation_factor.append(0.8)  # Default for unknown/missing occupation
            else:
                occupation_factor.append(occupation_multipliers[occ])
        occupation_factor = np.array(occupation_factor)
        loan_adjustments *= occupation_factor
        
        # 6. Capital gain/loss factor (net capital has positive correlation)
        net_capital = eligible_df['capital-gain'] - eligible_df['capital-loss']
        # Normalize capital gains impact (cap the effect to avoid extreme values)
        capital_factor = 1.0 + np.clip(net_capital / 100000, -0.2, 0.3)
        loan_adjustments *= capital_factor
        
        # 7. Hours per week factor (sweet spot around 40-50 hours)
        hours_factor = np.where(eligible_df['hours-per-week'] < 35, 0.85,
                               np.where(eligible_df['hours-per-week'] > 50, 0.9, 1.0))
        loan_adjustments *= hours_factor
        
        # Apply all adjustments to base loan amounts
        final_loans = base_loans * loan_adjustments
        
        # Round to nearest 5000 and ensure within bounds
        final_loans = np.round(final_loans / 5000) * 5000
        final_loans = np.clip(final_loans, 90000, 300000)
        
        # Assign the calculated loan amounts back to the main dataframe
        df.loc[eligible_mask, 'max_loan'] = final_loans.astype(int)
    
    return df

df = pd.DataFrame(getDataset()) 
print(df.head())
print(f"\nLoan statistics for eligible applicants:")
print(df[df['loan_status'] == 1]['max_loan'].describe())
df.to_csv('../data/loan_data_with_max_loan.csv', index=False)

   age  workclass  fnlwgt     education  education-num      marital-status  \
0   25    Private  226802          11th              7       Never-married   
1   38    Private   89814       HS-grad              9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm             12  Married-civ-spouse   
3   44    Private  160323  Some-college             10  Married-civ-spouse   
4   18        NaN  103497  Some-college             10       Never-married   

          occupation relationship   race     sex  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                NaN    Own-child  White  Female             0             0   

   hours-per-week native-country loan_status  max_

  data.target.replace({ "<=50K": 0, ">50K": 1 }, inplace=True)
  data.target.replace({ "<=50K": 0, ">50K": 1 }, inplace=True)


### 🧠 About TabPFN Foundation Model:

A substantial foundation model with over 11 million parameters, trained on diverse structured data to learn general patterns that transfer to new tabular datasets both for regression and classification tasks!

- **Pre-trained** on millions of synthetic tabular datasets
- **Transformer-based** architecture optimized for tabular data
- **Foundation model** that can adapt to new tasks with minimal data
- **Ensemble approach** uses multiple models for robust predictions
- **Efficient** for small datasets (≤10K samples)

- **More details is here :**
https://github.com/PriorLabs/TabPFN


- **Sneak peek Teacher Models**

In [None]:
# Create dummy data to fit the model (this will load the underlying PyTorch model)
print("Creating dummy data and fitting the model...")
X_dummy, y_dummy = make_regression(n_samples=100, n_features=5, random_state=42)

# Fit the model - this will load the underlying PyTorch model
regressor_model.fit(X_dummy, y_dummy)

print("Model fitted! Now analyzing the TabPFN model parameters...")

# Get the underlying PyTorch model
model = regressor_model.model_

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("\n" + "="*60)
print("TABPFN REGRESSOR MODEL ANALYSIS")
print("="*60)

print(f" Model Architecture: {model.__class__.__name__}")
print(f" Total Parameters: {total_params:,}")
print(f" Trainable Parameters: {trainable_params:,}")
print(f" Model Size: {(total_params * 4) / (1024**2):.2f} MB")

print(f"\n Ensemble Configuration:")
print(f"   Number of estimators: {regressor_model.n_estimators}")
print(f"   Total ensemble parameters: {total_params * regressor_model.n_estimators:,}")

# Show parameter distribution by layer type
layer_counts = {}
for name, param in model.named_parameters():
    layer_type = name.split('.')[0]
    layer_counts[layer_type] = layer_counts.get(layer_type, 0) + param.numel()

print(f"\n Parameter Distribution:")
for layer_type, count in sorted(layer_counts.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / total_params) * 100
    print(f"   {layer_type}: {count:,} ({percentage:.1f}%)")

Creating dummy data and fitting the model...
Model fitted! Now analyzing the TabPFN model parameters...

TABPFN REGRESSOR MODEL ANALYSIS
 Model Architecture: PerFeatureTransformer
 Total Parameters: 11,081,864
 Trainable Parameters: 11,081,864
 Model Size: 42.27 MB

 Ensemble Configuration:
   Number of estimators: 8
   Total ensemble parameters: 88,654,912

 Key Facts about TabPFN:
   • Foundation model trained on millions of synthetic datasets
   • Uses Per-Feature Transformer architecture
   • Optimized for small tabular datasets (≤10K samples)
   • Each model in ensemble has ~11M parameters
   • Default ensemble uses 8 models for robust predictions

 Parameter Distribution:
   transformer_encoder: 7,077,888 (63.9%)
   decoder_dict: 3,993,224 (36.0%)
   feature_positional_embedding_embeddings: 9,408 (0.1%)
   encoder: 768 (0.0%)
   y_encoder: 576 (0.0%)
Model fitted! Now analyzing the TabPFN model parameters...

TABPFN REGRESSOR MODEL ANALYSIS
 Model Architecture: PerFeatureTransfor

In [None]:
# Utility function to get TabPFN parameter count
def get_tabpfn_parameter_count(model_type='regressor'):
    """
    Get the parameter count of a TabPFN model without fitting data.
    
    Args:
        model_type (str): 'regressor' or 'classifier'
    
    Returns:
        dict: Dictionary containing parameter information
    """
    from tabpfn import TabPFNRegressor, TabPFNClassifier
    from sklearn.datasets import make_regression, make_classification
    
    # Choose the appropriate model and dummy data
    if model_type.lower() == 'regressor':
        model = TabPFNRegressor()
        X_dummy, y_dummy = make_regression(n_samples=50, n_features=5, random_state=42)
    else:
        model = TabPFNClassifier()
        X_dummy, y_dummy = make_classification(n_samples=50, n_features=5, n_classes=2, random_state=42)
    
    # Fit with minimal data to load the model
    model.fit(X_dummy, y_dummy)
    
    # Get parameter count
    pytorch_model = model.model_
    total_params = sum(p.numel() for p in pytorch_model.parameters())
    trainable_params = sum(p.numel() for p in pytorch_model.parameters() if p.requires_grad)
    
    return {
        'model_type': model_type,
        'architecture': pytorch_model.__class__.__name__,
        'total_parameters': total_params,
        'trainable_parameters': trainable_params,
        'model_size_mb': (total_params * 4) / (1024**2),
        'n_estimators': model.n_estimators,
        'ensemble_total_params': total_params * model.n_estimators
    }

# Example usage
print("Getting TabPFN parameter information...")
regressor_info = get_tabpfn_parameter_count('regressor')
classifier_info = get_tabpfn_parameter_count('classifier')

print("\n TABPFN PARAMETER SUMMARY")
print("="*50)
print(f"Regressor Parameters: {regressor_info['total_parameters']:,}")
print(f"Classifier Parameters: {classifier_info['total_parameters']:,}")
print(f"Architecture: {regressor_info['architecture']}")
print(f"Single Model Size: {regressor_info['model_size_mb']:.1f} MB")
print(f"Default Ensemble Size: {regressor_info['n_estimators']} models")
print(f"Total Ensemble Parameters: {regressor_info['ensemble_total_params']:,}")

print(f"\n Answer to your question:")
print(f"The TabPFN foundation model has {regressor_info['total_parameters']:,} parameters")
print(f"When used as an ensemble (default), it uses {regressor_info['ensemble_total_params']:,} total parameters")

Getting TabPFN parameter information...

📊 TABPFN PARAMETER SUMMARY
Regressor Parameters: 11,081,864
Classifier Parameters: 7,244,554
Architecture: PerFeatureTransformer
Single Model Size: 42.3 MB
Default Ensemble Size: 8 models
Total Ensemble Parameters: 88,654,912

💡 Answer to your question:
The TabPFN foundation model has 11,081,864 parameters
When used as an ensemble (default), it uses 88,654,912 total parameters


## 🔍 TabPFN Model Parameters - Summary


### 📊 Key Findings:

1. **TabPFNRegressor**: **11,081,864 parameters** (~11.1M)
2. **TabPFNClassifier**: **7,244,554 parameters** (~7.2M) 
3. **Architecture**: PerFeatureTransformer (Transformer-based)
4. **Model Size**: ~42.3 MB per model
5. **Ensemble**: 8 models by default = **88,654,912 total parameters**

