# README ----

Aave V2 Wallet Credit Scoring Model
This project develops a machine learning model to assign a credit score (0-1000) to Aave V2 wallets based on their historical transaction behavior. The score aims to reflect a wallet's reliability and responsible usage, with higher scores indicating better behavior and lower scores indicating riskier or potentially exploitative patterns.

Challenge Overview
The task was to analyze raw, transaction-level data from the Aave V2 protocol, specifically focusing on actions like deposit, borrow, repay, redeemUnderlying, and liquidationCall. Based on this data, a robust machine learning model was developed to assign a credit score to each unique wallet.

Data Source
The model is trained and operates on a sample of user-transactions provided as a JSON file (~87MB). This file contains transaction-level details including wallet, action, value, and timestamp.

Feature Engineering Strategy
From the raw transaction data, the following features were engineered to capture various aspects of a wallet's interaction with the Aave V2 protocol. These features are designed to proxy creditworthiness:

deposit_count / total_value_deposited: Number and total value of assets deposited. Indicates participation and capital provided.

borrow_count / total_value_borrowed: Number and total value of assets borrowed. Indicates reliance on protocol's lending.

repay_count / total_value_repaid: Number and total value of repaid loans. Crucial for creditworthiness.

redeem_count / total_value_redeemed: Number and total value of redeemed collateral. Frequent redemptions might suggest short-term engagement.

liquidation_count / total_value_liquidated: Number and total value of liquidations incurred by the wallet. A strong negative indicator, reflecting failure to maintain collateral.

account_age_days: Duration between the first and last transaction. Longer activity periods might imply stability.

borrow_repay_ratio: total_value_repaid / total_value_borrowed. A ratio significantly greater than 1 suggests over-repayment or consistent good behavior. A ratio near 0 or less indicates unrepaid debt.

net_borrow_exposure: total_value_borrowed - total_value_repaid. Represents the outstanding borrowed amount. Higher values are riskier.

liquidation_rate: liquidation_count / total_transactions. Frequency of liquidations relative to total activity.

Score Logic and Interpretation (0-1000)
Since explicit credit scores were not provided, a heuristic target score was defined to train the supervised learning model. This heuristic assigns a score between 0 and 1000 based on predefined rules that reflect 'good' vs. 'bad' DeFi behavior:

Higher Scores (e.g., 700-1000): Indicative of wallets that consistently repay their loans, have a low (ideally zero) number of liquidations, show substantial value repaid, and demonstrate sustained activity on the platform. These wallets represent reliable and responsible usage.

Mid Scores (e.g., 400-699): May represent wallets with mixed activity, occasional borrowing and repayment, but perhaps some unrepaid exposure or limited history.

Lower Scores (e.g., 0-399): Reflect wallets that have experienced liquidations, have a significant net_borrow_exposure, or exhibit patterns suggestive of risky, bot-like, or exploitative behavior (e.g., many small, rapid transactions that could be part of an exploit, although further analysis would be needed to confirm 'bot-like' specifically).

The Gradient Boosting Regressor then learns to map the engineered features to this heuristic score.

Model Architecture
A Gradient Boosting Regressor from scikit-learn was chosen for this task.

Reasoning: Gradient Boosting models are powerful, robust to various data types, capable of capturing non-linear relationships, and provide insights into feature importance, which aids in explaining the score logic.

Preprocessing: Features are scaled using StandardScaler to ensure all features contribute equally to the model, regardless of their original magnitude.

How to Set Up and Run the Code
Download the Data:
Download the user-transactions.json file from:
https://drive.google.com/file/d/1ISFbAXxadMrt7Zl96rmzzZmEKZnyW7FS/view?usp=sharing
Place this file in the root directory of your project.

Clone the Repository:

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

Create and Activate Virtual Environment:

python -m venv venv
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Install Dependencies:

pip install pandas numpy scikit-learn

Run the Jupyter Notebook:

jupyter lab

Open the aave_credit_score.ipynb notebook and run all cells sequentially. This will:

Load the data.

Engineer features.

Define and assign heuristic scores.

Train the ML model and save model.joblib and scaler.joblib.

Demonstrate the "one-step" scoring process by loading the saved model and generating scores, saving them to wallet_credit_scores_generated.json.

Output:
The final wallet credit scores will be saved in wallet_credit_scores_generated.json in the same directory.

Extensibility and Future Improvements
Advanced Feature Engineering: Incorporate time-series analysis (e.g., moving averages of debt, repayment streaks), asset-specific features (volatility of collateral assets), and network graph analysis (interactions between wallets).

External Data Integration: Leverage external DeFi data sources (e.g., oracle prices, total value locked, token liquidity) for richer context.

Dynamic Target Definition: Develop a more sophisticated, potentially unsupervised or semi-supervised approach to define "creditworthiness" if labeled data becomes available.

Explainable AI (XAI): Implement techniques like SHAP or LIME to provide more granular explanations for individual wallet scores, beyond just feature importance.

Temporal Validation: Evaluate the model's performance on future data to ensure its robustness over time.

Risk Categorization: Instead of just a score, categorize wallets into risk buckets (e.g., Low Risk, Medium Risk, High Risk).

In [16]:
# ALL NECESSARY LIBRARIES ----

import json
import pandas as pd                 # For analysis part for finding insigths from data
import numpy as np                  # For mathematical evaluations
import re                          # For regular expressions, if needed for complex parsing
from sklearn.model_selection import train_test_split     # For splitting data into two parts -- FOR TRAINING(75%) & FOR TESTING(25%)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler         #Feature Scaling
from sklearn.metrics import mean_squared_error, r2_score
import joblib                       # For saving/loading the trained model and scaler
# Suppress warnings for cleaner output in notebook
import warnings
warnings.filterwarnings('ignore')

print("All necessary libraries imported successfully.")




# DATA LOADING ----

"""
    Loads transaction data from a JSON file into a pandas DataFrame.
    Converts 'timestamp' to datetime objects.

    Args:
        json_file_path (str): The path to the user-transactions.json file.

    Returns:
        pd.DataFrame: DataFrame containing the transaction data.
"""

def load_data(json_file_path):
    print(f"Loading data from {json_file_path}...")
    try:
        with open(json_file_path, 'r') as f:
            data = json.load(f)
        df = pd.DataFrame(data)
        # Convert 'timestamp' from Unix timestamp (seconds) to datetime
        df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
        print(f"Data loaded successfully. Total records: {len(df)}")
        print("First 5 rows of data:")
        print(df.head())
        print("\nData Info:")
        df.info()
        return df
    except FileNotFoundError:
        print(f"Error: The file '{json_file_path}' was not found.")
        print("Please ensure you have downloaded 'user-transactions.json' and placed it in the same directory as this notebook.")
        return pd.DataFrame() # Return empty DataFrame on error
    except Exception as e:
        print(f"An error occurred while loading data: {e}")
        return pd.DataFrame()

# Execute Data Loading ---
transactions_df = load_data('user-transactions.json')

# Check if data loaded successfully before proceeding
if transactions_df.empty:
    raise SystemExit("Data loading failed. Exiting notebook execution.")




# FEATURE SCALING ----

"""
    Engineers relevant features from the raw transaction DataFrame for credit scoring.

    Args:
        df (pd.DataFrame): The input DataFrame of Aave V2 transactions.

    Returns:
        tuple: A tuple containing:
            - pd.DataFrame: DataFrame of engineered features.
            - pd.Series: Series of wallet addresses corresponding to the features.
"""

def engineer_features(df):
    """
    Engineers relevant features from the raw transaction DataFrame for credit scoring.

    Args:
        df (pd.DataFrame): The input DataFrame of Aave V2 transactions.

    Returns:
        tuple: A tuple containing:
            - pd.DataFrame: DataFrame of engineered features.
            - pd.Series: Series of wallet addresses corresponding to the features.
    """
    print("\nEngineering features from transaction data...")

    # --- CRITICAL CHANGE HERE: Extract 'amount' from 'actionData' ---
    # We need to safely extract the 'amount' key from the dictionary in 'actionData'.
    # Use .apply(lambda x: x.get('amount', '0')) to handle cases where 'amount' might be missing
    # and default to '0' (as a string) before converting to numeric.
    df['value_numeric'] = df['actionData'].apply(lambda x: x.get('amount', '0'))
    df['value_numeric'] = pd.to_numeric(df['value_numeric'], errors='coerce').fillna(0)
    print("Extracted 'amount' from 'actionData' and converted to numeric 'value_numeric'.")


    # Group by wallet to create aggregated features
    # Grouping by 'userWallet' as per your data's column name
    wallet_features = df.groupby('userWallet').agg(
        # Transaction Counts for each action type
        total_transactions=('txHash', 'count'), # Using 'txHash' as the unique transaction identifier
        # Using exact action names as found in your 'actionData' output (case-sensitive)
        deposit_count=('action', lambda x: (x == 'Deposit').sum()),
        borrow_count=('action', lambda x: (x == 'Borrow').sum()),
        repay_count=('action', lambda x: (x == 'Repay').sum()),
        redeem_count=('action', lambda x: (x == 'RedeemUnderlying').sum()),
        liquidation_count=('action', lambda x: (x == 'LiquidationCall').sum()),

        # Sum of values for each action type
        # These lambda functions correctly use df.loc[x.index, 'action'] to filter based on the original DataFrame's action column
        total_value_deposited=('value_numeric', lambda x: x[df.loc[x.index, 'action'] == 'Deposit'].sum()),
        total_value_borrowed=('value_numeric', lambda x: x[df.loc[x.index, 'action'] == 'Borrow'].sum()),
        total_value_repaid=('value_numeric', lambda x: x[df.loc[x.index, 'action'] == 'Repay'].sum()),
        total_value_redeemed=('value_numeric', lambda x: x[df.loc[x.index, 'action'] == 'RedeemUnderlying'].sum()),
        total_value_liquidated=('value_numeric', lambda x: x[df.loc[x.index, 'action'] == 'LiquidationCall'].sum()),

        # Time-based Features: First and last transaction timestamps
        first_transaction_time=('timestamp', 'min'),
        last_transaction_time=('timestamp', 'max'),
    ).reset_index()

    # Calculate derived features
    # Account age in days: Difference between last and first transaction
    wallet_features['account_age_days'] = (wallet_features['last_transaction_time'] - wallet_features['first_transaction_time']).dt.days
    # Handle cases where account age might be 0 (single transaction) or NaN
    wallet_features['account_age_days'] = wallet_features['account_age_days'].fillna(0)


    # Borrow-to-Repay Ratio: Higher is generally better for creditworthiness
    # Add a small epsilon (1e-9) to the denominator to prevent division by zero for wallets with no borrows.
    wallet_features['borrow_repay_ratio'] = wallet_features['total_value_repaid'] / (wallet_features['total_value_borrowed'] + 1e-9)
    # Replace infinite values (if total_value_borrowed was 0 and repaid was >0) with a high value or 0
    wallet_features['borrow_repay_ratio'] = wallet_features['borrow_repay_ratio'].replace([np.inf, -np.inf], 0).fillna(0)


    # Net Borrow Exposure: Total borrowed minus total repaid. Positive means outstanding debt.
    wallet_features['net_borrow_exposure'] = wallet_features['total_value_borrowed'] - wallet_features['total_value_repaid']

    # Liquidation Rate: Frequency of liquidations relative to total activity
    wallet_features['liquidation_rate'] = wallet_features['liquidation_count'] / (wallet_features['total_transactions'] + 1e-9)
    wallet_features['liquidation_rate'] = wallet_features['liquidation_rate'].fillna(0) # Fill NaN if total_transactions is 0


    # Define the list of features that will be used by the ML model
    features_for_model = [
        'deposit_count', 'borrow_count', 'repay_count', 'redeem_count', 'liquidation_count',
        'total_value_deposited', 'total_value_borrowed', 'total_value_repaid',
        'total_value_redeemed', 'total_value_liquidated',
        'account_age_days', 'borrow_repay_ratio', 'net_borrow_exposure', 'liquidation_rate'
    ]

    print("Features engineered successfully. First 5 rows of features:")
    print(wallet_features[features_for_model].head())
    print(f"\nTotal wallets processed: {len(wallet_features)}")

    return wallet_features[features_for_model], wallet_features['userWallet'] # Returning 'userWallet' as the wallet identifier

# --- Execute Feature Engineering ---
features_df, wallets = engineer_features(transactions_df)

# Check if features were engineered successfully
if features_df.empty:
    raise SystemExit("Feature engineering failed. Exiting notebook execution.")




# HEAURISTIC TARGET SCORE ----

"""
    Assigns a heuristic credit score (0-1000) to each wallet based on engineered features.
    This score serves as a proxy target variable for the supervised learning model.
    The logic here defines what 'good' and 'bad' behavior means in our context.

    Args:
        features_df (pd.DataFrame): DataFrame of engineered features.

    Returns:
        pd.Series: A Series of heuristic credit scores for each wallet.
"""

def assign_heuristic_score(features_df):
    """
    Assigns a heuristic credit score (0-1000) to each wallet based on engineered features.
    This score serves as a proxy target variable for the supervised learning model.
    The logic here defines what 'good' and 'bad' behavior means in our context.

    Args:
        features_df (pd.DataFrame): DataFrame of engineered features.

    Returns:
        pd.Series: A Series of heuristic credit scores for each wallet.
    """
    print("\nAssigning heuristic credit scores (0-1000) for model training...")

    # Initialize all scores to a base value (e.g., 500)
    scores = pd.Series(np.full(len(features_df), 500), index=features_df.index)

    # --- Reward Positive Behaviors ---
    # 1. Reward good repayment behavior (higher borrow_repay_ratio)
    # Max 300 points for a very good ratio. Using log scale to dampen extreme values.
    scores += (np.log1p(features_df['borrow_repay_ratio']) * 50).clip(0, 300)

    # 2. Reward significant total value repaid (indicates active, responsible usage)
    # Normalized by max total repaid value to get a score between 0-100
    if features_df['total_value_repaid'].max() > 0:
        scores += (features_df['total_value_repaid'] / features_df['total_value_repaid'].max() * 100).fillna(0).clip(0, 100)

    # 3. Reward account age (longer active accounts might be more reliable)
    # Normalized by max account age to get a score between 0-50
    if features_df['account_age_days'].max() > 0:
        scores += (features_df['account_age_days'] / features_df['account_age_days'].max() * 50).fillna(0).clip(0, 50)


    # --- Penalize Negative Behaviors ---
    # 1. Penalize liquidations heavily (strongest negative signal)
    # Each liquidation reduces score significantly. Max reduction 500 points.
    scores -= (features_df['liquidation_count'] * 100).clip(0, 500)

    # 2. Penalize high liquidation rate
    # Higher rate means more frequent failures. Max reduction 200 points.
    scores -= (features_df['liquidation_rate'] * 200).clip(0, 200)

    # 3. Penalize significant net borrow exposure (unrepaid debt)
    # Normalized by max net exposure. Max reduction 150 points.
    if features_df['net_borrow_exposure'].max() > 0:
        scores -= (features_df['net_borrow_exposure'] / features_df['net_borrow_exposure'].max() * 150).fillna(0).clip(0, 150)


    # Clamp scores between 0 and 1000 to ensure they stay within the desired range
    scores = scores.clip(0, 1000).round(0).astype(int)

    print("Heuristic scores assigned. First 5 scores:")
    print(scores.head())
    print(f"Score distribution (min, max, mean): {scores.min()}, {scores.max()}, {scores.mean():.2f}")

    return scores

# --- Execute Heuristic Score Assignment ---
heuristic_scores = assign_heuristic_score(features_df)



# MODEL TRINING AND SAVING ----
"""
    Trains a Gradient Boosting Regressor model, evaluates it, and saves the
    trained model and the StandardScaler for future use.

    Args:
        features_df (pd.DataFrame): DataFrame of engineered features.
        target_scores (pd.Series): Series of heuristic credit scores (target variable).
        wallets (pd.Series): Series of wallet addresses.

    Returns:
        pd.DataFrame: DataFrame containing wallet addresses and their predicted credit scores.
"""

def train_and_save_model(features_df, target_scores, wallets):
    print("\nStarting model training and evaluation...")

    # 1. Scale Features: Standardize features to have mean 0 and variance 1
    # This is important for many ML models, including Gradient Boosting,
    # though it's less sensitive than linear models or neural networks.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(features_df)
    print("Features scaled.")

    # 2. Split Data: Divide data into training and testing sets
    # 80% for training, 20% for testing to evaluate model performance.
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, target_scores, test_size=0.2, random_state=42         # fixed random_state for reproducibility
    )
    print(f"Data split into training ({len(X_train)} samples) and testing ({len(X_test)} samples).")

    # 3. Initialize and Train Model: Gradient Boosting Regressor
    # n_estimators: number of boosting stages
    # learning_rate: shrinks the contribution of each tree
    # max_depth: limits the number of nodes in the tree
    model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
    print("Training Gradient Boosting Regressor model...")
    model.fit(X_train, y_train)
    print("Model training complete.")

    # 4. Evaluate Model Performance on the test set
    y_pred_test = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    r2 = r2_score(y_test, y_pred_test)
    print(f"Model Performance on Test Set:")
    print(f"  Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"  R-squared (R2) Score: {r2:.2f}")
    print("An R2 score closer to 1 indicates a better fit; RMSE indicates average prediction error.")


    # 5. Predict Scores for ALL Wallets (using the full dataset)
    # Transform the entire feature set using the *fitted* scaler
    all_wallet_scores_predicted = model.predict(scaler.transform(features_df))

    # Clamp scores to the 0-1000 range and round to nearest integer
    final_scores_df = pd.DataFrame({
        'wallet': wallets,
        'credit_score': all_wallet_scores_predicted.clip(0, 1000).round(0).astype(int)
    })
    # 6. Save the trained model and scaler for the one-step script
    # This allows the scoring script to use the exact same transformation and prediction logic
    joblib.dump(scaler, 'scaler.joblib')
    joblib.dump(model, 'model.joblib')
    print("\nModel and Scaler saved as 'model.joblib' and 'scaler.joblib' in the current directory.")

    print("\nFirst 10 predicted credit scores for all wallets:")
    print(final_scores_df.head(10))
    print(f"Predicted score distribution (min, max, mean): {final_scores_df['credit_score'].min()}, {final_scores_df['credit_score'].max()}, {final_scores_df['credit_score'].mean():.2f}")

    return final_scores_df

# --- Execute Model Training and Saving ---
final_predicted_scores_df = train_and_save_model(features_df, heuristic_scores, wallets)




# ONE STEP SCORING SCRIPT ----

"""
    Generates wallet credit scores from a new JSON transaction file using a pre-trained model.

    Args:
        json_file_path (str): Path to the input JSON file with transaction data.
        model_path (str): Path to the saved machine learning model (.joblib).
        scaler_path (str): Path to the saved StandardScaler (.joblib).

    Returns:
        pd.DataFrame: DataFrame containing wallet addresses and their predicted credit scores.
"""

# This section simulates the 'one-step script' (score_wallets.py)
# It demonstrates how you would load the saved model and scaler
# to predict scores for a new (or the same) dataset.

def generate_wallet_scores_from_file(json_file_path, model_path='model.joblib', scaler_path='scaler.joblib'):
    print(f"\n--- Running One-Step Scoring Script for '{json_file_path}' ---")

    # 1. Load Data
    transactions_df_new = load_data(json_file_path)
    if transactions_df_new.empty:
        print("Skipping score generation due to data loading error.")
        return pd.DataFrame()

    # 2. Engineer Features (using the same logic as training)
    features_df_new, wallets_new = engineer_features(transactions_df_new)
    if features_df_new.empty:
        print("Skipping score generation due to feature engineering error.")
        return pd.DataFrame()

    # 3. Load Pre-trained Model and Scaler
    print("Loading pre-trained model and scaler...")
    try:
        scaler = joblib.load(scaler_path)
        model = joblib.load(model_path)
        print("Model and Scaler loaded successfully.")
    except FileNotFoundError:
        print(f"Error: Model or Scaler files not found at '{model_path}' or '{scaler_path}'.")
        print("Please ensure you have run the 'Model Training and Saving' cell above to create these files.")
        return pd.DataFrame()
    except Exception as e:
        print(f"An error occurred while loading model/scaler: {e}")
        return pd.DataFrame()

    # 4. Scale New Features (using the loaded scaler)
    # It's crucial to use the *same* scaler fitted on the training data.
    X_scaled_new = scaler.transform(features_df_new)
    print("New features scaled using the pre-trained scaler.")

    # 5. Predict Scores
    print("Predicting scores for new wallets...")
    predicted_scores_new = model.predict(X_scaled_new)

    # 6. Clamp scores to 0-1000 range and format
    final_scores_df_new = pd.DataFrame({
        'wallet': wallets_new,
        'credit_score': predicted_scores_new.clip(0, 1000).round(0).astype(int)
    })

    # 7. Output Results
    output_filename = 'wallet_credit_scores_generated.json'
    final_scores_df_new.to_json(output_filename, orient='records', indent=4)
    print(f"Scores generated and saved to '{output_filename}' in the current directory.")
    print("\nFirst 10 generated credit scores:")
    print(final_scores_df_new.head(10))

    return final_scores_df_new

# --- Execute the simulated one-step script ---
# This will use the 'user-transactions.json' file again,
# but it demonstrates the process of loading the saved model.
generated_scores_from_script = generate_wallet_scores_from_file('user-transactions.json')


All necessary libraries imported successfully.
Loading data from user-transactions.json...
Data loaded successfully. Total records: 100000
First 5 rows of data:
                                    _id  \
0  {'$oid': '681d38fed63812d4655f571a'}   
1  {'$oid': '681aa70dd6df53021cc6f3c0'}   
2  {'$oid': '681d04c2d63812d4654c733e'}   
3  {'$oid': '681d133bd63812d46551b6ef'}   
4  {'$oid': '681899e4ba49fc91cf2f4454'}   

                                   userWallet  network protocol  \
0  0x00000000001accfa9cef68cf5371a23025b6d4b6  polygon  aave_v2   
1  0x000000000051d07a4fb3bd10121a343d85818da6  polygon  aave_v2   
2  0x000000000096026fb41fc39f9875d164bd82e2dc  polygon  aave_v2   
3  0x000000000096026fb41fc39f9875d164bd82e2dc  polygon  aave_v2   
4  0x0000000000e189dd664b9ab08a33c4839953852c  polygon  aave_v2   

                                              txHash  \
0  0x695c69acf608fbf5d38e48ca5535e118cc213a89e3d6...   
1  0xe6fc162c86b2928b0ba9b82bda672763665152b9de9d...   
2  0xe2d7