# Bankruptcy Prediction - Exploratory Data Analysis

This notebook performs exploratory data analysis on financial datasets for bankruptcy prediction.

## Objectives
- Load and understand the financial data
- Analyze distributions and correlations
- Identify key features for bankruptcy prediction
- Visualize financial health indicators

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import boto3
from io import StringIO

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print('Libraries imported successfully')

In [None]:
# Load data from S3
s3_client = boto3.client('s3')
bucket_name = 'bankruptcy-data-lake-prod'
file_key = 'processed/feature_engineered/data.csv'

# Read data
# df = pd.read_csv(f's3://{bucket_name}/{file_key}')

# For demonstration, create sample data
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'company_id': range(n_samples),
    'fiscal_year': np.random.choice([2020, 2021, 2022, 2023], n_samples),
    'total_assets': np.random.lognormal(15, 2, n_samples),
    'total_liabilities': np.random.lognormal(14.5, 2, n_samples),
    'revenue': np.random.lognormal(14, 1.5, n_samples),
    'net_income': np.random.normal(1e6, 5e5, n_samples),
    'current_ratio': np.random.gamma(2, 0.5, n_samples),
    'debt_to_equity': np.random.gamma(2, 0.3, n_samples),
    'roa': np.random.normal(0.05, 0.03, n_samples),
    'altman_z_score': np.random.normal(2.5, 1.2, n_samples),
    'bankruptcy_status': np.random.choice([0, 1], n_samples, p=[0.95, 0.05])
})

print(f'Data loaded: {len(df)} rows, {len(df.columns)} columns')
df.head()

In [None]:
# Basic statistics
print('Dataset Statistics:')
print(df.describe())

print('\nBankruptcy Distribution:')
print(df['bankruptcy_status'].value_counts())
print(f'\nBankruptcy Rate: {df["bankruptcy_status"].mean():.2%}')

In [None]:
# Visualize bankruptcy distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['bankruptcy_status'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Bankruptcy Status Distribution')
axes[0].set_xlabel('Status (0=Not Bankrupt, 1=Bankrupt)')
axes[0].set_ylabel('Count')

# Pie chart
df['bankruptcy_status'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
axes[1].set_title('Bankruptcy Status Proportion')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Financial ratios distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Current Ratio
sns.histplot(data=df, x='current_ratio', hue='bankruptcy_status', ax=axes[0, 0], kde=True)
axes[0, 0].set_title('Current Ratio Distribution')

# Debt to Equity
sns.histplot(data=df, x='debt_to_equity', hue='bankruptcy_status', ax=axes[0, 1], kde=True)
axes[0, 1].set_title('Debt to Equity Distribution')

# ROA
sns.histplot(data=df, x='roa', hue='bankruptcy_status', ax=axes[1, 0], kde=True)
axes[1, 0].set_title('Return on Assets Distribution')

# Altman Z-Score
sns.histplot(data=df, x='altman_z_score', hue='bankruptcy_status', ax=axes[1, 1], kde=True)
axes[1, 1].set_title('Altman Z-Score Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
numeric_cols = ['total_assets', 'total_liabilities', 'revenue', 'net_income',
                'current_ratio', 'debt_to_equity', 'roa', 'altman_z_score', 'bankruptcy_status']

correlation_matrix = df[numeric_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

print('\nCorrelation with Bankruptcy Status:')
print(correlation_matrix['bankruptcy_status'].sort_values(ascending=False))

In [None]:
# Box plots for key metrics
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

sns.boxplot(data=df, x='bankruptcy_status', y='altman_z_score', ax=axes[0])
axes[0].set_title('Altman Z-Score by Bankruptcy Status')

sns.boxplot(data=df, x='bankruptcy_status', y='current_ratio', ax=axes[1])
axes[1].set_title('Current Ratio by Bankruptcy Status')

sns.boxplot(data=df, x='bankruptcy_status', y='roa', ax=axes[2])
axes[2].set_title('ROA by Bankruptcy Status')

plt.tight_layout()
plt.show()

## Key Findings

1. **Class Imbalance**: The dataset shows significant imbalance with ~95% non-bankrupt companies
2. **Altman Z-Score**: Strong indicator of bankruptcy risk
3. **Financial Ratios**: Current ratio and debt-to-equity show clear separation between classes
4. **Feature Importance**: Need to use techniques like SMOTE to handle class imbalance

## Next Steps

- Feature engineering and selection
- Model training with class balancing
- Hyperparameter tuning
- Model evaluation with focus on recall (minimize false negatives)