# Credit Risk Prediction Analysis

## 1. Introduction
This notebook presents a detailed analysis of the German Credit Data to build a model for predicting credit risk. We will explore the data, preprocess it, train multiple machine learning models, and evaluate their performance to identify the best approach for this classification task.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
url = 'https://www.google.com/search?q=https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
columns = [
    'checking_account_status', 'duration', 'credit_history', 'purpose', 'credit_amount', 
    'savings_account', 'present_employment', 'installment_rate', 'personal_status_sex', 
    'other_debtors', 'present_residence_since', 'property', 'age', 'other_installment_plans', 
    'housing', 'number_of_credits', 'job', 'people_liable', 'telephone', 'foreign_worker', 'risk'
]
df = pd.read_csv(url, sep=' ', header=None, names=columns)

# Convert target variable
df['risk'] = df['risk'].replace({1: 'Good', 2: 'Bad'})

print('Dataset Shape:', df.shape)
print('
First 5 Rows:')
display(df.head())

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Distribution of the target variable
plt.figure(figsize=(8, 6))
sns.countplot(x='risk', data=df)
plt.title('Distribution of Credit Risk')
plt.show()

# Distribution of Credit Amount by Risk
plt.figure(figsize=(10, 6))
sns.boxplot(x='risk', y='credit_amount', data=df)
plt.title('Credit Amount Distribution by Risk')
plt.show()

# Credit History vs. Risk
plt.figure(figsize=(12, 7))
sns.countplot(x='credit_history', hue='risk', data=df)
plt.title('Credit History vs. Risk')
plt.xticks(rotation=45)
plt.show()

## 4. Data Preprocessing and Modeling
We will now preprocess the data by encoding categorical variables and scaling numerical features. After that, we will train and evaluate Logistic Regression, Random Forest, and Gradient Boosting models.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Prepare data for modeling
df['risk'] = df['risk'].replace({'Good': 0, 'Bad': 1})
X = df.drop('risk', axis=1)
y = df['risk']

categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=np.number).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# --- Logistic Regression ---
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(random_state=42))])
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)
print('--- Logistic Regression Report ---')
print(classification_report(y_test, y_pred_lr))

# --- Random Forest ---
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42))])
rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)
print('--- Random Forest Report ---')
print(classification_report(y_test, y_pred_rf))

## 5. Conclusion
Based on the evaluation metrics, we can compare the models. The Random Forest classifier generally provides a good balance of precision and recall. Further improvements could be achieved through more advanced feature engineering and hyperparameter tuning. The model's feature importance can also provide valuable insights into the key drivers of credit risk.