Credit Risk Analysis:

Introduction: In this project, I embarked on an analysis of a loan dataset with the goal of understanding its structure, identifying patterns, and building predictive models for loan status classification. The dataset, sourced from a CSV file named "loan.csv," provided an opportunity to delve into the world of finance and predictive modeling.

Data Loading and Initial Exploration: First, I loaded the dataset into a Pandas DataFrame named load_df. To get a glimpse of the data, I displayed the first few rows and checked for missing values. Upon initial exploration, I found that the dataset contained several columns that were irrelevant to my analysis. Therefore, I pruned these columns to keep only the ones necessary for my investigation.

Data Cleaning and Preprocessing: Cleaning the data was crucial to ensure its quality for analysis. I tackled missing values by either filling them appropriately or dropping them when necessary. Visualizations, such as heatmaps, helped me gain insights into the missingness patterns in the data. Additionally, I encoded categorical variables using label encoding to prepare them for analysis. Furthermore, I engineered a new feature, the debt-to-income ratio (DTI), which I believed could be informative for predicting loan status.

Exploratory Data Analysis (EDA): EDA allowed me to dive deeper into the dataset and uncover meaningful insights. I started by computing summary statistics for numerical variables and visualizing their distributions using histograms. For categorical variables, I created bar plots and count plots to understand their distributions. Moreover, I utilized box plots to identify relationships between numerical variables and loan status, providing valuable insights into potential predictors of loan outcomes.

Modeling: With the data prepared and explored, I proceeded to build predictive models for loan status classification. I trained Random Forest, Neural Network, and XGBoost Classifier models using the features extracted from the dataset. Evaluation of these models revealed promising performance, with the XGB model performing the best due to the complex nature and the multi-class nature of the target variable. Visualizations such as ROC-AUC curves and feature importance plots helped me assess model performance and identify influential features.

Conclusion: In conclusion, this project offered valuable insights into the loan dataset, enabling a deeper understanding of its characteristics and predictive modeling potential. The developed models show promise in accurately classifying loan statuses, though further optimizations and fine-tuning could enhance their performance. Overall, the project serves as a foundation for future analyses in the domain of finance and predictive modeling.

Recommendations: Based on my analysis, I recommend exploring further optimizations and fine-tuning of the predictive models to potentially improve performance. Additionally, investigating additional features or incorporating external data sources could enrich the analysis and provide deeper insights into loan outcomes.

Limitations: It's important to acknowledge the limitations of this analysis. For instance, the dataset may be subject to biases or limitations inherent in its collection process. Furthermore, the predictive models developed in this project may not capture all factors influencing loan outcomes, leaving room for further exploration and refinement.

Acknowledgments: I would like to express gratitude to the sources of the dataset, as well as the libraries and resources utilized in this project. Without their contributions, this analysis would not have been possible.

References: Any references to literature, documentation, or methodologies used in the project are duly acknowledged and appreciated. These references provided invaluable guidance and insights throughout the analysis process.

Tools and Techniques Used: 
Programming Languages: Python Libraries: NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, xgboost Data Analysis and Visualization: Exploratory Data Analysis (EDA), Data Cleaning, Data Preprocessing, Feature Engineering, Statistical Analysis, Visualization Techniques (Histograms, Box Plots, Heatmaps) 

Modeling Techniques: Random Forest Classifier, Neural Network (MLPClassifier), XGBoost Classifier, Label Encoding, Evaluation Metrics (Accuracy, Confusion Matrix, Classification Report, ROC-AUC Curve), Feature Importance Analysis 

Machine Learning: Supervised Learning, Classification

Business Value: Risk Management: The predictive models developed in this project can assist financial institutions in assessing the risk associated with lending by accurately classifying loan statuses. This enables proactive risk management strategies to minimize losses. 

Customer Insights: Analysis of loan characteristics and borrower attributes provides valuable insights into customer behavior and preferences, aiding in the development of targeted marketing strategies and personalized financial products.

Operational Efficiency: Automation of loan status classification processes through machine learning models enhances operational efficiency, allowing financial institutions to streamline decision-making and allocate resources effectively.

Applicability: Financial Services Industry: The project's insights and predictive models are directly applicable to financial institutions such as banks, credit unions, and lending platforms for improving loan approval processes, managing risks, and enhancing customer experiences. 

Credit Scoring Systems: The developed models can be integrated into credit scoring systems to assess borrowers' creditworthiness and determine appropriate lending terms and interest rates. 

Fintech Startups: Fintech companies can leverage the project's methodologies and techniques to develop innovative solutions for peer-to-peer lending, microfinance, and alternative credit scoring.

In [None]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown, Latex
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import f1_score
pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings("ignore")

In [None]:
load_df = pd.read_csv("loan.csv", low_memory=False)

In [None]:
load_df.head(5)

In [None]:
load_df.info()

In [None]:
#There is lots of missing data and unnecessary columns and I'd like to delete them.

In [None]:
#Calculating the missing data percentage
missing_percentage = load_df.isnull().mean() * 100
missing_percentage

In [None]:
#Ater careful consideration, decided to keep just these columns as they're the most relevent and also have very little to no missing data
columns_to_keep = ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'emp_length', 'home_ownership',
                   'annual_inc', 'verification_status', 'loan_status', 'purpose']
loan_df = load_df[columns_to_keep]

In [None]:
loan_df.head() #just checking the datset to make sure it's all good

In [None]:
loan_df.info()

In [None]:
#I have now removed all the irrelevent columns ans kept only the columns needed for the analysis.
#Checking the missing values in the final dataset
missing_perc = loan_df.isnull().mean() * 100
missing_perc

In [None]:
#Checking the unique values in the columns
loan_df["emp_length"].describe()

In [None]:
load_df["emp_length"].value_counts()

In [None]:
#Replacing the null values in annual income with zero
loan_df.annual_inc = loan_df.annual_inc.fillna(0)
loan_df.isnull().sum()

In [None]:
missing_matrix = loan_df.isnull()

# Create a heatmap to visualize the missingness matrix to check if the data pattern
plt.figure(figsize=(10, 6))
sns.heatmap(missing_matrix, cbar=False, cmap='viridis')
plt.title('Missingness Matrix')
plt.xlabel('Variables')
plt.ylabel('Observations')
plt.show()

In [None]:
#Checking the correlation between the differnt variables
correlation_matrix = loan_df.corr()
print(correlation_matrix)
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

Here's a summary of the correlation matrix:

loan_amnt vs. int_rate: There is a positive correlation of approximately 0.145 between the loan amount and the interest rate. This indicates that as the loan amount increases, the interest rate tends to increase slightly.

loan_amnt vs. installment: There is a strong positive correlation of approximately 0.945 between the loan amount and the installment. This indicates that higher loan amounts are associated with higher installments, which is intuitive since larger loans would require larger periodic payments.

loan_amnt vs. annual_inc: There is a moderate positive correlation of approximately 0.333 between the loan amount and the annual income. This suggests that individuals with higher annual incomes may qualify for larger loan amounts.

int_rate vs. installment: There is a positive correlation of approximately 0.133 between the interest rate and the installment. This suggests that higher interest rates are associated with higher installment payments.

int_rate vs. annual_inc: There is a negative correlation of approximately -0.073 between the interest rate and the annual income. This suggests that individuals with higher annual incomes may qualify for loans with lower interest rates, and vice versa.

installment vs. annual_inc: There is a moderate positive correlation of approximately 0.326 between the installment and the annual income. This suggests that individuals with higher annual incomes may be able to afford higher installment payments.

In [None]:
#Deleting the rows where the column emp_length has missing values
loan_df.dropna(subset=['emp_length'], inplace=True)

In [None]:
loan_df.info()

In [None]:
#Understanding the Data:
#Review the summary statistics for numerical variables (loan_amnt, int_rate, installment, annual_inc) and categorical variables (term, grade, emp_length, home_ownership, verification_status, loan_status, purpose).
#Check for any unusual values or outliers in the data.
#Understand the distribution and range of values for each variable"

In [None]:
numerical_data_summary = loan_df[['loan_amnt', 'int_rate', 'installment', 'annual_inc']].describe()
print("Summary statistics for numerical variables:")
print(numerical_data_summary)

Loan Amount (loan_amnt):

The average loan amount is approximately $14,914, with a considerable standard deviation of around $8,450, indicating significant variability in loan amounts.

The range of loan amounts spans from $500 to $35,000, suggesting a diverse set of loan products or borrower needs.
The interquartile range (IQR) of $11,675 indicates that the middle 50\% of loan amounts fall within a relatively wide range, emphasizing the diversity in loan sizes.
Interest Rate (int_rate):

The average interest rate is approximately 13.25\%, with a standard deviation of about 4.39\%, indicating variability in interest rates across loans.
Interest rates range from 5.32\% to 28.99\%, suggesting a broad spectrum of rates offered to borrowers.
The IQR of 6.21\% reflects the variability in interest rates experienced by the majority of borrowers, with rates ranging from 9.99\% to 16.20\%.
Installment Amount (installment):

The average installment amount is approximately $441, with a standard deviation of about $245, indicating variability in installment payments.

Installment payments range from $15.67 to $1,445.46, showcasing a wide range of payment obligations for borrowers.
The IQR of $313.51 highlights the variability in installment amounts, with payments for the middle 50% of borrowers ranging from $263.93 to $577.44.
Annual Income (annual_inc):

The average annual income is approximately $76,353, with a standard deviation of around $65,643, indicating significant variability in income levels among borrowers.
Annual incomes range from $1,896 to $9,500,000, reflecting a wide disparity in income levels across borrowers.
The IQR of $43,000 underscores the variability in income levels among the majority of borrowers, with incomes ranging from $47,000 to $90,000.

Overall, these summary statistics reveal a diverse set of loan products and borrower profiles within the dataset. The wide ranges and significant standard deviations suggest that borrowers vary widely in their loan amounts, interest rates, installment payments, and annual incomes. Understanding these variations is crucial for assessing risk, making lending decisions, and tailoring financial products to meet the needs of different borrower segments. Further analysis, such as exploring relationships between variables or identifying outliers, could provide additional insights for informed decision-making.

Exploring Relationships:

I'll calculate pairwise correlations between numerical variables to understand any linear relationships.
I'll visualize relationships between pairs of numerical variables using scatter plots.
To explore relationships between categorical variables, I'll create cross-tabulations or pivot tables.
Analyzing Loan Status:

I'll examine the distribution of loan_status to understand the proportion of loans that are fully paid, charged off, or in other states.
I'll investigate factors like loan amount, interest rate, employment length, and purpose to understand their influence on loan status.

Feature Extraction:

I'll extract additional features from existing variables that might be informative for predicting loan status or other outcomes.
For example, I'll create derived features like debt-to-income ratio (DTI) using installment and annual_inc.
Visualizing Loan Performance:

I'll plot bar charts or pie charts to visualize the distribution of loan_status.
Using stacked bar charts, I'll show the distribution of loan_status within different categories of categorical variables like grade or verification_status.
Understanding Loan Characteristics:

I'll explore how loan amounts are distributed across different loan terms (e.g., 36 months vs. 60 months) and loan grades.
I'll investigate the relationship between interest rates and loan grades to understand how risk is priced.
Identifying Trends Over Time:

I'll analyze trends in loan issuance or loan performance over time by plotting time series plots or line charts.

In [None]:
# Calculating pairwise correlations
correlation_matrix = loan_df.corr()

# Print correlation matrix
print("Pairwise Correlation Matrix:")
print(correlation_matrix)

sns.heatmap(correlation_matrix)

In [None]:
# Plot scatter plots for pairs of numerical variables
plt.figure(figsize=(20,20))
sns.pairplot(loan_df[['loan_amnt', 'int_rate', 'installment', 'annual_inc']], markers='.', diag_kind='kde', plot_kws={'alpha': 0.7, 's': 80}, palette='deep')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
loan_df.info()

In [None]:
# Creating cross-tabulations for pairs of categorical variables
cross_tab = pd.crosstab(loan_df['term'], loan_df['grade'])

# Creating pivot table for exploring relationships between categorical variables
pivot_table = loan_df.pivot_table(index='home_ownership', columns='loan_status', aggfunc='size')
from tabulate import tabulate

# Print cross-tabulation between Term and Grade
print("Cross-Tabulation between Term and Grade:")
print(tabulate(cross_tab, headers='keys', tablefmt='psql'))

# Print pivot table for Home Ownership and Loan Status
print("\nPivot Table for Home Ownership and Loan Status:")
print(tabulate(pivot_table, headers='keys', tablefmt='psql'))


In [None]:
pivot_table.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Loan Status by Home Ownership')
plt.xlabel('Home Ownership')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


In [None]:
# Calculate the proportion of each loan status
loan_status_proportion = loan_df['loan_status'].value_counts(normalize=True) * 100
print("Proportion of Loan Status:")
print(loan_status_proportion)

In [None]:
# Calculate proportions for each loan status within each grade
normalized_df = loan_df.groupby(['grade', 'loan_status']).size().reset_index(name='count')
normalized_df['proportion'] = normalized_df.groupby('grade')['count'].apply(lambda x: x / x.sum())

plt.figure(figsize=(14, 8))
sns.barplot(x='proportion', y='grade', hue='loan_status', data=normalized_df, palette='bright')
plt.title('Loan Status Distribution by Grade (Normalized)')
plt.xlabel('Proportion')
plt.ylabel('Grade')
plt.legend(title='Loan Status')
plt.show()

In [None]:
# Calculate debt-to-income ratio (DTI)
loan_df['dti'] = (loan_df['installment'] * 12) / loan_df['annual_inc']

# Visualize the distribution of the new feature (DTI)
plt.figure(figsize=(8, 6))
sns.histplot(loan_df['dti'], bins=30, kde=True)
plt.title('Distribution of Debt-to-Income Ratio (DTI)')
plt.xlabel('Debt-to-Income Ratio (DTI)')
plt.ylabel('Count')
plt.show()


In [None]:

# Plot relationship between loan amount and loan status
plt.figure(figsize=(10, 6))
sns.boxplot(x='loan_status', y='loan_amnt', data=loan_df, palette='bright')
plt.title('Loan Amount by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)
plt.ylabel('Loan Amount ($)')
plt.show()
# This plot shows the distribution of loan amounts for different loan statuses. It helps in understanding if there are significant differences in loan amounts between different loan statuses.

# Plot relationship between interest rate and loan status
plt.figure(figsize=(10, 6))
sns.boxplot(x='loan_status', y='int_rate', data=loan_df, palette='bright')
plt.title('Interest Rate by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)
plt.ylabel('Interest Rate (%)')
plt.show()
# This plot displays how interest rates vary across different loan statuses. It aids in identifying if interest rates differ significantly based on loan status.

# Plotting boxplot for loan amounts across different loan terms
plt.figure(figsize=(10, 6))
sns.boxplot(x='term', y='loan_amnt', data=loan_df, palette='bright')
plt.title('Loan Amount Distribution Across Loan Terms')
plt.xlabel('Term')
plt.ylabel('Loan Amount ($)')
plt.show()
# This plot displays the distribution of loan amounts across different loan terms (e.g., 36 months vs. 60 months), providing insights into how loan amounts vary based on the loan term.

# Plotting boxplot for interest rates across different loan grades
plt.figure(figsize=(10, 6))
sns.boxplot(x='grade', y='int_rate', data=loan_df, palette='bright', order=sorted(loan_df['grade'].unique()))
plt.title('Interest Rates Across Loan Grades')
plt.xlabel('Grade')
plt.ylabel('Interest Rate (%)')
plt.show()
# This plot illustrates how interest rates vary across different loan grades. It helps in understanding how risk is priced based on the borrower's creditworthiness.


In [None]:
# Plot relationship between purpose and loan status
plt.figure(figsize=(16, 8))
sns.countplot(y='purpose', hue='loan_status', data=loan_df, palette='bright')
plt.title('Loan Status by Purpose')
plt.ylabel('Purpose')
plt.xlabel('Count')
plt.xticks(rotation=90)
plt.show()
# This plot shows the distribution of loan statuses across different loan purposes. It helps in understanding if certain loan purposes are associated with higher approval rates or lower default rates.

In [None]:
# Plot relationship between employment length and loan status
plt.figure(figsize=(10, 6))
sns.countplot(y='emp_length', hue='loan_status', data=loan_df, palette='bright')
plt.title('Loan Status by Employment Length')
plt.ylabel('Employment Length')
plt.xticks(rotation=90)
plt.xlabel('Count')
plt.tight_layout()
plt.show()
# This plot illustrates the distribution of loan statuses across various employment lengths. It provides insights into the influence of employment length on loan approval or rejection.

In [None]:
loan_df.sample(5)

In [None]:
loan_df.info()

In [None]:
loan_df["grade"].value_counts()

In [None]:
loan_df.info()

In [None]:
categorical_cols = ['term', 'grade', 'emp_length', 'home_ownership', 'verification_status', 'purpose', 'loan_status']

In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode categorical variables using label encoding
for col in categorical_cols:
    loan_df[col] = label_encoder.fit_transform(loan_df[col])

# Display the DataFrame with encoded categorical variables
print(loan_df.head())


In [None]:
loan_df.sample(5)

In [None]:
loan_df.loan_status.value_counts()

In [None]:
loan_df.info()

In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode loan_status column
loan_df['loan_status_encoded'] = label_encoder.fit_transform(loan_df['loan_status'])


In [None]:
loan_df.sample(10)

In [None]:
loan_df.info()

In [None]:
loan_df.drop('loan_status', axis=1, inplace=True)

In [None]:
loan_df.sample(5)

In [None]:
loan_df.info()

In [None]:
loan_df.isna().sum()

In [None]:
# Check for missing values (NaNs)
missing_values = loan_df.isnull().sum()
print("Missing values:\n", missing_values)

# Check for infinite values
infinite_values = np.isinf(loan_df).sum()
print("Infinite values:\n", infinite_values)


In [None]:
# Identify records with zero income
zero_income_indices = loan_df[loan_df['annual_inc'] == 0].index

# Set the debt-to-income ratio to NaN for records with zero income
loan_df.loc[zero_income_indices, 'dti'] = np.nan


In [None]:
# Check for infinite values
infinite_values = np.isinf(loan_df).sum()
print("Infinite values:\n", infinite_values)

In [None]:
# Replace NaN values with the mean of the column
mean_dti = loan_df['dti'].mean()
loan_df['dti'].fillna(mean_dti, inplace=True)


In [None]:
#Visualisations to understand the data better
# Remove initial histograms for numerical variables
# Histograms for Numerical Variables with improvements

'''Histograms for Numerical Variables with Improvements:

This block of code plots histograms for numerical variables such as loan amount, interest rate, installment, and annual income with improvements like proper binning and kernel density estimation (KDE).
Insights: These histograms provide a visual representation of the distribution of each numerical variable, allowing us to understand their central tendency, spread, and presence of outliers.'''


plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
sns.histplot(loan_df['loan_amnt'], bins=30, kde=True, alpha=0.7)
plt.title('Distribution of Loan Amount')

plt.subplot(2, 2, 2)
sns.histplot(loan_df['int_rate'], bins=30, kde=True, alpha=0.7)
plt.title('Distribution of Interest Rate')

plt.subplot(2, 2, 3)
sns.histplot(loan_df['installment'], bins=30, kde=True, alpha=0.7)
plt.title('Distribution of Installment')

plt.subplot(2, 2, 4)
sns.histplot(loan_df['annual_inc'], bins=30, kde=True, alpha=0.7)
plt.title('Distribution of Annual Income')

plt.tight_layout()
plt.show()


In [None]:
# Remove initial bar plots for categorical variables
# Bar Plots for Categorical Variables with improvements
'''Bar Plots for Categorical Variables with Improvements:

This block of code plots bar plots for categorical variables such as term, grade, employment length, home ownership, verification status, and purpose with improvements like proper rotation of x-axis labels.
Insights: These bar plots visualize the distribution of each categorical variable, enabling us to analyze the frequency of different categories within each variable.'''


plt.figure(figsize=(20, 10))

plt.subplot(3, 2, 1)
sns.countplot(x='term', data=loan_df, palette='bright')
plt.title('Distribution of Term')
plt.xticks(rotation=45)

plt.subplot(3, 2, 2)
sns.countplot(x='grade', data=loan_df, palette='bright')
plt.title('Distribution of Grade')
plt.xticks(rotation=45)

plt.subplot(3, 2, 3)
sns.countplot(x='emp_length', data=loan_df, palette='bright')
plt.title('Distribution of Employment Length')
plt.xticks(rotation=45)

plt.subplot(3, 2, 4)
sns.countplot(x='home_ownership', data=loan_df, palette='bright')
plt.title('Distribution of Home Ownership')
plt.xticks(rotation=45)

plt.subplot(3, 2, 5)
sns.countplot(x='verification_status', data=loan_df, palette='bright')
plt.title('Distribution of Verification Status')
plt.xticks(rotation=45)

plt.subplot(3, 2, 6)
sns.countplot(x='purpose', data=loan_df, palette='bright')
plt.title('Distribution of Purpose')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


In [None]:
# Corrected Box Plots for Numerical Variables by Loan Status
'''Corrected Box Plots for Numerical Variables by Loan Status:

This block of code plots box plots for numerical variables (loan amount, interest rate, installment, annual income) grouped by loan status, with improvements like corrected titles.
Insights: These box plots provide insights into how numerical variables vary across different loan statuses, helping us understand if there are significant differences in these variables for different loan outcomes.'''

plt.figure(figsize=(14, 10))

plt.subplot(2, 2, 1)
sns.boxplot(x='loan_status_encoded', y='loan_amnt', data=loan_df, palette='bright')
plt.title('Loan Amount by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)

plt.subplot(2, 2, 2)
sns.boxplot(x='loan_status_encoded', y='int_rate', data=loan_df, palette='bright')
plt.title('Interest Rate by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)

plt.subplot(2, 2, 3)
sns.boxplot(x='loan_status_encoded', y='installment', data=loan_df, palette='bright')
plt.title('Installment by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)

plt.subplot(2, 2, 4)
sns.boxplot(x='loan_status_encoded', y='annual_inc', data=loan_df, palette='bright')
plt.title('Annual Income by Loan Status')
plt.xlabel('Loan Status')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Split the data into features (X) and target variable (y)
X = loan_df.drop('loan_status_encoded', axis=1)
y = loan_df['loan_status_encoded']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(objective='multi:softprob', num_class=10, random_state=42)
xgb_model.fit(X_train, y_train)

# Step 6: Make Predictions
y_pred = xgb_model.predict(X_test)

# Step 7: Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print("XGBoost Accuracy:", accuracy)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=1))

# Additional evaluation metrics or visualizations
# ROC-AUC curve, feature importance plot, etc.

In [None]:
# Evaluate Model Performance - ROC-AUC Curve
# Get the predicted probabilities for each class
y_probs = xgb_model.predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(len(xgb_model.classes_)):
    fpr[i], tpr[i], _ = roc_curve((y_test == xgb_model.classes_[i]).astype(int), y_probs[:, i])  # Convert boolean to int
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curve for each class
plt.figure(figsize=(10, 8))
for i in range(len(xgb_model.classes_)):
    plt.plot(fpr[i], tpr[i], label='ROC curve (class {}) (area = {:.2f})'.format(xgb_model.classes_[i], roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

# Step 7 (Continued): Evaluate Model Performance - Feature Importance Plot
# Get feature importances from the trained model
feature_importance = xgb_model.feature_importances_

# Create a dataframe with feature names and their importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importance})

# Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance Plot')
plt.show()
