<a href="https://colab.research.google.com/github/prashanth1009/Assignment_1/blob/main/Another_copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - PaisaBazaar

##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -** A.Prashanth
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The PaisaBazaar project focuses on creating a data-driven financial marketplace platform aimed at simplifying the process of comparing, applying for, and managing financial products such as loans, credit cards, and insurance. The platform serves as a bridge between financial institutions and consumers, providing personalized recommendations using user data, credit scores, and financial behavior.

The primary objective of the project is to enhance user experience by enabling smarter, faster, and more transparent decision-making when selecting financial products. The platform leverages artificial intelligence and machine learning algorithms to analyze customer data, assess creditworthiness, and match users with suitable products. Real-time eligibility checks, document upload features, and application tracking systems are integrated to reduce manual effort and improve the turnaround time for loan approvals or credit card issuances.

Key components of the project include a secure user registration system, integration with multiple banks and financial institutions through APIs, a personalized dashboard, and a recommendation engine. Backend systems are built to handle large volumes of financial data, ensure compliance with regulatory standards (such as KYC and GDPR), and provide robust analytics dashboards for business decision-making.

The project also addresses challenges such as fraud detection, data privacy, and customer onboarding. Predictive analytics models are used to detect anomalies and reduce risk, while secure encryption protocols ensure data safety.

From a business perspective, the PaisaBazaar platform increases customer acquisition for partner banks, enhances customer retention through value-added services like credit score tracking, and offers monetization through lead generation and premium services.

In conclusion, the PaisaBazaar project delivers a comprehensive, scalable, and user-centric financial platform that empowers customers to make informed financial decisions while providing financial institutions with a robust lead management and analytics ecosystem. It exemplifies the use of fintech innovation to promote financial inclusion and digital transformation in the personal finance sector.

# **GitHub Link -**

https://github.com/SSubhashReddy/AI-ML-project/tree/main

# **Problem Statement**


In today’s fast-paced financial ecosystem, individuals face significant challenges in selecting suitable financial products due to a lack of transparency, overwhelming options, and limited access to personalized financial advice. Traditional banking systems often involve lengthy application processes, complex documentation, and unclear eligibility criteria, leading to poor user experiences and high rejection rates. Consumers struggle to compare offerings like personal loans, credit cards, and insurance plans across various institutions in one place. On the other hand, financial institutions face difficulties in reaching the right target audience, resulting in inefficient lead generation and high customer acquisition costs. There is a pressing need for a unified digital platform that simplifies product comparison, enables real-time eligibility checks, and offers personalized financial recommendations. The PaisaBazaar project aims to address these issues by building a data-driven fintech marketplace that connects consumers and financial institutions, ensuring faster decisions, enhanced transparency, and improved financial access for all users.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

try:
    # Attempt to read the specified CSV file into a DataFrame from Google Drive
    # Changed from pd.read_excel to pd.read_csv as the file extension is .csv
    df = pd.read_csv('/content/drive/MyDrive/dataset-2.csv')
except FileNotFoundError:
    # If the file is not found, print a specific error message mentioning the correct filename and path
    print("Error: The file '/content/drive/MyDrive/dataset-2.csv' was not found.")
    print("Please verify the file path and ensure the file exists and is correctly named in your Google Drive.")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt # Ensure plt is imported
import seaborn as sns # Ensure seaborn is imported

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

It contains financial data like customer info, loan details, and credit scores.

Focuses on loan applications, approvals, and product preferences.

Useful for analyzing customer behavior, credit risk, and predicting loan approval.

Helps in targeted marketing and financial product insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Customer ID:** A unique identifier given to each customer.

**Age:**The age of the customer applying for a loan or service.

**Gender:** Specifies whether the customer is male or female.

**Location:** The city or region where the customer resides.

**Income:** The monthly income of the customer.

**Employment Type:** Indicates if the customer is salaried or self-employed.

**Loan Type:** The category of loan applied for, like personal loan or home loan.

**Loan Amount:** The amount of money the customer wants to borrow.

**Tenure:** The loan repayment period in months.

**Interest Rate:** The rate of interest charged on the loan.

**CIBIL Score:** The credit score showing the customer’s creditworthiness.

**Application Status:** Shows whether the loan is approved, rejected, or pending.

**Application Date:** The date on which the customer applied for the loan.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df_copy = df.copy()

#drop unnecessary columns
# Use errors='ignore' to avoid KeyError if a column is not found
drop_columns = ['ID', 'Customer_ID', 'Name', 'SSN']
df.drop(columns = drop_columns, inplace = True, errors='ignore')

#convert data types
# Check if the columns exist before converting their types
if 'Num_Bank_Accounts' in df.columns:
    df['Num_Bank_Accounts'] = df['Num_Bank_Accounts'].astype('int64')
if 'Age' in df.columns:
    df['Age'] = df['Age'].astype('int64')
if 'Num_Credit_Inquiries' in df.columns:
    df['Num_Credit_Inquiries'] = df['Num_Credit_Inquiries'].astype('int64')

#round numerical values
df = df.round(2)

In [None]:
#feature engineering
#1. Debt to income ratio
df['Debt_to_Income_Ratio'] = df['Outstanding_Debt'] / df['Annual_Income']

#2. Credit card Utilization score
df['Credit_Card_Utilization_Score'] = df['Credit_Utilization_Ratio'] * df['Num_Credit_Card']

#3. Credit Mix score
credit_mix_mapping = {'Bad': 0, 'Standard': 1, 'Good': 2}
df['Credit_Mix_Score'] = df['Credit_Mix'].map(credit_mix_mapping)


#4. Payment Delay Score
df['Payment_Delay_Score'] = df['Num_of_Delayed_Payment'] * df['Delay_from_due_date']

### What all manipulations have you done and insights you found?

**Data Manipulations :**

Removed Irrelevant Columns

Data type conversion

Rounded Numerical Values

Feature Engineering

**Insights found:**

better data quality

Impact of Debt to Income Ratio

Credit Utillisation and Risk

Delayed Payment Behaviou

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.countplot(x = df['Credit_Score'], hue = df['Credit_Score'], palette = 'viridis', order = df['Credit_Score'].value_counts().index)
#Set labels and title
plt.title('Distribution of Credit Scores')
plt.xlabel('Credit Score Category')
plt.ylabel('Count')
#show plot
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart clearly shows and compares the number of people in each credit score category.

##### 2. What is/are the insight(s) found from the chart?

Most people have Standard credit scores.

A large group has Poor scores.

Few have Good scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact:**

Target credit improvement tools to Poor/Standard groups.

Offer rewards to retain Good score customers.

**Negative insight:**

High number of Poor scores may lead to more defaults and reduced loan approvals, which can hurt business.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.histplot(df['Age'], bins = 30, kde = True, color = 'blue')

#set labels and title
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE line is perfect to show the age distribution and spot trends easily.

##### 2. What is/are the insight(s) found from the chart?

Most people are aged 20–45.

Fewer users are below 18 or above 50.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact:**

Helps target products/services to the 20–45 age group, the largest audience.

**Negative insight:**

Low presence of older users may mean missed opportunities if their needs are ignored.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.histplot(df['Annual_Income'], bins = 30, kde = True, color = 'green')

#set labels and title
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE is ideal to show how annual income is distributed across the population.

##### 2. What is/are the insight(s) found from the chart?

Most people earn between ₹10,000 – ₹40,000 annually.

Very few earn above ₹1,00,000, showing a right-skewed distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact:**

Helps businesses focus products or pricing for low-to-middle income groups, which are the majority.

**Negative insight:**

Smaller high-income segment may mean limited market for premium services or luxury products.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.histplot(df['Credit_Utilization_Ratio'], bins = 30, kde = True, color = 'purple')

#set labels and title
plt.title('Distribution of Credit Utilization Ratio')
plt.xlabel('Credit Utilization Ratio')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

This histogram with KDE (Kernel Density Estimate) line was chosen to visualize the distribution of Credit Utilization Ratio, making it easy to observe data concentration and shape (e.g., normality, skewness).

##### 2. What is/are the insight(s) found from the chart?

The distribution is nearly bell-shaped and symmetric, peaking between 30–35%.

Most credit utilization values lie between 25% and 40%.

Very few customers have ratios below 25% or above 45%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify typical usage behavior for better risk scoring.

Tailor credit products for the majority who maintain healthy utilization (30–35%).

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.histplot(df['Num_Credit_Card'], bins = 15, kde = True, color = 'red')

#set labels and title
plt.title('Distribution of Number of Credit Cards')
plt.xlabel('Number of Credit Cards')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart with KDE line was chosen to understand the distribution pattern of the number of credit cards held by users, helping identify customer segments and behaviors.

##### 2. What is/are the insight(s) found from the chart?

Most users have 4 to 7 credit cards, with peaks at 5 and 6.

Very few users have 1 or fewer or more than 9 cards.

The KDE line indicates some irregular fluctuations, possibly due to grouped or rounded data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Credit companies can target promotions to users with 3–7 cards—this is the most active group.

It supports portfolio expansion strategies or credit limit increases.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.boxplot(x= 'Credit_Score', y = "Annual_Income", data = df, palette = 'viridis')

#set label and title
plt.title('Annual Income vs Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Annual Income')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

Credit companies can target promotions to users with 3–7 cards—this is the most active group.

It supports portfolio expansion strategies or credit limit increases.

##### 2. What is/are the insight(s) found from the chart?

Most users have 4 to 7 credit cards, with peaks at 5 and 6.

Very few users have 1 or fewer or more than 9 cards.

The KDE line indicates some irregular fluctuations, possibly due to grouped or rounded data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Credit companies can target promotions to users with 3–7 cards—this is the most active group.

It supports portfolio expansion strategies or credit limit increases.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Assuming 'Payment_Behaviour' and 'Credit_Score' are relevant columns
# If they don't exist, choose other categorical and numerical columns from your df.columns list
if 'Payment_Behaviour' in df.columns and 'Credit_Score' in df.columns:
    sns.boxplot(x='Payment_Behaviour', y='Credit_Score', data=df, palette='magma')

    # Set labels and title
    plt.title('Credit Score vs Payment Behaviour')
    plt.xlabel('Payment Behaviour')
    plt.ylabel('Credit Score')

    # Rotate x-axis labels if they overlap
    plt.xticks(rotation=45, ha='right')

    # Show plot
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()
else:
    print("Required columns ('Payment_Behaviour' or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

This box plot was chosen to compare the distribution of credit scores across different payment behaviours, helping to visualize median values, spread, and outliers effectively.

##### 2. What is/are the insight(s) found from the chart?

Across all categories, the median credit score remains close to "Standard".

Slightly higher scores are seen in users with high spend & large value payments.

Low spend with medium or small payments shows more users with lower credit scores and greater spread.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps segment customers by behavior and design custom repayment plans or credit offerings.

Encourages positive payment patterns like large-value payments, which may be linked to better credit scores.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Let's explore the relationship between Loan Amount and Credit Score
# Using a boxplot to compare the distribution of Loan Amount across Credit Score categories

# Check if the necessary columns exist
if 'Loan Amount' in df.columns and 'Credit_Score' in df.columns:
    sns.boxplot(x='Credit_Score', y='Loan Amount', data=df, palette='pastel')

    # Set labels and title
    plt.title('Loan Amount vs Credit Score')
    plt.xlabel('Credit Score')
    plt.ylabel('Loan Amount')

    # Show plot
    plt.show()
else:
    print("Required columns ('Loan Amount' or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

The box plot was selected to clearly show the variation in credit score across different payment behaviours, helping compare median scores, ranges, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Most payment behaviours result in standard credit scores.

High spend with large payments tends to maintain higher median credit scores.

Low spend with small/medium payments shows more poor score outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can promote good payment habits (e.g., larger consistent payments) to improve customer credit health.

Helps design risk-based offerings based on spending and payment style.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.boxplot(x= 'Credit_Score', y = "Annual_Income", data = df, palette = 'viridis')

#set label and title
plt.title('Annual Income vs Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Annual Income')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

This box plot effectively shows how annual income varies across credit score categories, helping to identify income patterns linked to creditworthiness.

##### 2. What is/are the insight(s) found from the chart?

Users with Good credit scores generally have higher median and wider income ranges.

Poor credit scores are associated with lower median incomes and more lower-income outliers.

Outliers exist in all groups, but high earners are more common in the Good score group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in credit risk assessment and income-based segmentation.

High-income, good-score customers can be targeted for premium credit products.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Let's visualize the relationship between Age and Annual Income using a scatter plot
# with Credit Score as the hue to see how age and income relate to creditworthiness.

# Check if the necessary columns exist
if 'Age' in df.columns and 'Annual_Income' in df.columns and 'Credit_Score' in df.columns:
    sns.scatterplot(x='Age', y='Annual_Income', hue='Credit_Score', data=df, palette='viridis', alpha=0.6)

    # Set labels and title
    plt.title('Age vs Annual Income by Credit Score')
    plt.xlabel('Age')
    plt.ylabel('Annual Income')

    # Show plot
    plt.show()
else:
    print("Required columns ('Age', 'Annual_Income', or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

This scatter plot was chosen to visualize the relationship between age, income, and credit score, enabling multi-variable comparison across customer segments.

##### 2. What is/are the insight(s) found from the chart?

Good credit scores (dark blue) are more common in higher income brackets across all age groups.

Poor scores (light green) are more frequent in the lower income range, regardless of age.

There's no strong age-income dependency—high incomes exist across ages.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps target financially stable customers for credit offers, regardless of age.

Useful for creditworthiness prediction models based on income and score clusters.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Let's visualize the relationship between Loan Amount and Interest Rate
# using a scatter plot, potentially colored by Credit Score for more insight.

# Check if the necessary columns exist
if 'Loan Amount' in df.columns and 'Interest Rate' in df.columns and 'Credit_Score' in df.columns:
    # Chart - 11 visualization code
    sns.scatterplot(x='Loan Amount', y='Interest Rate', hue='Credit_Score', data=df, palette='plasma', alpha=0.6)

    # Set labels and title
    plt.title('Loan Amount vs Interest Rate by Credit Score')
    plt.xlabel('Loan Amount')
    plt.ylabel('Interest Rate')

    # Show plot
    plt.show()
else:
    print("Required columns ('Loan Amount', 'Interest Rate', or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

The box plot was selected to clearly show the variation in credit score across different payment behaviours, helping compare median scores, ranges, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Most payment behaviours result in standard credit scores.

High spend with large payments tends to maintain higher median credit scores.

Low spend with small/medium payments shows more poor score outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can promote good payment habits (e.g., larger consistent payments) to improve customer credit health.

Helps design risk-based offerings based on spending and payment style.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Let's visualize the relationship between Number of Credit Inquiries and Credit Score
# Using a boxplot to compare the distribution of inquiries across Credit Score categories

# Check if the necessary columns exist
if 'Num_Credit_Inquiries' in df.columns and 'Credit_Score' in df.columns:
    sns.boxplot(x='Credit_Score', y='Num_Credit_Inquiries', data=df, palette='viridis')

    # Set labels and title
    plt.title('Number of Credit Inquiries vs Credit Score')
    plt.xlabel('Credit Score')
    plt.ylabel('Number of Credit Inquiries')

    # Show plot
    plt.show()
else:
    print("Required columns ('Num_Credit_Inquiries' or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

This box plot was chosen to show the link between credit inquiries and credit score, which helps assess financial behavior and risk.

##### 2. What is/are the insight(s) found from the chart?

Users with Good credit scores have fewer credit inquiries.

Poor credit scores are linked to a higher number of inquiries.

More inquiries often reflect credit-seeking behavior or risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in building credit risk models—fewer inquiries may signal more stable customers.

Supports pre-approval targeting for low-inquiry users.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Let's visualize the relationship between Number of Open Accounts and Credit Score
# Using a boxplot to compare the distribution of open accounts across Credit Score categories

# Check if the necessary columns exist
if 'Num_Open_Accounts' in df.columns and 'Credit_Score' in df.columns:
    sns.boxplot(x='Credit_Score', y='Num_Open_Accounts', data=df, palette='viridis')

    # Set labels and title
    plt.title('Number of Open Accounts vs Credit Score')
    plt.xlabel('Credit Score')
    plt.ylabel('Number of Open Accounts')

    # Show plot
    plt.show()
else:
    print("Required columns ('Num_Open_Accounts' or 'Credit_Score') not found in the DataFrame.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

The box plot was selected to clearly show the variation in credit score across different payment behaviours, helping compare median scores, ranges, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Most payment behaviours result in standard credit scores.

High spend with large payments tends to maintain higher median credit scores.

Low spend with small/medium payments shows more poor score outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can promote good payment habits (e.g., larger consistent payments) to improve customer credit health.

Helps design risk-based offerings based on spending and payment style.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select only numerical columns for the correlation matrix
numerical_cols = df.select_dtypes(include=np.number).columns
correlation_matrix = df[numerical_cols].corr()

# Create the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap is used to quickly identify strong relationships between numerical variables in the dataset, helping guide feature selection and model design.

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlation between Monthly_Inhand_Salary & Annual_Income (0.81), and Outstanding_Debt & Total_EMI_per_month (0.83).

Strong negative correlation between Credit_Utilization_Ratio & Credit_Mix_Score (−0.69), and Payment_Delay_Score & Credit_Mix_Score (−0.76).

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # Import numpy if not already imported for checking numerical types

print("Available columns in df:", df.columns.tolist())

# Update the list of columns for pairplot based on the correct names
# Choose a few relevant numerical columns from your available columns
required_cols = ['Age', 'Annual_Income', 'Interest_Rate', 'Credit_Score'] # Updated list
present_cols = [col for col in required_cols if col in df.columns]

# Ensure at least two columns are present for pairplot
if len(present_cols) >= 2:
    # Select only numerical columns for pairplot to avoid errors with non-numeric types
    numerical_cols_for_pairplot = df[present_cols].select_dtypes(include=np.number).columns.tolist()

    # Ensure there are columns to plot after selection
    if len(numerical_cols_for_pairplot) > 1:
         sns.pairplot(df[numerical_cols_for_pairplot]) # Plot without hue since 'Application_Status' is not present
         plt.show()
    else:
        print("Not enough numerical columns present for pairplot after filtering.")
else:
    print("Required numerical columns for pairplot not found in the DataFrame or not enough columns present.")
    print("Available columns:", df.columns.tolist())

##### 1. Why did you pick the specific chart?

This pair plot (scatter matrix) was chosen to visually explore relationships and distributions between multiple numerical variables — specifically Age, Annual_Income, and Interest_Rate — all at once.

##### 2. What is/are the insight(s) found from the chart?

Annual Income is right-skewed, with most individuals earning below ₹100,000.

Interest Rate shows a stepwise distribution, likely due to discrete loan brackets.

No strong linear correlation is observed between Age and either Income or Interest Rate.

Most young to mid-age users (20–40) dominate the dataset.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis 1:**

"There is no significant difference in average annual income across different credit score categories."

Test Type: One-Way ANOVA

Null Hypothesis (H₀): Mean income is the same across 'Poor', 'Standard', and 'Good' credit score groups.

Alternative Hypothesis (H₁): At least one group has a different mean income.

**Hypothesis 2:**

"The average age of people with good credit scores is higher than those with poor credit scores."

Test Type: Independent t-test (two-sample)

H₀: Average age of 'Good' = Average age of 'Poor'

H₁: Average age of 'Good' > Average age of 'Poor'

**Hypothesis 3:**

"Most of the population earns below ₹50,000 annually."

Test Type: One-sample proportion z-test

H₀: Proportion earning < ₹50,000 = 50%

H₁: Proportion earning < ₹50,000 > 50%

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

### Hypothetical Statement - 1

**Hypothesis 1:**
Income vs Credit Score Category
Research Hypothesis:
There is a significant difference in the average annual income across different credit score categories.

**Null Hypothesis (H₀):**
There is no significant difference in mean annual income across credit score categories ('Poor', 'Standard', 'Good').

**Alternate Hypothesis (H₁):**
At least one credit score category has a significantly different mean annual income.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 1
import scipy.stats as stats

# Assuming 'Credit_Score' and 'Annual_Income' are columns in your DataFrame df

# Separate data into groups based on Credit_Score
poor_income = df[df['Credit_Score'] == 'Poor']['Annual_Income']
standard_income = df[df['Credit_Score'] == 'Standard']['Annual_Income']
good_income = df[df['Credit_Score'] == 'Good']['Annual_Income']

# Perform one-way ANOVA test
# Check if there are enough samples in each group and if the groups exist
if len(poor_income) > 1 and len(standard_income) > 1 and len(good_income) > 1:
    f_statistic, p_value = stats.f_oneway(poor_income, standard_income, good_income)

    print(f"One-Way ANOVA Test for Annual Income across Credit Scores:")
    print(f"F-statistic: {f_statistic:.4f}")
    print(f"P-value: {p_value:.4f}")

    # Interpret the results based on a significance level (e.g., 0.05)
    alpha = 0.05
    if p_value < alpha:
        print("\nConclusion: Reject the null hypothesis.")
        print("There is a significant difference in average annual income across different credit score categories.")
    else:
        print("\nConclusion: Fail to reject the null hypothesis.")
        print("There is no significant difference in average annual income across different credit score categories.")
else:
    print("Could not perform ANOVA test. Check if the 'Credit_Score' categories exist and have more than one sample.")

##### Which statistical test have you done to obtain P-Value?

We used the One-Way ANOVA (Analysis of Variance) test.

##### Why did you choose the specific statistical test?

The goal was to compare the mean annual income across three different credit score categories: Poor, Standard, and Good.

Since there are more than two independent groups and we are analyzing a numerical variable (annual income), One-Way ANOVA is the most appropriate test.

ANOVA determines whether there is a statistically significant difference in means across the groups.

It provides an F-statistic and a P-value to help decide if at least one group mean is different.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Research Hypothesis:**
People with good credit scores are, on average, older than those with poor credit scores.

**Null Hypothesis (H₀):**
There is no difference or the average age of individuals with good credit scores is less than or equal to those with poor credit scores.
→ H₀: μ_good ≤ μ_poor

**Alternate Hypothesis (H₁):**
The average age of people with good credit scores is higher than those with poor scores.
→ H₁: μ_good > μ_poor

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 2
import scipy.stats as stats

# Assuming 'Age' and 'Credit_Score' are columns in your DataFrame df

# Separate age data for 'Good' and 'Poor' credit scores
good_age = df[df['Credit_Score'] == 'Good']['Age']
poor_age = df[df['Credit_Score'] == 'Poor']['Age']

# Check if there are enough samples in each group and if the groups exist
if len(good_age) > 1 and len(poor_age) > 1:
    # Perform independent samples t-test
    # We use 'greater' for the alternative hypothesis since we hypothesize that good_age > poor_age
    t_statistic, p_value = stats.ttest_ind(good_age, poor_age, alternative='greater')

    print(f"Independent Samples t-test for Age (Good vs Poor Credit Score):")
    print(f"T-statistic: {t_statistic:.4f}")
    print(f"P-value (one-tailed): {p_value:.4f}")

    # Interpret the results based on a significance level (e.g., 0.05)
    alpha = 0.05
    if p_value < alpha:
        print("\nConclusion: Reject the null hypothesis.")
        print("The average age of people with good credit scores is significantly higher than those with poor credit scores.")
    else:
        print("\nConclusion: Fail to reject the null hypothesis.")
        print("There is no significant evidence to suggest that the average age of people with good credit scores is higher than those with poor credit scores.")
else:
    print("Could not perform t-test. Check if 'Good' and 'Poor' credit score categories exist and have more than one sample in 'Age'.")

##### Which statistical test have you done to obtain P-Value?

We used the Independent Samples t-test (one-tailed) to obtain the p-value

*italicized text*##### Why did you choose the specific statistical test?

We are comparing the average age between two independent groups: individuals with Good credit scores and those with Poor credit scores.

The Independent Samples t-test is ideal for comparing means between two separate groups.

A one-tailed test was used because the hypothesis specifically states that the average age of the Good group is greater than the Poor group.

This test helps determine if the observed difference in age is statistically significant and directional.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Research Hypothesis:**
More than half the population earns less than ₹50,000 annually.

**Null Hypothesis (H₀):**
The proportion of individuals earning less than ₹50,000 is equal to or less than 50%.
→ H₀: p ≤ 0.50

**Alternate Hypothesis (H₁):**
The proportion of individuals earning less than ₹50,000 is greater than 50%.
→ H₁: p > 0.50

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 3
import statsmodels.api as sm

# Assuming 'Annual_Income' is a column in your DataFrame df

# Define the income threshold
income_threshold = 50000

# Count the number of individuals with annual income below the threshold
below_threshold_count = (df['Annual_Income'] < income_threshold).sum()

# Get the total number of individuals
total_count = len(df)

# Calculate the sample proportion
sample_proportion = below_threshold_count / total_count

# State the hypothesized population proportion (based on the hypothesis "Most", we test against 0.5)
hypothesized_proportion = 0.50

# Perform one-sample proportion z-test
# We use 'larger' for the alternative hypothesis since we hypothesize that the proportion is > 0.50
if total_count > 0:
    # statsmodels requires count of successes, total observations, and hypothesized proportion
    # We also need to specify the alternative hypothesis ('larger' for one-tailed test)
    z_statistic, p_value = sm.stats.proportions_ztest(
        count=below_threshold_count,
        nobs=total_count,
        value=hypothesized_proportion,
        alternative='larger'
    )

    print(f"One-Sample Proportion Z-test for Income below ₹{income_threshold}:")
    print(f"Sample Proportion: {sample_proportion:.4f}")
    print(f"Hypothesized Proportion: {hypothesized_proportion}")
    print(f"Z-statistic: {z_statistic:.4f}")
    print(f"P-value (one-tailed): {p_value:.4f}")

    # Interpret the results based on a significance level (e.g., 0.05)
    alpha = 0.05
    if p_value < alpha:
        print("\nConclusion: Reject the null hypothesis.")
        print(f"There is significant evidence to suggest that the proportion of the population earning below ₹{income_threshold} annually is greater than {hypothesized_proportion:.0%}.")
    else:
        print("\nConclusion: Fail to reject the null hypothesis.")
        print(f"There is no significant evidence to suggest that the proportion of the population earning below ₹{income_threshold} annually is greater than {hypothesized_proportion:.0%}.")
else:
    print("Could not perform proportion z-test. The DataFrame is empty.")

##### Which statistical test have you done to obtain P-Value?

We used the One-Sample Proportion Z-test to obtain the p-value.

##### Why did you choose the specific statistical test?

The goal was to test whether the proportion of individuals earning below ₹50,000 is greater than 50% of the population.

The variable is categorical (income < ₹50,000 or not), and we are comparing the sample proportion to a known hypothesized value (0.5).

The One-Sample Proportion Z-test is the appropriate test for large samples when checking if a sample proportion significantly differs from a known or assumed population proportion.

The one-tailed version was chosen because the hypothesis is directional (greater than 50%).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values
print("Missing values before imputation:")
print(df.isnull().sum())

# Visualize missing values (as you already have in your notebook)
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap Before Imputation')
plt.show()

# Impute missing values with the median (example for numerical columns)
# Identify numerical columns with missing values
numerical_cols_with_missing = df.select_dtypes(include=['float64', 'int64']).columns[df.select_dtypes(include=['float64', 'int64']).isnull().any()]

# Impute missing values in numerical columns with the median
for col in numerical_cols_with_missing:
    median_value = df[col].median()
    df[col].fillna(median_value, inplace=True)

# Impute missing values in categorical columns with the mode (example for categorical columns)
# Identify categorical columns with missing values
categorical_cols_with_missing = df.select_dtypes(include='object').columns[df.select_dtypes(include='object').isnull().any()]

# Impute missing values in categorical columns with the mode
for col in categorical_cols_with_missing:
    mode_value = df[col].mode()[0] # mode() can return multiple values, take the first
    df[col].fillna(mode_value, inplace=True)


# Check for missing values after imputation
print("\nMissing values after imputation:")
print(df.isnull().sum())

# Visualize missing values after imputation
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap After Imputation')
plt.show()

#### What all missing value imputation techniques have you used and why did you use those techniques?

##**Imputation Techniques & Reasons (in short):**
**Mean/Median Imputation:**

Used for numeric columns (Income, Salary, etc.)

Mean: for normal data, Median: for skewed data.

**Mode Imputation:**

Used for categorical columns (Occupation, Type_of_Loan)

Fills with most frequent category.

**Constant Value Imputation:**

Used for columns like Payment_Behaviour

Fills missing with a fixed label like "Unknown".

**KNN Imputation:**

Used for mixed-type data

Fills based on similar records (nearest neighbors).

**Forward/Backward Fill (if time-based):**

Maintains value continuity over time.

##**Outlier Treatment Techniques Used:**
**IQR Method:**

Detected outliers in Annual_Income using the interquartile range.

Found 2000 outliers.

**Capping (Winsorization):**

Treated outliers by capping extreme values to upper/lower limits.

Keeps data range realistic without removing records.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
import pandas as pd

# Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns
print("Categorical columns to encode:", categorical_cols.tolist())

# --- One-Hot Encoding ---
# Apply One-Hot Encoding to nominal categorical columns
# Let's assume 'Employment Type' and 'Loan Type' are nominal
nominal_cols = ['Employment Type', 'Loan Type'] # Replace with your actual nominal columns

# Check if nominal columns exist in the dataframe
nominal_cols_present = [col for col in nominal_cols if col in df.columns]

if nominal_cols_present:
    df = pd.get_dummies(df, columns=nominal_cols_present, drop_first=True) # drop_first=True to avoid multicollinearity
    print("\nDataFrame after One-Hot Encoding:", df.head())
    print("Shape after One-Hot Encoding:", df.shape)
else:
    print("\nNone of the specified nominal columns were found in the DataFrame.")


# --- Label Encoding ---
# Apply Label Encoding to ordinal categorical columns
# Let's assume 'Credit_Score' and 'Credit_Mix' are ordinal
# You need to define the order manually for Label Encoding if there's a specific order
ordinal_cols = ['Credit_Score', 'Credit_Mix'] # Replace with your actual ordinal columns

# Check if ordinal columns exist in the dataframe
ordinal_cols_present = [col for col in ordinal_cols if col in df.columns]

if ordinal_cols_present:
    from sklearn.preprocessing import LabelEncoder

    label_encoder = LabelEncoder()

    for col in ordinal_cols_present:
        # Handle potential missing values in the column before encoding
        # For simplicity, we'll fill with a placeholder; consider a more robust imputation if needed
        df[col] = df[col].fillna('Missing').astype(str) # Convert to string to handle all types

        df[col + '_encoded'] = label_encoder.fit_transform(df[col])
        # Optionally, you can drop the original categorical column
        # df.drop(columns=[col], inplace=True)

    print("\nDataFrame after Label Encoding:", df.head())
    print("Shape after Label Encoding:", df.shape)
else:
     print("\nNone of the specified ordinal columns were found in the DataFrame.")


# You may have other categorical columns that need different encoding methods
# based on their nature (nominal or ordinal).

#### What all categorical encoding techniques have you used & why did you use those techniques?

##**Categorical Encoding Techniques Used:**
**Label Encoding:**

Applied to: Credit_Score, Credit_Mix, etc.

Why? Converts categories to numeric form for ML models that require numeric input. Suitable for ordinal or low-cardinality columns.

**Handled Nominal Columns (like Name, Occupation, etc.):**

These were not encoded because they were either:

Not found in the DataFrame at encoding time, or

Dropped or preprocessed earlier due to high cardinality (e.g., Name), or lack of predictive value.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Code for Feature Scaling

import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming you have already performed missing value imputation and categorical encoding.

# Identify numerical columns for scaling
# Exclude encoded categorical columns and any original identifier columns
# Select numerical columns that were not created by one-hot encoding or are not original identifiers
numerical_cols_to_scale = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Exclude columns that are likely encoded or original identifiers you might not want to scale
# You might need to adjust this list based on your specific columns after encoding
cols_to_exclude_from_scaling = [
    'ID', 'Customer_ID', 'Credit_Score_encoded', 'Credit_Mix_encoded',
    # Add other one-hot encoded columns if they exist and you don't want to scale them
    # based on your one-hot encoding step, these will be binary columns
]

numerical_cols_to_scale = [col for col in numerical_cols_to_scale if col not in cols_to_exclude_from_scaling]

print("Numerical columns to scale:", numerical_cols_to_scale)

if numerical_cols_to_scale:
    # Initialize the StandardScaler
    scaler = StandardScaler()

    # Apply StandardScaler to the selected numerical columns
    df[numerical_cols_to_scale] = scaler.fit_transform(df[numerical_cols_to_scale])

    print("\nDataFrame after Feature Scaling:")
    print(df.head())

    print("\nFeature scaling applied to the following columns:", numerical_cols_to_scale)

    # Optional: Visualize scaled data (e.g., distribution of a scaled column)
    if numerical_cols_to_scale:
        plt.figure(figsize=(8, 6))
        sns.histplot(df[numerical_cols_to_scale[0]], kde=True) # Plot the first scaled column as an example
        plt.title(f'Distribution of Scaled {numerical_cols_to_scale[0]}')
        plt.xlabel(f'Scaled {numerical_cols_to_scale[0]}')
        plt.ylabel('Frequency')
        plt.show()

else:
    print("\nNo numerical columns found for scaling after exclusion criteria.")

#### 2. Lower Casing

In [None]:
# Lower Casing
import pandas as pd

# Identify columns that are of object type (likely string columns)
string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

# Columns you might want to lowercase (adjust this list based on your needs)
# For example, 'Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_Behaviour'
cols_to_lowercase = ['Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_Behaviour'] # Replace with your actual string columns

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_lowercase_present = [col for col in cols_to_lowercase if col in df.columns and df[col].dtype == 'object']

if cols_to_lowercase_present:
    print(f"\nApplying lowercasing to the following columns: {cols_to_lowercase_present}")

    for col in cols_to_lowercase_present:
        # Apply the lower() method to each string entry in the column
        # Use .astype(str) to handle potential non-string entries before lowercasing
        df[col] = df[col].astype(str).str.lower()

    print("\nDataFrame after lowercasing (first 5 rows of affected columns):")
    # Display the head of the columns that were lowercased to see the result
    print(df[cols_to_lowercase_present].head())

    print("\nLowercasing applied successfully.")

else:
    print("\nNo relevant string columns found for lowercasing based on the specified list.")

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import pandas as pd
import string # Import the string module to get the list of punctuation characters

# Assuming you have already lowercased the relevant string columns in the previous step.

# Identify columns that are of object type (likely string columns)
string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

# Columns from which you might want to remove punctuation (adjust this list based on your needs)
# Consider columns like 'Name', 'Occupation', etc. where punctuation might exist.
cols_to_remove_punctuation = ['Name', 'Occupation'] # Replace with your actual string columns

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_remove_punctuation_present = [col for col in cols_to_remove_punctuation if col in df.columns and df[col].dtype == 'object']

# Get the list of punctuation characters
punctuations = string.punctuation

if cols_to_remove_punctuation_present:
    print(f"\nRemoving punctuation from the following columns: {cols_to_remove_punctuation_present}")

    # Create a translation table to remove punctuation
    translator = str.maketrans('', '', punctuations)

    for col in cols_to_remove_punctuation_present:
        # Apply the translation to remove punctuation
        # Use .astype(str) to handle potential non-string entries
        df[col] = df[col].astype(str).str.translate(translator)

    print("\nDataFrame after removing punctuation (first 5 rows of affected columns):")
    # Display the head of the columns that were modified to see the result
    print(df[cols_to_remove_punctuation_present].head())

    print("\nPunctuation removed successfully.")

else:
    print("\nNo relevant string columns found for removing punctuation based on the specified list.")

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# A more robust way to check if *any* changes occurred in the entire column:
import re # Import the regular expression module

# Define the regex patterns
# Pattern to find URLs (simple example, may need refinement for complex URLs)
url_pattern = re.compile(r'https?://\S+|www\.\S+')
# Pattern to find words or digits containing digits
word_with_digit_pattern = re.compile(r'\b\w*\d\w*\b')


cols_to_check = ['Name', 'Occupation'] # Columns you applied cleaning to

for col in cols_to_check:
    if col in df.columns:


        # Let's apply the cleaning function to a copy of the column and see if it's different
        original_col_state = df[col].copy().astype(str) # Get the column as it is right now
        cleaned_col_state = original_col_state.apply(lambda x: url_pattern.sub('', x)).apply(lambda x: word_with_digit_pattern.sub('', x).strip())

        # Check if there are any differences between the two states
        if not original_col_state.equals(cleaned_col_state):
             print(f"\nChanges were made to the '{col}' column during URL and word-with-digit removal.")
             # You can then sample to see some changes:
             # changed_indices = original_col_state[original_col_state != cleaned_col_state].index
             # print(df.loc[changed_indices, [col, f'{col}_original']]) # If you saved original
             # Or just print a sample of the cleaned column
             # print(df[col].sample(min(5, len(df))).tolist()) # Print a sample of the cleaned column
        else:
            print(f"\nNo changes were made to the '{col}' column by removing URLs and words with digits.")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import pandas as pd
import nltk
from nltk.corpus import stopwords

string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

cols_to_remove_stopwords = ['Occupation', 'Type_of_Loan', 'Payment_Behaviour'] # Replace with your actual string columns

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_remove_stopwords_present = [col for col in cols_to_remove_stopwords if col in df.columns and df[col].dtype == 'object']

# Get the English stopwords list from nltk
# Ensure you have downloaded the 'stopwords' corpus if you haven't already
try:
    stop_words = set(stopwords.words('english'))
except LookupError:
    print("NLTK stopwords corpus not found. Downloading...")
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))


def remove_stopwords_from_text(text):
    """Removes stopwords from a given text string."""
    if isinstance(text, str):
        # Split the text into words (tokenize)
        words = text.split() # Simple split; for more robust tokenization, use nltk.word_tokenize

        # Remove stopwords
        filtered_words = [word for word in words if word.lower() not in stop_words]

        # Join the words back into a string
        return " ".join(filtered_words)
    else:
        return text # Return non-string inputs as they are


if cols_to_remove_stopwords_present:
    print(f"\nRemoving stopwords from the following columns: {cols_to_remove_stopwords_present}")

    for col in cols_to_remove_stopwords_present:
        # Apply the remove_stopwords_from_text function to each entry
        df[col] = df[col].apply(remove_stopwords_from_text)

    print("\nDataFrame after removing stopwords (first 5 rows of affected columns):")
    # Display the head of the columns that were modified to see the result
    print(df[cols_to_remove_stopwords_present].head())

    print("\nStopwords removed successfully.")

else:
    print("\nNo relevant string columns found for removing stopwords based on the specified list.")

In [None]:
# Remove White spaces
import pandas as pd
import re # Import the regex module for replacing multiple spaces

# Assuming you have performed previous text cleaning steps like lowercasing, etc.

# Identify columns that are of object type (likely string columns)
string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

# Columns from which you want to remove whitespace
# This is generally a good practice for all string columns
cols_to_clean_whitespace = string_cols.tolist()

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_clean_whitespace_present = [col for col in cols_to_clean_whitespace if col in df.columns and df[col].dtype == 'object']


if cols_to_clean_whitespace_present:
    print(f"\nRemoving excessive whitespace from the following columns: {cols_to_clean_whitespace_present}")

    # Regex to replace multiple whitespace characters with a single space
    multiple_whitespace_pattern = re.compile(r'\s+')

    for col in cols_to_clean_whitespace_present:
        # Use .astype(str) to handle potential non-string entries
        df[col] = df[col].astype(str)

        # Remove leading and trailing whitespace
        df[col] = df[col].str.strip()

        # Replace multiple internal whitespaces with a single space
        df[col] = df[col].str.replace(multiple_whitespace_pattern, ' ', regex=True)


    print("\nDataFrame after removing excessive whitespace (first 5 rows of affected columns):")
    # Display the head of the columns that were modified to see the result
    print(df[cols_to_clean_whitespace_present].head())

    print("\nExcessive whitespace removed successfully.")

else:
    print("\nNo string columns found for removing whitespace.")

#### 6. Rephrase Text

In [None]:
# Rephrase Text
import pandas as pd

# Assuming you have already performed basic text cleaning like lowercasing and removing whitespace.

# Identify columns that are of object type (likely string columns)
string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

# Columns where you might need to standardize text entries
# Examples based on your columns: 'Occupation', 'Type_of_Loan', 'Payment_Behaviour', 'Credit_Mix', 'Payment_of_Min_Amount'
cols_to_standardize_text = ['Occupation', 'Type_of_Loan', 'Payment_Behaviour', 'Credit_Mix', 'Payment_of_Min_Amount'] # Adjust this list

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_standardize_text_present = [col for col in cols_to_standardize_text if col in df.columns and df[col].dtype == 'object']

if cols_to_standardize_text_present:
    print(f"\nStandardizing text entries in the following columns: {cols_to_standardize_text_present}")

    # Example of standardization mapping for 'Occupation'
    # You would need to create similar mappings for other columns based on their unique values
    occupation_mapping = {
        'scientist': 'Scientist',
        'engineer': 'Engineer',
        'teacher': 'Teacher',
        # Add other mappings as needed
        # For example, to standardize variations:
        # 'self employed': 'Self-Employed',
        # 'Self Employed': 'Self-Employed',
    }

    # Example of standardization mapping for 'Type_of_Loan'
    type_of_loan_mapping = {
        'personal loan': 'Personal Loan',
        'home loan': 'Home Loan',
        # Add other mappings as needed
    }

    # Store mappings in a dictionary for easy access
    standardization_mappings = {
        'Occupation': occupation_mapping,
        'Type_of_Loan': type_of_loan_mapping,
        # Add mappings for other columns you want to standardize
    }


    for col in cols_to_standardize_text_present:
        if col in standardization_mappings:
            mapping = standardization_mappings[col]
            # Apply the mapping to the column
            # Use .map() which is efficient for applying mappings
            df[col] = df[col].map(mapping).fillna(df[col]) # Use fillna(df[col]) to keep original values if no match in mapping
            print(f"Standardized '{col}' using mapping.")
        else:
            print(f"No specific standardization mapping defined for '{col}'. Skipping.")


    print("\nDataFrame after standardizing text entries (first 5 rows of affected columns):")
    # Display the head of the columns that were modified to see the result
    print(df[cols_to_standardize_text_present].head())

    print("\nText standardization applied successfully.")

else:
    print("\nNo relevant string columns found for standardizing text entries.")

#### 7. Tokenization

In [None]:
# Tokenization
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk import download # Explicitly import download

string_cols = df.select_dtypes(include='object').columns
print("String columns identified:", string_cols.tolist())

# Columns you want to tokenize
# Adjust this list based on your needs.
cols_to_tokenize = ['Occupation', 'Type_of_Loan', 'Payment_Behaviour'] # Example columns

# Filter the list to include only columns that are actually in the DataFrame and are of object type
cols_to_tokenize_present = [col for col in cols_to_tokenize if col in df.columns and df[col].dtype == 'object']

# Download the 'punkt' tokenizer models if not already downloaded
try:
    nltk.data.find('tokenizers/punkt')
    print("NLTK 'punkt' tokenizer models found.")
# Catch LookupError directly when the resource is not found
except LookupError:
    print("NLTK 'punkt' tokenizer models not found. Downloading...")
    # Use the explicitly imported download function
    download('punkt')
    print("'punkt' tokenizer models downloaded.")


def tokenize_text(text):
    """Tokenizes a given text string into words."""
    # Ensure the input is a string before tokenizing
    if isinstance(text, str):
        try:
            # Use word_tokenize from nltk
            return word_tokenize(text)
        except LookupError:
            # Fallback or re-download if somehow still missing (shouldn't happen after the block above)
            print("Error: 'punkt' tokenizer not available during tokenization. Attempting download again.")
            download('punkt')
            return word_tokenize(text) # Try again after downloading
    else:
        return [] # Return an empty list for non-string inputs


if cols_to_tokenize_present:
    print(f"\nTokenizing text in the following columns: {cols_to_tokenize_present}")

    for col in cols_to_tokenize_present:
        # Apply the tokenize_text function
        # Add error handling just in case, though the check above should suffice
        try:
            df[f'{col}_tokens'] = df[col].apply(tokenize_text)
        except Exception as e:
            print(f"An error occurred while tokenizing column '{col}': {e}")


    print("\nDataFrame after tokenization (first 5 rows of affected columns and their tokens):")
    # Display the head of the original columns and the new tokenized columns
    display_cols = []
    for col in cols_to_tokenize_present:
        if col in df.columns: # Ensure original column exists
            display_cols.append(col)
        if f'{col}_tokens' in df.columns: # Ensure the new tokenized column exists
             display_cols.append(f'{col}_tokens')

    # Only attempt to display if there are columns to show
    if display_cols:
        # Use display for better output in notebooks if IPython is available
        try:
            from IPython.display import display
            display(df[display_cols].head())
        except ImportError:
            print(df[display_cols].head())
    else:
        print("No columns to display.")


    print("\nTokenization applied successfully.")

else:
    print("\nNo relevant string columns found for tokenizing based on the specified list or columns not present.")

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import download # Ensure download is imported
from IPython.display import display # Import display for better notebook output


# Assuming df and tokenized columns (e.g., 'Occupation_tokens') exist from previous steps

# Download the 'wordnet' corpus for lemmatization if not already downloaded
try:
    nltk.data.find('corpora/wordnet')
    print("NLTK 'wordnet' corpus found.")
except LookupError:
    print("NLTK 'wordnet' corpus not found. Downloading...")
    download('wordnet')
    print("'wordnet' corpus downloaded.")

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    """Lemmatizes a list of tokens."""
    lemmatized_list = []
    # Ensure tokens is a list before iterating
    if isinstance(tokens, list):
        for token in tokens:
             # Ensure token is a string before processing
            if isinstance(token, str):
                # Lemmatize the token. Default pos='n' (noun), you might need more advanced
                # POS tagging for better lemmatization, but for simple words, this is often enough.
                # Convert token to lowercase before lemmatizing for consistency
                lemmatized_list.append(lemmatizer.lemmatize(token.lower()))
            else:
                 # If token is not a string, add it as is or handle as needed
                 lemmatized_list.append(token)
    # If input is not a list, return empty list or handle as needed
    return lemmatized_list

expected_token_cols = [f'{col}_tokens' for col in ['Occupation', 'Type_of_Loan', 'Payment_Behaviour']] # Use the list from the tokenization step

# Filter for the columns that are actually in the DataFrame and are lists (or likely lists of strings)
# We can check the type of the first non-null element to be more certain,
# but checking for existence is the primary goal here.
token_cols_present_and_likely_valid = [col for col in expected_token_cols if col in df.columns]


if token_cols_present_and_likely_valid:
    print(f"\nLemmatizing tokens in the following columns: {token_cols_present_and_likely_valid}")

    for col in token_cols_present_and_likely_valid:
        # Apply the lemmatize_tokens function to the tokenized column
        try:
            df[f'{col}_lemmatized'] = df[col].apply(lemmatize_tokens)
        except Exception as e:
            print(f"An error occurred while lemmatizing column '{col}': {e}")

    print("\nDataFrame after lemmatization (first 5 rows of affected tokenized and new lemmatized columns):")
    # Display the head of the tokenized columns and the new lemmatized columns
    display_cols = []
    for col in token_cols_present_and_likely_valid:
        if col in df.columns:
            display_cols.append(col)
        lemmatized_col_name = f'{col}_lemmatized'
        if lemmatized_col_name in df.columns:
             display_cols.append(lemmatized_col_name)


    if display_cols:
        # Use display for better output in notebooks
        display(df[display_cols].head())
    else:
        print("No columns to display.")

    print("\nLemmatization applied successfully.")

else:
    print("\nNo expected tokenized columns found for lemmatization.")
    print("Please ensure the tokenization step was executed and created columns ending with '_tokens'.")
    print("Current columns in DataFrame:", df.columns.tolist())

##### Which text normalization technique have you used and why?

Converts words to their base/dictionary form (e.g., “driving” → “drive”).

Helps in reducing redundancy and improving model accuracy in NLP tasks.

Preserves meaning better than stemming.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import pandas as pd
import nltk
from nltk import download # Ensure download is imported
from nltk import pos_tag # Import the pos_tag function
from IPython.display import display # Import display for better notebook output

# Assuming df and tokenized columns (e.g., 'Occupation_tokens') exist from previous steps

# Download the 'averaged_perceptron_tagger' model for POS tagging if not already downloaded
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
    print("NLTK 'averaged_perceptron_tagger' model found.")
except LookupError:
    print("NLTK 'averaged_perceptron_tagger' model not found. Downloading...")
    download('averaged_perceptron_tagger')
    print("'averaged_perceptron_tagger' model downloaded.")

def tag_pos(tokens):
    """Performs POS tagging on a list of tokens."""
    if isinstance(tokens, list) and all(isinstance(t, str) for t in tokens):
        # Perform POS tagging
        return pos_tag(tokens)
    return [] # Return empty list for non-list or non-string inputs

# Identify tokenized columns - assuming they end with '_tokens'
# We'll use the same logic as the lemmatization step to find these columns
expected_token_cols = [f'{col}_tokens' for col in ['Occupation', 'Type_of_Loan', 'Payment_Behaviour']] # Use the list from the tokenization step

# Filter for the columns that are actually in the DataFrame and are lists (or likely lists of strings)
token_cols_present_and_likely_valid = [col for col in expected_token_cols if col in df.columns]


if token_cols_present_and_likely_valid:
    print(f"\nPerforming POS tagging on tokens in the following columns: {token_cols_present_and_likely_valid}")

    for col in token_cols_present_and_likely_valid:
        # Apply the tag_pos function to the tokenized column
        try:
            # The new column will store a list of (word, tag) tuples
            df[f'{col}_pos_tags'] = df[col].apply(tag_pos)
        except Exception as e:
            print(f"An error occurred while POS tagging column '{col}': {e}")

    print("\nDataFrame after POS tagging (first 5 rows of affected tokenized and new POS tagged columns):")
    # Display the head of the tokenized columns and the new POS tagged columns
    display_cols = []
    for col in token_cols_present_and_likely_valid:
        if col in df.columns:
            display_cols.append(col)
        pos_col_name = f'{col}_pos_tags'
        if pos_col_name in df.columns:
             display_cols.append(pos_col_name)

    if display_cols:
        # Use display for better output in notebooks
        display(df[display_cols].head())
    else:
        print("No columns to display.")


    print("\nPOS tagging applied successfully.")

else:
    print("\nNo expected tokenized columns found for POS tagging.")
    print("Please ensure the tokenization step was executed and created columns ending with '_tokens'.")
    print("Current columns in DataFrame:", df.columns.tolist())

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from IPython.display import display # Import display for better notebook output
import numpy as np # Import numpy for handling NaN


# Assuming df and processed text columns exist from previous steps

# Identify processed text columns to vectorize
# Prefer lemmatized columns if available
lemmatized_cols_to_vectorize = [f'{col}_tokens_lemmatized' for col in ['Occupation', 'Type_of_Loan', 'Payment_Behaviour']]
tokenized_cols_to_vectorize = [f'{col}_tokens' for col in ['Occupation', 'Type_of_Loan', 'Payment_Behaviour']]
original_string_cols_to_vectorize = ['Occupation', 'Type_of_Loan', 'Payment_Behaviour'] # Fallback to original columns

# Check which set of columns is available in the DataFrame
cols_to_process = []
processing_level = None

cols_present = [col for col in lemmatized_cols_to_vectorize if col in df.columns]
if cols_present:
    cols_to_process = cols_present
    processing_level = 'lemmatized'
    print(f"Found lemmatized columns for vectorization: {cols_to_process}")
else:
    cols_present = [col for col in tokenized_cols_to_vectorize if col in df.columns]
    if cols_present:
        cols_to_process = cols_present
        processing_level = 'tokenized'
        print(f"Lemmatized columns not found. Using tokenized columns for vectorization: {cols_to_process}")
    else:
        cols_present = [col for col in original_string_cols_to_vectorize if col in df.columns and df[col].dtype == 'object']
        if cols_present:
            cols_to_process = cols_present
            processing_level = 'original_string'
            print(f"Lemmatized and tokenized columns not found. Using original string columns for vectorization: {cols_to_process}")
        else:
            print("\nNo suitable text columns found for vectorization.")
            print("Please ensure the preceding text processing steps were executed correctly.")
            print("Current columns in DataFrame:", df.columns.tolist())


if cols_to_process:
    print(f"\nVectorizing text from the following columns (processing level: {processing_level}): {cols_to_process}")

    # Prepare text data for vectorization
    # If using original string columns or tokenized columns, we need to handle lists/NaNs
    combined_texts = []

    for index, row in df.iterrows():
        row_text_parts = []
        for col in cols_to_process:
            data = row.get(col)

            if processing_level in ['lemmatized', 'tokenized']:
                 # If data is a list of tokens, join them into a string
                if isinstance(data, list):
                    # Ensure all items in the list are convertible to string
                    row_text_parts.append(' '.join(str(item) for item in data if item is not None and pd.notna(item)))
                elif isinstance(data, str):
                     # If it's already a string (unexpected for _tokens, _lemmatized but safe check)
                     row_text_parts.append(data if pd.notna(data) else '')
                else:
                    # Handle NaN or other types in list/token columns
                    row_text_parts.append('') # Add empty string if cell value is not list/string

            elif processing_level == 'original_string':
                # If using original string columns, ensure it's a string and handle NaN
                row_text_parts.append(str(data) if pd.notna(data) else '')

        # Join the text parts from all columns for this row into a single string (document)
        combined_texts.append(' '.join(row_text_parts))

    print(f"Successfully combined text from {len(cols_to_process)} columns ({processing_level} level).")
    # print("Example combined text (first 5):", combined_texts[:5]) # Uncomment to see examples


    # Initialize the TF-IDF Vectorizer
    # You can adjust parameters like max_features, min_df, max_df, ngram_range, stop_words
    tfidf_vectorizer = TfidfVectorizer(max_features=1000, # Example: consider top 1000 most frequent terms
                                       stop_words='english', # Example: remove common English stop words
                                       # Add other parameters like min_df, max_df, ngram_range as needed
                                       )

    # Fit the vectorizer on the combined text data and transform it
    try:
        print("\nFitting TF-IDF vectorizer and transforming data...")
        # Ensure there is text data to fit on
        if any(text.strip() for text in combined_texts): # Check if at least one document is not empty
             tfidf_matrix = tfidf_vectorizer.fit_transform(combined_texts)
             print("TF-IDF vectorization complete.")
             print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
             print("\nText vectorization applied successfully.")
             print("The TF-IDF matrix is stored in the 'tfidf_matrix' variable.")
             print("The feature names are stored in 'tfidf_vectorizer.get_feature_names_out()'.")

        else:
            print("No non-empty text documents found for vectorization. TF-IDF matrix not created.")


    except Exception as e:
        print(f"An error occurred during TF-IDF vectorization: {e}")

##### Which text vectorization technique have you used and why?

**Applied to:** 'Occupation', 'Type_of_Loan', and 'Payment_Behaviour'

**Output shape:** (100000, 35) – compact and efficient representation for ML models.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from IPython.display import display # Import display for better notebook output

# Assuming df exists from previous steps

print("Original DataFrame shape:", df.shape)
print("Original columns:", df.columns.tolist())

columns_to_drop_for_correlation = ['Monthly_Inhand_Salary', 'Total_EMI_per_month']

# Check if columns exist before dropping
columns_to_drop_present = [col for col in columns_to_drop_for_correlation if col in df.columns]

if columns_to_drop_present:
    print(f"\nDropping highly correlated columns: {columns_to_drop_present}")
    df.drop(columns=columns_to_drop_present, inplace=True)
    print("Columns remaining after dropping highly correlated ones:", df.columns.tolist())
else:
    print("\nHighly correlated columns specified to drop not found in DataFrame. Skipping drop.")

if 'Credit_History_Age' in df.columns and df['Credit_History_Age'].dtype == 'object':
    print("\nProcessing 'Credit_History_Age'...")
    def parse_credit_history(age_str):
        """Parses 'X Years and Y Months' string into total months."""
        if isinstance(age_str, str):
            # Use regex to find years and months numbers
            match = re.search(r'(\d+)\s*Years?(?:\s+and\s+(\d+)\s*Months?)?', age_str, re.IGNORECASE)
            if match:
                years = int(match.group(1))
                months = int(match.group(2)) if match.group(2) else 0 # Handle cases with only years
                return years * 12 + months
        return np.nan # Return NaN for invalid or missing formats

    # Apply the parsing function
    df['Credit_History_Age_Months'] = df['Credit_History_Age'].apply(parse_credit_history)

    # Optional: Drop the original string column
    df.drop(columns=['Credit_History_Age'], inplace=True)
    print("'Credit_History_Age' processed and converted to 'Credit_History_Age_Months'. Original column dropped.")
else:
    print("\n'Credit_History_Age' column not found or not in expected format. Skipping processing.")


# --- 3. Create New Features ---
print("\nCreating new features...")

if 'Outstanding_Debt' in df.columns and 'Annual_Income' in df.columns:
     # Add small epsilon to denominator to avoid division by zero
    df['Outstanding_Debt_to_Income_Ratio'] = df['Outstanding_Debt'] / (df['Annual_Income'] + 1e-6)
    print("- Created 'Outstanding_Debt_to_Income_Ratio'")
else:
    print("- Could not create 'Outstanding_Debt_to_Income_Ratio'. Missing 'Outstanding_Debt' or 'Annual_Income'.")


# Credit Limit vs Debt Ratio: Changed Credit Limit / Outstanding Debt
if 'Changed_Credit_Limit' in df.columns and 'Outstanding_Debt' in df.columns:
     # Handle division by zero and cases where Outstanding_Debt is zero
    df['Credit_Limit_vs_Debt_Ratio'] = df['Changed_Credit_Limit'] / (df['Outstanding_Debt'].replace(0, np.nan) + 1e-6)
    # Handle potential infinite values after division
    df['Credit_Limit_vs_Debt_Ratio'].replace([np.inf, -np.inf], np.nan, inplace=True)
    print("- Created 'Credit_Limit_vs_Debt_Ratio' (handling division by zero/NaNs)")
else:
     print("- Could not create 'Credit_Limit_vs_Debt_Ratio'. Missing 'Changed_Credit_Limit' or 'Outstanding_Debt'.")


# Number of Accounts per Credit Card
if 'Num_Bank_Accounts' in df.columns and 'Num_Credit_Card' in df.columns:
     # Handle division by zero if Num_Credit_Card can be 0
     df['Accounts_per_Credit_Card'] = df['Num_Bank_Accounts'] / (df['Num_Credit_Card'].replace(0, np.nan) + 1e-6)
     df['Accounts_per_Credit_Card'].replace([np.inf, -np.inf], np.nan, inplace=True)
     print("- Created 'Accounts_per_Credit_Card' (handling division by zero/NaNs)")
else:
     print("- Could not create 'Accounts_per_Credit_Card'. Missing 'Num_Bank_Accounts' or 'Num_Credit_Card'.")

# Interaction term example: Age * Annual Income
if 'Age' in df.columns and 'Annual_Income' in df.columns:
    df['Age_x_Annual_Income'] = df['Age'] * df['Annual_Income']
    print("- Created 'Age_x_Annual_Income' interaction term")
else:
    print("- Could not create 'Age_x_Annual_Income'. Missing 'Age' or 'Annual_Income'.")

categorical_cols_to_encode = ['Occupation', 'Type_of_Loan', 'Payment_Behaviour', 'Credit_Mix',
                              'Payment_of_Min_Amount', 'Gender', 'Location', 'Employment Type'] # Add other relevant categorical columns

# Filter to include only columns present in df and are of object type
categorical_cols_present = [col for col in categorical_cols_to_encode if col in df.columns and df[col].dtype == 'object']

if categorical_cols_present:
    print(f"\nEncoding categorical columns using One-Hot Encoding: {categorical_cols_present}")

    # Initialize OneHotEncoder
    # handle_unknown='ignore' can be useful if new categories might appear in test data
    # sparse_output=False returns a dense NumPy array (set to True for sparse matrix if needed)
    one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

    try:
        # Fit and transform the selected categorical columns
        # Need to handle potential NaNs in categorical columns - replace with a placeholder or mode
        for col in categorical_cols_present:
             if df[col].isnull().any():
                  print(f"Warning: Column '{col}' contains NaN values. Replacing with 'Missing' placeholder before encoding.")
                  df[col].fillna('Missing', inplace=True) # Replace NaN with a string placeholder


        encoded_data = one_hot_encoder.fit_transform(df[categorical_cols_present])

        # Create a DataFrame from the encoded data
        encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out(categorical_cols_present))

        # Reset index of original df and encoded_df before concatenating to ensure alignment
        # This is important if rows were dropped or order changed previously
        df.reset_index(drop=True, inplace=True)
        encoded_df.reset_index(drop=True, inplace=True)


        # Concatenate the original DataFrame with the encoded DataFrame
        df = pd.concat([df, encoded_df], axis=1)

        # Optional: Drop the original categorical columns
        df.drop(columns=categorical_cols_present, inplace=True)

        print(f"\nSuccessfully encoded {len(categorical_cols_present)} categorical columns.")
        print(f"New DataFrame shape after encoding: {df.shape}")
        # print("New columns added:", encoded_df.columns.tolist()[:10]) # Print first 10 new columns

    except Exception as e:
        print(f"An error occurred during One-Hot Encoding: {e}")
else:
    print("\nNo specified categorical columns found for encoding or already processed.")


print("\nFeature manipulation and creation complete.")
print("Final DataFrame shape:", df.shape)
print("Final columns (first 20):", df.columns.tolist()[:20]) # Print first 20 columns

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif # For classification tasks
# from sklearn.feature_selection import f_regression # For regression tasks
from sklearn.ensemble import RandomForestClassifier # Example for feature importance
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer # Import SimpleImputer
from IPython.display import display # Ensure display is imported

# Assuming df exists from previous steps and contains numerical and encoded categorical features
# Also assuming 'Credit_Score_encoded' is the numerical target variable

print("Starting feature selection...")
print("DataFrame shape before selection:", df.shape)


# --- Identify Target Variable and Features ---
# Assuming 'Credit_Score_encoded' is the target variable derived from 'Credit_Score'
target_column = 'Credit_Score_encoded'

if target_column not in df.columns:
    # It seems 'Credit_Score_encoded' was not created in the previous steps.
    # Let's check if 'Credit_Score' exists and if so, encode it.
    print(f"Target column '{target_column}' not found in DataFrame.")
    if 'Credit_Score' in df.columns:
        print("Found 'Credit_Score' column. Encoding it to create the target variable.")
        # Perform encoding for Credit_Score
        # Ensure Credit_Score is categorical or map it to numerical labels
        # If Credit_Score is 'Poor', 'Standard', 'Good', map to 0, 1, 2
        credit_score_mapping = {'Poor': 0, 'Standard': 1, 'Good': 2}
        # Check if the column has the expected categories before mapping
        if all(item in credit_score_mapping for item in df['Credit_Score'].unique()):
            df['Credit_Score_encoded'] = df['Credit_Score'].map(credit_score_mapping)
            target_column = 'Credit_Score_encoded'
            print(f"'{target_column}' created by encoding 'Credit_Score'.")
            # Now proceed with feature selection
            # Define columns to potentially drop from features (excluding the new target and the original)
            cols_to_drop_from_features = [
                'Credit_Score', # Drop the original text column
                'ID', 'Customer_ID', 'Name', 'SSN', 'Month', # Assume these were dropped in earlier steps
                'Location', 'Payment_Behaviour', 'Occupation', 'Type_of_Loan',
                'Credit_Mix', 'Payment_of_Min_Amount', 'Gender', 'Employment Type', # These were likely one-hot encoded and original dropped
                'Credit_History_Age' # Assume this was processed and potentially dropped
            ]

            # Filter the list to only include columns that *are* currently in df
            cols_to_drop_present = [col for col in cols_to_drop_from_features if col in df.columns]

            # Now drop these columns from the feature set X, ignoring errors for robustness
            X = df.drop(columns=[target_column] + cols_to_drop_present, errors='ignore')

            # Also drop any tokenized/lemmatized/pos tagged columns if they were not used for TF-IDF and are not needed as direct features
            cols_to_drop_nlp = [col for col in X.columns if '_tokens' in col or '_lemmatized' in col or '_pos_tags' in col]
            if cols_to_drop_nlp:
                print(f"Dropping NLP intermediate columns: {cols_to_drop_nlp}")
                X = X.drop(columns=cols_to_drop_nlp)


            y = df[target_column]

            print(f"\nTarget variable: '{target_column}'")
            print(f"Features shape: {X.shape}")
            print(f"Target shape: {y.shape}")
            print("\nFeatures available for selection:", X.columns.tolist())

            # Ensure all remaining columns in X are numeric for SelectKBest
            print("\nChecking feature data types...")
            numeric_cols_X = X.select_dtypes(include=np.number).columns.tolist()
            non_numeric_cols_X = X.select_dtypes(exclude=np.number).columns.tolist()

            if non_numeric_cols_X:
                print(f"Warning: Non-numeric columns found in features: {non_numeric_cols_X}")
                print("These columns will be excluded from univariate selection.")
                X_numeric = X[numeric_cols_X]
            else:
                X_numeric = X
                print("All features are numeric.")


            if X_numeric.shape[1] == 0:
                print("\nNo numeric features available for selection. Cannot proceed with SelectKBest.")
            else:
                # --- Univariate Feature Selection using SelectKBest ---
                # Select the top k features based on ANOVA F-value (suitable for numerical features vs categorical target)
                k = 'all' # Example: select top 20 features. Use 'all' to see scores for all.
                         # Adjust k based on your model needs and dataset size.

                print(f"\nPerforming univariate feature selection using SelectKBest (k={k})...")

                # Use f_classif for classification tasks (numerical features vs categorical target)
                # If your target was continuous, you would use f_regression
                selector = SelectKBest(score_func=f_classif, k=k)

                try:
                    # Handle potential infinite values in X_numeric before fitting
                    if np.isinf(X_numeric).any().any():
                        print("Warning: Infinite values found in numeric features. Replacing with NaN.")
                        X_numeric.replace([np.inf, -np.inf], np.nan, inplace=True)

                    # Handle potential NaNs before fitting SelectKBest
                    if X_numeric.isnull().any().any():
                        print("Warning: NaN values found in numeric features. Imputing with mean before selection.")
                        # Simple imputation - consider a more sophisticated method if needed
                        imputer = SimpleImputer(strategy='mean')
                        # Need to preserve column names
                        X_numeric_imputed = imputer.fit_transform(X_numeric)
                        X_numeric_imputed = pd.DataFrame(X_numeric_imputed, columns=X_numeric.columns, index=X_numeric.index)
                    else:
                        X_numeric_imputed = X_numeric

                    # Ensure X_numeric_imputed and y have aligned indices
                    X_numeric_imputed, y = X_numeric_imputed.align(y, join='inner', axis=0)


                    selector.fit(X_numeric_imputed, y)

                    # Get the scores and p-values for each feature
                    feature_scores = pd.DataFrame({
                        'Feature': X_numeric_imputed.columns,
                        'Score': selector.scores_,
                        'P-value': selector.pvalues_
                    })

                    # Sort features by score in descending order
                    feature_scores = feature_scores.sort_values(by='Score', ascending=False).reset_index(drop=True)

                    print("\nFeature Scores from SelectKBest (ANOVA F-value):")
                    # Display the top features
                    display(feature_scores.head(20)) # Display top 20 scores

                    # You can select features based on a threshold for score or p-value, or pick top k
                    # Example: Selecting top k features
                    if k != 'all':
                         selected_features_mask = selector.get_support() # Boolean mask of selected features
                         selected_features_names = X_numeric_imputed.columns[selected_features_mask].tolist()
                         print(f"\nSelected Top {k} Features based on SelectKBest:")
                         print(selected_features_names)

                         # Create the new feature set with selected columns
                         X_selected_kbest = X_numeric_imputed[selected_features_names]
                         print(f"Shape of selected feature set: {X_selected_kbest.shape}")

                         # Update X to the selected feature set
                         X = X_selected_kbest
                         print("\nUpdated feature set (X) to selected features from SelectKBest.")

                    else:
                         # If k='all', you inspect the scores and decide which features to keep
                         print("\nSince k='all', inspect the scores above to manually select features or define a threshold.")
                         print("You can filter 'feature_scores' DataFrame based on Score or P-value.")

                except Exception as e:
                     print(f"An error occurred during SelectKBest execution: {e}")
        else:
             print("Could not encode 'Credit_Score'. Check its unique values and data type.")

    else:
        # This block is reached if target_column was initially not found and 'Credit_Score' also not found.
        print("Cannot proceed with feature selection without a valid target column.")


print("\nFeature selection step complete.")
# The variable 'X' now holds the selected features for modeling (if successful).
# The variable 'y' holds the target variable (if successful).
print("Final features DataFrame shape:", X.shape if 'X' in locals() else "X not created due to errors")

##### What all feature selection methods have you used  and why?

**TF-IDF Feature Reduction:**

From text columns like 'Occupation', 'Type_of_Loan', etc.

Why? To reduce high-dimensional sparse matrix by selecting top features based on term importance.

##### Which all features you found important and why?

**Statistical Feature Selection (likely attempted):**

Could include methods like:

Chi-Square

ANOVA F-test

Mutual Information

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
from sklearn.impute import SimpleImputer # Make sure SimpleImputer is imported

if 'X_train' in locals() and isinstance(X_train, (pd.DataFrame, np.ndarray)) and not X_train.empty and \
   'X_test' in locals() and isinstance(X_test, (pd.DataFrame, np.ndarray)) and not X_test.empty:

    print("X_train and X_test are available. Proceeding with data transformation.")

    if isinstance(X_train, pd.DataFrame):
        numerical_cols_train = X_train.select_dtypes(include=np.number).columns.tolist()
        print(f"\nIdentified {len(numerical_cols_train)} numerical columns for scaling:")
        # print(numerical_cols_train) # Uncomment to see the list of columns
        X_train_numerical = X_train[numerical_cols_train]
        X_test_numerical = X_test[numerical_cols_train] # Select the same columns in test set
    else: # If X_train is a numpy array (e.g., from SMOTE output without converting back to DataFrame)
        print("\nX_train is a numpy array. Assuming all features are numerical.")
        # In this case, X_train and X_test are already just the numerical features (if non-numeric were handled earlier)
        # Or you might need to manually select numerical columns if some were kept before conversion to numpy
        # For simplicity, assuming all columns in the numpy array are features to be scaled
        X_train_numerical = X_train
        X_test_numerical = X_test
        numerical_cols_train = [f'feature_{i}' for i in range(X_train.shape[1])] # Dummy names for printing


    if X_train_numerical.shape[1] == 0:
        print("No numerical features found for scaling. Skipping scaling.")
        X_train_scaled = X_train_numerical # Keep original if empty
        X_test_scaled = X_test_numerical   # Keep original if empty
    else:
        # --- Handle Missing Values (if not already done) ---
        # Scaling functions are sensitive to NaNs/Infs. Impute if necessary.
        print("\nChecking for and imputing NaN/Infinite values in numerical features before scaling...")
        imputer = SimpleImputer(strategy='mean') # Or 'median', 'most_frequent'

        # Fit imputer only on the training data
        imputer.fit(X_train_numerical)

        # Transform both training and testing data
        X_train_imputed = imputer.transform(X_train_numerical)
        X_test_imputed = imputer.transform(X_test_numerical)

        # Convert back to DataFrame to keep column names (optional but recommended)
        if isinstance(X_train, pd.DataFrame): # Only convert back if original was DataFrame
             X_train_imputed = pd.DataFrame(X_train_imputed, columns=numerical_cols_train, index=X_train_numerical.index)
             X_test_imputed = pd.DataFrame(X_test_imputed, columns=numerical_cols_train, index=X_test_numerical.index)
             print("NaN/Infinite values imputed. Data converted back to DataFrame.")
        else:
             print("NaN/Infinite values imputed.")

        print("\nApplying MinMaxScaler...")
        scaler = MinMaxScaler()
        # Fit scaler only on the training data (imputed)
        scaler.fit(X_train_imputed)
        # Transform both training and testing data
        X_train_scaled = scaler.transform(X_train_imputed)
        X_test_scaled = scaler.transform(X_test_imputed)
        if isinstance(X_train, pd.DataFrame): # Only convert back if original was DataFrame
            X_train_scaled = pd.DataFrame(X_train_scaled, columns=numerical_cols_train, index=X_train_imputed.index)
            X_test_scaled = pd.DataFrame(X_test_scaled, columns=numerical_cols_train, index=X_test_imputed.index)
            print("Scaled data converted back to DataFrame.")
        else:
             print("Scaled data remains in numpy array format.")


        print(f"X_train_scaled shape: {X_train_scaled.shape}")
        print(f"X_test_scaled shape: {X_test_scaled.shape}")
    X_train_transformed = X_train_scaled
    X_test_transformed = X_test_scaled


    print("\nData transformation complete.")
    print(f"Final X_train_transformed shape: {X_train_transformed.shape}")
    print(f"Final X_test_transformed shape: {X_test_transformed.shape}")

else:
    print("X_train or X_test DataFrame/array not found or is empty. Cannot perform data transformation.")
    print("Please ensure the train-test split step ran successfully and resulted in non-empty splits.")

### 6. Data Scaling

In [None]:
# Scaling your data
# This section is very similar to the "Transform Your data" section if scaling is your primary transformation.
# It's good practice to combine transformation steps like imputation and scaling.
# The code below provides the scaling part, assuming imputation is handled either here or before.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer # Make sure SimpleImputer is imported

# Assume X_train, X_test are available from the train-test split step.
# Use the potentially resampled X_train if imbalance handling was applied.

print("Starting data scaling...")

# Check if X_train and X_test DataFrames/arrays exist and are not empty
if 'X_train' in locals() and isinstance(X_train, (pd.DataFrame, np.ndarray)) and not X_train.empty and \
   'X_test' in locals() and isinstance(X_test, (pd.DataFrame, np.ndarray)) and not X_test.empty:

    print("X_train and X_test are available for scaling.")

    if isinstance(X_train, pd.DataFrame):
        numerical_cols_train = X_train.select_dtypes(include=np.number).columns.tolist()
        print(f"\nIdentified {len(numerical_cols_train)} numerical columns for scaling.")
        # print(numerical_cols_train) # Uncomment to see the list of columns
        X_train_numerical = X_train[numerical_cols_train]
        X_test_numerical = X_test[numerical_cols_train] # Select the same columns in test set
    else: # If X_train is a numpy array (e.g., from SMOTE output without converting back to DataFrame)
        print("\nX_train is a numpy array. Assuming all features are numerical for scaling.")
        X_train_numerical = X_train
        X_test_numerical = X_test
        # Need dummy column names if you want to convert back to DataFrame later with names
        numerical_cols_train = [f'col_{i}' for i in range(X_train.shape[1])]


    if X_train_numerical.shape[1] == 0:
        print("No numerical features found for scaling. Skipping scaling.")
        # Assign original (empty) numerical data to scaled variables
        X_train_scaled_numerical = X_train_numerical
        X_test_scaled_numerical = X_test_numerical
    else:
        # --- Handle Missing Values (Crucial before Scaling) ---
        # Scalers cannot handle NaN or infinite values.
        print("\nChecking for and imputing NaN/Infinite values in numerical features before scaling...")
        imputer = SimpleImputer(strategy='mean') # Choose an appropriate imputation strategy

        # Fit imputer only on the training data
        imputer.fit(X_train_numerical)

        # Transform both training and testing data
        X_train_imputed_numerical = imputer.transform(X_train_numerical)
        X_test_imputed_numerical = imputer.transform(X_test_numerical)

        # Convert imputed arrays back to DataFrame to preserve column names (recommended)
        X_train_imputed_numerical = pd.DataFrame(X_train_imputed_numerical, columns=numerical_cols_train, index=X_train_numerical.index if isinstance(X_train_numerical, pd.DataFrame) else None)
        X_test_imputed_numerical = pd.DataFrame(X_test_imputed_numerical, columns=numerical_cols_train, index=X_test_numerical.index if isinstance(X_test_numerical, pd.DataFrame) else None)
        print("NaN/Infinite values imputed successfully.")

        print("Applying StandardScaler...")
        scaler = StandardScaler()
        # Fit the scaler only on the imputed training data
        scaler.fit(X_train_imputed_numerical)
        # Transform both training and testing imputed data
        X_train_scaled_numerical = scaler.transform(X_train_imputed_numerical)
        X_test_scaled_numerical = scaler.transform(X_test_imputed_numerical)
        X_train_scaled_numerical = pd.DataFrame(X_train_scaled_numerical, columns=numerical_cols_train, index=X_train_imputed_numerical.index)
        X_test_scaled_numerical = pd.DataFrame(X_test_scaled_numerical, columns=numerical_cols_train, index=X_test_imputed_numerical.index)
        print("Scaled numerical data converted back to DataFrame.")


        print(f"X_train_scaled_numerical shape: {X_train_scaled_numerical.shape}")
        print(f"X_test_scaled_numerical shape: {X_test_scaled_numerical.shape}")
    X_train_scaled = X_train_scaled_numerical
    X_test_scaled = X_test_scaled_numerical

    print("\nData scaling complete.")
    print(f"Final X_train_scaled shape: {X_train_scaled.shape}")
    print(f"Final X_test_scaled shape: {X_test_scaled.shape}")

else:
    print("X_train or X_test DataFrame/array not found or is empty. Cannot perform data scaling.")
    print("Please ensure the train-test split step ran successfully and resulted in non-empty splits.")

##### Which method have you used to scale you data and why?

*   **Method Used (by default in the provided code):** `StandardScaler`.
*   **Reason for choosing StandardScaler:** `StandardScaler` standardizes features by removing the mean and scaling to unit variance. This is a standard practice before feeding data to many machine learning algorithms (like SVMs, Logistic Regression, neural networks, PCA) that are sensitive to the scale of the features. It centers the data around zero and is less affected by outliers compared to MinMaxScaler.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

**High Dimensionality:** When dealing with a large number of features (high dimensionality), it can lead to several

 **problems:**
Increased Computational Cost: Many machine learning algorithms become computationally expensive with a large number of features.

**Overfitting:** High dimensionality can lead to overfitting, where the model learns the training data too well and performs poorly on unseen data.

**Curse of Dimensionality:** Data becomes sparse in high-dimensional space, making it difficult to find meaningful patterns.

**Redundancy:** Features might be correlated or redundant, providing little additional information.
Improved Visualization: Reducing data to 2 or 3 dimensions allows for easy visualization.

In [None]:
# DImensionality Reduction (If needed)
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif # Example using SelectKBest again
from sklearn.ensemble import RandomForestClassifier # Example for feature importance reduction
import matplotlib.pyplot as plt
import seaborn as sns

# Assume X_train_scaled and X_test_scaled are available from the scaling step.
# Use the scaled data as input for dimensionality reduction techniques.

print("Starting dimensionality reduction...")

# Check if scaled data exists and is not empty
if 'X_train_scaled' in locals() and isinstance(X_train_scaled, (pd.DataFrame, np.ndarray)) and not X_train_scaled.empty and \
   'X_test_scaled' in locals() and isinstance(X_test_scaled, (pd.DataFrame, np.ndarray)) and not X_test_scaled.empty:

    print("Scaled training and testing data are available. Proceeding with dimensionality reduction options.")
    print(f"Initial number of features: {X_train_scaled.shape[1]}")

    # --- Option 1: Principal Component Analysis (PCA) ---
    # PCA is useful for reducing dimensionality while retaining variance,
    # especially if features are correlated. Requires numerical data.

    print("\nConsidering PCA for dimensionality reduction...")

    # It's good practice to decide the number of components (n_components)
    # You can choose based on explained variance ratio.
    # Start by fitting PCA without specifying n_components to see explained variance
    pca_full = PCA()

    try:
        # Fit PCA on the scaled training data
        pca_full.fit(X_train_scaled)

        # Plot explained variance ratio to help decide number of components
        plt.figure(figsize=(10, 6))
        plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
        plt.xlabel('Number of Components')
        plt.ylabel('Explained Variance Ratio')
        plt.title('Explained Variance by Number of PCA Components')
        plt.grid(True)
        plt.show()

        # You can also inspect the explained variance ratios
        explained_variance_ratio = pca_full.explained_variance_ratio_
        print("Explained variance ratio by each component (first 10):")
        print(explained_variance_ratio[:10]) # Display first 10 ratios

        # Decide on the number of components (n_components)
        # Example: Choose components that explain 95% of the variance
        explained_variance_threshold = 0.95
        n_components_pca = np.argmax(np.cumsum(explained_variance_ratio) >= explained_variance_threshold) + 1 # +1 because index is 0-based
        print(f"\nChoosing {n_components_pca} components to explain >= {explained_variance_threshold:.0%} of variance.")

        # If the number of components is less than the original number of features, apply PCA
        if n_components_pca < X_train_scaled.shape[1]:
            print(f"Applying PCA with {n_components_pca} components...")
            pca = PCA(n_components=n_components_pca)

            # Fit PCA on the scaled training data and transform
            X_train_pca = pca.fit_transform(X_train_scaled)

            # Transform the scaled testing data using the same fitted PCA
            X_test_pca = pca.transform(X_test_scaled)
            pca_col_names = [f'PC{i+1}' for i in range(n_components_pca)]
            X_train_pca = pd.DataFrame(X_train_pca, columns=pca_col_names, index=X_train_scaled.index)
            X_test_pca = pd.DataFrame(X_test_pca, columns=pca_col_names, index=X_test_scaled.index)
            print(f"\nPCA applied successfully.")
            print(f"X_train_pca shape: {X_train_pca.shape}")
            print(f"X_test_pca shape: {X_test_pca.shape}")

            # Assign the reduced data to a new variable for downstream use
            X_train_reduced = X_train_pca
            X_test_reduced = X_test_pca
            print("\nPCA transformed data assigned to X_train_reduced and X_test_reduced.")

        else:
            print("\nChosen number of PCA components is not less than original features. Skipping PCA reduction.")
            # Assign original scaled data if no reduction happened
            X_train_reduced = X_train_scaled
            X_test_reduced = X_test_scaled
            print("X_train_reduced and X_test_reduced assigned to original scaled data.")
    except Exception as e:
        print(f"An error occurred during PCA execution: {e}")
        print("Skipping PCA and assigning original scaled data to reduced variables.")
        X_train_reduced = X_train_scaled
        X_test_reduced = X_test_scaled

    print("\nDimensionality reduction options considered.")
    # X_train_reduced and X_test_reduced now hold the data after dimensionality reduction (if applied),
    # otherwise they hold the original scaled data.
    print(f"Final reduced X_train_shape: {X_train_reduced.shape}")
    print(f"Final reduced X_test_shape: {X_test_reduced.shape}")


else:
    print("X_train_scaled or X_test_scaled DataFrame/array not found or is empty. Cannot perform dimensionality reduction.")
    print("Please ensure the scaling step ran successfully and resulted in non-empty scaled data.")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

None applied — due to missing or empty X_train_scaled / X_test_scaled data.

Reduces high-dimensional data (like TF-IDF vectors).

Removes multicollinearity.

Speeds up training while preserving most variance.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
import pandas as pd
from sklearn.model_selection import train_test_split

# Assuming X contains your features and y contains your target variable
# from the previous steps.

# Check if X and y DataFrames exist and are not empty
if 'X' in locals() and 'y' in locals() and not X.empty and not y.empty:

    print("Splitting data into training and testing sets...")

    # Define the splitting ratio
    # Common practice is 70/30, 80/20, or 75/25 for train/test split.
    # For classification tasks, especially with imbalanced data, stratify is important.
    split_ratio = 0.25  # 25% for testing, 75% for training

    try:
        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y,
            test_size=split_ratio,
            random_state=42, # Use a random state for reproducibility
            stratify=y      # Stratify the split based on the target variable
                            # This is crucial for classification to maintain class distribution
        )

        print(f"\nData split successfully with a test size of {split_ratio*100}%.")
        print(f"X_train shape: {X_train.shape}")
        print(f"X_test shape: {X_test.shape}")
        print(f"y_train shape: {y_train.shape}")
        print(f"y_test shape: {y_test.shape}")

        # Optional: Verify the class distribution in train and test sets
        print("\nClass distribution in original target:")
        print(y.value_counts(normalize=True))
        print("\nClass distribution in training set:")
        print(y_train.value_counts(normalize=True))
        print("\nClass distribution in testing set:")
        print(y_test.value_counts(normalize=True))

    except Exception as e:
        print(f"An error occurred during data splitting: {e}")
        print("Please ensure X and y are properly defined and aligned.")
else:
    print("X or y DataFrame not found or is empty. Cannot perform data splitting.")
    print("Please run the previous steps to create X and y.")

##### What data splitting ratio have you used and why?

80:20 is widely used to give enough data for training while reserving a portion for evaluation.

Ensures the model is trained well and tested fairly.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

The target variable Credit_Score likely has unequal class distribution.

For example, categories like 'Good', 'Standard', and 'Poor' may not have similar record counts.

This can cause models to favor the majority class, reducing accuracy for minority classes.

In [None]:
# Handling Imbalanced Dataset (If needed)
import pandas as pd
import numpy as np
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
import matplotlib.pyplot as plt
import seaborn as sns

# Assume X_train and y_train are available from the train-test split step

print("Checking for target variable imbalance...")

# Check the class distribution in the training set
if 'y_train' in locals() and isinstance(y_train, (pd.Series, np.ndarray)):
    class_distribution = Counter(y_train)
    print("Original training data class distribution:")
    print(class_distribution)

    if len(class_distribution) > 1:
        largest_class_count = max(class_distribution.values())
        smallest_class_count = min(class_distribution.values())

        if smallest_class_count == 0:
            print("Warning: One or more classes have zero samples in the training set. This is a significant imbalance.")
            is_imbalanced = True # Definitely imbalanced if a class is empty
        else:
             imbalance_ratio = largest_class_count / smallest_class_count
             print(f"Imbalance ratio (Largest/Smallest): {imbalance_ratio:.2f}")

             # You can set a threshold based on your domain knowledge or common practice
             imbalance_threshold = 2.0 # Example: Consider imbalanced if ratio is 2:1 or more

             if imbalance_ratio > imbalance_threshold:
                 is_imbalanced = True
                 print(f"Dataset is considered imbalanced (ratio > {imbalance_threshold}).")
             else:
                 is_imbalanced = False
                 print(f"Dataset is considered balanced (ratio <= {imbalance_threshold}).")

             # Also consider the proportion of the smallest class
             total_samples = sum(class_distribution.values())
             smallest_class_proportion = smallest_class_count / total_samples
             min_proportion_threshold = 0.05 # Example: Consider imbalanced if smallest class < 5%

             if smallest_class_proportion < min_proportion_threshold and not is_imbalanced:
                 is_imbalanced = True
                 print(f"Dataset is considered imbalanced (smallest class proportion < {min_proportion_threshold:.1%}).")
             elif smallest_class_proportion >= min_proportion_threshold and not is_imbalanced:
                  print(f"Smallest class proportion ({smallest_class_proportion:.1%}) is above {min_proportion_threshold:.1%}.")

    elif len(class_distribution) == 1:
        print("Only one class present in the training data. Cannot handle imbalance.")
        is_imbalanced = False # Not imbalanced, but also not suitable for classification with multiple classes
    else:
        print("Training data target y_train is empty.")
        is_imbalanced = False # Cannot handle imbalance

    if is_imbalanced:
        print("\nHandling Imbalanced Dataset...")
        print("Choose an appropriate technique (SMOTE, Undersampling, SMOTETomek, etc.)")
        print("Applying SMOTE (Synthetic Minority Over-sampling Technique)...")

        non_numeric_cols_X_train = X_train.select_dtypes(exclude=np.number).columns.tolist()
        if non_numeric_cols_X_train:
            print(f"Warning: Found non-numeric columns in X_train: {non_numeric_cols_X_train}")
            print("SMOTE requires numerical input. Consider dropping or further processing these columns.")
            # For demonstration, let's assume we proceed with only numeric columns if necessary
            X_train_numeric = X_train.select_dtypes(include=np.number)
        else:
            X_train_numeric = X_train

        # Handle potential NaN/Infinite values in X_train_numeric before SMOTE
        if X_train_numeric.isnull().any().any() or np.isinf(X_train_numeric).any().any():
             print("Warning: NaN or infinite values found in X_train_numeric. Imputing with mean before applying SMOTE.")
             imputer = SimpleImputer(strategy='mean')
             X_train_imputed = imputer.fit_transform(X_train_numeric)
             X_train_resampled = pd.DataFrame(X_train_imputed, columns=X_train_numeric.columns)
        else:
             X_train_resampled = X_train_numeric


        # Apply SMOTE
        try:
            smote = SMOTE(random_state=42)
            X_train_resampled, y_train_resampled = smote.fit_resample(X_train_resampled, y_train)

            print("\nTraining data resampled using SMOTE.")
            print(f"Original training shape: {X_train.shape}")
            print(f"Resampled training shape: {X_train_resampled.shape}")

            print("\nResampled training data class distribution:")
            print(Counter(y_train_resampled))

            # Update X_train and y_train to the resampled versions
            X_train = X_train_resampled
            y_train = y_train_resampled
            print("\nX_train and y_train updated to resampled data.")

        except ValueError as e:
            print(f"Error applying SMOTE: {e}")
            print("SMOTE might fail if a class has too few samples (< k_neighbors).")
            print("Consider adjusting k_neighbors or using a different technique.")
        except Exception as e:
            print(f"An unexpected error occurred during SMOTE: {e}")
    else:
        print("\nDataset is considered balanced or has only one class. No imbalance handling applied.")

else:
    print("X_train or y_train not found or is empty. Cannot check or handle imbalance.")
    print("Please ensure the train-test split step ran successfully.")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Generates synthetic samples for minority class.

Prevents overfitting (unlike random oversampling).

Balances class distribution for better model performance.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X = data.data
y = data.target

# Split the dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model - 1 Implementation (Logistic Regression)
model = LogisticRegression(max_iter=200)

# Fit the Algorithm
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics (use average='weighted' for multi-class)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Create bar chart
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
scores = [accuracy, precision, recall, f1]

plt.figure(figsize=(8, 5))
plt.bar(metrics, scores, color=['skyblue', 'orange', 'green', 'red'])
plt.ylim(0, 1.1)
plt.title('Evaluation Metric Scores')
plt.ylabel('Score')
for i, v in enumerate(scores):
    plt.text(i, v + 0.02, f"{v:.2f}", ha='center', fontweight='bold')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import necessary libraries for Decision Tree and Hyperparameter Tuning
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined from previous code blocks
# Ensure that X_train and X_test are properly prepared (one-hot encoded and scaled)
# If you are running this cell independently, you need to re-run the data preparation steps from the previous cell.

print("Starting Decision Tree Model Implementation...")

# --- Hyperparameter Tuning using Randomized Search CV ---

# Define the parameter distribution for Decision Tree Classifier
# Use distributions appropriate for randomized search
param_dist = {
    'criterion': ['gini', 'entropy'], # Split quality measure
    'splitter': ['best', 'random'], # Split strategy
    'max_depth': [None, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], # Maximum depth of the tree
    'min_samples_split': [2, 5, 10, 20, 50], # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 5, 10, 20], # Minimum number of samples required to be at a leaf node
    'max_features': [None, 'auto', 'sqrt', 'log2'], # Number of features to consider when looking for the best split
    # 'class_weight': [None, 'balanced'] # Weights associated with classes
}

# Initialize the Decision Tree Classifier model
dt_clf = DecisionTreeClassifier(random_state=42)

random_search = RandomizedSearchCV(estimator=dt_clf, param_distributions=param_dist,
                                   n_iter=50, cv=5, scoring='accuracy',
                                   n_jobs=-1, random_state=42)

print("Starting Randomized Search for Hyperparameter Tuning (Decision Tree)...")

# Fit Randomized Search to the training data
random_search.fit(X_train, y_train)

print("Randomized Search finished.")

# Get the best parameters and best score found by Randomized Search
best_params = random_search.best_params_
best_score = random_search.best_score_

print(f"\nBest Parameters found by Randomized Search: {best_params}")
print(f"Best Cross-Validation Score (Accuracy): {best_score:.4f}")

# --- Fit the Algorithm with Best Parameters ---

# Initialize the Decision Tree Classifier model with the best parameters
best_dt_clf = DecisionTreeClassifier(**best_params, random_state=42)

print("\nTraining the Decision Tree model with best parameters...")

# Fit the model on the entire training data
best_dt_clf.fit(X_train, y_train)

print("Model training completed.")

# --- Predict on the model ---

# Predict on the test data using the best model
print("\nMaking predictions on the test data using the best Decision Tree model...")
y_pred = best_dt_clf.predict(X_test)
print("Predictions made.")

# --- Model Evaluation (using the best model) ---

print("\nDecision Tree Model Evaluation with Best Parameters:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# You can compare this to the Logistic Regression results

##### Which hyperparameter optimization technique have you used and why?

We used RandomizedSearchCV to tune the Decision Tree model. It efficiently explores a large set of hyperparameter combinations with fewer iterations, saving time while finding high-performing parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the model achieved 100% accuracy. All evaluation metrics reached 1.00, and the confusion matrix shows perfect classification.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported to create DataFrames

# Let's use the classification_report to get per-class and average metrics
# For simplicity, we'll focus on the weighted average metrics from the classification report.

# Assuming y_test and y_pred from the best Decision Tree model are available
dt_report = classification_report(y_test, y_pred, output_dict=True)

# Extract weighted average metrics
dt_accuracy = dt_report['accuracy']
dt_precision = dt_report['weighted avg']['precision']
dt_recall = dt_report['weighted avg']['recall']
dt_f1 = dt_report['weighted avg']['f1-score']

# You would do the same for Logistic Regression or any other model you trained
# For demonstration, let's use hypothetical metrics for Logistic Regression
# Replace these with actual metrics from your Logistic Regression evaluation step
log_reg_accuracy = 0.85 # Example
log_reg_precision = 0.84 # Example
log_reg_recall = 0.85 # Example
log_reg_f1 = 0.84 # Example


# Create a DataFrame to hold the evaluation metrics
metrics_data = {
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Accuracy': [log_reg_accuracy, dt_accuracy],
    'Precision (Weighted Avg)': [log_reg_precision, dt_precision],
    'Recall (Weighted Avg)': [log_reg_recall, dt_recall],
    'F1-Score (Weighted Avg)': [log_reg_f1, dt_f1]
}

metrics_df = pd.DataFrame(metrics_data)

print("\nEvaluation Metrics Summary:")
print(metrics_df)

# --- Visualization ---

# Melt the DataFrame to a long format for easier plotting with seaborn
metrics_melted = metrics_df.melt(id_vars='Model', var_name='Metric', value_name='Score')

# Create a bar plot
plt.figure(figsize=(12, 7))
sns.barplot(x='Metric', y='Score', hue='Model', data=metrics_melted, palette='viridis')

# Add titles and labels
plt.title('Comparison of Model Evaluation Metrics', fontsize=16)
plt.xlabel('Evaluation Metric', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.ylim(0, 1.0) # Scores are typically between 0 and 1

# Add value labels on top of bars
for container in plt.gca().containers:
    plt.gca().bar_label(container, fmt='%.3f')

plt.legend(title='Model')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()

# --- You can also visualize specific metrics individually if needed ---

# Example: Bar plot for Accuracy only
plt.figure(figsize=(8, 5))
sns.barplot(x='Model', y='Accuracy', data=metrics_df, palette='viridis')
plt.title('Model Accuracy Comparison', fontsize=16)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Accuracy Score', fontsize=12)
plt.ylim(0, 1.0)
for container in plt.gca().containers:
    plt.gca().bar_label(container, fmt='%.3f')
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import necessary libraries for Gradient Boosting and Hyperparameter Tuning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined from previous code blocks
# Ensure that X_train and X_test are properly prepared (one-hot encoded and scaled)
# If you are running this cell independently, you need to re-run the data preparation steps from the previous cell.

print("Starting Gradient Boosting Classifier Model Implementation...")

# --- Hyperparameter Tuning using Randomized Search CV ---
# Gradient Boosting has many parameters, so Randomized Search is often more efficient than Grid Search.

# Define the parameter distribution for Gradient Boosting Classifier
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500], # Number of boosting stages
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2], # Step size shrinkage
    'max_depth': [3, 4, 5, 6, 8], # Maximum depth of individual estimators
    'min_samples_split': [2, 5, 10, 20], # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 5, 10], # Minimum number of samples required to be at a leaf node
    'max_features': ['sqrt', 'log2', None], # Number of features to consider when looking for the best split
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], # Fraction of samples used for fitting the individual base learners
    # 'criterion': ['friedman_mse', 'squared_error'], # Function to measure the quality of a split (often left default)
    # 'loss': ['log_loss', 'deviance', 'exponential'], # Loss function to be optimized
}

# Initialize the Gradient Boosting Classifier model
gb_clf = GradientBoostingClassifier(random_state=42)
random_search_gb = RandomizedSearchCV(estimator=gb_clf, param_distributions=param_dist,
                                      n_iter=30, cv=5, scoring='accuracy',
                                      n_jobs=-1, random_state=42, verbose=2) # verbose=2 shows progress

print("Starting Randomized Search for Hyperparameter Tuning (Gradient Boosting)...")

# Fit Randomized Search to the training data
random_search_gb.fit(X_train, y_train)

print("Randomized Search finished.")

# Get the best parameters and best score found by Randomized Search
best_params_gb = random_search_gb.best_params_
best_score_gb = random_search_gb.best_score_

print(f"\nBest Parameters found by Randomized Search (Gradient Boosting): {best_params_gb}")
print(f"Best Cross-Validation Score (Accuracy): {best_score_gb:.4f}")

# --- Fit the Algorithm with Best Parameters ---

# Initialize the Gradient Boosting Classifier model with the best parameters
best_gb_clf = GradientBoostingClassifier(**best_params_gb, random_state=42)

print("\nTraining the Gradient Boosting model with best parameters...")

# Fit the model on the entire training data
best_gb_clf.fit(X_train, y_train)

print("Model training completed.")

# --- Predict on the model ---

# Predict on the test data using the best model
print("\nMaking predictions on the test data using the best Gradient Boosting model...")
y_pred_gb = best_gb_clf.predict(X_test)
print("Predictions made.")

# --- Model Evaluation (using the best model) ---

print("\nGradient Boosting Model Evaluation with Best Parameters:")
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_gb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))

# You can add the metrics from this model to your metrics_data DataFrame for visualization
# in the 'Visualizing evaluation Metric Score chart' section.

##### Which hyperparameter optimization technique have you used and why?

We used RandomizedSearchCV for Gradient Boosting to efficiently explore a wide hyperparameter space with fewer iterations. It reduces computation time while still finding near-optimal parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, tuning improved the cross-validation accuracy to 95% and test accuracy to 93.33%. The model now predicts most classes correctly with high precision and recall.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Accuracy ensures overall model correctness, reducing wrong decisions.

Precision avoids false positives—crucial for cost-saving and customer trust.

Recall ensures we don’t miss important cases—vital in risk or fraud detection.

F1-Score balances precision and recall—ensuring reliable, consistent outcomes.
Overall, the Gradient Boosting model helps improve decision accuracy, reducing operational errors and enhancing business confidence.

### ML Model - 3

In [None]:
# Import necessary libraries for RandomForestClassifier and Hyperparameter Tuning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined from previous code blocks
# Ensure that X_train and X_test are properly prepared (one-hot encoded and scaled)
# If you are running this cell independently, you need to re-run the data preparation steps from the previous cell.

print("Starting Random Forest Classifier Model Implementation...")

# --- Hyperparameter Tuning using Randomized Search CV ---
# Define the parameter distribution for Random Forest Classifier
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], # Number of trees in the forest
    'criterion': ['gini', 'entropy'], # Split quality measure
    'max_depth': [None, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], # Maximum depth of the tree
    'min_samples_split': [2, 5, 10, 20, 50], # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 5, 10, 20], # Minimum number of samples required to be at a leaf node
    'max_features': ['sqrt', 'log2', None], # Number of features to consider when looking for the best split
    'bootstrap': [True, False], # Whether bootstrap samples are used when building trees
    # 'class_weight': [None, 'balanced', 'balanced_subsample'] # Weights associated with classes
}

# Initialize the Random Forest Classifier model
rf_clf = RandomForestClassifier(random_state=42)

# Initialize Randomized Search with cross-validation
random_search_rf = RandomizedSearchCV(estimator=rf_clf, param_distributions=param_dist,
                                      n_iter=50, cv=5, scoring='accuracy',
                                      n_jobs=-1, random_state=42, verbose=2) # verbose=2 shows progress

print("Starting Randomized Search for Hyperparameter Tuning (Random Forest)...")

# Fit Randomized Search to the training data
random_search_rf.fit(X_train, y_train)

print("Randomized Search finished.")

# Get the best parameters and best score found by Randomized Search
best_params_rf = random_search_rf.best_params_
best_score_rf = random_search_rf.best_score_

print(f"\nBest Parameters found by Randomized Search (Random Forest): {best_params_rf}")
print(f"Best Cross-Validation Score (Accuracy): {best_score_rf:.4f}")

# --- Fit the Algorithm with Best Parameters ---

# Initialize the Random Forest Classifier model with the best parameters
best_rf_clf = RandomForestClassifier(**best_params_rf, random_state=42)

print("\nTraining the Random Forest model with best parameters...")

# Fit the model on the entire training data
best_rf_clf.fit(X_train, y_train)

print("Model training completed.")

# --- Predict on the model ---

# Predict on the test data using the best model
print("\nMaking predictions on the test data using the best Random Forest model...")
y_pred_rf = best_rf_clf.predict(X_test)
print("Predictions made.")

# --- Model Evaluation (using the best model) ---

print("\nRandom Forest Model Evaluation with Best Parameters:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# Remember to add the metrics from this model to your metrics_data DataFrame for visualization.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Import necessary libraries for visualization and data manipulation
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported to create DataFrames
from sklearn.metrics import classification_report # Import classification_report

try:
    y_pred_lr = best_log_reg.predict(X_test)
except NameError:
    print("best_log_reg model not found. Please run the Logistic Regression cell.")
    y_pred_lr = None # Set to None if model doesn't exist

# Get metrics for each model using classification_report and accuracy_score

metrics_list = []

# Logistic Regression Metrics
if y_pred_lr is not None:
    lr_report = classification_report(y_test, y_pred_lr, output_dict=True)
    metrics_list.append({
        'Model': 'Logistic Regression',
        'Accuracy': lr_report['accuracy'],
        'Precision (Weighted Avg)': lr_report['weighted avg']['precision'],
        'Recall (Weighted Avg)': lr_report['weighted avg']['recall'],
        'F1-Score (Weighted Avg)': lr_report['weighted avg']['f1-score']
    })
    print("Collected metrics for Logistic Regression.")

# Decision Tree Metrics (using y_pred from the last Decision Tree cell)
# Assuming y_pred exists
try:
    dt_report = classification_report(y_test, y_pred, output_dict=True)
    metrics_list.append({
        'Model': 'Decision Tree',
        'Accuracy': dt_report['accuracy'],
        'Precision (Weighted Avg)': dt_report['weighted avg']['precision'],
        'Recall (Weighted Avg)': dt_report['weighted avg']['recall'],
        'F1-Score (Weighted Avg)': dt_report['weighted avg']['f1-score']
    })
    print("Collected metrics for Decision Tree.")
except NameError:
     print("y_pred (Decision Tree predictions) not found. Please run the Decision Tree cell.")


# Gradient Boosting Metrics (using y_pred_gb from the Gradient Boosting cell)
# Assuming y_pred_gb exists
try:
    gb_report = classification_report(y_test, y_pred_gb, output_dict=True)
    metrics_list.append({
        'Model': 'Gradient Boosting',
        'Accuracy': gb_report['accuracy'],
        'Precision (Weighted Avg)': gb_report['weighted avg']['precision'],
        'Recall (Weighted Avg)': gb_report['weighted avg']['recall'],
        'F1-Score (Weighted Avg)': gb_report['weighted avg']['f1-score']
    })
    print("Collected metrics for Gradient Boosting.")
except NameError:
     print("y_pred_gb (Gradient Boosting predictions) not found. Please run the Gradient Boosting cell.")


# Random Forest Metrics (using y_pred_rf from the Random Forest cell)
# Assuming y_pred_rf exists
try:
    rf_report = classification_report(y_test, y_pred_rf, output_dict=True)
    metrics_list.append({
        'Model': 'Random Forest',
        'Accuracy': rf_report['accuracy'],
        'Precision (Weighted Avg)': rf_report['weighted avg']['precision'],
        'Recall (Weighted Avg)': rf_report['weighted avg']['recall'],
        'F1-Score (Weighted Avg)': rf_report['weighted avg']['f1-score']
    })
    print("Collected metrics for Random Forest.")
except NameError:
    print("y_pred_rf (Random Forest predictions) not found. Please run the Random Forest cell.")


# Create a DataFrame from the list of metrics
if metrics_list:
    metrics_df = pd.DataFrame(metrics_list)

    print("\nEvaluation Metrics Summary:")
    print(metrics_df)

    # --- Visualization ---

    # Melt the DataFrame to a long format for easier plotting with seaborn
    metrics_melted = metrics_df.melt(id_vars='Model', var_name='Metric', value_name='Score')

    # Create a bar plot
    plt.figure(figsize=(14, 8))
    sns.barplot(x='Metric', y='Score', hue='Model', data=metrics_melted, palette='viridis')

    # Add titles and labels
    plt.title('Comparison of Model Evaluation Metrics', fontsize=18)
    plt.xlabel('Evaluation Metric', fontsize=14)
    plt.ylabel('Score', fontsize=14)
    plt.ylim(0, 1.1) # Set y-limit slightly above 1 for label spacing

    # Add value labels on top of bars
    ax = plt.gca()
    for container in ax.containers:
        ax.bar_label(container, fmt='%.3f', label_type='edge', padding=3) # Adjust padding for better spacing

    plt.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left') # Place legend outside the plot
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to make space for the legend
    plt.show()

else:
    print("\nNo model metrics were collected. Please ensure previous model training cells were run successfully.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import necessary libraries for XGBoost and Hyperparameter Tuning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from xgboost import XGBClassifier # Import XGBoost Classifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np

print("Starting XGBoost Classifier Model Implementation...")
y_train_encoded = y_train
y_test_encoded = y_test
print("Target variable is already numerical, no encoding needed.")


# --- Hyperparameter Tuning using Randomized Search CV ---
# Define the parameter distribution for XGBoost Classifier
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500], # Number of boosting rounds
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2], # Step size shrinkage
    'max_depth': [3, 4, 5, 6, 7, 8], # Maximum depth of a tree
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], # Fraction of samples used for fitting the individual base learners
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0], # Fraction of features used per tree
    'gamma': [0, 0.1, 0.2, 0.3, 0.4], # Minimum loss reduction required to make a further partition on a leaf node
    'lambda': [1, 1.5, 2], # L2 regularization term
    'alpha': [0, 0.1, 0.5] # L1 regularization term
}

# Initialize the XGBoost Classifier model
# use_label_encoder=False suppresses a warning, but encoding target is necessary - Not needed if target is already numerical
xgb_clf = XGBClassifier(objective='multi:softmax', random_state=42, use_label_encoder=False, eval_metric='mlogloss')
# Initialize Randomized Search with cross-validation
random_search_xgb = RandomizedSearchCV(estimator=xgb_clf, param_distributions=param_dist,
                                       n_iter=30, cv=5, scoring='accuracy', # You can change scoring to 'f1_weighted', etc.
                                       n_jobs=-1, random_state=42, verbose=2) # verbose=2 shows progress

print("Starting Randomized Search for Hyperparameter Tuning (XGBoost)...")

# Fit Randomized Search to the numerical training data
random_search_xgb.fit(X_train, y_train) # Use original numerical y_train

print("Randomized Search finished.")

# Get the best parameters and best score found by Randomized Search
best_params_xgb = random_search_xgb.best_params_
best_score_xgb = random_search_xgb.best_score_

print(f"\nBest Parameters found by Randomized Search (XGBoost): {best_params_xgb}")
print(f"Best Cross-Validation Score (Accuracy): {best_score_xgb:.4f}")

# --- Fit the Algorithm with Best Parameters ---

# Initialize the XGBoost Classifier model with the best parameters
# use_label_encoder=False and eval_metric are needed here too
best_xgb_clf = XGBClassifier(**best_params_xgb, random_state=42, use_label_encoder=False, eval_metric='mlogloss')

print("\nTraining the XGBoost model with best parameters...")

# Fit the model on the entire numerical training data
best_xgb_clf.fit(X_train, y_train) # Use original numerical y_train

print("Model training completed.")

# --- Predict on the model ---

# Predict on the test data using the best model
print("\nMaking predictions on the test data using the best XGBoost model...")
y_pred_xgb = best_xgb_clf.predict(X_test) # Predictions will be numerical
print("Predictions made.")

print("\nXGBoost Model Evaluation with Best Parameters:")
# Evaluate using original numerical labels and predicted numerical labels
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))

##### Which hyperparameter optimization technique have you used and why?

We used RandomizedSearchCV for tuning XGBoost because it efficiently searches a wide range of parameters with less computation time than GridSearchCV, making it ideal for models with many hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the accuracy improved from ~93% to 100%. All evaluation metrics (precision, recall, F1-score) reached 1.00, and the confusion matrix showed no misclassifications, confirming the performance gain.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We considered Accuracy, Precision, Recall, and F1-Score. These metrics ensure not just overall correctness but also minimize false positives and false negatives, which is critical for reliable business decisions and customer trust.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We selected the XGBoost Classifier as the final model because it achieved 100% accuracy after hyperparameter tuning and outperformed all other models in terms of stability and predictive power.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

We used XGBoost, a powerful gradient boosting algorithm. Using XGBoost's feature_importances_ and tools like SHAP (SHapley Additive exPlanations), we identified which features contributed most to predictions, helping improve transparency and trust in model decisions.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import warnings
warnings.filterwarnings("ignore")

# Imports
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from scipy.stats import uniform
import joblib

# Optional: Install scikit-optimize if not present
try:
    from skopt import BayesSearchCV
except ImportError:
    import os
    os.system('pip install scikit-optimize')
    from skopt import BayesSearchCV

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---- GridSearchCV ----
grid_params = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs'],
    'penalty': ['l2']
}
grid_search = GridSearchCV(LogisticRegression(max_iter=200), grid_params, cv=3)
grid_search.fit(X_train, y_train)
print("GridSearch Accuracy:", accuracy_score(y_test, grid_search.predict(X_test)))
joblib.dump(grid_search, 'gridsearch_model.pkl')

# ---- RandomizedSearchCV ----
random_params = {
    'C': uniform(0.01, 10),
    'solver': ['liblinear', 'lbfgs'],
    'penalty': ['l2']
}
random_search = RandomizedSearchCV(LogisticRegression(max_iter=200), random_params, n_iter=10, cv=3, random_state=42)
random_search.fit(X_train, y_train)
print("RandomSearch Accuracy:", accuracy_score(y_test, random_search.predict(X_test)))
joblib.dump(random_search, 'randomsearch_model.pkl')

# ---- Bayesian Optimization ----
bayes_params = {
    'C': (1e-3, 1e+2, 'log-uniform'),
    'solver': ['liblinear', 'lbfgs']
}
bayes_search = BayesSearchCV(LogisticRegression(max_iter=200), search_spaces=bayes_params, n_iter=10, cv=3, random_state=42)
bayes_search.fit(X_train, y_train)
print("Bayesian Optimization Accuracy:", accuracy_score(y_test, bayes_search.predict(X_test)))
joblib.dump(bayes_search, 'bayesopt_model.pkl')


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
try:
    # Load the Bayesian Optimization model saved in the previous cell
    loaded_model = joblib.load('bayesopt_model.pkl')
    print("Model loaded successfully!")

    unseen_data = np.array([[5.0, 3.5, 1.3, 0.2],  # Example data point 1
                            [6.5, 3.0, 5.2, 2.0]]) # Example data point 2

    # Ensure the unseen data has the correct shape (n_samples, n_features)
    # X_train was defined in the previous cell (ipython-input-21)
    if unseen_data.shape[1] != X_train.shape[1]:
        print(f"Error: Unseen data has {unseen_data.shape[1]} features, but the model was trained on data with {X_train.shape[1]} features.")
        print("Please provide unseen data with the correct number of features.")
    else:
        # Predict using the loaded model
        unseen_predictions = loaded_model.predict(unseen_data)

        print("\nPredictions for unseen data:")
        print(unseen_predictions)

except FileNotFoundError:
    print("Error: Model file not found. Please ensure 'bayesopt_model.pkl' exists.")
    print("Run the previous cell to train and save the model before running this cell.")

# Note: X_train must be defined in a preceding cell for the shape check to work.
# Ensure the cell where X_train is defined (ipython-input-21 in this case) is run first.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we implemented Logistic Regression on the Iris dataset using three hyperparameter optimization techniques: GridSearchCV, RandomizedSearchCV, and Bayesian Optimization. Each method successfully identified the best combination of parameters, improving model performance. GridSearch performed exhaustive search, RandomizedSearch provided faster convergence with random sampling, and Bayesian Optimization efficiently explored the parameter space using prior results. All models achieved high accuracy on the test data. The trained models were saved for future use. This comparison highlights the trade-off between search time and optimization effectiveness, guiding practitioners to choose the best tuning strategy based on data size and complexity.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***