                               Loan Risk Assessment Report                          

1.Introduction

In this report, we present the methodology and results of a loan risk assessment project, where the goal was to determine whether a two-wheeler loan application should be accepted or rejected based on applicant data. We utilized machine learning techniques to build a predictive model using historical data.
Approach Taken
   
1.Loading Data

We were provided with three files for this analysis:

•	Assignment_Train.csv: Contains labeled data with features and the target variable "Application Status".

•	Assignment_Test.csv: Contains unlabeled data with the same features but without the target variable.

•	Assignment_FeatureDictionary.xlsx**: Provides descriptions of the variables used in the datasets

 2.Data Preprocessing


We began by loading the training and test datasets and reviewing the feature dictionary for a better understanding of the variables.

2.1 Handling Missing Values

Numerical Data: Missing values were imputed with the mean of the respective column.
Categorical Data: Missing values were imputed with the mode (most frequent value) of the respective column.

2.2 Encoding Categorical Variables

Categorical variables were converted into numeric values using Label Encoding. This transformation is essential for the model to process non-numeric data.

3.Feature Engineering

We ensured that all relevant features were included in both training and test datasets. A computed feature, `APPLIED_AMOUNT_PER_ASSET`, was added where necessary to match the training dataset.

4.Training and Validation

The dataset was split into training and validation sets to evaluate the model's performance. The model was trained on the training set and validated on the validation set.

5.Model Training &Model Selection

We used the Random Forest Classifier, a robust and versatile model that works well for classification tasks.


6.Model Evaluation

The performance of the model was assessed using accuracy and classification report metrics, including precision, recall, and F1-score.


7.Conclusion

The Random Forest Classifier performed well with the given data, providing reliable predictions for loan acceptance. The model's performance was satisfactory based on the evaluation metrics, making it a viable tool for loan risk assessment.


Test Predictions

Predictions were made on the test dataset using the trained model. The results were saved in a CSV file named `predictions.csv`, which includes the unique identifiers (UIDs) and the corresponding predictions.




#This code includes:

Data loading and preprocessing

Model training and evaluation

Test data preprocessing and prediction

Saving results to CSV

1.Data Loading

In [1]:
import pandas as pd
train_data = pd.read_csv('Assignment_Train.csv')
test_data = pd.read_csv('Assignment_Test.csv')
feature_dict = pd.read_excel('Assignment_FeatureDictionary.xlsx')


2. Data Preprocessing

2.1 Convert and Handle Missing Values

In [2]:
# Convert 'Cibil Score' to numeric and handle missing values
train_data['Cibil Score'] = pd.to_numeric(train_data['Cibil Score'], errors='coerce')

# Identify numerical and categorical columns
numerical_cols = train_data.select_dtypes(include=['number']).columns
categorical_cols = train_data.select_dtypes(include=['object']).columns

# Fill missing values
train_data[numerical_cols] = train_data[numerical_cols].fillna(train_data[numerical_cols].mean())
train_data[categorical_cols] = train_data[categorical_cols].fillna(train_data[categorical_cols].mode().iloc[0])


2.2 Encode Categorical Features

In [3]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
label_encoder = LabelEncoder()
for col in categorical_cols:
    if train_data[col].dtype == 'object':
        train_data[col] = label_encoder.fit_transform(train_data[col].astype(str))


3.Feature and Target Definition

In [4]:
# Define features and target variable
X = train_data.drop(['Application Status'], axis=1)
y = train_data['Application Status']


4. Split Data

In [5]:
from sklearn.model_selection import train_test_split

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


5. Model Selection and Training

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)


6.Model Evaluation

In [7]:
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_val)
print(f"Accuracy: {accuracy_score(y_val, y_pred)}")
print("Classification Report:")
print(classification_report(y_val, y_pred))


Accuracy: 0.887
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.93      0.92      1327
           1       0.85      0.81      0.83       673

    accuracy                           0.89      2000
   macro avg       0.88      0.87      0.87      2000
weighted avg       0.89      0.89      0.89      2000



7.Preprocess and Predict on Test Data

In [8]:
# Ensure that the test data columns match the features used for training
test_data_with_uid = test_data[['UID'] + [col for col in X.columns if col in test_data.columns]]
test_data = test_data_with_uid

# Handle missing values in the test data
for col in numerical_cols:
    if col in test_data.columns:
        test_data[col] = pd.to_numeric(test_data[col], errors='coerce')
        test_data[col] = test_data[col].fillna(test_data[col].mean())

for col in categorical_cols:
    if col in test_data.columns:
        test_data[col] = test_data[col].fillna(test_data[col].mode().iloc[0])

# Encode categorical features in the test data
for col in categorical_cols:
    if col in test_data.columns:
        if test_data[col].dtype == 'object':
            if col in label_encoder.classes_:
                test_data[col] = label_encoder.transform(test_data[col].astype(str))
            else:
                test_data[col] = label_encoder.fit_transform(test_data[col].astype(str))

# Add missing columns with default values or computed values
if 'APPLIED_AMOUNT_PER_ASSET' not in test_data.columns:
    test_data['APPLIED_AMOUNT_PER_ASSET'] = test_data['APPLIED AMOUNT'] / test_data['TOTAL ASSET COST']

# Ensure that the feature columns match the training data
test_data = test_data[X.columns]

# Make predictions on the test data
test_predictions = model.predict(test_data)

# Create a DataFrame for submission
submission = pd.DataFrame({'UID': test_data_with_uid['UID'], 'Prediction': test_predictions})

# Save predictions to CSV
submission.to_csv('predictions.csv', index=False)
