#This code snippet imports libraries for data preprocessing:
- Pandas for data manipulation.
- 'LabelEncoder' for categorical label conversion.
- 'StandardScaler' for feature standardization.
- 'train_test_split' for data separation.


In [13]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Load the dataset
This code snippet loads a dataset using Pandas:
- 'df' is a variable used to store the loaded data.
- 'pd.read_csv()' is a Pandas function used to read a CSV file.
- The file path "/content/drive/MyDrive/Datasets/loan_cleaned.csv" specifies the location of the CSV file being read.


In [14]:
# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/Datasets/loan_cleaned.csv")

  df = pd.read_csv("/content/drive/MyDrive/Datasets/loan_cleaned.csv")


#df.shape

This code snippet retrieves the shape of the DataFrame 'df':
- The 'shape' attribute of a DataFrame returns a tuple representing its dimensions (rows, columns).
- Running this code provides the number of rows and columns present in the DataFrame.


In [15]:
df.shape

(39717, 41)

# df.head()

This code snippet displays the top rows of the DataFrame 'df':
- The 'head()' function of a DataFrame is used to view the first few rows.
- By default, it shows the first 5 rows, providing a quick overview of the data's structure and contents.

In [16]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,...,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,pub_rec_bankruptcies
0,5000,5000,4975.0,36 months,10.65,162.87,B,B2,,10,...,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-15,171.62,May-16,0.0
1,2500,2500,2500.0,60 months,15.27,59.83,C,C4,Ryder,1,...,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-13,119.66,Sep-13,0.0
2,2400,2400,2400.0,36 months,15.96,84.33,C,C5,,10,...,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-14,649.91,May-16,0.0
3,10000,10000,10000.0,36 months,13.49,339.31,C,C1,AIR RESOURCES BOARD,10,...,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-15,357.48,Apr-16,0.0
4,3000,3000,3000.0,60 months,12.69,67.79,B,B5,University Medical Group,1,...,3513.33,2475.94,1037.39,0.0,0.0,0.0,May-16,67.79,May-16,0.0


# df.dtypes

This code snippet displays the data types of columns in the DataFrame 'df':
- The 'dtypes' attribute of a DataFrame provides the data type of each column.
- Running this code shows the data type (e.g., integer, float, string) of each column in the DataFrame.


In [17]:
df.dtypes

loan_amnt                    int64
funded_amnt                  int64
funded_amnt_inv            float64
term                        object
int_rate                   float64
installment                float64
grade                       object
sub_grade                   object
emp_title                   object
emp_length                   int64
home_ownership              object
annual_inc                 float64
verification_status         object
issue_d                     object
loan_status                 object
purpose                     object
title                       object
zip_code                    object
addr_state                  object
dti                        float64
delinq_2yrs                  int64
earliest_cr_line            object
inq_last_6mths               int64
open_acc                     int64
pub_rec                      int64
revol_bal                    int64
revol_util                 float64
total_acc                    int64
out_prncp           

# Select only the necessary columns
This code snippet filters the DataFrame to include only specific columns:

- 'selected_columns' is a list of column names that are deemed necessary.
- 'df[selected_columns]' creates a new DataFrame 'df_selected' containing only the selected columns.

The code focuses on essential columns like loan details, borrower information, and financial attributes for further analysis.


In [18]:
# Select only the necessary columns
selected_columns = ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'loan_status', 'purpose', 'dti', 'delinq_2yrs', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'recoveries', 'last_pymnt_d', 'last_pymnt_amnt', 'issue_d']
df_selected = df[selected_columns]

# Selecting Relevant Columns
In this section, the code filters the DataFrame to include only the required columns:

- 'selected_columns' is a list containing the names of the columns considered essential.
- The DataFrame 'df_selected' is created by extracting the selected columns from the original DataFrame 'df'.

The purpose of this step is to focus on specific attributes relevant to analysis, such as loan details, borrower information, and financial metrics.


In [19]:
# Convert 'issue_d' and 'last_pymnt_d' columns to datetime
df_selected['issue_d'] = pd.to_datetime(df_selected['issue_d'], format='%b-%y')
df_selected['last_pymnt_d'] = pd.to_datetime(df_selected['last_pymnt_d'], format='%b-%y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['issue_d'] = pd.to_datetime(df_selected['issue_d'], format='%b-%y')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['last_pymnt_d'] = pd.to_datetime(df_selected['last_pymnt_d'], format='%b-%y')


# Preprocessing the 'term' Column
In this section, the code performs preprocessing on the 'term' column within the 'df_selected' DataFrame:

- The 'term' column contains loan terms in the format of 'XX months'.
- The 'apply()' function with a lambda expression is used to extract the numeric part of the term and convert it to an integer.
- This ensures that the 'term' column represents loan terms in months, making it suitable for analysis.


In [20]:
# Preprocess 'term' column
df_selected['term'] = df_selected['term'].apply(lambda x: int(x.split()[0]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['term'] = df_selected['term'].apply(lambda x: int(x.split()[0]))


# Converting Selected Attributes to Categorical
Here, the code aims to convert specific attributes within the 'df_selected' DataFrame into categorical data:

- 'attributes_to_convert' is a list containing the names of attributes that need to be treated as categorical variables.
- The DataFrame 'df_selected' is updated by applying the categorical conversion to the columns listed in 'attributes_to_convert'.

By converting these attributes to categorical, the data is better structured for analysis and modeling.


In [21]:
# List of attributes to convert to categorical
attributes_to_convert = ['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'loan_status', 'purpose']  # Add other column names
df_selected[attributes_to_convert] = df_selected[attributes_to_convert].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[attributes_to_convert] = df_selected[attributes_to_convert].astype('category')


# Saving and Downloading Processed DataFrame
In this section, the code saves and downloads the processed DataFrame as a CSV file:

- 'df_selected.to_csv('df_selected.csv', index=False)': This line saves the 'df_selected' DataFrame to a CSV file named 'df_selected.csv'.
  - The parameter 'index=False' ensures that the DataFrame index is not saved as a separate column in the CSV file.

- 'from google.colab import files': This line imports the 'files' module from the 'google.colab' library, which is used for file operations in the Colab environment.

- 'files.download('df_selected.csv')': This line downloads the saved CSV file using the 'files.download()' function from the 'files' module.
  - It allows you to directly download the file from the Colab environment to your local machine.

This code snippet provides a way to save the processed DataFrame as a CSV file and then download it for further use.


In [22]:
'''df_selected.to_csv('df_selected.csv', index=False)  # Set index=False to avoid saving the DataFrame index as a column
from google.colab import files
files.download('df_selected.csv')'''

"df_selected.to_csv('df_selected.csv', index=False)  # Set index=False to avoid saving the DataFrame index as a column\nfrom google.colab import files\nfiles.download('df_selected.csv')"

# Applying Label Encoding to Selected Categorical Columns
In this section, the code applies label encoding to specific categorical columns within the 'df_selected' DataFrame:

- 'label_encoder' is an instance of the 'LabelEncoder' class from scikit-learn, used to convert categorical labels to numerical values.
- 'selected_categorical_columns' is a list containing the names of columns that require label encoding.
- A loop iterates through each column in 'selected_categorical_columns'.
- For each column, the 'fit_transform()' method of the label encoder is used to convert categorical values to numerical labels.

By performing label encoding, categorical data is transformed into a format suitable for machine learning algorithms.


In [23]:
# Apply label encoding to selected categorical columns
label_encoder = LabelEncoder()
selected_categorical_columns = ['grade', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status', 'purpose', 'loan_status']  # Replace with actual column names
for col in selected_categorical_columns:
    df_selected[col] = label_encoder.fit_transform(df_selected[col])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[col] = label_encoder.fit_transform(df_selected[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[col] = label_encoder.fit_transform(df_selected[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected[col] = label_encoder.fit_transform(df_selected[col])
A value is

# Imputing Missing Values with Median
In this section, the code fills in missing values within the 'df_selected' DataFrame using the median:

- 'fillna()' is a Pandas function used to replace missing values with specified values.
- The argument 'df_selected.median()' calculates the median for each column and serves as the replacement value.
- 'inplace=True' ensures that the changes are applied directly to the original DataFrame.

By imputing missing values with the median, the data is prepared for analysis and modeling while minimizing the impact of missing data.


In [24]:
# Assuming you want to impute missing values with the median
df_selected.fillna(df_selected.median(), inplace=True)

  df_selected.fillna(df_selected.median(), inplace=True)
  df_selected.fillna(df_selected.median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected.fillna(df_selected.median(), inplace=True)


from sklearn.preprocessing import StandardScaler

# Initializing the StandardScaler
scaler = StandardScaler()

# Selecting Numerical Features for Scaling (Excluding Datetime Columns)
In this section, numerical features are selected for scaling from the 'df_selected' DataFrame:

- 'numerical_feature_columns' is a list containing the names of columns with numerical features that require scaling.
- 'X_train' and 'X_test' are data subsets created by splitting the data into training and testing sets.
- The 'drop()' function is used to remove the 'loan_status' column, as it is the target variable.

# Fitting and Transforming the Scaler on Training Numerical Data
In this part, the StandardScaler is used to standardize the training numerical data:

- 'scaler.fit_transform(X_train_numerical)' fits the scaler to the training data and scales it.

# Transforming the Test Numerical Data Using the Same Scaler
Here, the standardized scaler from the training data is applied to the test numerical data:

- 'scaler.transform(X_test_numerical)' transforms the test data using the previously fitted scaler.

By performing this scaling, the numerical features are standardized, which helps machine learning algorithms perform more effectively.


In [25]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select numerical features for scaling (excluding datetime columns)
numerical_feature_columns = ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'purpose', 'dti', 'delinq_2yrs', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'recoveries', 'last_pymnt_amnt']
X_train, X_test, y_train, y_test = train_test_split(df_selected.drop('loan_status', axis=1), df_selected['loan_status'], test_size=0.2, random_state=42)
X_train_numerical = X_train[numerical_feature_columns]
X_test_numerical = X_test[numerical_feature_columns]

# Fit and transform the scaler on the training numerical data
X_train_scaled = scaler.fit_transform(X_train_numerical)

# Transform the test numerical data using the same scaler
X_test_scaled = scaler.transform(X_test_numerical)


# Initializing the KNN Model
knn_model = KNeighborsClassifier(n_neighbors=3)
- The code uses the 'KNeighborsClassifier' class from scikit-learn to create a k-nearest neighbors (KNN) classification model.
- The parameter 'n_neighbors' is set to 3, specifying the number of neighbors to consider. This value can be adjusted based on the desired model behavior.

# Training the Model on Scaled Training Data
knn_model.fit(X_train_scaled, y_train)
- This step trains the KNN model using the standardized training data ('X_train_scaled') and corresponding target labels ('y_train').

# Making Predictions on the Scaled Test Data
y_pred = knn_model.predict(X_test_scaled)
- The trained KNN model predicts the loan status labels for the scaled test data ('X_test_scaled').


In [26]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize the KNN model
knn_model = KNeighborsClassifier(n_neighbors=3)  # You can adjust the number of neighbors

# Train the model on the scaled training data
knn_model.fit(X_train_scaled, y_train)

# Predict on the scaled test data
y_pred = knn_model.predict(X_test_scaled)

# Calculating the Accuracy of the Model
accuracy = accuracy_score(y_test, y_pred)
- The 'accuracy_score' function is used to calculate the accuracy of the model's predictions compared to the true test labels ('y_test').
- The calculated accuracy is then printed to the console using the 'print()' function.

This code trains a KNN classification model, makes predictions on test data, and evaluates its accuracy in predicting loan statuses.

In [27]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9097432024169184


# Defining Selected Numerical Feature Columns
This section defines a list of selected numerical feature columns:

- 'selected_numerical_columns' is a list containing the names of columns that are considered numerical features.
- These columns hold attributes such as loan amount, interest rate, employment details, and financial metrics.

The purpose of defining this list is to create a standardized reference to the specific columns used for analysis and modeling.


In [28]:
# Define the selected numerical feature columns
selected_numerical_columns = ['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
                              'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'purpose', 'dti',
                              'delinq_2yrs', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp',
                              'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'recoveries', 'last_pymnt_amnt']


# Creating a Sample New Data Point
In this section, a sample new data point is created using a Pandas DataFrame:

- 'new_data_point' is a DataFrame that contains values for each selected numerical feature column.
- The values provided simulate a new loan application data point for analysis.

# Preprocessing and Transforming the New Data Point
Here, the new data point is preprocessed and transformed for prediction:

- 'new_data_numerical' is a subset of the 'new_data_point' DataFrame containing only the selected numerical feature columns.
- 'new_data_scaled' is obtained by transforming 'new_data_numerical' using the previously fitted scaler.

# Making Predictions Using the Trained KNN Model
The trained KNN model is used to make predictions for the new data point:

- 'new_predictions' stores the predicted loan status labels for the new data point.

# Printing the Predictions
The predictions are printed to the console using the 'print()' function.

This code snippet demonstrates the process of creating, preprocessing, and predicting using a new data point with the trained KNN model.


In [29]:
# Create a sample new data point
new_data_point = pd.DataFrame({
    'loan_amnt': [100000],
    'term': [36],
    'int_rate': [8.5],
    'installment': [315.67],
    'grade': [2],
    'sub_grade': [10],
    'emp_title': [12345],
    'emp_length': [2],
    'home_ownership': [1],
    'annual_inc': [5000],
    'verification_status': [0],
    'purpose': [2],
    'dti': [15.0],
    'delinq_2yrs': [0],
    'open_acc': [8],
    'pub_rec': [0],
    'revol_bal': [8000],
    'total_acc': [15],
    'out_prncp': [0],
    'total_pymnt': [1000],
    'total_rec_prncp': [900],
    'total_rec_int': [1000],
    'recoveries': [0],
    'last_pymnt_amnt': [315.67]
})

# Preprocess and transform the new data point
new_data_numerical = new_data_point[selected_numerical_columns]
new_data_scaled = scaler.transform(new_data_numerical)

# Make predictions using the trained KNN model
new_predictions = knn_model.predict(new_data_scaled)

# Print the predictions
print("Predictions for new data point:", new_predictions)


Predictions for new data point: [0]


In [30]:
# Create a sample new data point
new_data_point = pd.DataFrame({
    'loan_amnt': [10000],
    'term': [36],
    'int_rate': [8.5],
    'installment': [315.67],
    'grade': [2],
    'sub_grade': [10],
    'emp_title': [12345],
    'emp_length': [2],
    'home_ownership': [1],
    'annual_inc': [50000],
    'verification_status': [0],
    'purpose': [2],
    'dti': [15.0],
    'delinq_2yrs': [0],
    'open_acc': [8],
    'pub_rec': [0],
    'revol_bal': [8000],
    'total_acc': [15],
    'out_prncp': [0],
    'total_pymnt': [1000],
    'total_rec_prncp': [9000],
    'total_rec_int': [1000],
    'recoveries': [0],
    'last_pymnt_amnt': [315.67]
})

# Preprocess and transform the new data point
new_data_numerical = new_data_point[selected_numerical_columns]
new_data_scaled = scaler.transform(new_data_numerical)

# Make predictions using the trained KNN model
new_predictions = knn_model.predict(new_data_scaled)

# Print the predictions
print("Predictions for new data point:", new_predictions)


Predictions for new data point: [2]
