# **Data Collection Notebook**

## Objectives

+ Fetch data from [Kaggle](https://www.kaggle.com/) and save it as raw data.
+ Inspect and save the data under outputs/datasets/collection.
+ Gain a deeper understanding of the data using Pandas Profiling and correlation analysis to address Business Requirement 1:
    + The client wants to identify which variables have the strongest correlation with loan defaults.

## Inputs

+ Authentication token from Kaggle (JSON file).
+ Kaggle dataset: Loan Default Dataset.

## Outputs

+ Generate a dataset in the outputs file.
+ Generate code that answer business requirement 1.

---


## Change working directory

Change working directory from the current one to the parent folder.

In [None]:
import os
current_dir = os.getcwd() # get current directory
current_dir

To make the parent directory the current directory, we must use `os.path.dirname()` to get the parent, and `os.chir()` to redefine.

In [None]:
os.chdir(os.path.dirname(current_dir)) # change directory to parent directory
print("The directory you are in is:", os.getcwd()) # print current directory

Confirm the new current directory.

In [None]:
current_dir = os.getcwd() # get current directory
current_dir

## Fetch the data from Kaggle

First install Kaggle package.

In [None]:
%pip install kaggle

+ An account must be registered on Kaggle to obtain an API Key in the format of a JSON file.

+ To authenticate with the Kaggle API, set the environment variable `KAGGLE_CONFIG_DIR` to the current working directory. It is necessary to modify its permissions to read§write for the owner, using `chmod 600` to restrict access and protect sensitive credentials.  

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

+ The dataset used in this project is [Loan Default Dataset](https://www.kaggle.com/datasets/yasserh/loan-default-dataset/data).

+ The dataset path is `yasserh/loan-default-dataset/data`.

+ Define the Kaggle dataset and destination folder and download it to the folder (inputs/datasets/raw).

In [None]:
KaggleDatasetPath = "yasserh/loan-default-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip and the kaggle.json file.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

## Load and Inspect the data

Using pandas library, the dataset can be loaded as a dataframe so the data can be inspected.

In [None]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/Loan_Default.csv")
df.head()

A dataframe summary can be obtained.

In [None]:
df.info()

+ From the summary we can see that there are missing values in the dataframe, as the column "Non-null" have different values for different features. 

+ We create and print a list with the columns that contain missing values.

In [None]:
columns_with_nan = df.columns[df.isnull().any()].to_list()
columns_with_nan

+ As the dataset contains an ID column, we must check for duplicates.

In [None]:
df[df.duplicated(subset=["ID"])]

+ The status variable is already numerical, meaning there is no need for changing.

In [None]:
# will I need to convert any columns to a different data type?
# will I need to rename any columns?

## Save the data set

In [None]:
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/LoanDefault.csv",index=False)

# Data Exploration

We will conduct pandas profiling to explore the dataset, identify missing values, analyze data types and distributions, and understand the business context of each variable.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

After profiling, we can draw some conclusions:
+ Almost half of the features have more than 85% of values concentrated in a single category. These may provide little predictive power and could be dropped after further evaluation.
+ Important features such as property_value, income, loan-to-value(LTV), term and age contain missing or incorrect values that will be addressed.
+ In the alerts tab, a warning is raised for 75% of zeros in the status column. This is expected since status is our target variable, where 0 indicates "not defaulted."
+ LTV shows extremely high skewness, suggesting that most values are concentrated at lower levels, while a few extreme outliers—likely errors—significantly distort the distribution.
+ Other alerts, including missing values, high uniqueness, or constant features, will be handled later through imputation or removal as needed.

## Correlation Study

Running Pearson and Spearman correlation will help identify key predictors of default status before cleaning the data. Pearson detects linear relationships, while Spearman captures rank-based trends. This will guide feature selection and highlight potential redundancies or outliers.

Before conducting the correlation studies, the data needs to be converted to numerical format. Missing values will be imputed with the most frequent value in each column.

In [None]:
from feature_engine.imputation import CategoricalImputer

imputer = CategoricalImputer(imputation_method="frequent")
df_imputed = imputer.fit_transform(df)
df_imputed.isnull().sum()

In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=df_imputed.columns[df_imputed.dtypes == "object"].to_list(), drop_last=False)
df_encoded = encoder.fit_transform(df_imputed)
df_encoded.head()

+ We know that this command returns a pandas Series, with the first value representing the correlation between 'Status' and itself, which is always 1. To exclude this, we slice the Series starting from index 1 using `[1:]`. Then, we sort the values by their absolute magnitude, using `key=abs` to ensure that the correlations are ordered by their strength, regardless of sign.

In [None]:
corr_pearson = df_encoded.corr(method="pearson")["Status"].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

In [None]:
corr_spearman = df_encoded.corr(method="spearman")["Status"].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

The correlation studies, both Pearson and Spearman, did not reveal any strong relationships between the features and the target variable, with the exception of `credit_type_EQUI`. However, it's worth noting that credit_type_EQUI represents only about 10% of the total loan types, according to the profiling, which limits its significance. While this suggests that there may not be immediately obvious strong correlations, it does not imply that the model won't perform well. Instead, further data engineering and feature selection are needed. As a result, we will not focus further on the correlation analysis but will explore other techniques, such as selecting the best features, for the model.

---

## Conclusion and Next Steps

+ The correlation study did not provide significant insights into which features strongly influence the default status.
+ The data will undergo further cleaning and imputation, followed by additional analysis to identify and select the most relevant features for model training.