# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/LoanDefaultDataset.csv


---

# Imports

In [None]:
import os
import pandas as pd
# for vs code
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import glob
# Ignore FutureWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

---

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.7.4.5

In order to authenticate Kaggle to download data in this session, your **authentication token (JSON file)** from Kaggle needs to be stored in the main project repository.
* In case you don't have your token yet, please refer to the [Kaggle Documentation](https://www.kaggle.com/docs/api)


Once you dropped your `kaggle.json` file in the main working directory, run the cell below, so the token is recognized in the session.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
#! chmod 600 kaggle.json  # not neccessary in vs code

This project uses the [Loan Default Prediction Dataset](https://www.kaggle.com/datasets/laotse/credit-risk-dataset).

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "laotse/credit-risk-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
# Find all zip files in the folder
zip_files = glob.glob(os.path.join(DestinationFolder, "*.zip"))

# Extract each zip file and then delete it
for zip_path in zip_files:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)  # remove the zip after extracting

# Optionally, remove kaggle.json if it exists
kaggle_json = os.path.join(os.getcwd(), "kaggle.json")
if os.path.exists(kaggle_json):
    os.remove(kaggle_json)

print("All ZIP files extracted and deleted.")

---

# Load and Inspect Kaggle data

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/credit_risk_dataset.csv")
print(df.shape)
df.head()

* The Dataset contains 32581 rows and 12 columns

### Data Types

In [None]:
df.info()

The dataset consists of 12 variables. All data types are correctly assigned:

- **7 Numerical variables** (`person_age`, `person_income`, `person_emp_length`, `loan_amnt`, `loan_int_rate`, `loan_percent_income`, `cb_person_cred_hist_length`) are stored as either `int64` or `float64`
- **4 Categorical variables** (`person_home_ownership`, `loan_intent`, `loan_grade`, `cb_person_default_on_file`) are stored as `object` type
- **Target variable** `loan_status` is numerical (`int64`), where `0` represents non-default and `1` represents default

This indicates that the dataset is **properly typed**, with no immediate data type conversions required before preprocessing.


### Missing Values

In [None]:
print("Number of missing values in each column:")

missing_count = df.isna().sum()
missing_percent = (df.isna().sum() / len(df)) * 100

missing_data = pd.DataFrame({
    'Missing Values': missing_count,
    'Percentage': missing_percent.round(2)
})

print(missing_data)

print("\nTotal number of missing values in the dataframe:", df.isna().sum().sum())

* The dataset contains minimal missing values overall, with only two variables affected:
    - `person_emp_length` has **895 missing values** (2.75% of records)  
    - `loan_int_rate` has **3,116 missing values** (9.56% of records)  

* All other variables are complete with no missing entries 
* The proportion of missing data is relatively low, indicating good data quality
* Appropriate imputation strategies (such as median or model-based imputation) should be applied to `person_emp_length` and `loan_int_rate` during preprocessing to preserve dataset integrity

### Summary Statistics

Numerical

In [None]:
df.describe().round(2).T

Categorical

In [None]:
df.describe(include='object').T

In [None]:

cat_cols = df.select_dtypes(include='object').columns

for col in cat_cols:
    print(f"{col}: {df[col].unique()}")

* Summary statistics were generated for all variables:

    - For **numerical features**, metrics such as `mean`, `std`, `min`, and `max` were reviewed to identify potential outliers or inconsistencies
    - For **categorical features**, counts and most frequent categories were inspected to understand variable diversity and dominant groups

* Additionally, we examined the distinct values within each categorical variable to ensure that all entries are reasonable and align with expected categories

* Overall, this provides a comprehensive first look at both numerical and categorical distributions in the dataset.  
    While most variables fall within expected ranges, several numerical features — particularly `person_age`, `person_emp_length` and `person_income` — exhibit unusually high maximum values, suggesting the presence of potential outliers that should be examined or treated during preprocessing to ensure model robustness and reliability.  
    No unexpected categories or apparent data entry errors are observed, indicating these features are **clean and ready for encoding** in preprocessing.

### Duplicated Entries

In [None]:
duplicates = df.duplicated()
df[df.duplicated(keep=False)].sort_values(by=['person_age','person_income'])

In [None]:
num_duplicates = duplicates.sum()
percent_duplicates = (num_duplicates / len(df)) * 100
print(f"Number of duplicate rows: {num_duplicates} ({percent_duplicates:.2f}%)")

* To ensure data integrity, the dataset was checked for duplicated rows across all features.  
  A total of **165 duplicate rows** were identified, likely due to the fact that this is an artificially created dataset.  
* As it is highly unlikely for two borrowers to have identical values for all features, these duplicates should be removed.  
  They represent **less than 1% of the dataset**, so dropping them will not significantly reduce the data size.


### Target Variable Exploration

The target variable **`loan_status`** indicates whether a borrower has defaulted on their loan (`1`) or not (`0`).  

The class distribution is examined to understand the balance between default and non-default cases.


In [None]:
print("Distribution of Loan Defaults:")

pd.DataFrame({
    'Count': df['loan_status'].value_counts(),
    'Percentage (%)': round(df['loan_status'].value_counts(normalize=True) * 100, 2)
})

In [None]:
plt.figure(figsize=(5,4))
sns.countplot(x='loan_status', data=df, hue= "loan_status", palette='Set2')
plt.title('Target Variable Distribution: loan_status')
plt.ylabel('Count')
plt.legend(labels=['0 = No', '1 = Yes']) 
plt.show()


- The target variable shows a **highly imbalanced** distribution
- This is important because **imbalanced target classes** can bias models toward the majority class. We will have to perform oversampling in order to increase the representation of the minority class before training a model

At this stage, no transformation is applied yet, as the goal is to understand the target before data cleaning and modeling.

---

# Push files to Repo

### Collected dataset

In [None]:
file_path = f'outputs/datasets/collection'
variable_to_save = df
filename = "LoanDefaultData.csv"

# Try to generate output folder
try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

# Save the dataset as csv file for further use
variable_to_save.to_csv(f"{file_path}/{filename}", index=False)

---

# Conclusions and Next Steps

The dataset appears well-structured and mostly complete. Numerical variables are correctly typed, although a few features (`person_age`, `person_emp_length`, `person_income`) exhibit unusually high maximum values, suggesting potential outliers that should be addressed during preprocessing. Categorical variables are stored as objects, and all observed categories are reasonable and consistent with expectations. Summary statistics indicate that the target variable `loan_status` is imbalanced but suitable for modeling. The dataset was checked for duplicated rows across all features, revealing **165 duplicates** (less than 1% of the data) likely due to the artificial nature of the dataset; these will be removed to maintain data integrity. Minimal missing values exist only in `person_emp_length` and `loan_int_rate`, which can be imputed during preprocessing.


Next Steps:
* Conduct exploratory data analysis (EDA): visualize univariate distributions and relationships between features and the target variable, to answer Business Requirement 1