# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/LoanDefaultDataset.csv


---

# Change working directory

We need to change the working directory from its current folder, where the notebook is stored, to its parent folder
* First we access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Then we want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"You set a new current directory: {current_dir}")

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
#%pip install kaggle==1.7.4.5

In order to authenticate Kaggle to download data in this session, your **authentication token (JSON file)** from Kaggle needs to be stored in the main project repository.
* In case you don't have your token yet, please refer to the [Kaggle Documentation](https://www.kaggle.com/docs/api)


Once you dropped your `kaggle.json` file in the main working directory, run the cell below, so the token is recognized in the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
#! chmod 600 kaggle.json

This project uses the [Loan Default Prediction Dataset](https://www.kaggle.com/datasets/nikhil1e9/loan-default).

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "nikhil1e9/loan-default"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
import zipfile
import glob

# Find all zip files in the folder
zip_files = glob.glob(os.path.join(DestinationFolder, "*.zip"))

# Extract each zip file and then delete it
for zip_path in zip_files:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)  # remove the zip after extracting

# Optionally, remove kaggle.json if it exists
kaggle_json = os.path.join(os.getcwd(), "kaggle.json")
if os.path.exists(kaggle_json):
    os.remove(kaggle_json)

print("All ZIP files extracted and deleted.")

---

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Loan_default.csv")
print(df.shape)
df.head()

* The Dataset contains 255347 rows and 18 columns. 

### Data Types

In [None]:
df.info()

* The dataset includes 10 numerical variables and 8 categorical ones. All data types are assigned appropriately:

    - **Numerical variables** (e.g., `Age`, `Income`, `LoanAmount`, `CreditScore`, `DTIRatio`, etc.) are stored as either `int64` or `float64`.  
    - **Categorical variables** (e.g., `Education`, `EmploymentType`, `MaritalStatus`, `LoanPurpose`, etc.) are stored as `object` type.  
    - **Target variable** `Default` is also categorical (`int64`), representing 0 = non-default and 1 = default.

    This indicates that the dataset is **properly typed** and no immediate conversions are required before preprocessing.

### Missing Values

In [None]:
print("Number of NA values in each column:")
print(df.isna().sum())

print("\nTotal number of NA values in the dataframe:", df.isna().sum().sum())

* There are no missing values in the dataset.

### Summary Statistics

In [None]:
df.describe().round(2).T

In [None]:
df.describe(include='object').T

* Using `df.describe()`, summary statistics were generated for all variables.

    - For **numerical features**, metrics such as `mean`, `std`, `min`, and `max` were reviewed to identify potential outliers or inconsistencies.
    - For **categorical features** (via `df.describe(include='object')`), counts and most frequent categories were inspected to understand variable diversity and dominant groups.

    Overall, this provides a comprehensive first look at both numerical and categorical distributions in the dataset.
    The summary statistics indicate that the numerical variables are within reasonable ranges and there are no obvious anomalies (e.g., negative ages or zero income values).  

In [None]:
# Select all object columns
cat_cols = df.select_dtypes(include='object').columns

# Print unique values per column
for col in cat_cols:
    print(f"{col}: {df[col].unique()}")

### Duplicated Entries

In [None]:
df[df.duplicated(subset=['LoanID'])]

* To ensure data integrity, the `LoanID` variable was checked for duplicate entries.  
    A total of **0 duplicate IDs** were found, confirming that each loan record is unique.  

* As the variable `LoanID` is a unique identifier for each record, it does not contribute to the prediction and will therefore be **excluded during the data cleaning step** before model training.

### Target Variable Exploration

The target variable **`Default`** indicates whether a borrower has defaulted on their loan (`1`) or not (`0`).  

The class distribution is examined to understand the balance between default and non-default cases.


In [None]:
print("Distribution of Loan Defaults:")

pd.DataFrame({
    'Count': df['Default'].value_counts(),
    'Percentage (%)': round(df['Default'].value_counts(normalize=True) * 100, 2)
})

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(5,4))
sns.countplot(x='Default', data=df, hue= "Default", palette='pastel')
plt.title('Target Variable Distribution: Default')
plt.ylabel('Count')
plt.legend(labels=['0 = No', '1 = Yes']) 
plt.show()



- The target variable shows a **highly imbalanced** distribution.  
- This is important because **imbalanced target classes** can bias models toward the majority class. We will have to perform oversampling in order to increase the representation of the minority class before training a model.

At this stage, no transformation is applied yet, as the goal is to understand the target before data cleaning and modeling.

---

# Push files to Repo

In [None]:
import os

file_path = f'outputs/datasets/collection'
variable_to_save = df
filename = "LoanDefaultData.csv"

# Try to generate output folder
try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

# Save the dataset as csv file for further use
variable_to_save.to_csv(f"{file_path}/{filename}", index=False)


---

# Conclusions and Next Steps

The dataset appears complete and well-structured. Numerical variables are correctly typed, and categorical variables are stored as objects. Summary statistics show reasonable distributions, and the target variable Default is slightly imbalanced but usable. The LoanID column is unique for each record and does not contribute predictive value, so it will be removed in the data cleaning step. No missing or duplicate values were detected in the dataset.

Next Steps:
* Begin the data cleaning process: remove irrelevant columns (e.g., LoanID), handle any missing or inconsistent values if found, and encode categorical variables.
* Conduct exploratory data analysis (EDA): visualize relationships between features and the target variable.
* Prepare data for feature engineering and modeling: scaling numerical variables, encoding categorical features, and creating new derived features if needed.