# **Data Collection Notebook**

## Objectives

+ Fetch data from [Kaggle](https://www.kaggle.com/) and save it as raw data.
+ Inspect and save the data under outputs/datasets/collection.

## Inputs

+ Authentication token from Kaggle (JSON file).
+ Kaggle dataset: Loan Default Dataset.

## Outputs

+ 

---


## Change working directory

Change working directory from the current one to the parent folder.

In [None]:
import os
current_dir = os.getcwd() # get current directory
current_dir

To make the parent directory the current directory, we must use `os.path.dirname()` to get the parent, and `os.chir()` to redefine.

In [None]:
os.chdir(os.path.dirname(current_dir)) # change directory to parent directory
print("The directory you are in is:", os.getcwd()) # print current directory

Confirm the new current directory.

In [None]:
current_dir = os.getcwd() # get current directory
current_dir

## Fetch the data from Kaggle

First install Kaggle package.

In [None]:
%pip install kaggle

+ An account must be registered on Kaggle to obtain an API Key in the format of a JSON file.

+ To authenticate with the Kaggle API, set the environment variable `KAGGLE_CONFIG_DIR` to the current working directory. It is necessary to modify its permissions to read§write for the owner, using `chmod 600` to restrict access and protect sensitive credentials.  

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

+ The dataset used in this project is [Loan Default Dataset](https://www.kaggle.com/datasets/yasserh/loan-default-dataset/data).

+ The dataset path is `yasserh/loan-default-dataset/data`.

+ Define the Kaggle dataset and destination folder and download it to the folder (inputs/datasets/raw).

In [None]:
KaggleDatasetPath = "yasserh/loan-default-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip and the kaggle.json file.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/loan-default-dataset.zip
  inflating: inputs/datasets/raw/Loan_Default.csv  


---

## Load and Inspect the data

Using pandas library, the dataset can be loaded as a dataframe so the data can be inspected.

In [9]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/Loan_Default.csv")
df.head()

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1
0,24890,2019,cf,Sex Not Available,nopre,type1,p1,l1,nopc,nob/c,...,EXP,758,CIB,25-34,to_inst,98.728814,south,direct,1,45.0
1,24891,2019,cf,Male,nopre,type2,p1,l1,nopc,b/c,...,EQUI,552,EXP,55-64,to_inst,,North,direct,1,
2,24892,2019,cf,Male,pre,type1,p1,l1,nopc,nob/c,...,EXP,834,CIB,35-44,to_inst,80.019685,south,direct,0,46.0
3,24893,2019,cf,Male,nopre,type1,p4,l1,nopc,nob/c,...,EXP,587,CIB,45-54,not_inst,69.3769,North,direct,0,42.0
4,24894,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,CRIF,602,EXP,25-34,not_inst,91.886544,North,direct,0,39.0


A dataframe summary can be obtained.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148670 entries, 0 to 148669
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   ID                         148670 non-null  int64  
 1   year                       148670 non-null  int64  
 2   loan_limit                 145326 non-null  object 
 3   Gender                     148670 non-null  object 
 4   approv_in_adv              147762 non-null  object 
 5   loan_type                  148670 non-null  object 
 6   loan_purpose               148536 non-null  object 
 7   Credit_Worthiness          148670 non-null  object 
 8   open_credit                148670 non-null  object 
 9   business_or_commercial     148670 non-null  object 
 10  loan_amount                148670 non-null  int64  
 11  rate_of_interest           112231 non-null  float64
 12  Interest_rate_spread       112031 non-null  float64
 13  Upfront_charges            10

+ From the summary we can see that there are missing values in the dataframe, as the column "Non-null" have different values for different features. 

+ We create and print a list with the columns that contain missing values.

In [15]:
columns_with_nan = df.columns[df.isnull().any()].to_list()
columns_with_nan

['loan_limit',
 'approv_in_adv',
 'loan_purpose',
 'rate_of_interest',
 'Interest_rate_spread',
 'Upfront_charges',
 'term',
 'Neg_ammortization',
 'property_value',
 'income',
 'age',
 'submission_of_application',
 'LTV',
 'dtir1']

+ As the dataset contains an ID column, we must check for duplicates.

In [17]:
df[df.duplicated(subset=["ID"])]

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1


+ The status variable is already numerical, meaning there is no need for changing.

In [None]:
# will I need to convert any columns to a different data type?
# will I need to rename any columns?

# Save the data set

In [18]:
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/LoanDefault.csv",index=False)

# Conclusion

+ The dataset was download successfully.
+ The dataset was inspected and there is no duplicates in the ID columns but there are many columns with missing values that will be handled in another notebook.
+ No need for data type changing.
+ The dataset was saved in the outputs directory.