# **Data Collection Notebook**

## Objectives

+ Fetch data from [Kaggle](https://www.kaggle.com/) and save it as raw data.
+ Inspect and save the data under outputs/datasets/collection.
+ Gain a deeper understanding of the data using Pandas Profiling and correlation analysis to address Business Requirement 1:
    + The client wants to identify which variables have the strongest correlation with loan defaults.

## Inputs

+ Authentication token from Kaggle (JSON file).
+ Kaggle dataset: Loan Default Dataset.

## Outputs

+ Generate a dataset in the outputs file.
+ Generate code that answer business requirement 1.

---


## Change working directory

Change working directory from the current one to the parent folder.

In [1]:
import os
current_dir = os.getcwd() # get current directory
current_dir

'/Users/maria/CodeInstitute/pp5/jupiter_notebooks'

To make the parent directory the current directory, we must use `os.path.dirname()` to get the parent, and `os.chir()` to redefine.

In [2]:
os.chdir(os.path.dirname(current_dir)) # change directory to parent directory
print("The directory you are in is:", os.getcwd()) # print current directory

The directory you are in is: /Users/maria/CodeInstitute/pp5


Confirm the new current directory.

In [3]:
current_dir = os.getcwd() # get current directory
current_dir

'/Users/maria/CodeInstitute/pp5'

## Fetch the data from Kaggle

First install Kaggle package.

In [4]:
%pip install kaggle

Collecting kaggle
  Obtaining dependency information for kaggle from https://files.pythonhosted.org/packages/75/f6/cccedb8db42beac1c5d7cfa57d5b800b90ef91118833972af3f7b5e159c8/kaggle-1.7.4.2-py3-none-any.whl.metadata
  Downloading kaggle-1.7.4.2-py3-none-any.whl.metadata (16 kB)
Collecting protobuf (from kaggle)
  Obtaining dependency information for protobuf from https://files.pythonhosted.org/packages/8e/66/7f3b121f59097c93267e7f497f10e52ced7161b38295137a12a266b6c149/protobuf-6.30.2-cp39-abi3-macosx_10_9_universal2.whl.metadata
  Downloading protobuf-6.30.2-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
Collecting python-slugify (from kaggle)
  Obtaining dependency information for python-slugify from https://files.pythonhosted.org/packages/a4/62/02da182e544a51a5c3ccf4b03ab79df279f9c60c5e82d5e8bec7ca26ac11/python_slugify-8.0.4-py2.py3-none-any.whl.metadata
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode (from kaggle)
  Obt

+ An account must be registered on Kaggle to obtain an API Key in the format of a JSON file.

+ To authenticate with the Kaggle API, set the environment variable `KAGGLE_CONFIG_DIR` to the current working directory. It is necessary to modify its permissions to read§write for the owner, using `chmod 600` to restrict access and protect sensitive credentials.  

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: kaggle.json: No such file or directory


+ The dataset used in this project is [Loan Default Dataset](https://www.kaggle.com/datasets/yasserh/loan-default-dataset/data).

+ The dataset path is `yasserh/loan-default-dataset/data`.

+ Define the Kaggle dataset and destination folder and download it to the folder (inputs/datasets/raw).

In [6]:
KaggleDatasetPath = "yasserh/loan-default-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/Users/maria/CodeInstitute/pp5/venv/bin/kaggle", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/maria/CodeInstitute/pp5/venv/lib/python3.11/site-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/maria/CodeInstitute/pp5/venv/lib/python3.11/site-packages/kaggle/api/kaggle_api_extended.py", line 1734, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/maria/CodeInstitute/pp5/venv/lib/python3.11/site-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'


Unzip the downloaded file, delete the zip and the kaggle.json file.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

zsh:1: no matches found: inputs/datasets/raw/*.zip


---

## Load and Inspect the data

Using pandas library, the dataset can be loaded as a dataframe so the data can be inspected.

In [8]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/Loan_Default.csv")
df.head()

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1
0,24890,2019,cf,Sex Not Available,nopre,type1,p1,l1,nopc,nob/c,...,EXP,758,CIB,25-34,to_inst,98.728814,south,direct,1,45.0
1,24891,2019,cf,Male,nopre,type2,p1,l1,nopc,b/c,...,EQUI,552,EXP,55-64,to_inst,,North,direct,1,
2,24892,2019,cf,Male,pre,type1,p1,l1,nopc,nob/c,...,EXP,834,CIB,35-44,to_inst,80.019685,south,direct,0,46.0
3,24893,2019,cf,Male,nopre,type1,p4,l1,nopc,nob/c,...,EXP,587,CIB,45-54,not_inst,69.3769,North,direct,0,42.0
4,24894,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,CRIF,602,EXP,25-34,not_inst,91.886544,North,direct,0,39.0


A dataframe summary can be obtained.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148670 entries, 0 to 148669
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   ID                         148670 non-null  int64  
 1   year                       148670 non-null  int64  
 2   loan_limit                 145326 non-null  object 
 3   Gender                     148670 non-null  object 
 4   approv_in_adv              147762 non-null  object 
 5   loan_type                  148670 non-null  object 
 6   loan_purpose               148536 non-null  object 
 7   Credit_Worthiness          148670 non-null  object 
 8   open_credit                148670 non-null  object 
 9   business_or_commercial     148670 non-null  object 
 10  loan_amount                148670 non-null  int64  
 11  rate_of_interest           112231 non-null  float64
 12  Interest_rate_spread       112031 non-null  float64
 13  Upfront_charges            10

+ From the summary we can see that there are missing values in the dataframe, as the column "Non-null" have different values for different features. 

+ We create and print a list with the columns that contain missing values.

In [10]:
columns_with_nan = df.columns[df.isnull().any()].to_list()
columns_with_nan

['loan_limit',
 'approv_in_adv',
 'loan_purpose',
 'rate_of_interest',
 'Interest_rate_spread',
 'Upfront_charges',
 'term',
 'Neg_ammortization',
 'property_value',
 'income',
 'age',
 'submission_of_application',
 'LTV',
 'dtir1']

+ As the dataset contains an ID column, we must check for duplicates.

In [11]:
df[df.duplicated(subset=["ID"])]

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1


+ The status variable is already numerical, meaning there is no need for changing.

In [12]:
# will I need to convert any columns to a different data type?
# will I need to rename any columns?

## Save the data set

In [13]:
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/LoanDefault.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


# Data Exploration

We will conduct pandas profiling to explore the dataset, identify missing values, analyze data types and distributions, and understand the business context of each variable.

In [14]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 34/34 [00:00<00:00, 51.75it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]