# 1. Understand the dataset
## 1.1 Importing the CSV File
First, we imported the CSV file into a Python environment to explore the dataset structure. To do this, we used the pandas library. The following code loaded the file and provided an initial overview of the da.a:

In [1]:
# Import library
import pandas as pd

In [None]:
# Load dataset
df = pd.read_csv("Dataset/loan_dataset.csv")

In [6]:
# Show the columns
df.columns

Index(['person_age', 'person_gender', 'person_education', 'person_income',
       'person_emp_exp', 'person_home_ownership', 'loan_amnt', 'loan_intent',
       'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
       'credit_score', 'previous_loan_defaults_on_file', 'loan_status'],
      dtype='object')

In [7]:
# See the shape
df.shape

(45000, 14)

In [4]:
df["loan_status"].value_counts()

loan_status
0    35000
1    10000
Name: count, dtype: int64

In [9]:
# Displaying global information about the dataset
df.head()

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22,female,Master,71948,0,RENT,35000,PERSONAL,16.02,0.49,3,561,No,1
1,21,female,High School,12282,0,OWN,1000,EDUCATION,11.14,0.08,2,504,Yes,0
2,25,female,High School,12438,3,MORTGAGE,5500,MEDICAL,12.87,0.44,3,635,No,1
3,23,female,Bachelor,79753,0,RENT,35000,MEDICAL,15.23,0.44,2,675,No,1
4,24,male,Master,66135,1,RENT,35000,MEDICAL,14.27,0.53,4,586,No,1


These results give us an initial view of the content and diversity of the data. The dataset contains 45,000 records spread across 14 columns. Each column represents a specific attribute, ranging from personal information (age, gender, education level) to loan characteristics and credit history.

## 1.2 Checking for Null Values ​​and Duplicate Data
The quality of the analyses depends on the completeness and uniqueness of the data. We therefore checked for the presence of null values ​​and duplicates in the dataset:

### Checking for Null Values
The following code counts the missing values ​​in each column:

In [10]:
# Checking for null values
df.isnull().sum()

person_age                        0
person_gender                     0
person_education                  0
person_income                     0
person_emp_exp                    0
person_home_ownership             0
loan_amnt                         0
loan_intent                       0
loan_int_rate                     0
loan_percent_income               0
cb_person_cred_hist_length        0
credit_score                      0
previous_loan_defaults_on_file    0
loan_status                       0
dtype: int64

Conclusion: No records are incomplete, ensuring an analysis free from bias due to missing data.
### Duplicate Check
To ensure no data is duplicated, we performed:

In [11]:
# Identification of duplicates
df.duplicated().sum()

0

Conclusion: The dataset is free of duplicates, ensuring data quality and integrity for subsequent analyses.

### 1.3 Data Transformation and Typing
To ensure integrity during import into SQL Server, we performed a thorough data typing check.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45000 entries, 0 to 44999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   person_age                      45000 non-null  int64  
 1   person_gender                   45000 non-null  object 
 2   person_education                45000 non-null  object 
 3   person_income                   45000 non-null  int64  
 4   person_emp_exp                  45000 non-null  int64  
 5   person_home_ownership           45000 non-null  object 
 6   loan_amnt                       45000 non-null  int64  
 7   loan_intent                     45000 non-null  object 
 8   loan_int_rate                   45000 non-null  float64
 9   loan_percent_income             45000 non-null  float64
 10  cb_person_cred_hist_length      45000 non-null  int64  
 11  credit_score                    45000 non-null  int64  
 12  previous_loan_defaults_on_file  

# 2. Conclusion
The first phase of exploration confirmed the quality of the HSBC UK dataset. Key steps include:
## 2.1 Integrity Validation
No incomplete records were identified during the null value check. This ensures that the analysis is performed on a complete dataset, without bias induced by missing data.
## 2.2 Type Check and Correction
The check confirmed that numeric and categorical data were correctly defined.
## 2.3 Duplicate Analysis:
Running the *df.duplicated().sum()* function confirmed that no duplicates were present in the dataset. This ensures the reliability of aggregations and avoids bias during analysis.