# Exploratory Data Analysis 
First I will explore the dataset to gain an insight which will help to define the next steps

In [None]:
import pandas as pd

df = pd.read_csv("data/raw.csv", sep=" ") # The file use space to separate columns
df.head(5)

In [None]:
print(list(df.columns))

The columns names are not defined in the csv file, so we change them in the file reading function

In [None]:
columns_names = ["Status", "Duration", "History", "Purpose", "Amount", "Savings", "Employment", "Installment_Rate", 
                 "Personal_Status", "Other_Debtors", "Residence_Years", "Property", "Age", "Other_Installments", "Housing",
                 "Existing_Credits", "Job", "Num_People_Liable", "Telephone", "Foreign_Worker", "Target"]

df = pd.read_csv("data/raw.csv", sep=" ", header=None, names=columns_names)
print(list(df.columns))
df.head(5)

In [None]:
print(df.isnull().sum())

In [None]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded = df_encoded.astype(int)
df_encoded.head(5)

In [None]:
print(list(df_encoded.columns))

In [None]:
target_column = "Target"
columns = [col for col in df_encoded.columns if col != target_column] # Get all columns names except Target one
columns.append(target_column)

df_encoded = df_encoded[columns]
print(list(df_encoded.columns))

df_encoded.head(5)

Now, because it is a pretty clean dataset, we are ready with the pre-processing of the data and can move on to the training. But first, I would like to display some plots that are going to make the data more understandable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize =(18, 6))

# First subplot
sns.countplot(x="Target", data=df_encoded, ax=axes[0])
axes[0].set_title("Target Variable Distribution")

# Second subplot
corr_matrix = df_encoded.corr()
sns.heatmap(corr_matrix, cmap="coolwarm", ax=axes[1])
axes[1].set_title("Correlation Heatmap")

plt.tight_layout()
plt.show()

**From the plots above we can infer two main things** 
- The first one is that the data set is a bit **unbalanced** towards good clients. But we have the solution to this already, which is the **cost matrix** that comes asociated to the dataset, changing the weights in the decision. It will be used later in the training
- A second observation is from the correlation matrix. This tells us that the correlation between variables is almost 0 for every one, but because we are first using decision trees this is not noticeable, although we could check this with the permutation importance of each feature

In [None]:
df_encoded.to_csv("data/pre_processed.csv")