# Exploratory data analysis of Credit Risk data

### Context 

The dataset contains 1000 entries with 10 categorial/symbolic attributes. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.

[www.kaggle.com/datasets/uciml/german-credit/data](https://www.kaggle.com/datasets/uciml/german-credit/data)

### Attributes

1. Age (numeric)
2. Sex (text: male, female)
3. Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
4. Housing (text: own, rent, or free)
5. Saving accounts (text - little, moderate, quite rich, rich)
6. Checking account (numeric, in DM - Deutsch Mark)
7. Credit amount (numeric, in DM)
8. Duration (numeric, in month)
9. Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
10. Risk (text: good, bad)

### Purposes

There are the following hypotheses that need to be tested:

1. Each of the attributes individually affects credit risk
2. There is an interaction between various attributes and they jointly affect credit risk

### Initial data inspection

First, I load the data from the file, look at the dataset size, data types and several lines of the dataset, identify any obvious inconsistencies and draw conclusions for the next stage of data preparation:

In [1]:
# Importing the necessary libraries
import pandas as pd

In [2]:
# Loading data
df = pd.read_csv('german_credit_data.csv')

In [3]:
# Dataset size
num_rows, num_columns = df.shape
print(f"\nNumber of lines: {num_rows}")
print(f"Number of columns: {num_columns}")


Number of lines: 1000
Number of columns: 11


The number of rows is 1000 as stated, and the number of columns is 11, which does not correspond to the stated 10, need to check this column:

In [4]:
# View first 10 lines
print("First 10 lines of the dataset:\n")
print(df.head(10))

First 10 lines of the dataset:

   Unnamed: 0  Age     Sex  Job Housing Saving accounts Checking account  \
0           0   67    male    2     own             NaN           little   
1           1   22  female    2     own          little         moderate   
2           2   49    male    1     own          little              NaN   
3           3   45    male    2    free          little           little   
4           4   53    male    2    free          little           little   
5           5   35    male    1    free             NaN              NaN   
6           6   53    male    2     own      quite rich              NaN   
7           7   35    male    3    rent          little         moderate   
8           8   61    male    1     own            rich              NaN   
9           9   28    male    3     own          little         moderate   

   Credit amount  Duration              Purpose  Risk  
0           1169         6             radio/TV  good  
1           5951   

Obvious inconsistencies:

1. There is an extra column “Unnamed”, presumably with the entry identifier, need to understand whether it is needed
2. The "Checking account" values do not correspond to the declared data type, need to figure out where the error is
3. There are missing values in “Saving accounts” and “Checking account”, need to find out the percentage of missing data and how to replace it

Dealing with the column “Unnamed”:

In [5]:
# Viewing the number of unique values
unique_count = df.iloc[:, 0].nunique()
print(f"Number of unique values in column “Unnamed”: {unique_count}")

Number of unique values in column “Unnamed”: 1000


In [8]:
# Checking the minimum and maximum value
min_value = df.iloc[:, 0].min()
max_value = df.iloc[:, 0].max()

print(f"Minimum value: {min_value}")
print(f"Maximum value: {max_value}")

Minimum value: 0
Maximum value: 999


More likely “Unnamed” is the entry identifier, it's not needed

Dealing with the column “Checking account”:

In [11]:
# View unique values
unique_values = df['Checking account'].unique()
print(f"Values in Checking account: {unique_values}")

Values in Checking account: ['little' 'moderate' nan 'rich']


Here it will be necessary to take into account the error in the description of the data and the further impact on the study

Dealing with empty values in “Saving accounts” and “Checking account”. I check the percentage of missing data, and I do this check for all attributes at once:

In [21]:
# Viewing missing data
for attribute in df.columns:
    null_percentage = df[attribute].isnull().mean() * 100
    print(f"Percentage of empty data in {attribute}: {null_percentage}%")

Percentage of empty data in Unnamed: 0: 0.0%
Percentage of empty data in Age: 0.0%
Percentage of empty data in Sex: 0.0%
Percentage of empty data in Job: 0.0%
Percentage of empty data in Housing: 0.0%
Percentage of empty data in Saving accounts: 18.3%
Percentage of empty data in Checking account: 39.4%
Percentage of empty data in Credit amount: 0.0%
Percentage of empty data in Duration: 0.0%
Percentage of empty data in Purpose: 0.0%
Percentage of empty data in Risk: 0.0%


With 39.4% blank values, it can be assumed that a significant number of customers either do not have current accounts or information about them has not been collected. A high percentage of empty values in Checking Accounts can significantly impact analysis results. Potential bias due to missing values should be taken into account when interpreting results.

Low percentage of missing data in savings accounts 18.3%, it can be concluded that savings account information is available to most customers.

In any case, need to decide how to handle missing data before the next stage.

There are no blanks in all other fields. Now I'm looking to see if there are duplicates:

In [38]:
# Check for duplicates in all columns except the first
columns_to_check = df.iloc[:, 1:]
duplicates = df.duplicated(columns_to_check.columns)
print(f'Number of duplicates: {duplicates.sum()}')

Number of duplicates: 0


In [39]:
# Checking data types
print("\nData types in each column:")
print(df.dtypes)


Data types in each column:
Unnamed: 0           int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object


Data types of "Sex", "Housing", "Saving accounts", "Checking account", "Purpose", "Risk" need to be changed to "category".


So, what needs to be done in data cleaning including:

1. to remove first column
2. to correct data types
3. to fill empty values