## 0) Import Libraries and the Dataset

Setup your python environment and download the credit-score data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [3]:
credit_card_df = pd.read_csv('Dataset/train_biased.csv')

## 1) Understanding the dataset



### 1.1) General overview of the dataset

Using the function `.head()`, we can extract from the dataset the first x rows (x=5 by default, but we can change this number). They are not useful for analytic purposes, but still enable us to visualize the dataset to get a general overview of the data. 


In [22]:
# Head of the DataFrame
print("Head of the DataFrame:")
print(credit_card_df.head())
print("\n")

# Identify Number of Columns and Rows
n_rows = len(credit_card_df)
n_columns = len(credit_card_df.columns)
print(f"The dataset contains {n_columns} columns")
print(f"The dataset contains {n_rows} rows")


Head of the DataFrame:
       ID Customer_ID     Month           Name    City         Street    Age  \
0     NaN   CUS_0xd40       NaN  Aaron Maashoh  Lonton  Oxford Street    NaN   
1  0x1603   CUS_0xd40  February  Aaron Maashoh  Lonton  Oxford Street   23.0   
2  0x1604   CUS_0xd40       NaN            NaN  Lonton  Oxford Street -500.0   
3  0x1605   CUS_0xd40     April  Aaron Maashoh  Lonton  Oxford Street   23.0   
4     NaN   CUS_0xd40       May  Aaron Maashoh  Lonton  Oxford Street   23.0   

           SSN Occupation Annual_Income  ...  Num_of_Delayed_Payment  \
0  821-00-0265    Manager      19114.12  ...                       7   
1  821-00-0265    Manager      19114.12  ...                     NaN   
2  821-00-0265    Manager      19114.12  ...                       7   
3  821-00-0265    Manager      19114.12  ...                       4   
4  821-00-0265    Manager      19114.12  ...                     NaN   

   Changed_Credit_Limit  Num_Credit_Inquiries  Credit_Mix Outst

We continued our first and superficial analysis of the dataset by retrieving the names of the columns and their data types. This is done using the function `.dtypes`, in order to understand the nature of the data we are dealing with and to check if there are any inconsistencies in the dataset (e.g. a column that should be numeric but is stored as a string). 
In addition, we can also check the number of missing values in each column using the function `.isnull().sum()`. This is useful to understand if we need to impute the dataset before proceeding with the analysis. 
Finally, we can also check the number of unique values in each column using the function `.nunique()`, which is relevant to discover the categorical variables in the dataset.

In [23]:
# Create a DataFrame with column data types and number of unique values
column_summary = pd.DataFrame({
    "Data Type": credit_card_df.dtypes,
    "Unique Values": credit_card_df.nunique(),
    "Missing Values": credit_card_df.isnull().sum(),
})

# Display the combined summary
print(column_summary)

                         Data Type  Unique Values  Missing Values
ID                          object          90058            9942
Customer_ID                 object          12500            9893
Month                       object              8            9938
Name                        object          10139           18887
City                        object              4            9851
Street                      object             31            9920
Age                        float64           1501           14488
SSN                         object          12501           10001
Occupation                  object             21            9944
Annual_Income               object          18438           10094
Monthly_Inhand_Salary      float64          13204           23595
Num_Bank_Accounts          float64            879            9833
Num_Credit_Card            float64           1132           10062
Interest_Rate              float64           1597            9849
Num_of_Loa

Many columns, such as **Annual Income**, are classified as categorical. Upon inspecting the dataset, we observed that several of these columns contain special characters like `"_ \@ $"`, which cause the values to be treated as objects instead of numbers. To address this, we replace these special characters with an empty string `''`. This process may introduce new missing values, but it is necessary for accurate analysis. A cell containing only special characters provides no meaningful information and can be reasonably treated as a missing value.

In [24]:
def simple_clean_numeric(df, cols):
    
    """
    Convert specified columns to numeric by removing any non-digit characters 
    (except for period and minus sign) and then converting to numeric.
    
    Parameters:
      df: pandas DataFrame
      cols: List of column names to clean and convert
      
    Returns:
      A new DataFrame with the specified columns cleaned and converted to numeric.
    """
    
    df_clean = df.copy()
    for col in cols:
        if col in df_clean.columns:
            # Convert to string and remove all characters except digits, period, and minus sign.
            cleaned = df_clean[col].astype(str).str.replace(r'[^0-9\.-]', '', regex=True)
            # Convert the cleaned string to numeric (float); unconvertible values become NaN.
            df_clean[col] = pd.to_numeric(cleaned, errors='coerce')
    return df_clean

# List of columns that should be numeric
should_be_numeric = [
    'Annual_Income', 
    'Num_of_Loan', 
    'Num_of_Delayed_Payment', 
    'Changed_Credit_Limit', 
    'Outstanding_Debt', 
    'Amount_invested_monthly'
]

# Clean and convert the data
credit_card_df = simple_clean_numeric(credit_card_df, should_be_numeric)

# Verify that the columns are now numeric
print(credit_card_df[should_be_numeric].dtypes)


Annual_Income              float64
Num_of_Loan                float64
Num_of_Delayed_Payment     float64
Changed_Credit_Limit       float64
Outstanding_Debt           float64
Amount_invested_monthly    float64
dtype: object


After converting the columns to numeric, we can correctly check the data types of the columns again using the function `.dtypes`. This allows us to verify that the columns are now in the appropriate format for analysis, as well as to split the dataset into categorical and numerical variables. 

In [25]:
##Selecting numerical features
numerical_data = credit_card_df.select_dtypes(include='number')
#append the features of numerical_data to list
numerical_features = numerical_data.columns.tolist()

#Selecting categoricalfeatures
categorical_data = credit_card_df.select_dtypes(include= 'object')
#append the features of categorical_data to list
categorical_features = categorical_data.columns.tolist()

print(f'There are {len(numerical_features)} numerical features:')
print(numerical_features)
print('\n')
print(f'There are {len(categorical_features)} categorical features:')
print(categorical_features)

There are 15 numerical features:
['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries', 'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Amount_invested_monthly', 'Credit_Score']


There are 12 categorical features:
['ID', 'Customer_ID', 'Month', 'Name', 'City', 'Street', 'SSN', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Credit_History_Age', 'Payment_of_Min_Amount']


Now that we have all the numerical features in the correct format, we can proceed with a further analysis of the dataset. We can check the summary statistics of the dataset using the function `.describe()`, which provides useful information such as the mean, standard deviation, minimum and maximum values, and quartiles for each numerical column. This is useful to understand the distribution of the data and to identify any potential outliers.

In [26]:
credit_card_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,85512.0,110.226845,684.9076,-500.0,24.0,33.0,42.0,8698.0
Annual_Income,89906.0,176544.79559,1429013.0,7005.93,19453.05,37579.75,72760.04,24198060.0
Monthly_Inhand_Salary,76405.0,4185.789272,3178.561,303.645417,1625.02375,3086.683333,5940.3175,15204.63
Num_Bank_Accounts,90167.0,17.105172,117.7282,-1.0,3.0,6.0,7.0,1798.0
Num_Credit_Card,89938.0,22.44941,129.0372,0.0,4.0,5.0,7.0,1499.0
Interest_Rate,90151.0,72.343923,465.6848,1.0,8.0,13.0,20.0,5797.0
Num_of_Loan,89809.0,3.065116,63.19337,-100.0,1.0,3.0,5.0,1496.0
Delay_from_due_date,90012.0,21.081156,14.86271,-5.0,10.0,18.0,28.0,67.0
Num_of_Delayed_Payment,83782.0,31.041369,226.3703,-3.0,9.0,14.0,18.0,4388.0
Changed_Credit_Limit,88058.0,10.389097,6.782829,-6.49,5.34,9.41,14.86,36.97


We can instantly see that the dataset contains some strange values. As an example, the column **Age** has both negative (minimun value = -500) and very high values (maximum value = 8698). 
It is also possible to note that the target variable **Credit Score** score is more a categorical variable than a continuous one, as it has only 4 unique values. As a consequence, we will convert the target variable into a categorical variable.

In [27]:
# Convert target variable to object type
credit_card_df['Credit_Score'] = credit_card_df['Credit_Score'].astype('object')

# Display the summary statistics for categorical features
credit_card_df.describe(include="object").T

Unnamed: 0,count,unique,top,freq
ID,90058,90058,0x25fed,1
Customer_ID,90107,12500,CUS_0x8600,8
Month,90062,8,April,11319
Name,81113,10139,Langep,42
City,90149,4,Standhampton,47744
Street,90080,31,Quality Street,5343
SSN,89999,12501,#F%$D@*&8,5008
Occupation,90056,21,Journalist,17002
Type_of_Loan,79688,6260,Not Specified,1272
Credit_Mix,90085,4,Standard,32875


We firstly tried to plot the distribution of numerical variables without addressing outliers or noisy values. However, the presence of outliers and extreme values in the dataset made it difficult to visualize the distribution effectively. 
To improve the visualization, we decided to remove outliers from the dataset. However, because of the presence of an high number of missing values, we had to address them first.

## 2) Missing values imputation

The dataset contains numerous missing values, likely due to poor data collection or entry errors. It is also possible that an event occurred, leading to the loss of many values. However, we observe that there are only 12,500 unique `Customer_ID` values across 100,000 entries. This indicates that many customers are repeated throughout the dataset. Consequently, for missing values that should remain consistent for the same customer, we can impute them using the corresponding values from other rows associated with the same customer.

The first issue to fix is hence to address the missing values in the `Customer_ID` column. 

### 2.1) Imputation of missing values in the `Customer_ID` column

We firstly display the number of missing values in the `Customer_ID` column. 

In [28]:
missing_customerID = credit_card_df['Customer_ID'].isnull().sum()
print(f"Missing Customer_ID: {missing_customerID}")

Missing Customer_ID: 9893


By looking at the dataset we go the impression that each client has a monthly analysis of his credit cards on a range of months that goes from January to August. Consequently, we trid to prove the hypothesis by checking that the number of unique values in the `Customer_ID` column every 8 rows is constant and equal to 1.

Example of correct order:       
| Customer_ID | Month    |
|-------------|----------|
| Client 1    | January  |
| Client 1    | February |
| Client 1    | March    |
| Client 1    | April    |
| Client 1    | May      |
| Client 1    | June     |
| Client 1    | July     |
| Client 1    | August   |

In [29]:
# Check if every 8 rows, the Customer ID is unique
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Customer_ID'][i:i+7].nunique() > 1: # Check for 8 rows
        print(f"Different Customer ID found in rows {i} to {i+7}")
    # Check the number of misisng values
    if credit_card_df['Customer_ID'][i:i+7].isnull().sum() > 6:
        print(f"Missing Customer IDs found in rows {i} to {i+7}")
    else:
        counter += 1

print(f"Number of times Customer ID is the unique and not missing every 8 rows: {counter}")
print(f"Number of unique Customer IDs: {credit_card_df['Customer_ID'].nunique()}")

Number of times Customer ID is the unique and not missing every 8 rows: 12500
Number of unique Customer IDs: 12500


As we can see, the output of the previous cell shows that the column `Customer_ID` has 12,500 unique values, and each of them is repeated 8 times for the 8 months in which the customer was monitored.
This also implies that there is no customer with a missing `Customer_ID` value, as the number of unique values every 8 rows (which is confirmed to be 1) is equal to the number of unique values in the entire column. As a consequence, there are not wrong entries, but only missing values, that can be easily imputed by taking the unique value of the `Customer_ID` column for each group of 8 rows.

In [30]:
seen_customer_ids = set()

for i in range(0, len(credit_card_df)):
    customer_id = credit_card_df['Customer_ID'][i]
    if customer_id in seen_customer_ids or pd.isnull(customer_id):
        # If the customer ID is already seen or is null we skip this row
        continue
    else:
        # If the customer ID is not seen, we add it to the set
        seen_customer_ids.add(customer_id)
        # We use this customer ID to impute the missing values in the 8 rows of the group
        credit_card_df.loc[(i//8*8) : (i//8*8) + 7, 'Customer_ID'] = customer_id

# Check if there are still any missing values in the Customer_ID column
missing_customerID = credit_card_df['Customer_ID'].isnull().sum()
print(f"Missing Customer_ID after imputation: {missing_customerID}")

Missing Customer_ID after imputation: 0


After the imputation, we can check if the Customer_ID is unique and not missing every 8 rows. We can also check the number of unique Customer IDs in the DataFrame.

In [31]:
# Check that there is still only one unique Customer_ID in every 8 rows nad that there are no missing values
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Customer_ID'][i:i+7].nunique() > 1: # Check for 8 rows
        print(f"Different Customer ID found in rows {i} to {i+7}")
    else:
        counter += 1
print(f"Number of times Customer ID is the same in 8 rows: {counter}")

missing_customerID = credit_card_df['Customer_ID'].isnull().sum()
print(f"Missing Customer_ID: {missing_customerID}")

Number of times Customer ID is the same in 8 rows: 12500
Missing Customer_ID: 0


### 2.2) Filling the missing values in the `month` column

As we proved in the previous section, the dataset contains 8 months of data for each customer. Here we  fix the remaining missing values for `month` by simply putting the right sequence (from January to August) for each 8 rows (data stored for each customer).

In [32]:
# Define the expected month order.
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']

# Reset the index to ensure we're working on the DataFrame's natural order (if needed)
credit_card_df = credit_card_df.reset_index(drop=True)

# Create a new month column with a repeating cycle: each group of 8 rows gets January to August.
credit_card_df['Month'] = [month_order[i % 8] for i in range(len(credit_card_df))]

# Check Missing Values in the Month Column
missing_month = credit_card_df['Month'].isnull().sum()
print(f"Missing Month values: {missing_month}")

Missing Month values: 0


### 2.3) Filling the columns 'Name', 'City', 'Street' and 'Occupation'

These columns must be filled with the same value for each customer, and hence it is easy to impute them, especially because there are no missing values in the `Customer_ID` column and there are not customers for which one of these information is missing. Just to be sure, we can check the number of unique values in these columns every 8 rows, to confirm that they are all the same for each customer.

In [38]:
# Check that the data is unique every 8 rows
counter_name = 0
counter_city = 0
counter_street = 0
counter_occupation = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Name'][i:i+7].nunique() > 1:
        # print(f"Different Name found in rows {i} to {i+7}")
        counter_name += 1
    if credit_card_df['City'][i:i+7].nunique() > 1:
        # print(f"Different City found in rows {i} to {i+7}")
        counter_city += 1
    if credit_card_df['Street'][i:i+7].nunique() > 1:
        # print(f"Different Street found in rows {i} to {i+7}")
        counter_street += 1
    if credit_card_df['Occupation'][i:i+7].nunique() > 1:
        # print(f"Different Occupation found in rows {i} to {i+7}")
        counter_occupation += 1

print(f"Number of times Name is different for the same customer: {counter_name}")
print(f"Number of times City is different for the same customer: {counter_city}")
print(f"Number of times Street is different for the same customer: {counter_street}")
print(f"Number of times Occupation is different for the same customer: {counter_occupation}")

Number of times Name is different for the same customer: 0
Number of times City is different for the same customer: 6633
Number of times Street is different for the same customer: 0
Number of times Occupation is different for the same customer: 0


Even though the customer ID is unique every 8 rows and there are not issues with Name, Street and Occupation, meaning that there is a unique value for each customer, the `City` column presents a huge problem. Indeed, considering the output of the previous cell, we can see how more than half of the customers changed city during the 8 months of monitoring at least once.

As  a consequence, we decided to impute the `City` column with the most frequent value for each customer. This is a reasonable approach, as it is likely that the customer lived in that city for most of the time. We can also check how many customers changed city at least once during the 8 months of monitoring.

In [36]:
# Impute missing values in the City column using the mode
mode_values = credit_card_df.groupby('Customer_ID')["City"].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None)
credit_card_df['City'] = credit_card_df['City'].fillna(credit_card_df['Customer_ID'].map(mode_values))

# Check Missing Values in the City Column
missing_city = credit_card_df['City'].isnull().sum()
print(f"Missing City values: {missing_city}")

Missing City values: 0


In [37]:
# Impute missing values in the Name, Street, and Occupation columns
cols_invariant = ['Name', 'Street', 'Occupation']
for col in cols_invariant:
    credit_card_df[col] = credit_card_df.groupby('Customer_ID')[col].transform(lambda x: x.ffill().bfill())

# Check for any remaining missing values in the columns
missing_values = credit_card_df[cols_invariant].isnull().sum()
print(f"Missing values in invariant columns after filling: {missing_values}")

Missing values in invariant columns after filling: Name          0
Street        0
Occupation    0
dtype: int64



### 2.4) Fixing Age

With the age column, we identified two main issues:

- Inconsistent values across months: For some customers, the age values fluctuate unrealistically between months. For example, in January, a customer might have an age of 23, in February -500 (which is impossible), and in March, it reverts back to 23. We opted to change these impossible values to NaN, as they are not valid ages.
- Age increment due to birthdates between January and August: Many customers have their birthdays between January and August, causing their age to increase by one year during this period. For these cases, it is acceptable to have some rows with one age value (e.g., 54) and others with the incremented value (e.g., 55). 

To handle this, we substiute the missing values or the values that have a difference with the previous one of more than 1 with the value of the previous month. If the latter is not available, we can use the next month value.

In [39]:
# Set impossible values to NaN
credit_card_df.loc[(credit_card_df['Age'] > 90) | (credit_card_df['Age'] < 0), 'Age'] = np.nan

# Impute missing values in the Age column using backward and forward fill
credit_card_df['Age'] = credit_card_df.groupby('Customer_ID')['Age'].ffill().bfill()

# Check for any remaining missing values in the Age column
print(f"Missing Age values after filling: {credit_card_df['Age'].isnull().sum()}")


Missing Age values after filling: 0


To make sure that the filling was done correctly, we can check the number of unique values in the `Age` column every 8 rows. 
If the imputation was successful, we should see that the number of unique values is at most 2, as we have only two possible values for each customer: the age before the birthday and the age after the birthday. Also, in this case, we have to ensure that theages are sorted in ascending order, as it is not possible to have a customer with an age of 23 in January and 22 in February.

In [43]:
# Check that the data is unique or sorted every 8 rows
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df["Age"][i:i+7].nunique() > 2: # Check for 8 rows
        print(f"Different values found in rows {i} to {i+7}")
    elif credit_card_df["Age"][i:i+7].nunique() > 1:
        # Check if the values are sorted
        if credit_card_df["Age"][i:i+7].tolist() != sorted(credit_card_df["Age"][i:i+7].tolist()):
            print(f"Values not in the correct order between {i} and {i+7}")
            credit_card_df["Age"][i:i+7] = sorted(credit_card_df["Age"][i:i+7].tolist())
        else:
            counter += 1
    else:
        counter += 1

print(f"Number of times Age is correct in 8 rows: {counter}")


Number of times Age is correct in 8 rows: 12500


### 2.5) Filling Missing SSN Values
The Social Security Number, should uniquely identify a customer. Since a customer's SSN is expected to remain constant over time, any missing or invalid SSN in one row can be reliably filled using a valid SSN from another row belonging to the same customer. We decided to treat this column differently from the others, as it also contains some invalid values.
Firtsly, we defined a function to verify if the SSN matches the standard format (three digits, a dash, two digits, a dash, and four digits). Values that do not meet this pattern or that correspond to known placeholders are set as missing (NaN).
By grouping the data using the Customer_ID, we can apply forward fill and backward fill within each group. This ensures that if at least one valid SSN exists among the 8 rows for the same customer, it is propagated to fill any missing values.

This method leverages the consistency of SSN within each customer group and the sequential structure of the dataset.

In [130]:
# Convert special placeholders into NaN.
def is_valid_ssn(x):
    """
    Checks if x matches the typical SSN format: 3 digits, '-', 2 digits, '-', 4 digits.
    Example: 821-00-0265
    """
    pattern = r'^\d{3}-\d{2}-\d{4}$'
    return bool(re.match(pattern, str(x)))

# Apply the function to the SSN column and set invalid SSNs to NaN
credit_card_df.loc[~credit_card_df['SSN'].apply(is_valid_ssn), 'SSN'] = np.nan

# Impute missing values in the SSN column using the mode
mode_values = credit_card_df.groupby('Customer_ID')["SSN"].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None)
credit_card_df["SSN"] = credit_card_df['Customer_ID'].map(mode_values)

# Check for any remaining missing values in the SSN column
missing_ssn = credit_card_df['SSN'].isnull().sum()
print(f"Missing SSN values after filling: {missing_ssn}")

# Check that the data is unique every 8 rows
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['SSN'][i:i+7].nunique() > 1: # Check for 8 rows
        print(f"Different SSN found in rows {i} to {i+7}")
    else:
        counter += 1

print(f"Number of times SSN is the same in 8 rows: {counter}")
print(f"Number of unique SSNs: {credit_card_df['SSN'].nunique()}")

Missing SSN values after filling: 0
Number of times SSN is the same in 8 rows: 12500
Number of unique SSNs: 12500


### 1.2.6) Filling Missing Values in Annual Income, Monthly Inhand Salary, Num Bank Accounts, Num Credit Card, Interest Rate and Num of Loan
We suspect that some columns (for example, Annual_Income, Monthly_Inhand_Salary, Num_Bank_Accounts, Num_Credit_Card, Interest_Rate, and Num_of_Loan) should remain constant over the 8 monthly records for each customer. However, there is a possibility that for some columns (such as Monthly_Inhand_Salary) the values might change from month to month. 

To decide if we can impute missing values using the same method as before (forward fill and backward fill based on Customer_ID), we first need to verify that the values in these columns do not vary within the same customer. The code below groups the data by Customer_ID and computes the number of unique values for each candidate column obviously excluding the missing values. This allows us to calculate the percentage of customers that have more than one unique value per column.

In [131]:
# Define the candidate columns that might be invariant across months for each customer
candidate_cols = [
    "Annual_Income",
    "Monthly_Inhand_Salary",
    "Num_Bank_Accounts",
    "Num_Credit_Card",
    "Interest_Rate",
    "Num_of_Loan"
]

# Check if the values are the same for each customer across all months (8 rows)
flag = False
for col in candidate_cols:
    group_dif= 0
    group_large_dif = 0
    for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
        if credit_card_df[col][i:i+7].nunique() > 1: # Check for 8 rows
            # Check the difference between the maximum and minimum values
            # print(f"Column {col} is not invariant across months in rows {i} to {i+7}")
            flag = True
            group_dif += 1
            if credit_card_df[col][i:i+7].max() - credit_card_df[col][i:i+7].min() > credit_card_df[col][i:i+7].min() * 10:
                # print(f"Column {col} has a large difference in rows {i} to {i+7}")
                group_large_dif += 1     
    print(f"Column {col} has {group_dif} groups with differences and {group_large_dif} groups with large differences.")
    if flag:
        print(f"Column {col} is not invariant across months.")
    else:
        print(f"Column {col} is invariant across months.")




Column Annual_Income has 779 groups with differences and 771 groups with large differences.
Column Annual_Income is not invariant across months.
Column Monthly_Inhand_Salary has 616 groups with differences and 0 groups with large differences.
Column Monthly_Inhand_Salary is not invariant across months.
Column Num_Bank_Accounts has 1229 groups with differences and 973 groups with large differences.
Column Num_Bank_Accounts is not invariant across months.
Column Num_Credit_Card has 1909 groups with differences and 1634 groups with large differences.
Column Num_Credit_Card is not invariant across months.
Column Interest_Rate has 1508 groups with differences and 1484 groups with large differences.
Column Interest_Rate is not invariant across months.
Column Num_of_Loan has 3054 groups with differences and 3047 groups with large differences.
Column Num_of_Loan is not invariant across months.


We have seen that the columns `Num_Bank_Accounts`, `Num_Credit_Card`, `Interest_Rate`, and `Num_of_Loan` show different values for the same customer. However, we also noticed that the columns often show a difference between the highest and lowest value that seems to be too high to be considered as a normal variation. As a consequence, we decided to impute the missing values in these columns using the mode value of the column for each customer, after excluding the outliers.

We check if there is a customer that has (for more than 1 month) an annual income that is greater than 1 million.

In [132]:
# How many times there is a customer with an annual income greater than 1 million as the most frequent value in 8 rows
for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    most_frequent_value = credit_card_df["Annual_Income"][i:i+8].mode().iloc[0]
    if most_frequent_value > 1 * 10**6:
        print(f"Annual Income exceeds 1 million in rows {i} to {i+7}")


In [133]:
# Exclude impossible values from the dataset

# Replace all negative values in the specified columns with NaN
for col in candidate_cols:
    credit_card_df.loc[credit_card_df[col] < 0, col] = np.nan

# Replace Number of Bank Accounts and Number of Credit Cards with Nan if they are greater than 20
credit_card_df.loc[credit_card_df['Num_Bank_Accounts'] > 15, 'Num_Bank_Accounts'] = np.nan
credit_card_df.loc[credit_card_df['Num_Credit_Card'] > 15, 'Num_Credit_Card'] = np.nan

# Replace Interest Rate with NaN if it is greater than 100%
credit_card_df.loc[credit_card_df['Interest_Rate'] > 100, 'Interest_Rate'] = np.nan

# Replace Annual Income with NaN if it is greater than 1 million
credit_card_df.loc[credit_card_df['Annual_Income'] > 1 * 10**6, 'Annual_Income'] = np.nan

# Replace Monthly Inhand Salary with NaN if it is greater than 1 million
credit_card_df.loc[credit_card_df['Monthly_Inhand_Salary'] > 1 * 10**6, 'Monthly_Inhand_Salary'] = np.nan

After exluding impossible values, we can check again with the previous code if the columns are consistent across the customers.

In [134]:
# Check if the values are the same for each customer across all months (8 rows)
flag = False
for col in candidate_cols:
    group_dif= 0
    group_large_dif = 0
    for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
        if credit_card_df[col][i:i+7].nunique() > 1: # Check for 8 rows
            # print(f"Column {col} is not invariant across months in rows {i} to {i+7}")
            flag = True
            group_dif += 1
            # Check the difference between the maximum and minimum values
            if credit_card_df[col][i:i+7].max() - credit_card_df[col][i:i+7].min() > credit_card_df[col][i:i+7].min() * 10:
                # print(f"Column {col} has a large difference in rows {i} to {i+7}")
                group_large_dif += 1
        # Check if the values are all missing 
        if credit_card_df[col][i:i+7].isnull().all():
            print(f"Column {col} is all missing in rows {i} to {i+7}")
    print(f"Column {col} has {group_dif} groups with differences and {group_large_dif} groups with large differences.")
    if flag:
        print(f"Column {col} is not invariant across months.")
    else:
        print(f"Column {col} is invariant across months.")

Column Annual_Income has 31 groups with differences and 22 groups with large differences.
Column Annual_Income is not invariant across months.
Column Monthly_Inhand_Salary is all missing in rows 94720 to 94727
Column Monthly_Inhand_Salary has 616 groups with differences and 0 groups with large differences.
Column Monthly_Inhand_Salary is not invariant across months.
Column Num_Bank_Accounts has 255 groups with differences and 16 groups with large differences.
Column Num_Bank_Accounts is not invariant across months.
Column Num_Credit_Card has 263 groups with differences and 4 groups with large differences.
Column Num_Credit_Card is not invariant across months.
Column Interest_Rate has 17 groups with differences and 6 groups with large differences.
Column Interest_Rate is not invariant across months.
Column Num_of_Loan has 364 groups with differences and 355 groups with large differences.
Column Num_of_Loan is not invariant across months.


Since we have few customers that have a varying value for the same column (the maximum is 616 in the case of `Monthly Inhand Salary`), we can impute the missing values using the mode value of the column for each customer.

In [135]:
for col in candidate_cols:
    for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
        mode_values = credit_card_df[col][i:i+7].mode()
        if not mode_values.empty:
            credit_card_df.loc[i:i+7, col] = mode_values.iloc[0]
        else: # Use Annual Income divided by 12 as a fallback
            credit_card_df.loc[i:i+7, col] = credit_card_df['Annual_Income'][i:i+7].mode().iloc[0] / 12


# Check for any remaining missing values in the columns
missing_values = credit_card_df[candidate_cols].isnull().sum()
print(f"Missing values in candidate columns after filling: {missing_values}")

Missing values in candidate columns after filling: Annual_Income            0
Monthly_Inhand_Salary    0
Num_Bank_Accounts        0
Num_Credit_Card          0
Interest_Rate            0
Num_of_Loan              0
dtype: int64


We check if there is a customer that has (for more than 1 month) an annual income that is greater than 1 million.

### 1.2.7) Filling Missing Values in Type of Loan
The column `Type_of_Loan` contains categorical values that represent the type of loan a customer has. 

In [136]:
# Check if the values are the same for each customer across all months (8 rows)
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Type_of_Loan'][i:i+7].nunique() > 1:
        print(f"Different Type of Loan found in rows {i} to {i+7}")
    elif credit_card_df['Type_of_Loan'][i:i+7].isnull().all():
        print(f"Type of Loan is all missing in rows {i} to {i+7}")
        # credit_card_df['Type_of_Loan'][i:i+7] = "Not Specified"
    else:
        counter += 1
    
print(f"Number of times Type of Loan is the same in 8 rows: {counter}")

    

Type of Loan is all missing in rows 32 to 39
Type of Loan is all missing in rows 40 to 47
Type of Loan is all missing in rows 128 to 135
Type of Loan is all missing in rows 144 to 151
Type of Loan is all missing in rows 160 to 167
Type of Loan is all missing in rows 176 to 183
Type of Loan is all missing in rows 320 to 327
Type of Loan is all missing in rows 352 to 359
Type of Loan is all missing in rows 424 to 431
Type of Loan is all missing in rows 448 to 455
Type of Loan is all missing in rows 504 to 511
Type of Loan is all missing in rows 608 to 615
Type of Loan is all missing in rows 712 to 719
Type of Loan is all missing in rows 720 to 727
Type of Loan is all missing in rows 728 to 735
Type of Loan is all missing in rows 744 to 751
Type of Loan is all missing in rows 848 to 855
Type of Loan is all missing in rows 864 to 871
Type of Loan is all missing in rows 896 to 903
Type of Loan is all missing in rows 952 to 959
Type of Loan is all missing in rows 984 to 991
Type of Loan is a

In [137]:
# When the Type of Loan is missing, we can use the most frequent value in the group of 8 rows
for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mode_values = credit_card_df['Type_of_Loan'][i:i+7].mode()
    # print(f"Mode values for Type of Loan in rows {i} to {i+7}: {mode_values}")
    if not mode_values.empty:
        credit_card_df.loc[i:i+7, 'Type_of_Loan'] = mode_values.iloc[0]
    else: # Use not specified as a fallback
        credit_card_df.loc[i:i+7, 'Type_of_Loan'] = "Not Specified"

### 1.2.8) Filling Missing Values in Outstanding Debt
The column `Outstanding_Debt` contains numerical values that represent the amount of debt a customer has.

In [138]:
# Check if the values are the same for each customer across all months (8 rows)
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Outstanding_Debt'][i:i+7].nunique() > 1:
        print(f"Different Outstanding Debt found in rows {i} to {i+7}")
    elif credit_card_df['Outstanding_Debt'][i:i+7].isnull().all():
        print(f"Outstanding Debt is all missing in rows {i} to {i+7}")
    else:
        counter += 1

print(f"Number of times Outstanding Debt is the same in 8 rows: {counter}")

Number of times Outstanding Debt is the same in 8 rows: 12500


In [139]:
# We can use the most frequent value in the group of 8 rows
for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mode_values = credit_card_df['Outstanding_Debt'][i:i+7].mode()
    # print(f"Mode values for Outstanding Debt in rows {i} to {i+7}: {mode_values}")
    if not mode_values.empty:
        credit_card_df.loc[i:i+7, 'Outstanding_Debt'] = mode_values.iloc[0]

# Check for any remaining missing values in the columns
missing_values = credit_card_df['Outstanding_Debt'].isnull().sum()
print(f"Missing values in Outstanding Debt after filling: {missing_values}")

Missing values in Outstanding Debt after filling: 0


### 1.2.9) Filling Missing Values in Credit History
The column `Credit_History_Age` contains categorical values that represent the credit history of a customer.

In [140]:
# Transofrm the categorical value (x Years and x Months) into a single value in months
def convert_to_months(x):
    """
    Convert a string in the format 'x Years and y Months' to total months.
    """
    if pd.isnull(x):
        return np.nan
    match = re.match(r'(\d+)\s*Years?\s*and\s*(\d+)\s*Months?', x)
    if match:
        years, months = map(int, match.groups())
        return years * 12 + months
    else:
        return np.nan

# Apply the function to the 'Credit_History_Age' column
credit_card_df['Credit_History_Age'] = credit_card_df['Credit_History_Age'].apply(convert_to_months)



In [141]:
# Check if the values are sorted every 8 rows (exluding nans)
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Credit_History_Age'][i:i+7].dropna().tolist() != sorted(credit_card_df['Credit_History_Age'][i:i+7].dropna().tolist()):
            print(f"Values not in the correct order between {i} and {i+7}")
    # Check if the values are all missing
    if credit_card_df['Credit_History_Age'][i:i+7].isnull().all():
        print(f"Credit History Age is all missing in rows {i} to {i+7}")
    # Check if the values min and max differ by more than 8 months
    elif credit_card_df['Credit_History_Age'][i:i+7].max() - credit_card_df['Credit_History_Age'][i:i+7].min() > 7:
        print(f"Credit History Age has a large difference in rows {i} to {i+7}")
    else:
        counter += 1

print(f"Number of times Credit History Age is sequential in 8 rows: {counter}")

Number of times Credit History Age is sequential in 8 rows: 12500


In [142]:
seen_customer_ids = set()
for i in range(0, len(credit_card_df)):  # Iterate through the DataFrame in steps of 8
    customer_id = credit_card_df['Customer_ID'][i]
    credit_history_age = credit_card_df['Credit_History_Age'][i]
    if pd.isnull(credit_history_age):
        # If the credit history age is null, we skip this row
        continue
    elif customer_id in seen_customer_ids:
        # If the customer ID is already seen or is null we skip this row
        continue
    else:
        # If the customer ID is not seen, we add it to the set
        seen_customer_ids.add(customer_id)
        # Compute the range of value in the group of 8 rows
        start = i // 8 * 8
        end = start + 7
        start_month = credit_history_age - (i - start)
        end_month = credit_history_age + (end - i)
        credit_card_df.loc[start:end, 'Credit_History_Age'] = np.arange(start_month, end_month + 1)

# Check for any remaining missing values in the columns
missing_values = credit_card_df['Credit_History_Age'].isnull().sum()
print(f"Missing values in Credit History Age after filling: {missing_values}")

Missing values in Credit History Age after filling: 0


### 1.2.10) Filling Missing Values in `amount invested monthly` and `Credit_Utilization_Ratio`

The column `Amount_Invested_monthly` and `Credit_Utilization_Ratio` contains numerical values that represent the amount of money a customer invests each month. It is possible that this value changes from month to month and it is impossible to have a result that is surely correct. As a consequence, we decided to impute the missing values using the mean value of the column for each customer, after excluding the outliers.

In [143]:
# We iterate through the DataFrame in steps of 8
counter = 0
for i in range(0, len(credit_card_df), 8):
    # If the maxmum value is more than 100 times greater than the minimum value, we set that value to NaN
    if credit_card_df['Amount_invested_monthly'][i:i+7].max() - credit_card_df['Amount_invested_monthly'][i:i+7].mean() > credit_card_df['Amount_invested_monthly'][i:i+7].mean() * 3:
        # print(f"Credit History Age has a large difference in rows {i} to {i+7}") 
        counter += 1

print(f"Number of times Amount Invested Monthly has a large difference in 8 rows: {counter}")

# Remove values that are too high
credit_card_df.loc[credit_card_df['Amount_invested_monthly'] > 9000, 'Amount_invested_monthly'] = np.nan

# Do again the check for the values in the group of 8 rows
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Amount_invested_monthly'][i:i+7].max() - credit_card_df['Amount_invested_monthly'][i:i+7].mean() > credit_card_df['Amount_invested_monthly'][i:i+7].mean() * 3:
        # print(f"Credit History Age has a large difference in rows {i} to {i+7}") 
        counter += 1
print(f"Number of times Amount Invested Monthly has a large difference in 8 rows after processing: {counter}")

Number of times Amount Invested Monthly has a large difference in 8 rows: 2532
Number of times Amount Invested Monthly has a large difference in 8 rows after processing: 0


In [144]:
# Filling missing values in Amount Invested Monthly using the mean of the customer group

for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mean_value = credit_card_df['Amount_invested_monthly'][i:i+7].mean()
    if not pd.isnull(mean_value):
        credit_card_df.loc[i:i+7, 'Amount_invested_monthly'].fillna(mean_value, inplace=True)
    else:
        # If the mean is NaN, we can use the overall mean as a fallback
        print("Mean value is NaN, using overall mean")
        overall_mean = credit_card_df['Amount_invested_monthly'].fillna(mean_value, inplace=True)
        credit_card_df.loc[i:i+7, 'Amount_invested_monthly'] = overall_mean

# Check for any remaining missing values in the columns
missing_values = credit_card_df['Amount_invested_monthly'].isnull().sum()
print(f"Missing values in Amount Invested Monthly after filling: {missing_values}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit_card_df.loc[i:i+7, 'Amount_invested_monthly'].fillna(mean_value, inplace=True)


Missing values in Amount Invested Monthly after filling: 0


In [174]:
# We iterate through the DataFrame in steps of 8
counter = 0
for i in range(0, len(credit_card_df), 8):
    # If the maxmum value is more than 100 times greater than the minimum value, we set that value to NaN
    if credit_card_df['Credit_Utilization_Ratio'][i:i+7].max() - credit_card_df['Credit_Utilization_Ratio'][i:i+7].mean() > credit_card_df['Credit_Utilization_Ratio'][i:i+7].mean() * 3:
        # print(f"Credit History Age has a large difference in rows {i} to {i+7}") 
        counter += 1

print(f"Number of times Amount Invested Monthly has a large difference in 8 rows: {counter}")

# Remove values that are too high
credit_card_df.loc[credit_card_df['Credit_Utilization_Ratio'] > 9000, 'Credit_Utilization_Ratio'] = np.nan

# Do again the check for the values in the group of 8 rows
counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Credit_Utilization_Ratio'][i:i+7].max() - credit_card_df['Credit_Utilization_Ratio'][i:i+7].mean() > credit_card_df['Credit_Utilization_Ratio'][i:i+7].mean() * 3:
        # print(f"Credit History Age has a large difference in rows {i} to {i+7}") 
        counter += 1
print(f"Number of times Amount Invested Monthly has a large difference in 8 rows after processing: {counter}")

Number of times Amount Invested Monthly has a large difference in 8 rows: 0
Number of times Amount Invested Monthly has a large difference in 8 rows after processing: 0


In [178]:
# Filling missing values in Credit Utilization Ratio using the mean of the customer group

for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mean_value = credit_card_df['Credit_Utilization_Ratio'][i:i+7].mean()
    if not pd.isnull(mean_value):
        credit_card_df.loc[i:i+7, 'Credit_Utilization_Ratio'].fillna(mean_value, inplace=True)

    else:
        # If the mean is NaN, we can use the overall mean as a fallback
        print("Mean value is NaN, using overall mean")
        overall_mean = credit_card_df['Credit_Utilization_Ratio'].mean()
        credit_card_df.loc[i:i+7, 'Credit_Utilization_Ratio'].fillna(overall_mean, inplace=True)

# Check for any remaining missing values in the columns
missing_values = credit_card_df['Credit_Utilization_Ratio'].isnull().sum()
print(f"Missing values in Credit Utilization Ratio after filling: {missing_values}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit_card_df.loc[i:i+7, 'Credit_Utilization_Ratio'].fillna(mean_value, inplace=True)


Missing values in Credit Utilization Ratio after filling: 0


### 1.2.11) Filling Missing Values in `Payment_of_Min_Amount`


In [147]:
# Check the number of unique values for every customer in the group of 8 rows

counter = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Payment_of_Min_Amount'][i:i+7].nunique() > 2:
        print(f"All the categories are present in rows {i} to {i+7}")
    elif credit_card_df['Payment_of_Min_Amount'][i:i+7].isnull().all():
        print(f"Payment of Minimum Amount is all missing in rows {i} to {i+7}")
    else:
        counter += 1

In [149]:
# Convert all Nan values to "NM"

credit_card_df['Payment_of_Min_Amount'] = credit_card_df['Payment_of_Min_Amount'].fillna("NM")

### 1.2.12) Filling Missing Values in `Delay_from_due_date` and `Num_of_Delayed_Payment` and `Changed_Credit_Limit`
The columns `Delay_from_due_date`, `Num_of_Delayed_Payment`, and `Changed_Credit_Limit` contain numerical values that represent the delay from the due date, the number of delayed payments, and the change in credit limit for a customer.

In [161]:
# Check if there are more than 3 different values in the group of 8 rows
counter_dif_delay = 0
counter_dif_num = 0
counter_missing_delay = 0
counter_missing_num = 0
counter_dif_change = 0
counter_missing_change = 0
for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Delay_from_due_date'][i:i+7].nunique() > 4:
        counter_dif_delay += 1
    elif credit_card_df['Delay_from_due_date'][i:i+7].isnull().all():
        counter_missing_delay += 1
    if credit_card_df['Num_of_Delayed_Payment'][i:i+7].nunique() > 5:
        counter_dif_num += 1
    elif credit_card_df['Num_of_Delayed_Payment'][i:i+7].isnull().all():
        counter_missing_num += 1
    if credit_card_df['Changed_Credit_Limit'][i:i+7].nunique() > 2:
        counter_dif_change += 1
    elif credit_card_df['Changed_Credit_Limit'][i:i+7].isnull().all():
        counter_missing_change += 1

print(f"Number of times Delay from Due Date has a large difference in 8 rows: {counter_dif_delay}")
print(f"Number of times Delay from Due Date is all missing in 8 rows: {counter_missing_delay}")
print(f"Number of times Number of Delayed Payment has a large difference in 8 rows: {counter_dif_num}")
print(f"Number of times Number of Delayed Payment is all missing in 8 rows: {counter_missing_num}")
print(f"Number of times Changed Credit Limit has a large difference in 8 rows: {counter_dif_change}")
print(f"Number of times Changed Credit Limit is all missing in 8 rows: {counter_missing_change}")

Number of times Delay from Due Date has a large difference in 8 rows: 475
Number of times Delay from Due Date is all missing in 8 rows: 0
Number of times Number of Delayed Payment has a large difference in 8 rows: 54
Number of times Number of Delayed Payment is all missing in 8 rows: 0
Number of times Changed Credit Limit has a large difference in 8 rows: 3881
Number of times Changed Credit Limit is all missing in 8 rows: 0


In [163]:
# Filling missing values in Delay from Due Date and Number of Delayed Payment using the mode of the customer group

for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mode_value_delay = credit_card_df['Delay_from_due_date'][i:i+7].mode()
    mode_value_num = credit_card_df['Num_of_Delayed_Payment'][i:i+7].mode()
    mode_value_change = credit_card_df['Changed_Credit_Limit'][i:i+7].mode()
    if not mode_value_delay.empty:
        credit_card_df.loc[i:i+7, 'Delay_from_due_date'] = credit_card_df['Delay_from_due_date'].fillna(mode_value_delay.iloc[0])
    if not mode_value_num.empty:
        credit_card_df.loc[i:i+7, 'Num_of_Delayed_Payment'] = credit_card_df['Num_of_Delayed_Payment'].fillna(mode_value_num.iloc[0])
    if not mode_value_change.empty:
        credit_card_df.loc[i:i+7, 'Changed_Credit_Limit'] = credit_card_df['Changed_Credit_Limit'].fillna(mode_value_change.iloc[0])

# Check for any remaining missing values in the columns
missing_values = credit_card_df[['Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit']].isnull().sum()
print(f"Missing values in Delay from Due Date after filling: {missing_values['Delay_from_due_date']}")
print(f"Missing values in Number of Delayed Payment after filling: {missing_values['Num_of_Delayed_Payment']}")
print(f"Missing values in Changed Credit Limit after filling: {missing_values['Changed_Credit_Limit']}")

Missing values in Delay from Due Date after filling: 0
Missing values in Number of Delayed Payment after filling: 0
Missing values in Changed Credit Limit after filling: 0


### 1.2.13) Filling Missing Values in `Credit_Mix` and `Num_Credit_Inquiries`

In [172]:
# Check if the values are the same for each customer across all months (8 rows)
counter_mix = 0
counter_inq = 0

# Put all the values "-" to NaN
credit_card_df['Credit_Mix'] = credit_card_df['Credit_Mix'].replace("_", np.nan)

for i in range(0, len(credit_card_df), 8): # Iterate through the DataFrame in steps of 8
    if credit_card_df['Credit_Mix'][i:i+7].nunique() > 1:
        print(f"Different Credit Mix found in rows {i} to {i+7}")
    elif credit_card_df['Credit_Mix'][i:i+7].isnull().all():
        print(f"Credit Mix is all missing in rows {i} to {i+7}")
    else:
        counter_mix += 1
    if credit_card_df['Num_Credit_Inquiries'][i:i+7].nunique() > 3:
        print(f"Different Number of Inquiries found in rows {i} to {i+7}")
    elif credit_card_df['Num_Credit_Inquiries'][i:i+7].isnull().all():
        print(f"Number of Inquiries is all missing in rows {i} to {i+7}")
    else:
        counter_inq += 1

print(f"Number of times Credit Mix is the same in 8 rows: {counter_mix}")
print(f"Number of times Number of Inquiries is the same in 8 rows: {counter_inq}")

Different Number of Inquiries found in rows 10088 to 10095
Different Number of Inquiries found in rows 13816 to 13823
Different Number of Inquiries found in rows 15128 to 15135
Different Number of Inquiries found in rows 15344 to 15351
Different Number of Inquiries found in rows 19928 to 19935
Different Number of Inquiries found in rows 23080 to 23087
Different Number of Inquiries found in rows 35616 to 35623
Different Number of Inquiries found in rows 42592 to 42599
Different Number of Inquiries found in rows 43064 to 43071
Different Number of Inquiries found in rows 45920 to 45927
Different Number of Inquiries found in rows 46048 to 46055
Different Number of Inquiries found in rows 56472 to 56479
Different Number of Inquiries found in rows 68984 to 68991
Different Number of Inquiries found in rows 81992 to 81999
Different Number of Inquiries found in rows 87208 to 87215
Different Number of Inquiries found in rows 91056 to 91063
Different Number of Inquiries found in rows 93256 to 932

In [None]:
# Impute the values in the Credit Mix column using the mode of the customer group
for i in range(0, len(credit_card_df), 8):  # Iterate through the DataFrame in steps of 8
    mode_values = credit_card_df['Credit_Mix'][i:i+7].mode()
    mode_values_inq = credit_card_df['Num_Credit_Inquiries'][i:i+7].mode()
    if not mode_values_inq.empty:
        credit_card_df.loc[i:i+7, 'Num_Credit_Inquiries'] = credit_card_df['Num_Credit_Inquiries'].fillna(mode_values_inq.iloc[0])
    else: # Use 0 as a fallback
        credit_card_df.loc[i:i+7, 'Num_Credit_Inquiries'] = credit_card_df['Num_Credit_Inquiries'].fillna(0)
    if not mode_values.empty:
        credit_card_df.loc[i:i+7, 'Credit_Mix'] = mode_values.iloc[0]
    else: # Use "Good" as a fallback
        credit_card_df.loc[i:i+7, 'Credit_Mix'] = "Standard"

In [4]:
# Save the resulting DataFrame to a new CSV file
credit_card_df.to_csv('Dataset/cleaned_credit_card_data.csv', index=False)