After identifying cases with poor quality during the **Exploratory Data Analysis (EDA)** step, the data cleaning stage addresses and rectifies these issues.  

**Data cleaning** plays a fundamental role in enhancing the overall **quality** and **reliability** of datasets, directly impacting the performance of **machine learning models** and **analytical outcomes**. A well-cleaned dataset serves as the bedrock for **robust** model training and **accurate predictions**. 
In this notebook, our primary focus is on executing the initial stage of data cleaning, including:
* Feature screening
* Handling values that fall outside the logical range of respective fields
* Addressing inconsistencies within the dataset  

This meticulous approach to data cleaning not only enhances the accuracy of our analyses but also contributes significantly to the **overall success** of data science projects.

### Read the dataset

In [1]:
import pandas as pd
df = pd.read_csv('/kaggle/input/bank-loan/Bankloan.txt')

To enhance clarity and facilitate streamlined data processing, we **separate the dataset** into two distinct dataframes: one designated for the **target variable** or response, and the other for the **input variables** or predictors. This segregation allows for a more organized and efficient approach in preparing the data for subsequent modeling and analysis.

In [2]:
target = df.iloc[:,-1]
inputs = df.iloc[:,0:-1]

Create two lists distinguishing **categorical** and **continuous** variables based on their respective indices.

In [3]:
columns = inputs.columns

# Choose categorical elements 
categorical_indices = [1]

# Use a list comprehension to select the elements at the specified indices
categorical_fields = [columns[i] for i in categorical_indices]

# Create a new list of columns excluding categorical_fields (continuous)
continuous_fields = [j for j in columns if j not in categorical_fields]

### Feature Screening 
Filter out these features:  

* **Features with a coefficient of variation less than 0.1 for continuous variables**  
Identifying and screening out **continuous variables** with low variability ensures that the selected features provide **meaningful information** for analysis and modeling.

* **Features where the mode category percentage is greater than 95% for categorical variables**  
This step focuses on retaining **categorical variables** where one category overwhelmingly dominates, helping to streamline the dataset and enhance the interpretability of the resulting models.

* **Features with a percentage of unique categories exceeding 90% for categorical variables**  
Screening out **categorical variables** with a high percentage of unique categories contributes to simplifying the dataset and mitigating the risk of overfitting, ensuring a more robust and generalizable model.

In [4]:
# Define a minimum value for coefficient of variation
min_cv = 0.1

# Calculate the coefficient of variation for each column
cv_values = inputs[continuous_fields].std() / inputs[continuous_fields].mean()

# Filter out columns with CV less than 0.1
selected_columns =  cv_values[cv_values < 0.1].index

# Create a new DataFrame with only the selected columns
filtered_con = inputs[selected_columns]

# Print the resulting DataFrame
inputs_con = inputs[continuous_fields].drop(selected_columns, axis=1)
print(inputs_con)

      age  employ  address  income  debtinc   creddebt   othdebt
0    41.0      17       12   176.0      9.3  11.359392  5.008608
1    27.0      10        6    31.0     17.3   1.362202  4.000798
2    40.0      15        7     NaN      5.5   0.856075  2.168925
3    41.0      15       14   120.0      2.9   2.658720  0.821280
4    24.0       2        0    28.0     17.3   1.787436  3.056564
..    ...     ...      ...     ...      ...        ...       ...
695  36.0       6       15    27.0      4.6   0.262062  0.979938
696  29.0       6        4    21.0     11.5   0.369495  2.045505
697  33.0      15        3    32.0      7.6   0.491264  1.940736
698  45.0      19       22    77.0      8.4   2.302608  4.165392
699  37.0      12       14     NaN     14.7   2.994684  3.473316

[700 rows x 7 columns]


In [5]:
import pandas as pd

# Define a threshold for the dominant category percentage
threshold = 95

# Calculate the percentage of the mode category for each column
mode_category = (inputs[categorical_fields].apply(lambda x: x.value_counts().max() / len(x)) * 100)

# Select columns where the mode category percentage is greater than the threshold
selected_categorical_columns = mode_category[mode_category > threshold].index

# Create a new DataFrame with only the selected columns
mode_filtered_inputs = inputs[selected_categorical_columns]

# Filter out selected columns and print the resulting DataFrame
inputs_cat = inputs[categorical_fields].drop(selected_categorical_columns, axis=1)
print(inputs_cat)

      ed
0    3.0
1    1.0
2    1.0
3    NaN
4    2.0
..   ...
695  2.0
696  2.0
697  1.0
698  1.0
699  1.0

[700 rows x 1 columns]


In [6]:
import pandas as pd

# Set a threshold for excluding columns 
threshold = 90

# Calculate the percentage of distinct categories in categorical variables
distinct_percentage = (inputs_cat[categorical_fields].apply(lambda x: x.dropna().nunique() / x.count()) * 100)

# Select categorical columns based on distinct percentage threshold
selected_categorical_columns = distinct_percentage[distinct_percentage > threshold].index

# Create a new DataFrame with only the selected columns
distinct_filtered_inputs = inputs_cat[selected_categorical_columns]

# Filter out selected columns and print the resulting DataFrame
inputs_cat = inputs_cat.drop(selected_categorical_columns, axis=1)
print(inputs_cat)


      ed
0    3.0
1    1.0
2    1.0
3    NaN
4    2.0
..   ...
695  2.0
696  2.0
697  1.0
698  1.0
699  1.0

[700 rows x 1 columns]


Create a new dataframe by excluding both continuous and categorical features through feature screening.

In [7]:
filtered_df = pd.concat([inputs_con, inputs_cat, target], axis=1)

###  Handling Values Outside the Logical Range  
In data analysis, handling values that fall outside the logical range of respective fields is a critical step to maintain the **integrity** and **reliability** of the dataset. Values significantly deviating from the expected range, can distort analytical results and impact the overall **quality of findings**.Whether in continuous fields, addressing values beyond the logical range ensures that subsequent modeling or statistical techniques are based on a more representative and **trustworthy** dataset.

In [8]:
import pandas as pd

# Define ranges for each column
column_ranges = {
    'age': (18, 70),
    'employ': (0, 31),
    'address': (0, 80),
    'income': (0, 1000),
    'debtinc': (0, 100),
    'creddebt': (0, 30),
    'othdebt': (0, 30)
}

# Iterate through each column and fill NaN values outside the defined range
for column, (min_val, max_val) in column_ranges.items():
    filtered_df[column] = filtered_df[column].apply(lambda x: x if min_val <= x <= max_val else None)

# Display the updated DataFrame
print(filtered_df)
filtered_df.describe()
filtered_df.info()

      age  employ  address  income  debtinc   creddebt   othdebt   ed default
0    41.0      17       12   176.0      9.3  11.359392  5.008608  3.0       1
1    27.0      10        6    31.0     17.3   1.362202  4.000798  1.0       0
2    40.0      15        7     NaN      5.5   0.856075  2.168925  1.0       0
3    41.0      15       14   120.0      2.9   2.658720  0.821280  NaN       0
4    24.0       2        0    28.0     17.3   1.787436  3.056564  2.0       1
..    ...     ...      ...     ...      ...        ...       ...  ...     ...
695  36.0       6       15    27.0      4.6   0.262062  0.979938  2.0       1
696  29.0       6        4    21.0     11.5   0.369495  2.045505  2.0       0
697  33.0      15        3    32.0      7.6   0.491264  1.940736  1.0       0
698  45.0      19       22    77.0      8.4   2.302608  4.165392  1.0       0
699  37.0      12       14     NaN     14.7   2.994684  3.473316  1.0       0

[700 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Ra

### Handling Inconsistent Data  
In the area of data analysis, addressing inconsistent data is a basic task to ensure the **reliability** of results. Inconsistent data in **categorical variables**, whether due to **data entry errors** or discrepancies in **data Integration**, can introduce noise and **inaccuracies** into the dataset, potentially leading to misleading findings. For instance, one employee may enter customer addresses as "block 1/23", while another may use "block 1-23".  
By handling and rectifying these inconsistencies, analysts can foster a more cohesive and accurate representation of the underlying information. The impact of such attention to detail extends beyond cleaning the dataset; it directly influences the **credibility** of analysis reports. A meticulously curated dataset, free from inconsistencies in codes, lays the groundwork for **robust** statistical analyses and more informed decision-making.

First detect inconsistent data in frequency tables.

In [9]:
import numpy as np

def frequency_table(variable):
    
    # Get unique elements and their counts
    unique_elements, counts = np.unique(variable.dropna(), return_counts=True)

    # Calculate percentages
    percentages = (counts / len(variable)) * 100

    # Create a dictionary to store the value counts and percentages
    value_counts_and_percentages = zip(unique_elements, counts, percentages)

    # Print the value counts and percentages
    for i, j, k in value_counts_and_percentages:
        print(f"{i}: Count: {j}, Percentage: {k:.2f}%")
    return


frequency_table(filtered_df['default'])

'0': Count: 1, Percentage: 0.14%
0: Count: 515, Percentage: 73.57%
1: Count: 183, Percentage: 26.14%
:0: Count: 1, Percentage: 0.14%


After detecting inconsistent data in the frequency tables, the logical next step is to replace the incorrect data with the correct ones.

In [10]:
filtered_df['default'] = filtered_df['default'].replace([':0', "'0'"], '0')

Check the frequency table to ensure data consistency.

In [11]:
frequency_table(filtered_df['default'])

0: Count: 517, Percentage: 73.86%
1: Count: 183, Percentage: 26.14%


In [12]:
filtered_df.to_csv('/kaggle/working/Bankloan_Cleanedv1.csv')