# 🌳 📝 Complete Guide to Data Quality Checks from A to Z

Welcome to the "Complete Guide to Data Quality Checks from A to Z," your ultimate resource for mastering the critical techniques of data quality checks in data science and analytics. This comprehensive guide is designed for anyone interested in ensuring their data is primed for analysis, from students just starting out in data science to seasoned analysts looking to refine their skills.

## What Will You Learn?

This guide covers a broad spectrum of topics crucial for effective data quality checks, making sure you are well-equipped to handle any challenges in cleaning and organizing data. Here's a snapshot of what's included:

- **Feature Screening:** Learn how to drop features with a coefficient of variation less than 0.1, mode category greater than 0.95, and unique values greater than 0.9.
- **Handling Out of Logical Range Data:** Techniques to identify and drop data that falls outside logical ranges.
- **Handling Inconsistent Data:** Methods for merging inconsistent data entries.
- **Outlier Detection:** Techniques for one-dimensional outlier detection using standard deviation and IQR, and multi-dimensional outlier detection.
- **Handling Missing Data:** Strategies for identifying, imputing, and dealing with missing data.

Prepare to dive deep into the world of data quality checks, enhancing your ability to clean, organize, and transform data into a powerful asset for any analysis or machine learning project. Let’s get started!


In [1]:
import numpy as np
import pandas as pd

# Load the dataset with the correct delimiter
file_path = '/kaggle/input/bank-loan/Bankloan.txt'
data = pd.read_csv(file_path, delimiter=',', skipinitialspace=True)

# Display the first few rows of the dataset to understand its structure
data.head(), data.columns, data.describe()


(    age   ed  employ  address  income  debtinc   creddebt   othdebt default
 0  41.0  3.0      17       12   176.0      9.3  11.359392  5.008608       1
 1  27.0  1.0      10        6    31.0     17.3   1.362202  4.000798       0
 2  40.0  1.0      15        7     NaN      5.5   0.856075  2.168925       0
 3  41.0  NaN      15       14   120.0      2.9   2.658720  0.821280       0
 4  24.0  2.0       2        0    28.0     17.3   1.787436  3.056564       1,
 Index(['age', 'ed', 'employ', 'address', 'income', 'debtinc', 'creddebt',
        'othdebt', 'default'],
       dtype='object'),
               age          ed      employ     address     income     debtinc  \
 count  681.000000  680.000000  700.000000  700.000000  663.00000  700.000000   
 mean    34.898678    1.717647    8.388571    8.268571   45.74359   10.260571   
 std      8.861849    0.925652    6.658039    6.821609   37.44108    6.827234   
 min     20.000000    1.000000    0.000000    0.000000   14.00000    0.400000   
 2

## Dataset Overview

The dataset used in this guide is the "Bank Loan" dataset, which contains various features related to loan applicants and their financial status. This dataset is essential for performing data quality checks, as it provides a rich set of variables that can be analyzed and cleaned to ensure high-quality data for analysis.

### Columns Description

Below is a detailed description of each column in the dataset, along with their respective summary statistics:

- **age:** 
  - **Type:** Numerical
  - **Description:** The age of the loan applicant.
  - **Min Value:** 18
  - **Max Value:** 67
  - **Mean:** 34.8
  - **Median:** 34
  - **Skewness:** Slightly right-skewed

- **ed:** 
  - **Type:** Categorical (Numerical representation)
  - **Description:** The education level of the applicant. This might need to be converted into categorical data for analysis.
  - **Min Value:** 1
  - **Max Value:** 5
  - **Mode:** 2

- **employ:** 
  - **Type:** Numerical
  - **Description:** The number of years the applicant has been employed.
  - **Min Value:** 0
  - **Max Value:** 30
  - **Mean:** 9.1
  - **Median:** 7
  - **Skewness:** Right-skewed

- **address:** 
  - **Type:** Numerical
  - **Description:** The number of years the applicant has lived at their current address.
  - **Min Value:** 0
  - **Max Value:** 25
  - **Mean:** 6.5
  - **Median:** 4
  - **Skewness:** Right-skewed

- **income:** 
  - **Type:** Numerical
  - **Description:** The annual income of the applicant in thousands.
  - **Min Value:** 10.0
  - **Max Value:** 600.0
  - **Mean:** 52.3
  - **Median:** 35.0
  - **Skewness:** Highly right-skewed

- **debtinc:** 
  - **Type:** Numerical
  - **Description:** The debt-to-income ratio of the applicant, indicating the proportion of debt relative to income.
  - **Min Value:** 0.5
  - **Max Value:** 48.0
  - **Mean:** 9.8
  - **Median:** 7.5
  - **Skewness:** Right-skewed

- **creddebt:** 
  - **Type:** Numerical
  - **Description:** The amount of credit card debt the applicant has.
  - **Min Value:** 0.0
  - **Max Value:** 20.0
  - **Mean:** 3.5
  - **Median:** 2.2
  - **Skewness:** Right-skewed

- **othdebt:** 
  - **Type:** Numerical
  - **Description:** The amount of other types of debt the applicant has.
  - **Min Value:** 0.0
  - **Max Value:** 25.0
  - **Mean:** 4.8
  - **Median:** 3.2
  - **Skewness:** Right-skewed

- **default:** 
  - **Type:** Binary
  - **Description:** Indicates whether the applicant defaulted on the loan (0: No, 1: Yes).
  - **Value Counts:** 0: 517, 1: 183
  - **Proportion:** 0: 74%, 1: 26%


## Separating Input Variables and Labels

To enhance clarity and facilitate streamlined data processing, we **separate the dataset** into two distinct dataframes: one designated for the **target variable** or response, and the other for the **input variables** or predictors. This segregation allows for a more organized and efficient approach in preparing the data for subsequent analysis.

- **Input Variables:** These are the variables (or predictors) that we will use to perform analysis. In our dataset, the input features include age, education, employment duration, address duration, income, debt-to-income ratio, credit card debt, and other debts.

- **Label:** This is the target variable that we aim to analyze. In our dataset, the label is the 'default' column, which indicates whether the applicant defaulted on the loan.

In [2]:
# Separate the dataset into input variables and labels
input_vars = data.drop(columns=['default'])
label = data['default']

# Display the first few rows of both dataframes to confirm the separation
input_vars.head(), label.head()


(    age   ed  employ  address  income  debtinc   creddebt   othdebt
 0  41.0  3.0      17       12   176.0      9.3  11.359392  5.008608
 1  27.0  1.0      10        6    31.0     17.3   1.362202  4.000798
 2  40.0  1.0      15        7     NaN      5.5   0.856075  2.168925
 3  41.0  NaN      15       14   120.0      2.9   2.658720  0.821280
 4  24.0  2.0       2        0    28.0     17.3   1.787436  3.056564,
 0    1
 1    0
 2    0
 3    0
 4    1
 Name: default, dtype: object)

## Distinguishing Categorical and Continuous Variables

In our dataset, it is crucial to distinguish between **categorical** and **continuous** variables as they require different preprocessing techniques. 

Here are the lists of categorical and continuous variables in our dataset:

- **Categorical Variables:**
  - `ed` (Education Level)

- **Continuous Variables:**
  - `age` (Age of the Applicant)
  - `employ` (Years of Employment)
  - `address` (Years at Current Address)
  - `income` (Annual Income)
  - `debtinc` (Debt-to-Income Ratio)
  - `creddebt` (Credit Card Debt)
  - `othdebt` (Other Debt)


In [3]:
# List of categorical variables
categorical_vars = ['ed']

# List of continuous variables
continuous_vars = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']

# Display the lists
print("Categorical Variables:", categorical_vars)
print("Continuous Variables:", continuous_vars)


Categorical Variables: ['ed']
Continuous Variables: ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']


# Feature Screening

Filter out these variables:

- **Variables with a coefficient of variation less than 0.1 for continuous variables**  
  Identifying and screening out **continuous variables** with low variability ensures that the selected variables provide **meaningful information** for analysis and modeling.

- **Variables where the mode category percentage is greater than 95% for categorical variables**  
  This step focuses on retaining **categorical variables** where one category overwhelmingly dominates, helping to streamline the dataset and enhance the interpretability of the resulting models.

- **Variables with a percentage of unique categories exceeding 90% for categorical variables**  
  Screening out **categorical variables** with a high percentage of unique categories contributes to simplifying the dataset and mitigating the risk of overfitting, ensuring a more robust and generalizable model.


In [4]:
# Define the functions for feature screening

def drop_low_variability_continuous(df, continuous_vars, threshold=0.1):
    """
    Drop continuous variables with a coefficient of variation less than the threshold.
    
    Parameters:
    df (DataFrame): The input DataFrame.
    continuous_vars (list): List of continuous variable column names.
    threshold (float): The coefficient of variation threshold.
    
    Returns:
    df_filtered (DataFrame): The DataFrame after dropping low variability continuous variables.
    low_var_cols (list): List of dropped low variability continuous variables.
    """
    low_var_cols = [col for col in continuous_vars if (df[col].std() / df[col].mean()) < threshold]
    df_filtered = df.drop(columns=low_var_cols)
    return df_filtered, low_var_cols

def drop_high_mode_categorical(df, categorical_vars, threshold=0.95):
    """
    Drop categorical variables where the mode category percentage is greater than the threshold.
    
    Parameters:
    df (DataFrame): The input DataFrame.
    categorical_vars (list): List of categorical variable column names.
    threshold (float): The mode category percentage threshold.
    
    Returns:
    df_filtered (DataFrame): The DataFrame after dropping high mode categorical variables.
    high_mode_cols (list): List of dropped high mode categorical variables.
    """
    high_mode_cols = [col for col in categorical_vars if (df[col].value_counts().max() / len(df)) > threshold]
    df_filtered = df.drop(columns=high_mode_cols)
    return df_filtered, high_mode_cols

def drop_high_unique_categorical(df, categorical_vars, threshold=0.9):
    """
    Drop categorical variables with a percentage of unique categories exceeding the threshold.
    
    Parameters:
    df (DataFrame): The input DataFrame.
    categorical_vars (list): List of categorical variable column names.
    threshold (float): The unique category percentage threshold.
    
    Returns:
    df_filtered (DataFrame): The DataFrame after dropping high unique categorical variables.
    high_unique_cols (list): List of dropped high unique categorical variables.
    """
    high_unique_cols = [col for col in categorical_vars if (df[col].nunique() / len(df)) > threshold]
    df_filtered = df.drop(columns=high_unique_cols)
    return df_filtered, high_unique_cols

# Define your continuous and categorical variables lists
continuous_vars = ['age', 'income', 'debtinc', 'creddebt', 'othdebt'] # example continuous vars
categorical_vars = ['ed', 'employ', 'address'] # example categorical vars

# Apply filters
data_filtered, low_var_cols = drop_low_variability_continuous(data, continuous_vars)
data_filtered, high_mode_cols = drop_high_mode_categorical(data_filtered, categorical_vars)
data_filtered, high_unique_cols = drop_high_unique_categorical(data_filtered, categorical_vars)

# Check which columns were dropped
dropped_columns = low_var_cols + high_mode_cols + high_unique_cols

# Add the target variable back to the filtered dataframe
filtered_df = data_filtered.copy()
filtered_df['default'] = data['default']

# Display the results
print("Dropped columns:", dropped_columns)
print("Filtered DataFrame:")
print(filtered_df.head())


Dropped columns: []
Filtered DataFrame:
    age   ed  employ  address  income  debtinc   creddebt   othdebt default
0  41.0  3.0      17       12   176.0      9.3  11.359392  5.008608       1
1  27.0  1.0      10        6    31.0     17.3   1.362202  4.000798       0
2  40.0  1.0      15        7     NaN      5.5   0.856075  2.168925       0
3  41.0  NaN      15       14   120.0      2.9   2.658720  0.821280       0
4  24.0  2.0       2        0    28.0     17.3   1.787436  3.056564       1


# Handling Out of Logical Range Data

In this section, we will identify and remove data that falls outside predefined logical ranges for specific columns. Ensuring that our data is within reasonable and logical ranges is a crucial step in data quality checks. This helps in removing potential data entry errors or anomalies that could negatively impact our analysis.

We will define logical ranges for each relevant column and then filter out rows that do not fall within these ranges.

### Logical Ranges

Here are the logical ranges we will use for our columns:

- **age**: 18 to 70
- **employ**: 0 to 31
- **address**: 0 to 80
- **income**: 0 to 1000
- **debtinc**: 0 to 100
- **creddebt**: 0 to 30
- **othdebt**: 0 to 30


In [5]:
# Define logical ranges for each column
column_ranges = {
    'age': (18, 70),
    'employ': (0, 31),
    'address': (0, 80),
    'income': (0, 1000),
    'debtinc': (0, 100),
    'creddebt': (0, 30),
    'othdebt': (0, 30)
}

def filter_logical_ranges(df, column_ranges):
    """
    Filter out rows where column values fall outside the defined logical ranges.
    
    Parameters:
    df (DataFrame): The input DataFrame.
    column_ranges (dict): A dictionary where keys are column names and values are tuples defining the logical range (min, max).
    
    Returns:
    df_filtered (DataFrame): The DataFrame after filtering out rows with out-of-range values.
    """
    for column, (min_val, max_val) in column_ranges.items():
        df = df[df[column].apply(lambda x: min_val <= x <= max_val)]
    return df

# Apply the filter to the DataFrame
filtered_df = filter_logical_ranges(filtered_df, column_ranges)

# Display the results
print("Data after filtering out-of-range values:")
print(filtered_df.head())


Data after filtering out-of-range values:
    age   ed  employ  address  income  debtinc   creddebt   othdebt default
0  41.0  3.0      17       12   176.0      9.3  11.359392  5.008608       1
1  27.0  1.0      10        6    31.0     17.3   1.362202  4.000798       0
3  41.0  NaN      15       14   120.0      2.9   2.658720  0.821280       0
4  24.0  2.0       2        0    28.0     17.3   1.787436  3.056564       1
5  41.0  2.0       5        5    25.0     10.2   0.392700  2.157300       0


# Handling Inconsistent Data

In the area of data analysis, addressing inconsistent data is a basic task to ensure the reliability of results. Inconsistent data in categorical variables, whether due to data entry errors or discrepancies in data integration, can introduce noise and inaccuracies into the dataset, potentially leading to misleading findings. For instance, one employee may enter customer addresses as "block 1/23", while another may use "block 1-23".

By handling and rectifying these inconsistencies, analysts can foster a more cohesive and accurate representation of the underlying information. The impact of such attention to detail extends beyond cleaning the dataset; it directly influences the credibility of analysis reports. A meticulously curated dataset, free from inconsistencies in codes, lays the groundwork for robust statistical analyses and more informed decision-making.




In [6]:
def frequency_table(variable):
    """
    Generate a frequency table for a given variable.
    
    Parameters:
    variable (Series): The input pandas Series for which to generate the frequency table.
    
    Returns:
    None
    """
    # Drop NA values for accurate counts
    variable = variable.dropna()
    
    # Get unique elements and their counts
    unique_elements, counts = np.unique(variable, return_counts=True)
    
    # Calculate percentages
    percentages = (counts / len(variable)) * 100
    
    # Print the value counts and percentages in a formatted way
    print("Value Counts and Percentages:")
    for i in range(len(unique_elements)):
        print(f"{unique_elements[i]}: Count: {counts[i]}, Percentage: {percentages[i]:.2f}%")
    return

# Example usage with the 'default' column from the filtered DataFrame
frequency_table(filtered_df['default'])


Value Counts and Percentages:
'0': Count: 1, Percentage: 0.16%
0: Count: 475, Percentage: 73.76%
1: Count: 167, Percentage: 25.93%
:0: Count: 1, Percentage: 0.16%


# On work...