# 🌳 📝 Complete Guide to Data Quality from A to Z

Welcome to this comprehensive guide on data quality, designed to equip you with the knowledge and skills to ensure the integrity and reliability of your datasets. Whether you're a budding data scientist or a seasoned professional looking to refine your data quality management skills, this notebook is tailored for you!

## What Will You Learn?

In this guide, we will explore various methods to assess, clean, and maintain data quality, ensuring you have the tools to confidently tackle any data-driven challenge. Here's what we'll cover:

- **Feature Screening**: Learn how to identify and screen out features that do not contribute meaningful information to your analysis and modeling.
  - **Features with a Coefficient of Variation Less than 0.1 for Continuous Variables**: Retain only those continuous features with significant variability.
  
  - **Features where the Mode Category Percentage is Greater than 95% for Categorical Variables**: Streamline your dataset by focusing on dominant categorical features.
  
  - **Features with a Percentage of Unique Categories Exceeding 90% for Categorical Variables**: Simplify your dataset by removing overly unique categorical features.
  
  

- **Handling Out of Logical Range Data**: Address and correct values that fall outside logical ranges to maintain dataset integrity.

- **Handling Inconsistent Data**: Resolve inconsistencies in categorical data to enhance the reliability of your analysis.

- **Data Leakage**: Understand and prevent data leakage to ensure your machine learning models are robust and generalizable.

- **Outlier Detection**: Employ one-dimensional and multidimensional methods to identify and manage outliers in your data.

- **Handling Missing Data**: Learn various techniques for dealing with missing data, from simple imputation to advanced methods.

## Why This Guide?

- **Step-by-Step Tutorials**: Each section includes clear explanations followed by practical examples, ensuring you not only learn but also apply your knowledge.
- **Interactive Learning**: Engage with interactive code cells that allow you to see the effects of data quality methods in real-time.

### How to Use This Notebook

- **Run the Cells**: Follow along with the code examples by running the cells yourself. Modify the parameters to see how the results change.
- **Explore Further**: After completing the guided sections, try applying the methods to your own datasets to reinforce your learning.

Prepare to unlock the full potential of data quality management in data analysis. Let's dive in and transform data into reliable insights!


## Reading the Dataset

To begin our analysis, we'll start by loading the dataset. This dataset contains information about bank loans, including various features such as age, education level, employment duration, address duration, income, debt-to-income ratio, credit card debt, other debt, and loan default status.


In [1]:
import pandas as pd

# Load the dataset into a pandas DataFrame
file_path = '/kaggle/input/bank-loan/Bankloan.txt' 
dataset = pd.read_csv(file_path, delimiter=",")

dataset.head()


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17,12,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10,6,31.0,17.3,1.362202,4.000798,0
2,40.0,1.0,15,7,,5.5,0.856075,2.168925,0
3,41.0,,15,14,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2,0,28.0,17.3,1.787436,3.056564,1


## Dataset Explanation

The dataset contains the following columns:

- **age**: The age of the applicant.
  - **Type**: Numerical
  - **Min**: 20
  - **Max**: 67
  - **Mean**: 35.95
  - **Median**: 34
  - **Standard Deviation**: 11.36
  - **Skewness**: 0.45 (slightly right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the age in years. This variable is important for understanding the demographic distribution of the applicants.


- **ed**: The education level of the applicant, represented as an integer.
  - **Type**: Categorical (Ordinal)
  - **Min**: 1
  - **Max**: 5
  - **Mode**: 1
  - **Missing Values**: 24 (4% of the dataset)
  - **Details**: Represents the education level, where higher values indicate higher levels of education. This variable helps in assessing the education background of applicants.


- **employ**: The number of years the applicant has been employed.
  - **Type**: Numerical
  - **Min**: 0
  - **Max**: 35
  - **Mean**: 9.8
  - **Median**: 8
  - **Standard Deviation**: 8.2
  - **Skewness**: 0.65 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the number of years in employment. This variable is crucial for understanding the work experience of the applicants.


- **address**: The number of years the applicant has lived at their current address.
  - **Type**: Numerical
  - **Min**: 0
  - **Max**: 25
  - **Mean**: 6.9
  - **Median**: 4
  - **Standard Deviation**: 7.2
  - **Skewness**: 0.95 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the number of years at the current address. This variable helps in understanding the stability of the applicants' living situation.


- **income**: The annual income of the applicant in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 8.0
  - **Max**: 636.0
  - **Mean**: 70.55
  - **Median**: 40.0
  - **Standard Deviation**: 66.4
  - **Skewness**: 2.12 (highly right-skewed)
  - **Missing Values**: 38 (6.3% of the dataset)
  - **Details**: Represents the annual income in thousands. This variable is essential for assessing the financial status of the applicants.


- **debtinc**: The debt-to-income ratio of the applicant, expressed as a percentage.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 69.9
  - **Mean**: 10.1
  - **Median**: 8.9
  - **Standard Deviation**: 8.7
  - **Skewness**: 1.4 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the debt-to-income ratio. This variable helps in understanding the financial burden on the applicants.


- **creddebt**: The amount of credit card debt the applicant has, in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 11.36
  - **Mean**: 3.55
  - **Median**: 2.30
  - **Standard Deviation**: 3.41
  - **Skewness**: 0.75 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the credit card debt in thousands. This variable indicates the credit card liabilities of the applicants.


- **othdebt**: The amount of other debt the applicant has, in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 11.0
  - **Mean**: 3.02
  - **Median**: 2.20
  - **Standard Deviation**: 2.24
  - **Skewness**: 0.95 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents other debts in thousands. This variable shows additional financial liabilities apart from credit card debt.
  

- **default**: The default status of the loan, where 1 indicates default and 0 indicates no default.
  - **Type**: Categorical (Binary)
  - **Unique Values**: [0, 1]
  - **Mode**: 0
  - **Missing Values**: 0
  - **Details**: Binary indicator of loan default status. This is the target variable for modeling and analysis.

If you want to learn how to perform detailed data profiling and obtain these insights, visit the [Complete Guide to Data Profiling A to Z](https://www.kaggle.com/code/matinmahmoudi/complete-guide-to-data-profiling-a-to-z).


# Feature Screening

Feature screening is a crucial step in the data quality process that involves identifying and removing features (variables) that do not contribute meaningful information to the analysis or modeling. By screening out such features, we can streamline the dataset, improve model performance, and enhance interpretability. In this section, we will cover three specific criteria for feature screening:

### Features with a Coefficient of Variation Less than 0.1 for Continuous Variables

The coefficient of variation (CV) is a measure of relative variability. It is calculated as the ratio of the standard deviation to the mean. Features with a CV less than 0.1 are considered to have low variability and may not provide significant information for analysis. We will identify and remove such features.

### Features where the Mode Category Percentage is Greater than 95% for Categorical Variables

Categorical variables where a single category overwhelmingly dominates (mode category percentage > 95%) may not be useful for analysis as they do not provide much variation. We will identify and remove these categorical features to streamline the dataset.

### Features with a Percentage of Unique Categories Exceeding 90% for Categorical Variables

Categorical variables with a high percentage of unique categories ( > 90%) can complicate the analysis and lead to overfitting in models. We will identify and remove these features to ensure a more robust and generalizable model.

By applying these screening criteria, we can ensure that the remaining features in the dataset provide meaningful and relevant information for subsequent analysis.


In [2]:
# Separate the dataset into input variables (predictors) and target variable (response)
label = dataset['default']
inputs = dataset.drop(columns=['default'])

categorical_columns = ['ed']  
numerical_columns = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']

# Calculate Coefficient of Variation for continuous variables
cv = inputs[numerical_columns].std() / inputs[numerical_columns].mean()

# Identify features with CV less than 0.1
low_cv_features = cv[cv < 0.1].index.tolist()
print("Features with Coefficient of Variation less than 0.1:", low_cv_features)

# Calculate Mode Category Percentage for categorical variables
mode_percentage = inputs[categorical_columns].apply(lambda x: x.value_counts(normalize=True).max() * 100)

# Identify features where the mode category percentage is greater than 95%
high_mode_features = mode_percentage[mode_percentage > 95].index.tolist()
print("Categorical features where mode category percentage is greater than 95%:", high_mode_features)

# Calculate Percentage of Unique Categories for categorical variables
unique_category_percentage = inputs[categorical_columns].nunique() / len(inputs) * 100

# Identify features with a percentage of unique categories exceeding 90%
high_unique_features = unique_category_percentage[unique_category_percentage > 90].index.tolist()
print("Categorical features with percentage of unique categories exceeding 90%:", high_unique_features)

# Combine all features to be removed
features_to_remove = set(low_cv_features + high_mode_features + high_unique_features)
print("Features to be removed:", features_to_remove)

# Remove the identified features from the inputs dataframe
cleaned_inputs = inputs.drop(columns=features_to_remove)

# Combine the cleaned inputs with the label
cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)

# Display the cleaned dataset
cleaned_dataset.head()


Features with Coefficient of Variation less than 0.1: []
Categorical features where mode category percentage is greater than 95%: []
Categorical features with percentage of unique categories exceeding 90%: []
Features to be removed: set()


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17,12,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10,6,31.0,17.3,1.362202,4.000798,0
2,40.0,1.0,15,7,,5.5,0.856075,2.168925,0
3,41.0,,15,14,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2,0,28.0,17.3,1.787436,3.056564,1


# Handling Out of Logical Range Data

In data analysis, handling values that fall outside the logical range of respective fields is a critical step to maintain the integrity and reliability of the dataset. Values significantly deviating from the expected range can distort analytical results and impact the overall quality of findings. It is essential to define these ranges based on domain knowledge, business rules, and the specific context of the data.

### Defining Logical Ranges

To ensure the data is within logical boundaries, we define acceptable ranges for each column based on reasonable assumptions and domain knowledge. Here are the defined ranges for each column in our dataset:

- **age**: 18 to 70 years - This range assumes the typical age range of bank loan applicants.
- **employ**: 0 to 31 years - This range covers the typical employment duration for most individuals.
- **address**: 0 to 80 years - This range represents the duration someone might live at a given address.
- **income**: 0 to 1000 thousand dollars - This upper limit is set to include high-income individuals while excluding outliers.
- **debtinc**: 0 to 100 percent - This range covers the debt-to-income ratio, with 100% being the upper logical limit.
- **creddebt**: 0 to 30 thousand dollars - This range is set to include typical credit card debt amounts.
- **othdebt**: 0 to 30 thousand dollars - This range includes other types of debt and is set to exclude extreme outliers.

By adhering to these logical ranges, we can filter out anomalous data points that may otherwise skew our analysis and ensure a more accurate and reliable dataset.


In [3]:
# Define ranges for each column
column_ranges = {
    'age': (18, 70),
    'employ': (0, 31),
    'address': (0, 80),
    'income': (0, 1000),
    'debtinc': (0, 100),
    'creddebt': (0, 30),
    'othdebt': (0, 30)
}

# Apply the ranges to filter the dataframe using lambda
for column, (min_val, max_val) in column_ranges.items():
    cleaned_inputs = cleaned_inputs[cleaned_inputs[column].apply(lambda x: min_val <= x <= max_val)]

# Combine the cleaned inputs with the label
cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)

# Display the cleaned dataset
cleaned_dataset.head()


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17.0,12.0,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10.0,6.0,31.0,17.3,1.362202,4.000798,0
3,41.0,,15.0,14.0,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2.0,0.0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5.0,5.0,25.0,10.2,0.3927,2.1573,0


# Handling Inconsistent Data

In the area of data analysis, addressing inconsistent data is a fundamental task to ensure the reliability of results. Inconsistent data in categorical variables, whether due to data entry errors or discrepancies in data integration, can introduce noise and inaccuracies into the dataset, potentially leading to misleading findings.

### Detecting and Correcting Inconsistent Data

To detect inconsistent data, we generate frequency tables for each categorical variable, including the label. This helps us identify categories that may have been entered incorrectly or inconsistently. Once detected, we correct these inconsistencies to ensure a cohesive and accurate dataset.

For example, the frequency table for the `default` column revealed inconsistencies such as `'0'` and `':0'`. We will correct these to ensure consistency.


In [4]:
# Generate frequency tables for each categorical variable
categorical_columns = ['ed', 'default'] 

# Display frequency tables
for column in categorical_columns:
    print(f"Frequency table for {column}:")
    print(cleaned_dataset[column].value_counts())
    print("\n")

# Correct inconsistencies in the 'default' column
cleaned_dataset['default'] = cleaned_dataset['default'].replace({"'0'": 0, ':0': 0})
cleaned_dataset['default'] = cleaned_dataset['default'].astype(int)

# Verify correction
print("Corrected Frequency table for 'default':")
print(cleaned_dataset['default'].value_counts())
print("\n")

# Separate the cleaned inputs and label
cleaned_inputs = cleaned_dataset.drop(columns=['default'])
label = cleaned_dataset['default']

# Display the cleaned dataset
cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)
cleaned_dataset.head()


Frequency table for ed:
ed
1.0    330
2.0    182
3.0     76
4.0     32
5.0      5
Name: count, dtype: int64


Frequency table for default:
default
0      515
1      183
'0'      1
:0       1
Name: count, dtype: int64


Corrected Frequency table for 'default':
default
0    517
1    183
Name: count, dtype: int64




Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17.0,12.0,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10.0,6.0,31.0,17.3,1.362202,4.000798,0
3,41.0,,15.0,14.0,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2.0,0.0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5.0,5.0,25.0,10.2,0.3927,2.1573,0


## Data Leakage

Data leakage poses a significant challenge in the area of machine learning and data analytics, emphasizing the critical importance of a well-considered evaluation design. Data leakage occurs when information from the test set unintentionally influences the training process, leading to over-optimistic model performance. To mitigate this risk, adopting a robust evaluation design becomes imperative.

### Understanding Data Leakage

Data leakage can manifest in various forms, such as:

1. **Train-Test Contamination**: When data from the test set influences the training set, leading to artificially high performance metrics.
2. **Temporal Leakage**: Occurs in time-series data when future information is used to predict past events.
3. **Feature Leakage**: When features that are highly correlated with the target variable are included in the training data, but would not be available in a real-world scenario.

### Preventing Data Leakage

To prevent data leakage, it is essential to:

1. **Clearly Separate Training and Test Data**: Ensure that the training data does not contain any information from the test set.
2. **Use Temporal Split for Time-Series Data**: When working with time-series data, use a temporal split to ensure that past data is used to predict future events.
3. **Remove Highly Correlated Features**: Identify and remove features that are highly correlated with the target variable and would not be available in a real-world scenario.

By adhering to these principles, we can guard against data leakage and contribute to the development of more reliable and generalizable machine learning models.


In [5]:
from sklearn.model_selection import train_test_split

# Separate the cleaned inputs and label into training and test sets
X_train, X_test, y_train, y_test = train_test_split(cleaned_inputs, label, test_size=0.2, random_state=42)

# Further separate the continuous and categorical columns in the training and test sets
categorical_columns = ['ed']  # Add other categorical columns as needed
numerical_columns = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']

# Separate continuous and categorical columns in the training set
X_train_continuous = X_train[numerical_columns]
X_train_categorical = X_train[categorical_columns]

# Separate continuous and categorical columns in the test set
X_test_continuous = X_test[numerical_columns]
X_test_categorical = X_test[categorical_columns]

# Display the shapes of the training and test datasets to ensure correctness
print(f"Training data (continuous) shape: {X_train_continuous.shape}")
print(f"Training data (categorical) shape: {X_train_categorical.shape}")
print(f"Test data (continuous) shape: {X_test_continuous.shape}")
print(f"Test data (categorical) shape: {X_test_categorical.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test labels shape: {y_test.shape}")


Training data (continuous) shape: (560, 7)
Training data (categorical) shape: (560, 1)
Test data (continuous) shape: (140, 7)
Test data (categorical) shape: (140, 1)
Training labels shape: (560,)
Test labels shape: (140,)


# On work...