# 🌳 📝 Complete Guide to Data Quality A to Z

Welcome to this comprehensive guide on data quality, designed to equip you with the knowledge and skills to ensure the integrity and reliability of your datasets. Whether you're a budding data scientist or a seasoned professional looking to refine your data quality management skills, this notebook is tailored for you!

## What Will You Learn?

In this guide, we will explore various methods to assess, clean, and maintain data quality, ensuring you have the tools to confidently tackle any data-driven challenge. Here's what we'll cover:

- **Feature Screening**: Learn how to identify and screen out features that do not contribute meaningful information to your analysis and modeling.
  - **Features with a Coefficient of Variation Less than 0.1 for Continuous Variables**: Retain only those continuous features with significant variability.
  
  - **Features where the Mode Category Percentage is Greater than 95% for Categorical Variables**: Streamline your dataset by focusing on dominant categorical features.
  
  - **Features with a Percentage of Unique Categories Exceeding 90% for Categorical Variables**: Simplify your dataset by removing overly unique categorical features.
  
  

- **Handling Out of Logical Range Data**: Address and correct values that fall outside logical ranges to maintain dataset integrity.

- **Handling Inconsistent Data**: Resolve inconsistencies in categorical data to enhance the reliability of your analysis.

- **Data Leakage**: Understand and prevent data leakage to ensure your machine learning models are robust and generalizable.

- **Outlier Detection**: Employ one-dimensional and multidimensional methods to identify and manage outliers in your data.

- **Handling Missing Data**: Learn various techniques for dealing with missing data, from simple imputation to advanced methods.

## Why This Guide?

- **Step-by-Step Tutorials**: Each section includes clear explanations followed by practical examples, ensuring you not only learn but also apply your knowledge.
- **Interactive Learning**: Engage with interactive code cells that allow you to see the effects of data quality methods in real-time.

### How to Use This Notebook

- **Run the Cells**: Follow along with the code examples by running the cells yourself. Modify the parameters to see how the results change.
- **Explore Further**: After completing the guided sections, try applying the methods to your own datasets to reinforce your learning.

Prepare to unlock the full potential of data quality management in data analysis. Let's dive in and transform data into reliable insights!


## Reading the Dataset

To begin our analysis, we'll start by loading the dataset. This dataset contains information about bank loans, including various features such as age, education level, employment duration, address duration, income, debt-to-income ratio, credit card debt, other debt, and loan default status.


In [1]:
import pandas as pd

# Load the dataset into a pandas DataFrame
file_path = '/kaggle/input/loans-and-liability/LoanData_Raw_v1.0.csv' 
dataset = pd.read_csv(file_path, delimiter=",")

dataset.head(20)


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17,12,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10,6,31.0,17.3,1.362202,4.000798,0
2,40.0,1.0,15,7,,5.5,0.856075,2.168925,0
3,41.0,,15,14,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2,0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5,5,25.0,10.2,0.3927,2.1573,0
6,39.0,1.0,20,9,,30.6,3.833874,16.668126,0
7,,1.0,12,11,38.0,3.6,0.128592,1.239408,0
8,24.0,1.0,3,4,19.0,24.4,1.358348,3.277652,1
9,36.0,1.0,0,13,25.0,19.7,2.7777,2.1473,0


## Dataset Explanation

The dataset contains the following columns:

- **age**: The age of the applicant.
  - **Type**: Numerical
  - **Min**: 20
  - **Max**: 67
  - **Mean**: 35.95
  - **Median**: 34
  - **Standard Deviation**: 11.36
  - **Skewness**: 0.45 (slightly right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the age in years. This variable is important for understanding the demographic distribution of the applicants.


- **ed**: The education level of the applicant, represented as an integer.
  - **Type**: Categorical (Ordinal)
  - **Min**: 1
  - **Max**: 5
  - **Mode**: 1
  - **Missing Values**: 24 (4% of the dataset)
  - **Details**: Represents the education level, where higher values indicate higher levels of education. This variable helps in assessing the education background of applicants.


- **employ**: The number of years the applicant has been employed.
  - **Type**: Numerical
  - **Min**: 0
  - **Max**: 35
  - **Mean**: 9.8
  - **Median**: 8
  - **Standard Deviation**: 8.2
  - **Skewness**: 0.65 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the number of years in employment. This variable is crucial for understanding the work experience of the applicants.


- **address**: The number of years the applicant has lived at their current address.
  - **Type**: Numerical
  - **Min**: 0
  - **Max**: 25
  - **Mean**: 6.9
  - **Median**: 4
  - **Standard Deviation**: 7.2
  - **Skewness**: 0.95 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the number of years at the current address. This variable helps in understanding the stability of the applicants' living situation.


- **income**: The annual income of the applicant in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 8.0
  - **Max**: 636.0
  - **Mean**: 70.55
  - **Median**: 40.0
  - **Standard Deviation**: 66.4
  - **Skewness**: 2.12 (highly right-skewed)
  - **Missing Values**: 38 (6.3% of the dataset)
  - **Details**: Represents the annual income in thousands. This variable is essential for assessing the financial status of the applicants.


- **debtinc**: The debt-to-income ratio of the applicant, expressed as a percentage.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 69.9
  - **Mean**: 10.1
  - **Median**: 8.9
  - **Standard Deviation**: 8.7
  - **Skewness**: 1.4 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the debt-to-income ratio. This variable helps in understanding the financial burden on the applicants.


- **creddebt**: The amount of credit card debt the applicant has, in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 11.36
  - **Mean**: 3.55
  - **Median**: 2.30
  - **Standard Deviation**: 3.41
  - **Skewness**: 0.75 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents the credit card debt in thousands. This variable indicates the credit card liabilities of the applicants.


- **othdebt**: The amount of other debt the applicant has, in thousands of dollars.
  - **Type**: Numerical
  - **Min**: 0.0
  - **Max**: 11.0
  - **Mean**: 3.02
  - **Median**: 2.20
  - **Standard Deviation**: 2.24
  - **Skewness**: 0.95 (moderately right-skewed)
  - **Missing Values**: 0
  - **Details**: Represents other debts in thousands. This variable shows additional financial liabilities apart from credit card debt.
  

- **default**: The default status of the loan, where 1 indicates default and 0 indicates no default.
  - **Type**: Categorical (Binary)
  - **Unique Values**: [0, 1]
  - **Mode**: 0
  - **Missing Values**: 0
  - **Details**: Binary indicator of loan default status. This is the target variable for modeling and analysis.

If you want to learn how to perform detailed data profiling and obtain these insights, visit the [Complete Guide to Data Profiling A to Z](https://www.kaggle.com/code/matinmahmoudi/complete-guide-to-data-profiling-a-to-z).


# Feature Screening

Feature screening is a crucial step in the data quality process that involves identifying and removing features (variables) that do not contribute meaningful information to the analysis or modeling. By screening out such features, we can streamline the dataset, improve model performance, and enhance interpretability. In this section, we will cover three specific criteria for feature screening:

### Features with a Coefficient of Variation Less than 0.1 for Continuous Variables

The coefficient of variation (CV) is a measure of relative variability. It is calculated as the ratio of the standard deviation to the mean. Features with a CV less than 0.1 are considered to have low variability and may not provide significant information for analysis. We will identify and remove such features.

### Features where the Mode Category Percentage is Greater than 95% for Categorical Variables

Categorical variables where a single category overwhelmingly dominates (mode category percentage > 95%) may not be useful for analysis as they do not provide much variation. We will identify and remove these categorical features to streamline the dataset.

### Features with a Percentage of Unique Categories Exceeding 90% for Categorical Variables

Categorical variables with a high percentage of unique categories ( > 90%) can complicate the analysis and lead to overfitting in models. We will identify and remove these features to ensure a more robust and generalizable model.

By applying these screening criteria, we can ensure that the remaining features in the dataset provide meaningful and relevant information for subsequent analysis.


In [2]:
# Separate the dataset into input variables (predictors) and target variable (response)
label = dataset['default']
inputs = dataset.drop(columns=['default'])

categorical_columns = ['ed']  
numerical_columns = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']

# Calculate Coefficient of Variation for continuous variables
cv = inputs[numerical_columns].std() / inputs[numerical_columns].mean()

# Identify features with CV less than 0.1
low_cv_features = cv[cv < 0.1].index.tolist()
print("Features with Coefficient of Variation less than 0.1:", low_cv_features)

# Calculate Mode Category Percentage for categorical variables
mode_percentage = inputs[categorical_columns].apply(lambda x: x.value_counts(normalize=True).max() * 100)

# Identify features where the mode category percentage is greater than 95%
high_mode_features = mode_percentage[mode_percentage > 95].index.tolist()
print("Categorical features where mode category percentage is greater than 95%:", high_mode_features)

# Calculate Percentage of Unique Categories for categorical variables
unique_category_percentage = inputs[categorical_columns].nunique() / len(inputs) * 100

# Identify features with a percentage of unique categories exceeding 90%
high_unique_features = unique_category_percentage[unique_category_percentage > 90].index.tolist()
print("Categorical features with percentage of unique categories exceeding 90%:", high_unique_features)

# Combine all features to be removed
features_to_remove = set(low_cv_features + high_mode_features + high_unique_features)
print("Features to be removed:", features_to_remove)

# Remove the identified features from the inputs dataframe
cleaned_inputs = inputs.drop(columns=features_to_remove)

# Combine the cleaned inputs with the label
cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)

# Display the cleaned dataset
cleaned_dataset.head(20)


Features with Coefficient of Variation less than 0.1: []
Categorical features where mode category percentage is greater than 95%: []
Categorical features with percentage of unique categories exceeding 90%: []
Features to be removed: set()


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17,12,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10,6,31.0,17.3,1.362202,4.000798,0
2,40.0,1.0,15,7,,5.5,0.856075,2.168925,0
3,41.0,,15,14,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2,0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5,5,25.0,10.2,0.3927,2.1573,0
6,39.0,1.0,20,9,,30.6,3.833874,16.668126,0
7,,1.0,12,11,38.0,3.6,0.128592,1.239408,0
8,24.0,1.0,3,4,19.0,24.4,1.358348,3.277652,1
9,36.0,1.0,0,13,25.0,19.7,2.7777,2.1473,0


# Handling Out of Logical Range Data

In data analysis, handling values that fall outside the logical range of respective fields is a critical step to maintain the integrity and reliability of the dataset. Values significantly deviating from the expected range can distort analytical results and impact the overall quality of findings. It is essential to define these ranges based on domain knowledge, business rules, and the specific context of the data.

### Defining Logical Ranges

To ensure the data is within logical boundaries, we define acceptable ranges for each column based on reasonable assumptions and domain knowledge. Here are the defined ranges for each column in our dataset:

- **age**: 18 to 70 years - This range assumes the typical age range of bank loan applicants.
- **employ**: 0 to 31 years - This range covers the typical employment duration for most individuals.
- **address**: 0 to 80 years - This range represents the duration someone might live at a given address.
- **income**: 0 to 1000 thousand dollars - This upper limit is set to include high-income individuals while excluding outliers.
- **debtinc**: 0 to 100 percent - This range covers the debt-to-income ratio, with 100% being the upper logical limit.
- **creddebt**: 0 to 30 thousand dollars - This range is set to include typical credit card debt amounts.
- **othdebt**: 0 to 30 thousand dollars - This range includes other types of debt and is set to exclude extreme outliers.

By adhering to these logical ranges, we can filter out anomalous data points that may otherwise skew our analysis and ensure a more accurate and reliable dataset.


In [3]:
# Define ranges for each column
column_ranges = {
    'age': (18, 70),
    'employ': (0, 31),
    'address': (0, 80),
    'income': (0, 1000),
    'debtinc': (0, 100),
    'creddebt': (0, 30),
    'othdebt': (0, 30)
}

# Apply the ranges to filter the dataframe using lambda
for column, (min_val, max_val) in column_ranges.items():
    cleaned_inputs = cleaned_inputs[cleaned_inputs[column].apply(lambda x: min_val <= x <= max_val)]

# Combine the cleaned inputs with the label
cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)

# Display the cleaned dataset
cleaned_dataset.head(20)


Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17.0,12.0,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10.0,6.0,31.0,17.3,1.362202,4.000798,0
3,41.0,,15.0,14.0,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2.0,0.0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5.0,5.0,25.0,10.2,0.3927,2.1573,0
8,24.0,1.0,3.0,4.0,19.0,24.4,1.358348,3.277652,1
9,36.0,1.0,0.0,13.0,25.0,19.7,2.7777,2.1473,0
10,27.0,1.0,0.0,1.0,16.0,1.7,0.182512,0.089488,0
11,25.0,1.0,4.0,0.0,23.0,5.2,0.252356,0.943644,0
12,52.0,1.0,24.0,14.0,64.0,10.0,3.9296,2.4704,0


# Handling Inconsistent Data

In the area of data analysis, addressing inconsistent data is a fundamental task to ensure the reliability of results. Inconsistent data in categorical variables, whether due to data entry errors or discrepancies in data integration, can introduce noise and inaccuracies into the dataset, potentially leading to misleading findings.

### Detecting and Correcting Inconsistent Data

To detect inconsistent data, we generate frequency tables for each categorical variable, including the label. This helps us identify categories that may have been entered incorrectly or inconsistently. Once detected, we correct these inconsistencies to ensure a cohesive and accurate dataset.

For example, the frequency table for the `default` column revealed inconsistencies such as `'0'` and `':0'`. We will correct these to ensure consistency.


In [4]:
# Generate frequency tables for each categorical variable
categorical_columns = ['ed', 'default']  # Include the label as well for this step

# Display frequency tables
for column in categorical_columns:
    print(f"Frequency table for {column}:")
    print(cleaned_dataset[column].value_counts())
    print("\n")

# Correct inconsistencies in the 'default' column
cleaned_dataset['default'] = cleaned_dataset['default'].replace({"'0'": 0, ':0': 0})
cleaned_dataset['default'] = cleaned_dataset['default'].astype(int)

# Verify correction
print("Corrected Frequency table for 'default':")
print(cleaned_dataset['default'].value_counts())
print("\n")


# Display the cleaned dataset
cleaned_dataset.head(20)


Frequency table for ed:
ed
1.0    330
2.0    182
3.0     76
4.0     32
5.0      5
Name: count, dtype: int64


Frequency table for default:
default
0      515
1      183
'0'      1
:0       1
Name: count, dtype: int64


Corrected Frequency table for 'default':
default
0    517
1    183
Name: count, dtype: int64




Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41.0,3.0,17.0,12.0,176.0,9.3,11.359392,5.008608,1
1,27.0,1.0,10.0,6.0,31.0,17.3,1.362202,4.000798,0
3,41.0,,15.0,14.0,120.0,2.9,2.65872,0.82128,0
4,24.0,2.0,2.0,0.0,28.0,17.3,1.787436,3.056564,1
5,41.0,2.0,5.0,5.0,25.0,10.2,0.3927,2.1573,0
8,24.0,1.0,3.0,4.0,19.0,24.4,1.358348,3.277652,1
9,36.0,1.0,0.0,13.0,25.0,19.7,2.7777,2.1473,0
10,27.0,1.0,0.0,1.0,16.0,1.7,0.182512,0.089488,0
11,25.0,1.0,4.0,0.0,23.0,5.2,0.252356,0.943644,0
12,52.0,1.0,24.0,14.0,64.0,10.0,3.9296,2.4704,0


# Data Leakage

Data leakage poses a significant challenge in the area of machine learning and data analytics, emphasizing the critical importance of a well-considered evaluation design. Data leakage occurs when information from the test set unintentionally influences the training process, leading to over-optimistic model performance. To mitigate this risk, adopting a robust evaluation design becomes imperative.

### Understanding Data Leakage

Data leakage can manifest in various forms, such as:

1. **Train-Test Contamination**: When data from the test set influences the training set, leading to artificially high performance metrics.
2. **Temporal Leakage**: Occurs in time-series data when future information is used to predict past events.
3. **Feature Leakage**: When features that are highly correlated with the target variable are included in the training data, but would not be available in a real-world scenario.

### Preventing Data Leakage

To prevent data leakage, it is essential to:

1. **Clearly Separate Training and Test Data**: Ensure that the training data does not contain any information from the test set. This separation allows for an unbiased evaluation of model performance on unseen data, mimicking real-world scenarios.
2. **Use Temporal Split for Time-Series Data**: When working with time-series data, use a temporal split to ensure that past data is used to predict future events.
3. **Remove Highly Correlated Features**: Identify and remove features that are highly correlated with the target variable and would not be available in a real-world scenario.

By adhering to these principles, we can guard against data leakage and contribute to the development of more reliable and generalizable machine learning models.

### Separating Training and Test Data

In this step, we will separate our dataset into training and test sets. This separation is crucial to ensure that the model is evaluated on data it has never seen before, providing an unbiased estimate of its performance. Additionally, we will further separate the continuous and categorical variables within each set to facilitate specific preprocessing steps.


In [5]:
from sklearn.model_selection import train_test_split

categorical_columns = ['ed']  
numerical_columns = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']

# Separate the cleaned inputs and label
label = cleaned_dataset['default']
cleaned_inputs = cleaned_dataset.drop(columns=['default'])

# Separate the cleaned inputs and label into training and test sets
X_train, X_test, y_train, y_test = train_test_split(cleaned_inputs, label, test_size=0.3, random_state=42)

# Display the shapes of the training and test datasets to ensure correctness
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test labels shape: {y_test.shape}")

# Separate continuous and categorical data in the training set
X_train_continuous = X_train[numerical_columns]
X_train_categorical = X_train[categorical_columns]

# Separate continuous and categorical data in the test set
X_test_continuous = X_test[numerical_columns]
X_test_categorical = X_test[categorical_columns]

# Display the separated dataframes to ensure correctness
print("Training Continuous Data:")
print(X_train_continuous.head())
print("\nTraining Categorical Data:")
print(X_train_categorical.head())

print("Test Continuous Data:")
print(X_test_continuous.head())
print("\nTest Categorical Data:")
print(X_test_categorical.head())


Training data shape: (490, 8)
Test data shape: (210, 8)
Training labels shape: (490,)
Test labels shape: (210,)
Training Continuous Data:
      age  employ  address  income  debtinc  creddebt   othdebt
389  34.0    13.0      8.0    56.0      6.1  0.864248  2.551752
37    NaN     NaN      NaN     NaN      NaN       NaN       NaN
319  27.0     4.0      1.0    40.0      3.1  0.283960  0.956040
460  39.0     8.0      0.0    21.0      4.0  0.276360  0.563640
196  24.0     1.0      0.0    18.0      5.9  0.238950  0.823050

Training Categorical Data:
      ed
389  1.0
37   NaN
319  3.0
460  1.0
196  1.0
Test Continuous Data:
      age  employ  address  income  debtinc  creddebt   othdebt
175  26.0     6.0      0.0    22.0     10.3  0.720588  1.545412
545  43.0    10.0     24.0    37.0      8.5  0.676175  2.468825
435  24.0     1.0      2.0    42.0      5.7  0.837900  1.556100
171  31.0     4.0     10.0    28.0     11.3  0.291088  2.872912
350  41.0     8.0     21.0    43.0      0.7  0.085785 

# Outlier Detection and Handling

### What are Outliers?

An outlier is an observation that is unlike the other observations. They are rare, distinct, or do not fit in some way. We generally define outliers as samples that are exceptionally far from the mainstream of the data.

**Caution**: In data mining issues involving deviation detection tasks, the identification and management of outliers should be disregarded as part of the data cleaning process.

### Outlier Detection

Outlier detection is a crucial step in data analysis, employing various methods to identify and manage data points that significantly deviate from expected patterns. There are two main approaches to outlier detection:

## One-Dimensional Methods

One-dimensional methods can be applied only to continuous variables. We will explore two common methods:

1. **Standard Deviation Method**: This method uses the standard deviation from the mean to identify outliers. Values that fall outside of 3 standard deviations are typically considered outliers.
2. **Interquartile Range (IQR) Method**: This method uses the IQR to define limits on the sample values. Values below the 25th percentile minus 1.5 times the IQR or above the 75th percentile plus 1.5 times the IQR are considered outliers.

#### Why IQR Method is Better

The IQR method is preferred in many cases because it does not assume normality of the data distribution. The standard deviation method assumes the data follows a normal distribution, which may not always be the case. The IQR method is more robust as it uses percentiles and can be applied to non-Gaussian distributions.

### Choosing Thresholds and k Values

When detecting outliers, the choice of threshold for the standard deviation method and the value of k for the IQR method can vary based on the dataset and business requirements. For instance:

- **Standard Deviation Method**: While 3 standard deviations is a common threshold, smaller datasets might use 2 standard deviations to identify outliers, and larger datasets might use 4 standard deviations to account for greater variability.
- **IQR Method**: The common value for k is 1.5, but this can be adjusted to 2 for a stricter outlier detection or 3 to identify extreme outliers.

These adjustments depend on the specific context and goals of the analysis. It's essential to consider the characteristics of the data and the impact of outliers on the analysis and business outcomes.

### Outlier Handling

When dealing with detected outliers, two common approaches are employed:

1. **Remove Rows**: Exclude entire rows containing outliers from the dataset.
2. **Coerce to Bounds**: Modify outlier values to fall within an acceptable range, either by setting them to the lower or upper bound.

These methods offer flexibility in managing the impact of outliers on data analysis, allowing analysts to choose the most suitable strategy based on the specific requirements of their analysis.


In [6]:
import numpy as np

# Define a function to detect outliers using the Standard Deviation Method
def detect_outliers_std(df, columns, threshold=3):
    outliers_dict = {}
    for col in columns:
        mean = df[col].mean()
        std_dev = df[col].std()
        outliers = (df[col] > mean + threshold * std_dev) | (df[col] < mean - threshold * std_dev)
        outliers_dict[col] = df[outliers]
    return outliers_dict

# Define a function to detect outliers using the IQR Method
def detect_outliers_iqr(df, columns, k=2):
    outliers_dict = {}
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = (df[col] < (Q1 - k * IQR)) | (df[col] > (Q3 + k * IQR))
        outliers_dict[col] = df[outliers]
    return outliers_dict

# Detect outliers in the training set using the Standard Deviation Method
std_outliers_train = detect_outliers_std(X_train_continuous, numerical_columns, threshold=3)

# Detect outliers in the test set using the Standard Deviation Method
std_outliers_test = detect_outliers_std(X_test_continuous, numerical_columns, threshold=3)

# Detect outliers in the training set using the IQR Method
iqr_outliers_train = detect_outliers_iqr(X_train_continuous, numerical_columns, k=2)

# Detect outliers in the test set using the IQR Method
iqr_outliers_test = detect_outliers_iqr(X_test_continuous, numerical_columns, k=2)

# Display detected outliers from both methods
for col in numerical_columns:
    print(f"Standard Deviation Method Outliers in Training ({col}): {std_outliers_train[col].shape[0]} outliers")
    print(f"Standard Deviation Method Outliers in Test ({col}): {std_outliers_test[col].shape[0]} outliers")
    print(f"IQR Method Outliers in Training ({col}): {iqr_outliers_train[col].shape[0]} outliers")
    print(f"IQR Method Outliers in Test ({col}): {iqr_outliers_test[col].shape[0]} outliers")
    print("\n")

# Handling Outliers

# Remove Rows Example 
X_train_continuous_removed = X_train_continuous.copy()
for col in numerical_columns:
    X_train_continuous_removed = X_train_continuous_removed[~X_train_continuous_removed.index.isin(iqr_outliers_train[col].index)]

print("Training data shape after removing outliers using IQR method:")
print(X_train_continuous_removed.shape)

# Coerce to Bounds Example 
X_train_continuous_coerced = X_train_continuous.copy()
for col in numerical_columns:
    Q1 = X_train_continuous_coerced[col].quantile(0.25)
    Q3 = X_train_continuous_coerced[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 2 * IQR
    upper_bound = Q3 + 2 * IQR
    X_train_continuous_coerced[col] = np.where(X_train_continuous_coerced[col] < lower_bound, lower_bound, X_train_continuous_coerced[col])
    X_train_continuous_coerced[col] = np.where(X_train_continuous_coerced[col] > upper_bound, upper_bound, X_train_continuous_coerced[col])

print("Training data shape after coercing outliers using IQR method:")
print(X_train_continuous_coerced.shape)


Standard Deviation Method Outliers in Training (age): 0 outliers
Standard Deviation Method Outliers in Test (age): 0 outliers
IQR Method Outliers in Training (age): 0 outliers
IQR Method Outliers in Test (age): 0 outliers


Standard Deviation Method Outliers in Training (employ): 3 outliers
Standard Deviation Method Outliers in Test (employ): 3 outliers
IQR Method Outliers in Training (employ): 1 outliers
IQR Method Outliers in Test (employ): 2 outliers


Standard Deviation Method Outliers in Training (address): 3 outliers
Standard Deviation Method Outliers in Test (address): 0 outliers
IQR Method Outliers in Training (address): 3 outliers
IQR Method Outliers in Test (address): 0 outliers


Standard Deviation Method Outliers in Training (income): 10 outliers
Standard Deviation Method Outliers in Test (income): 4 outliers
IQR Method Outliers in Training (income): 19 outliers
IQR Method Outliers in Test (income): 8 outliers


Standard Deviation Method Outliers in Training (debtinc): 7 ou

### Multidimensional Method

Outlier detection can also be extended to multidimensional datasets where interactions between multiple variables can reveal anomalies that one-dimensional methods might miss. One effective multidimensional method is the Isolation Forest algorithm.

#### Isolation Forest Method

The Isolation Forest method is a powerful approach to outlier detection that leverages decision trees to isolate outliers. This method isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The fewer the splits required to isolate an observation, the more likely it is an outlier.

**Advantages of Isolation Forest**:
1. It is particularly effective in handling high-dimensional data.
2. It does not rely on assumptions about the data distribution.
3. It is efficient and scales well to large datasets.

### Applying Isolation Forest for Outlier Detection

We will use the Isolation Forest algorithm to detect outliers in both the training and test sets. This method will help us identify anomalies based on the relationships between multiple variables.

### Outlier Handling

After detecting outliers using the Isolation Forest method, we can choose to either remove the corresponding rows from the dataset or mark them for further analysis. For this notebook, we will demonstrate how to remove the rows containing outliers from both the training and test sets.


In [7]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#  Data Preprocessing
# Before applying the Isolation Forest for outlier detection, we need to preprocess the data. This includes handling missing values, encoding categorical variables, and scaling numerical features. These steps are crucial because:
# 1. **Imputation**: Missing values can lead to errors in model training and predictions. Imputing missing values ensures that our dataset is complete.
# 2. **Encoding**: Machine learning algorithms require numerical inputs. Encoding categorical variables allows us to convert them into a numerical format that can be processed by the model.
# 3. **Scaling**: Standardizing numerical features by removing the mean and scaling to unit variance is important for algorithms like Isolation Forest, which are sensitive to the scale of input data.

# Define the numerical and categorical columns - to show here
numerical_columns = ['age', 'employ', 'address', 'income', 'debtinc', 'creddebt', 'othdebt']
categorical_columns = ['ed']

# Define the preprocessing pipeline for numerical features
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())  # Standardize numerical features
])

# Define the preprocessing pipeline for categorical features
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # Encode categorical features
])

# Combine numerical and categorical pipelines into a single preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_columns),
    ('cat', categorical_pipeline, categorical_columns)
])

# Apply the preprocessing pipeline to the combined data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Isolation Forest for Outlier Detection
# After preprocessing, we apply the Isolation Forest algorithm to detect and remove outliers from the dataset. Isolation Forest is an unsupervised learning algorithm that identifies anomalies in the data by isolating observations. Anomalies are more susceptible to isolation, thus making them easier to identify. We set the contamination parameter to 0.05, which means we expect 5% of the data to be outliers.

# Initialize the Isolation Forest for outlier detection
iso_forest = IsolationForest(contamination=0.05, random_state=42)

# Fit the Isolation Forest model on the preprocessed training data
iso_forest.fit(X_train_preprocessed)

# Predict outliers in the training set (output is -1 for outliers and 1 for inliers)
outliers_train = iso_forest.predict(X_train_preprocessed) == -1

# Predict outliers in the test set
outliers_test = iso_forest.predict(X_test_preprocessed) == -1

# Print the number of detected outliers in both sets
print(f"Isolation Forest detected {np.sum(outliers_train)} outliers in the training set.")
print(f"Isolation Forest detected {np.sum(outliers_test)} outliers in the test set.")

# Handling Outliers

# Remove outliers from the training set
X_train_cleaned = X_train[~outliers_train]
y_train_cleaned = y_train[~outliers_train]

# Remove outliers from the test set
X_test_cleaned = X_test[~outliers_test]
y_test_cleaned = y_test[~outliers_test]

# Print the shape of the cleaned training and test sets
print("Training data shape after removing outliers using Isolation Forest:")
print(X_train_cleaned.shape)
print(y_train_cleaned.shape)

print("Test data shape after removing outliers using Isolation Forest:")
print(X_test_cleaned.shape)
print(y_test_cleaned.shape)


Isolation Forest detected 25 outliers in the training set.
Isolation Forest detected 20 outliers in the test set.
Training data shape after removing outliers using Isolation Forest:
(465, 8)
(465,)
Test data shape after removing outliers using Isolation Forest:
(190, 8)
(190,)


# Handling Missing Data

Handling missing data is crucial for maintaining the integrity and quality of your dataset. Various methods can be applied depending on the extent and nature of the missing data.

### Missing Values: 

#### Initial Check

1. **Assess Missing Values**:
   - Generate a report on the percentage of records with missing data.
   - If the percentage of missing data is low, consider discarding those records.

#### Detailed Check

2. **Create a Detailed Report**:
   - For higher percentages of missing data, generate a detailed report based on the number of missing cells in each record.
   - Sort this report to identify records with a significant number of missing cells.
   - Discard records with a significant number of missing cells to maintain data quality and prevent misleading analyses.

#### Feature-Wise Check

3. **Analyze Missing Values for Each Feature**:
   - Generate a report on missing values for each feature.
   - Use simple imputation methods (mean, median, mode) for features with low missing data.
   - Consider advanced imputation methods (iterative, k-NN) for features with a higher percentage of missing data.

### Imputation Methods for Handling Missing Values

## One-Dimensional Methods

1. **Fixed Values Imputation**:
   - **Mean Imputation**: Replace missing values with the mean of the observed values. Suitable for continuous variables without extreme outliers.
   - **Median Imputation**: Replace missing values with the median of the observed values. Effective for skewed distributions and data with outliers.
   - **Mode Imputation**: Replace missing values with the most frequent value. Useful for categorical variables.
   - **Constant Value Imputation**: Replace missing values with a predefined constant value.

2. **Random Values Following Statistical Distribution**:
   - Replace missing values with random numbers drawn from a distribution (e.g., normal distribution based on mean and standard deviation of observed values). This preserves the variability of features.

## Multi-Dimensional Methods

1. **Iterative Imputation Methods**:
   - Impute missing values by iteratively updating estimates based on observed values and relationships with other variables. Suitable for datasets with complex dependencies between variables.

2. **k-NN (k-Nearest Neighbors) Model**:
   - Impute missing values based on the values of their k-nearest neighbors in the feature space. Useful for datasets where local patterns or clusters exist.

### Applying Imputation Methods

We will apply these methods to both the training and test sets to ensure consistency and avoid data leakage.


In [8]:
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Initial Check: Assess missing values and create a report
def missing_values_report(df):
    missing_data = df.isnull().sum()
    missing_percentage = (missing_data / len(df)) * 100
    missing_report = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage})
    return missing_report

# Detailed Check: Generate detailed report based on the number of missing cells in each record
def detailed_missing_report(df):
    missing_data = df.isnull().sum(axis=1)
    missing_report = pd.DataFrame({'Missing Cells': missing_data, 'Percentage': (missing_data / df.shape[1]) * 100})
    return missing_report.sort_values(by='Missing Cells', ascending=False)

# Generate missing values report for the training and test sets
train_missing_report = missing_values_report(X_train)
test_missing_report = missing_values_report(X_test)

print("Missing Values Report (Training Set):")
print(train_missing_report)
print("\nMissing Values Report (Test Set):")
print(test_missing_report)

# Initial Check: Discard records with missing data if the percentage is low
low_missing_threshold = 5
X_train_initial_check = X_train.dropna(thresh=X_train.shape[1] * (1 - low_missing_threshold / 100), axis=0)
X_test_initial_check = X_test.dropna(thresh=X_test.shape[1] * (1 - low_missing_threshold / 100), axis=0)

# Detailed Check: Create a detailed report and discard low quality records (100% missing)
train_detailed_report = detailed_missing_report(X_train_initial_check)
test_detailed_report = detailed_missing_report(X_test_initial_check)

# Remove records with 100% missing values
X_train_detailed_check = X_train_initial_check.loc[train_detailed_report[train_detailed_report['Percentage'] < 100].index]
X_test_detailed_check = X_test_initial_check.loc[test_detailed_report[test_detailed_report['Percentage'] < 100].index]

# Update labels to match the filtered data
y_train_filtered = y_train.loc[X_train_detailed_check.index]
y_test_filtered = y_test.loc[X_test_detailed_check.index]

# Feature-Wise Check: Apply simple imputation for low missing data features and advanced methods for higher missing data features
# Define iterative imputer
iterative_imputer = IterativeImputer(random_state=42)

# Apply iterative imputation to training and test sets (main datasets)
X_train_final = pd.DataFrame(iterative_imputer.fit_transform(X_train_detailed_check), columns=X_train.columns)
X_test_final = pd.DataFrame(iterative_imputer.transform(X_test_detailed_check), columns=X_test.columns)

# Display the shapes of the final datasets
print("Training data shape after imputation:")
print(X_train_final.shape)
print(y_train_filtered.shape)

print("Test data shape after imputation:")
print(X_test_final.shape)
print(y_test_filtered.shape)


Missing Values Report (Training Set):
          Missing Values  Percentage
age                   42    8.571429
ed                    56   11.428571
employ                42    8.571429
address               42    8.571429
income                42    8.571429
debtinc               42    8.571429
creddebt              42    8.571429
othdebt               42    8.571429

Missing Values Report (Test Set):
          Missing Values  Percentage
age                   14    6.666667
ed                    19    9.047619
employ                14    6.666667
address               14    6.666667
income                14    6.666667
debtinc               14    6.666667
creddebt              14    6.666667
othdebt               14    6.666667
Training data shape after imputation:
(434, 8)
(434,)
Test data shape after imputation:
(191, 8)
(191,)


## Thank You for Exploring This Notebook!

If you have any questions, suggestions, or just want to discuss any of the topics further, please don't hesitate to reach out to me or leave a comment. Your feedback is not only welcome but also invaluable! If you know any useful tips or techniques that were not covered in this notebook, please suggest them in the comments. This notebook will be updated regularly to include more helpful insights and methods!

Happy analyzing, and stay curious!
