In the **CRISP-DM** (Cross-Industry Standard Process for Data Mining) framework, the second step following **business understanding** is **data understanding** and exploratory data analysis (EDA). This important phase involves comprehending the complexities of the dataset at hand and preparing it for subsequent modeling steps. **Descriptive statistics** play a crucial role in EDA, providing a comprehensive **snapshot** of the data's **central tendency**, **dispersion**, and **distributional** characteristics. Central tendency metrics like mean, median, and mode offer insights into the dataset's typical values, while measures of dispersion such as standard deviation and range quantify the extent of variability. Distributional statistics including skewness and kurtosis, along with box plots, illuminate the data's spread and shape. EDA is instrumental in describing the **overview** and **characteristics** of the data, ensuring its **quality checks**, and unraveling **relationships** between different features by **hypothesis testing**. This phase sets the foundation for informed decision-making in subsequent stages of the data mining process.

### Problem Definition  
In the area of financial decision-making, a dataset named "bank-loan" takes center stage, focusing on the critical domain of **credit scoring**. With a pool of **700 records** derived from **bank customers** who successfully obtained loans and conscientiously repaid their installments, the dataset captures the repayment outcomes, categorized as 1 and 0 for **default statuses**. The overarching objective is to develop a robust credit scoring system, a discerning arbiter for loan approvals. This system will draw on various factors, including age, education, employment duration, tenure at the current residence, income levels, debit-to-income ratio, credit-to-debit ratio, and other debts reported at the time of loan application. By delving into the intricate details of these parameters, the aim is to construct a **predictive model** that empowers the financial institution to make informed decisions when considering loan applications, thereby optimizing risk management and ensuring the soundness of lending practices.

In [1]:
import pandas as pd
df = pd.read_csv('/kaggle/input/bank-loan/Bankloan.txt')

Certainly, let's break down the meaning of each field in the "bank-loan" dataset:

**Age:** Age in years.

**Ed:** 1-Did not complete high school  2-High school degree   3-Some college   4-College degree   5-Post-undergraduate degree

**Employ:** Years with current employer

**Address:** Years at current address

**Income:** Household income in thousands

**Debtinc:** Debt to income ratio (x100)

**Creddebt:** Credit card debt in thousands

**Othdebt:** Other debt in thousands

**Default:** The "Default" field is the target variable, indicating previously defaulted. It takes binary values, with 1 typically denoting a "bad" default status and 0 representing a "good" repayment history.

### EDA answers some critical questions about data  

These questions will come down to diagnosing potentially critical data quality issues (or general data characteristics) that may compromise the operations of the machine learning models or their performance throughout the whole data science lifecycle.

* Do I have missing data? How does my missing data behave?
* Are all instances labelled? Are they trustful?
* Are these features consistent and informative?
* Do I have enough data to train my model?
* Is my data noisy?  

As standard machine learning algorithms expect their input to follow a few assumptions on the data (e.g., **representativeness**, **completeness**, **consistency**), these questions need to be addressed even before considering building a model. Let's consider examples related to representativeness, completeness, and consistency in a dataset for a data science project:

**Representativeness:**  

In a dataset for customer churn prediction, representativeness would mean that the dataset includes a diverse **sample** of customers, including various demographics, usage patterns, and regions. If the dataset predominantly represents only a specific group of customers, the model might not **generalize** well to the entire customer base.  

**Completeness:**
  
In a healthcare dataset tracking patient outcomes, completeness is crucial. If a significant percentage of patient records **lacks key informatio**n such as diagnosis, treatment history, or follow-up data, it can hinder the ability to draw comprehensive insights or build accurate predictive models.  

**Consistency:**
  
Consider a dataset that records sales transactions, and there's a column for currency. Consistency would mean that the currency values are uniform throughout the dataset. If some entries use symbols like "$," while others use "USD," it introduces inconsistency. Ensuring a standardized format enhances the reliability of analyses.

This is where profiling your data through and through is a required step in **every data science project**, and one that will save you a lot of **time**.

### What is Data Profiling, and why is it so important?
**Data Profiling** is precisely this step of diagnosing your data.

Think of it as performing a regular checkup to your data to see if everything is “good to go” for model building, or if there is something that needs to be handled properly first.

A standard way of profiling your data is through **Exploratory Data Analysis (EDA)**. This involves a deep exploration of the data, trying to understand its intricacies and properties as fully as possible, namely through:

**Dataset Overview and Characteristics:** Determining the number of features and observations, types of features, duplicate records;  

**Univariate Visualization and Feature Assessment:** Analyzing descriptive statistics such as distribution, range, variation, scale, common values, missing values, possible outliers, and evaluating the need for re-indexing, reformatting, or remove values as well as other operations such as data imputation or augmentation;  

**Multivariate Visualization and Correlation Assessment:** Investigating **patterns** and **relationships** between features, behavior or missing values, and assessing the need for dimensionality reduction and feature selection.  

* This is a thorough and crucial step in the data science lifecycle, but a **very-time consuming** one as well. It’s an intimidating process for anyone who is not a trained data scientists, and even for data scientists, it might be difficult to know where to look precisely.  

* Without systematic processes and tools, the success of EDA gets highly dependent on the **expertise** and experience of the person conducting the analysis.

* This is a giant responsibility, especially given that this is an **iterative process**, **error-prone**, with ever-changing requirements, thresholds, and constraints as the model is deployed to production. 

* **Data Profiling does not end** after the model is built and deployed because real-world domains are not immutable, and neither is data quality.   

As the new Data-Centric **AI paradigm** is extensively advocating, data profiling must become a **standard**, **systematic**, **iterative**, **continuous**, and **automated process** performed at each step of data analysis pipelines.

### Pandas Profiling: Automated EDA  
Having reached an outstanding milestone of 10K stars on GitHub, the data science community has praised Pandas Profiling as the top open-source tool for data profiling.   

  
Reference:https://medium.com/ydata-ai/auditing-data-quality-with-pandas-profiling-b1bf1919f856

In [2]:
!pip install ydata_profiling

Collecting numpy<1.24,>=1.16.0 (from ydata_profiling)
  Obtaining dependency information for numpy<1.24,>=1.16.0 from https://files.pythonhosted.org/packages/e4/f3/679b3a042a127de0d7c84874913c3e23bb84646eb3bc6ecab3f8c872edc9/numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conf

In [3]:
from ydata_profiling import ProfileReport

# Generate a profile report
profile = ProfileReport(df, title="Bankloan data EDA", type_schema = {"Ed": "categorical", "Default": "categorical"})

# Save the report to an HTML file
profile.to_file("your_dataset_profile_report.html")

  def hasna(x: np.ndarray) -> bool:


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
profile



### Needed Preparation Actions
The obtained results from EDA suggest that the following data cleaning and preparation steps must be undertaken:
* The 'age' column contains values that are out of range and should be converted to null.   

* The distribution of the 'age' suggests that values have been rounded, so it might be better to categorize it.
* In the 'ed' column, the values '4' and '5' should be merged, as the frequency of '5' is low.  
* The 'default' column includes inconsistent codes and should be corrected
* The missing values in 'age', 'ed' and 'income' should be imputed.
* The outliers should be detected and treated.
* The 'employ','address','income', 'debtinc', 'creddebt' and 'otherdebt' have skewed distribution and should be transformed or categorized.
* The target field is imbalanced; it should be taken into consideration during modeling.

### EDA with Consideration of Target Field 
You can separate the dataset based on the target field ('default') and compare the descriptive statistics and graphs for each class (0 or 1) to unveil the relationship between input fields and the target

In [5]:
from ydata_profiling import ProfileReport

df_default_0 = df[df.default == "0"]
df_default_1 = df[df.default == "1"]

# Generate a profile report
profile_0 = ProfileReport(df_default_0, title="Bankloan EDA 0",minimal=True,type_schema = {"Ed": "categorical", "Default": "categorical"})
profile_1 = ProfileReport(df_default_1, title="Bankloan EDA 1",minimal=True, type_schema = {"Ed": "categorical", "Default": "categorical"})

comparison_report = profile_0.compare(profile_1)
comparison_report.to_file("comparison.html")
comparison_report

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



In [6]:
comparison_report



For more details see https://docs.profiling.ydata.ai/latest/getting-started/concepts/

### Awesome Other Data Profiling Notebooks

https://medium.com/towards-data-science/awesome-data-science-tools-to-master-in-2023-data-profiling-edition-29d29310f779
