## EDA Lab

> **Copyright Notice**    
> 
> This IPython notebook is part of the **Deep Dive into Data Science** training program at Nexteer Automotive.  
> It incorporates materials from **Coursera**'s **Deep Learning Specialization**, **TensorFlow: Advanced Techniques Specialization**, and **Mathematics for Machine Learning and Data Science** Specialization, licensed under the **Creative Commons Attribution-ShareAlike 2.0 (CC BY-SA 2.0)**, as well as other sources (including, but not limited to, enhancements developed with the assistance of generative AI tools).  
> All original content created for this program, and all adaptations of source materials, are the intellectual property of Nexteer Automotive and are licensed under the same **Creative Commons Attribution-ShareAlike 2.0 (CC BY-SA 2.0)** license.

Conduct Exploratory Data Analysis (EDA) on the provided dataset using the process outlined in the last section of this module. All methods from the last section have been listed for you as a guidline. It is recommended to first decide which of those methods are applicable to this specific dataset, then proceed with the analysis.    

**Reminder:** You may need to research how to implement some of these methods.

### Loan - Credit Risk & Population Stability Dataset

Loan - Credit Risk & Population Stability is a part of Lending Club Company public database. LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.  

Source: [Loan - Credit Risk & Population Stability - Kaggle](https://www.kaggle.com/datasets/beatafaron/loan-credit-risk-and-population-stability?select=loan_2014_18.csv)

**Download Dataset**: Please download `loan_2014_18.csv` from [here](https://www.kaggle.com/datasets/beatafaron/loan-credit-risk-and-population-stability?select=loan_2014_18.csv).

In [3]:
import pandas as pd

df_raw = pd.read_csv("loan_2014_18.csv", low_memory=False)
df_raw = df_raw.drop(df_raw.columns[0], axis=1)

selected_columns = [
    "loan_amnt",
    "term",
    "int_rate",
    "emp_length",
    "annual_inc",
    "dti",
    "fico_range_low",
    "fico_range_high",
    "home_ownership",
    "purpose",
    "addr_state",
    "inq_last_6mths",
    "open_acc",
    "revol_util",
    "total_acc",
    "loan_status"  # Target variable
]

df = df_raw[selected_columns]
df.head()

Unnamed: 0,loan_amnt,term,int_rate,emp_length,annual_inc,dti,fico_range_low,fico_range_high,home_ownership,purpose,addr_state,inq_last_6mths,open_acc,revol_util,total_acc,loan_status
0,12000.0,36 months,7.97%,10+ years,42000.0,27.74,715.0,719.0,OWN,debt_consolidation,CA,0.0,9.0,37%,16.0,Fully Paid
1,32000.0,36 months,11.99%,10+ years,155000.0,12.35,715.0,719.0,MORTGAGE,credit_card,NJ,1.0,20.0,34.1%,42.0,Current
2,40000.0,60 months,15.05%,9 years,120000.0,31.11,765.0,769.0,MORTGAGE,debt_consolidation,TX,0.0,12.0,20.7%,26.0,Current
3,16000.0,36 months,7.97%,5 years,79077.0,15.94,700.0,704.0,RENT,debt_consolidation,VA,0.0,12.0,57.7%,20.0,Current
4,33000.0,36 months,7.21%,< 1 year,107000.0,19.06,785.0,789.0,MORTGAGE,debt_consolidation,TX,0.0,25.0,16.1%,52.0,Late (31-120 days)


### Understand Data Distribution

*   **Initial Data Profiling:** Generate a first-pass summary: data types, unique values (cardinality), missing counts/percentages, basic descriptive statistics, memory usage, duplicate rows. *Start with manual inspection (`.head()`, `.info()`, `.describe()`).*  

*   **Detailed Variable Characterization (Univariate Focus):**
    *   **Univariate Analysis:** Analyze key variables individually (prioritizing target and hypothesized important predictors).
        *   **Visualization:** Use appropriate plots (Histogram, Boxplot/Violin, Bar Chart, Density Plot). *Interpret plots in context.*
        *   **Descriptive Statistics:** Calculate central tendency, spread, shape. *Relate stats to visualizations.*
        * **Evaluate Distributions:** Rigorously assess the distribution of numerical variables.
            * **Q-Q Plot:** Visual check for normality against a theoretical distribution.
            * **Tests for Normality:**
                * **Anderson-Darling Test:** More sensitive to deviations in the tails of the distribution.
                * **Kolmogorov-Smirnov Test:** Compares the empirical distribution function to a theoretical distribution.
                * If $\mu$ and $\sigma$ are estimated from sample, use *Lilliefors Test* instead of KS.
            * **Choosing Between KS and AD**:
                * If you suspect deviations in the central part of the distribution, the KS test might be a good choice due to its simplicity and sensitivity in that region.
                * If you're concerned about outliers or heavy tails, the AD test is generally more powerful as it emphasizes discrepancies in these areas.
            *   **Tests for Homogeneity of Variance:** (If comparing groups) Levene's (robust), Bartlett's (requires normality).
            *   Analyze the distribution of *each* variable, understand its characteristics (e.g., skewed, multimodal, heavy-tailed) and its implications for analysis and modeling.
    *   **Target Variable Specific Analysis:** Thoroughly analyze target distribution (regression: skewness, range; classification: class balance). *Crucial for model choice, metrics, sampling.*

*   **Explore Data Structures (If Applicable):** Analyze inherent structures:
    *   **Time Series:** Trends, seasonality, cycles, autocorrelation (ACF/PACF), stationarity (e.g., ADF/KPSS tests).
    *   **Spatial Data:** Patterns, clustering, spatial autocorrelation (e.g., Moran's I).
    *   **Hierarchical/Nested Data:** Structure, dependencies, intra-class correlation (ICC).
    *   **Network/Graph Data:** Node degrees, centrality, communities.  

*   **Initial Bivariate Exploration:** Quick check of relationships between key predictors & target, and among key predictors (e.g., scatter plots, grouped boxplots) *to inform cleaning and refine hypotheses early.*

### Clean Data

* **Handle Missing Values:**
    * **Identify Missingness Mechanism:** Use formal tests like Little's MCAR test or visualize patterns of missingness (e.g., matrix plots, heatmaps of missingness) to understand if data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) [Read more [here](https://medium.com/analytics-vidhya/different-types-of-missing-data-59c87c046bf7)]. This informs the handling strategy.
        * Make sure assumptions hold. Little's MCAR assumes normality.
        * For large datasets, assess missingness via
            * *Visualization:* (e.g., heatmaps, missingness matrices using libaries like `missingno` in Python) as a quick and intuitive alternative for computationally expensive methods like Little's MCAR.
            * *Summary Statistics:* Examine descriptive statistics (means, variances) for observed and missing data patterns can reveal potential systematic differences.
            * *Creating Missingness Indicators:* Create binary indicator variables for each variable with missing data (e.g., `is_missing_variable_A`), then analyze the relationships between these indicators and other observed variables using techniques like logistic regression. Significant relationships suggest that the missingness is likely not MCAR (Missing Completely At Random).
            * *Focus on Robustness of Models:* Instead of exhaustively testing the missingness mechanism, focus on using imputation techniques or modeling approaches that are robust to different types of missingness.
    * **Strategize Imputation or Removal:** Based on the missingness mechanism, the percentage of missing data, and the problem context, choose appropriate strategies. 
        * If the proportion of missing values is very low, remove rows/columns (use with caution)
        * Otherwise, use the appropriate imputation techniques e.g., simple imputation (mean, median, mode, constant), or more sophisticated methods (K-Nearest Neighbors imputation, model-based imputation like MICE - Multivariate Imputation by Chained Equations).
* **Handle Outliers:**
    * **Identify Outliers:** Use statistical methods like Z-score (for normally distributed data), IQR (robust to non-normality), or more advanced techniques tailored to the data distribution or problem (e.g., robust statistical methods, isolation forests, domain-specific rules).
    * **Analyze Impact and Strategize Treatment:** Assess the potential impact of identified outliers on descriptive statistics, distributions, and relationships. Decide on appropriate actions based on the problem context and the likely cause of the outliers (e.g., removal if clear errors, capping/winsorizing to limit extreme values, transformation to reduce their influence, or using models that are robust to outliers).
* **Address Inconsistent Data:** Identify and correct inconsistencies in data entry (e.g., variations in spelling), units (e.g., mixed metric and imperial), or formatting (e.g., date formats).
* **Handle Duplicates:** Identify and handle duplicate records based on defined criteria. Decide on appropriate method to handle them (keep first/last, remove all, etc.).

### Transform Data

* **Apply Distributional Transforms:** Based on the analysis in Step 1 (Understand Data Distribution), apply transformations to address skewness or other distributional issues if required by potential modeling techniques that assume specific data distributions (e.g., log transformation, square root, reciprocal, Box-Cox transformation).
* **Handle Categorical Variables:**
    * **Analyze Cardinality:** Assess the number of unique values in categorical features. High cardinality requires careful consideration.
    * **Choose Encoding Strategies:** Select appropriate encoding methods based on cardinality, whether the variable is nominal or ordinal, and the requirements of the potential modeling technique. Common methods include One-Hot Encoding, Ordinal Encoding, and Target Encoding (use with caution to avoid leakage). Consider strategies for high-cardinality features like grouping rare categories or using feature hashing.
* **Address Data Scaling:** Scale numerical features to a common range to ensure all features contribute equally during model training (e.g., distance-based algorithms like K-Nearest Neighbors, SVMs, or algorithms using gradient descent):
    * Linear scaling (Min-Max)
    * Log scaling
    * Z-score scaling
    * Clipping  

    Read more on choosing the appropriate methods [here](https://developers.google.com/machine-learning/data-prep/transform/normalization).  

* **Handle Imbalanced Data (Recognition & Planning):** *Recognize* imbalance during analysis and *Plan* for addressing it during modeling (next phase) through techniques like resampling (oversampling, undersampling, SMOTE, etc.) or using appropriate loss functions/evaluation metrics. Read more treating imbalanced datasets [here](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data).  


**Reference (Address Data Scaling)**: [Machine Learning Crash Course - Numberical data: Normalization](https://developers.google.com/machine-learning/crash-course/numerical-data/normalization)

### Gain Deeper Insights

* **Analyze Relationships:** Explore predictor-target and predictor-predictor relationships.
    * **Predictor-Target:**  Use visuals (scatter, grouped boxplots, bar plots). Assess statistical relationships:
        *   *Correlation Coefficients*: Pearson (linear, normal data) or Spearman (robust to non-normality/non-linear monotonic).
        *   *t-Tests*: Compare means for a continuous target across two groups (binary predictor).
        *   *ANOVA*: Compare means for a continuous target across >2 groups (categorical predictor); F-test assesses overall model usefulness in regression.
        *   *Chi-Square Test*: Assess association between two categorical variables. 
    * **Predictor-Predictor:** Correlation matrices (heatmaps), pairs plots.
    * **Assess Multicollinearity:** Evaluate harmful multicollinearity among predictor variables, especially for *linear* models, using Variance Inflation Factor (VIF). High VIF values (commonly >5 or 10) indicate a feature is highly predictable by others. Address if necessary (remove/combine features, use regularization)  
* **Feature Selection & Importance Analysis:** Inform feature choice by assessing potential predictive power.
    * **Use Dimensionality Reduction for Visualization and Insight:** Use PCA/t-SNE for insight into patterns/clusters. *Interpret components derived from methods like PCA if possible, as they represent combinations of original features contributing most to data variance.*  
    * **Analytical Importance Metrics:** Use metrics derived from statistical tests (p-values from t-tests, ANOVA, Chi-Square tests), correlation magnitudes, or simple univariate ranking methods (e.g., SelectKBest based on statistical scores).
    * **Model-Based Importance Analysis:** 
        * *Stepwise Model Selection*: Iteratively add or remove features based on statistical criteria like AIC or BIC, commonly used in regression.
        * *LASSO Regression*: In a regression context, LASSO (L1 regularization) can automatically shrink coefficients of less important features towards zero, effectively performing feature selection. Examining which coefficients remain non-zero provides insight into feature importance.
        * *Tree-Based Importance*: Metrics like Gini Index or Information Gain are useful for ranking features based on how well they split data in decision tree-based models.
        * *Ablation Analysis*: Often used in deep learning, conceptually involves removing or permuting a feature to see the impact on model performance, which can be applied analytically using a simple model to gauge importance.
        * *Recursive Feature Elimination (RFE)*: Techniques like RFE can also be used with a simple model as an analytical tool to rank features.
*   **Feature Engineering:** Create new, potentially more informative features from existing ones based on analysis.
    *   Run residual diagnosis (in regression context), address distribution based issues by applying transformations (e.g., polynomial terms, log).
    *   Explore if one predictor's relationship with the target changes based on another's value. Create interaction terms ($x \times y$) based on identified potential interactions, esp. for regression. Explore statistically (preliminary regression with interaction terms).
    *   Aggregate information (counts, ratios, stats) based on groups or windows.
    *   Extract components from date/time features (trend, seasonality, cycles, etc.).
    *   Discretize or bin continuous variables.