### Design Summary

This Python script addresses a common bottleneck in tabular machine learning workflows, **Exploratory Data Analysis (EDA)**,  particularly when datasets have many features or high-cardinality categorical variables:  

**Key problems it solves:**  
- Systematically analyzes numeric and categorical features for:  
  - Missing values and imputation strategy  
  - Outliers and distributional skew  
  - Feature scaling  
  - Cardinality and encoding decisions  
  - Multicollinearity between features  
- Reduces feature explosion caused by one-hot encoding of high-cardinality categorical variables  
- Ensures target-aware preprocessing decisions to prevent information leakage  

**What the script provides:**  
- Detailed, structured summary artifacts documenting each preprocessing recommendation  
- Transparency and reproducibility for all decisions, including:  
  - Imputation method  
  - Distributional transformations  
  - Scaling  
  - Encoding  
  - Feature removal  
- Guidance for constructing robust, interpretable data pipelines specifically for Linear Regression modeling  

**Outcome:**  
Users gain actionable, defensible preprocessing guidance, (as a result of automated EDA), that scales to large or complex datasets, helping them build efficient pipelines without manually inspecting every feature. 

Note: The architecture, workflow, and functionality of this project are entirely my own. Generative AI was used for some coding tasks, like formatting and debugging. In an effort to modernize my approach and stay current with emerging technologies, I treat AI as a collaborative partner to enhance productivity and creativity, while all key decisions, project structure, and underlying logic remain under human control.

---

### Core Data Libraries


- **NumPy** : Supports numerical operations and reproducible random sampling for constructing example categorical features; used for analysis only.

- **Pandas**: Provides the DataFrame structure used to inspect missingness, cardinality, and data types that drive preprocessing recommendations.

#### Scikit-learn Components

- **California Housing loader**: Supplies a realistic regression dataset used solely for analysis and rule evaluation.

- **Train–test split utilities**: Included to conceptually enforce that recommendation-driving statistics come from training data only.

- **KNN Imputer / Iterative Imputer**: Imported as reference implementations for distance-based and model-based imputation strategies; not executed. The iterative imputer is explicitly enabled due to its experimental status.

#### System Utilities

- **OS interface**: Used to load external configuration files (e.g., ordinal feature lists) so domain knowledge informs recommendations without being hard-coded.

In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
import os

### Dataset Loading and Target Definition

This section loads the base dataset and defines the regression target used for all preprocessing recommendations. The dataset is treated as read-only and is used solely for feature inspection and analysis.

The California Housing dataset is loaded directly into a Pandas DataFrame to preserve column names and data types. The target variable is explicitly identified as the median house value, ensuring a clear separation between predictors and the response variable throughout the recommendation logic.

In [11]:
# ---------------- Load dataset ----------------
data = fetch_california_housing(as_frame=True)
df = data.frame
target_col = "MedHouseVal"

### Sample Categorical columns

The California Housing dataset does not contain any native categorical features; therefore, categorical columns were introduced solely for demonstration purposes. When working with datasets that do not include categorical variables, this section of the script may be safely removed. If no categorical features are present, the script will still execute without error and will explicitly annotate the output files to indicate that no categorical variables were detected.

In [12]:
# ---------------- Example categorical columns ----------------
np.random.seed(42)
df['Neighborhood'] = np.random.choice([f"N{i}" for i in range(1, 36)] + [np.nan], size=len(df), p=[0.027]*35 + [0.055])
df['HouseStyle'] = np.random.choice(['Ranch','Split','Colonial',np.nan], size=len(df), p=[0.3,0.3,0.3,0.1])
df['Flag'] = np.random.choice(['Yes','No'], size=len(df), p=[0.5,0.5])
df['Misc'] = np.random.choice(['A','B','C','D','E','F','G','H', np.nan], size=len(df), p=[0.1]*8+[0.2])
df['Quality'] = np.random.choice(['Low','Medium','High', np.nan], size=len(df), p=[0.3,0.5,0.15,0.05])

### Feature–Target Separation and Dataset Splitting

This section separates predictor features from the target variable and defines training and test datasets for use in preprocessing recommendation logic.

The target column is explicitly removed from the feature set to ensure it is never analyzed as an input feature. This enforces a clean conceptual boundary between predictors and the response variable when evaluating feature properties.

The dataset is then split into training and test subsets using a fixed random state. Although no preprocessing is applied, this split establishes the principle that any statistics or heuristics used to inform preprocessing recommendations should be derived from training data only, preventing implicit data leakage.

In [13]:
# ---------------- Features / target split ----------------
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Preprocessing Configuration Thresholds

This section defines global thresholds that control how preprocessing recommendations are generated. These values parameterize decision logic rather than enforcing fixed rules, making the recommendation system transparent and tunable.

- **Rare category threshold**: Defines the minimum proportion a category must represent before being considered stable. Categories below this threshold may trigger consolidation, alternative encoding strategies, or drop recommendations.

- **Missingness thresholds**: Establish escalation levels for handling missing data. Low missingness supports simple imputation recommendations, medium missingness suggests more robust strategies, and high missingness may justify dropping a feature.

- **Dataset size threshold**: Differentiates recommendation behavior based on dataset scale, such as when one-hot encoding is feasible versus when dimensionality-aware strategies are preferred.

- **Dataset size**: Captures the total number of observations to support dataset-size–aware recommendation logic.

These thresholds centralize preprocessing heuristics, ensuring consistent and explainable recommendations across features.

In [14]:
# ---------------- Configuration ----------------
rare_threshold = 0.02
missing_low = 0.05
missing_medium = 0.30
missing_high = 0.50
small_dataset_threshold = 10_000
dataset_size = len(df)

### Ordinal Feature Configuration

This section loads the list of ordinal features from an external configuration file to guide preprocessing recommendations.

Ordinal features are defined outside the script to separate domain knowledge from preprocessing logic. This allows ordinal semantics to be updated without modifying code and prevents these features from being incorrectly treated as nominal categorical variables.

By identifying ordinal columns early, the script can exclude them from cardinality-based analyses and nominal encoding recommendations, ensuring that subsequent guidance respects the ordered nature of these features.

In [15]:
# ---------------- Ordinal columns ----------------
script_dir = os.path.dirname(os.path.abspath(__file__)) if "__file__" in globals() else os.getcwd()
ordinal_file = os.path.join(script_dir, "ordinal_columns.txt")
with open(ordinal_file, "r") as f:
    ordinal_columns = [line.strip() for line in f if line.strip()]

### Numeric Preprocessing Workflow and Summary Generation

This section performs a **comprehensive analysis of numeric features** to generate preprocessing recommendations and justifications, without actually transforming the dataset.

#### 1. Numeric Feature Scope

- Only numeric features are considered, with **ordinal columns excluded** to prevent misinterpretation.

- Imputer objects (KNNImputer and IterativeImputer) are instantiated to support recommendation logic; they are **not applied to the dataset** for final preprocessing.

#### 2. Missingness and Imputation Recommendations

- For each numeric feature, missing count and missing ratio are computed.

- **Decision rules**:

    - **No missing values** → no imputation
    - **Low missingness (≤5%)** → mean or median imputation depending on skew
    - **Medium missingness (≤30%)** → KNN imputation
    - **High missingness (>30%)** → iterative imputation
    - **Very high missingness (>50%)** → iterative imputation + consider dropping

These calculations ensure that each feature’s missingness is addressed analytically, supporting transparent recommendations.

#### 3. Skew-Based Transformation

- **Relative skew** is calculated as:

        **relative skew = (mean − median) ÷ standard deviation**

- **Decision**: If |relative skew| > 0.5, a log transformation is recommended to reduce asymmetry; otherwise, no transformation is suggested.

#### 4. Outlier Detection and Capping

- Outliers are identified using the **IQR method**:

    - Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 − Q1

    - Lower bound = Q1 − 1.5 × IQR, Upper bound = Q3 + 1.5 × IQR

- **Outlier factor** = (number of outlier rows ÷ total rows) × 100

- **Decision**: If outlier factor > 1%, **capping is recommended**; otherwise, no capping.

#### 5. Scaling Recommendation

- Scaling is determined based on the **feature’s standard deviation**:

    - Features with significant variance → StandardScaler

    - Low-variance or stable-range features (e.g., AveBedrms) → optional scaling

#### 6. Justification Text

For each feature, a human-readable explanation is generated summarizing:

- Missingness and imputation choice

- Correlation with the target

- Skew and recommended transformation

- Outlier factor and capping recommendation

- Standard deviation and scaling advice

These justifications are combined into a single string per feature and stored alongside numeric metrics.

#### 7. Summary Table and Export

- All feature metrics, decisions, and justifications are consolidated into a **Pandas DataFrame**.

- The DataFrame is exported as numeric_EDA_summary.csv for documentation and review, providing a **complete, reproducible record of numeric preprocessing recommendations**.

In [16]:
# ---------------- Numeric preprocessing ----------------
numeric_cols = [c for c in X_train.select_dtypes(include=[np.number]).columns if c not in ordinal_columns]
knn_imputer = KNNImputer(n_neighbors=5)
iter_imputer = IterativeImputer(random_state=42)
imputed_features = {}
n_rows_train = len(X_train)
fill_values_dict = {}
numeric_results = []

for col in numeric_cols:
    series = X_train[[col]]
    non_null = series.dropna()
    missing_count = series[col].isna().sum()
    missing_ratio = missing_count / n_rows_train

    # Decide imputation
    consider_drop_flag = False
    if missing_count == 0:
        temp_series = series[col].copy()
        imputer_type = "None"
    elif missing_ratio <= missing_low:
        mean, median = non_null[col].mean(), non_null[col].median()
        fill_value = median if median != 0 and abs(mean - median)/abs(median) >= 0.1 else mean
        temp_series = series[col].fillna(fill_value)
        fill_values_dict[col] = fill_value
        imputer_type = "Median" if fill_value==median else "Mean"
    elif missing_ratio <= missing_medium:
        temp_series = pd.Series(knn_imputer.fit_transform(series), index=series.index, name=col)
        imputer_type = "KNN"
    elif missing_ratio <= missing_high:
        temp_series = pd.Series(iter_imputer.fit_transform(series), index=series.index, name=col)
        imputer_type = "Iterative"
    else:
        temp_series = pd.Series(iter_imputer.fit_transform(series), index=series.index, name=col)
        imputer_type = "Iterative"
        consider_drop_flag = True

    imputed_features[col] = temp_series

X_train_imputed = pd.DataFrame(imputed_features)
feature_target_corr = X_train_imputed.corrwith(y_train)

# ----- Full numeric summary with outliers, skew-based transform, scaling, justification -----
for col in numeric_cols:
    temp_series = X_train_imputed[col]
    std = temp_series.std()
    rel_skew = 0 if std == 0 else (temp_series.mean()-temp_series.median())/std
    corr_with_target = feature_target_corr[col]
    missing_count = X_train[col].isna().sum()
    missing_ratio = missing_count / n_rows_train
    drop_reasons = ["excessive missingness"] if missing_ratio>0.50 else []
    recommended_drop = ", ".join(drop_reasons) if drop_reasons else ""
    drop_flag = "Yes" if recommended_drop else ""

    # --------- Skew-based transform ----------
    if abs(rel_skew) > 0.5:  # you can tweak this threshold
        transform = "log"
    else:
        transform = "none"

    # --------- Outlier factor & capping ----------
    Q1, Q3 = temp_series.quantile([0.25, 0.75])
    IQR = Q3 - Q1
    outlier_upper = Q3 + 1.5*IQR
    outlier_lower = Q1 - 1.5*IQR
    outlier_count = ((temp_series < outlier_lower) | (temp_series > outlier_upper)).sum()
    outlier_factor = round(outlier_count / n_rows_train * 100, 2)  # percentage of rows
    cap_outliers = "yes" if outlier_factor > 1 else "no"  # can adjust threshold

    # --------- Scaling recommendation ----------
    if col in ["AveBedrms"]:  # example for optional scaling, customize as needed
        scaling = "optional"
    else:
        scaling = "StandardScaler"
    
    # --------- Justification text ----------
    justification_parts = []

    if missing_count == 0:
        justification_parts.append(f"The feature '{col}' has no missing values, so no imputation is required.")
    else:
        justification_parts.append(f"The feature '{col}' has {missing_ratio:.2%} missing values, imputed via {imputer_type}.")

    justification_parts.append(f"The feature '{col}' has a correlation of {round(corr_with_target,2)} with the target variable.")
    justification_parts.append(f"The distribution of '{col}' shows a skew of {round(rel_skew,2)}, and " + 
                               (f"a '{transform}' transformation is recommended to reduce skew." if transform!="none" else "no transformation is necessary."))
    justification_parts.append(f"The outlier factor for '{col}' is {round(outlier_factor,2)}, which indicates " + 
                               ("high" if outlier_factor>5 else "low") + " outlier severity. " +
                               ("Capping of outliers is recommended." if cap_outliers=="yes" else "No capping is required."))
    justification_parts.append(f"The standard deviation of '{col}' is {round(std,2)}, which means that scaling is {scaling.lower()}.")

    justification = " ".join(justification_parts)

    numeric_results.append({
        "Feature": col,
        "MissingRatio": round(missing_ratio,2),
        "Imputer": imputer_type,
        "CorrWithTarget": round(corr_with_target,2),
        "Skew": round(rel_skew,2),
        "Transform": transform,
        "OutlierFactor": round(outlier_factor,2),
        "Cap Outliers": cap_outliers,
        "Std": round(std,2),
        "Scaling": scaling,
        "RecommendedDrop": recommended_drop,
        "DropFlag": drop_flag,
        "Justification": justification
    })

# ----- Export to CSV -----
pd.DataFrame(numeric_results).to_csv("numeric_EDA_summary.csv", index=False)

### Categorical Preprocessing Workflow and Summary Generation  

This section evaluates **categorical features** to provide preprocessing recommendations, including missing value handling, encoding, rare category analysis, and drop guidance.

#### 1. Categorical Feature Scope
- All columns with `object` or `category` data types are included.  
- If no categorical columns are detected, a **placeholder entry** is created stating: *“No categorical columns were detected in the dataset.”*  
- This ensures the summary table is complete and prevents downstream errors.

#### 2. Missing Value Assessment and Imputation
- For each categorical feature:  
  - Missing count and missing ratio are calculated.  
  - A correlation between missingness and the target is computed to detect **informative missing values**.  
- **Imputation decision rules:**  
  - No missing values → no imputation  
  - Low missingness → impute via mode  
  - Medium missingness → impute with mode unless missingness is informative, then mark as `"missing"`  
  - High missingness → mark as `"missing"`  
  - Very high missingness → recommend **consider dropping**  

- Imputation flags and drop flags are recorded for clarity.

#### 3. Cardinality and Rare Category Analysis
- **Cardinality classification:**  
  - Low (≤10 unique values)  
  - Medium (11–20 unique values)  
  - High (>20 unique values)  
- Rare categories are identified as those below a frequency threshold (`rare_threshold`).  
- The proportion of rare categories is categorized as **few, moderate, many, or dominantly rare** to guide feature treatment and encoding decisions.

#### 4. Encoding and High-Cardinality Decisions
- Encoding recommendations depend on feature type and cardinality:  
  - Ordinal → **ordinal encoding**  
  - Binary → no encoding required  
  - Low cardinality → **one-hot encoding**  
  - Medium cardinality → one-hot if dataset is small; otherwise, **target encoding**  
  - High cardinality (>20):  
    - If correlation with target is low → consider dropping  
    - Otherwise → **target encoding**  

- Drop decisions due to high cardinality are explicitly flagged when applicable.

#### 5. Justification Text
- A human-readable justification is constructed for each feature summarizing:  
  - Missingness level and imputation strategy  
  - Informative missingness if present  
  - Cardinality classification  
  - Rare category prevalence  
  - Ordinal vs nominal nature  
  - Encoding recommendation or drop consideration  

This ensures **every recommendation is transparent** and tied to specific feature properties.

#### 6. Summary Table and Export
- All metrics, decisions, and justifications are consolidated into a **Pandas DataFrame** (`categorical_EDA_summary`).  
- The summary is exported as a CSV file for documentation, reporting, and review.  

This provides a **complete, reproducible record of categorical preprocessing recommendations** without altering the original dataset.


In [17]:
# ---------------- Categorical preprocessing ----------------
cat_cols = X_train.select_dtypes(include=['object','category'])
categorical_results = []

if cat_cols.empty:
    categorical_results.append({
        "Col": "—",
        "Missing %": "",
        "Impute?": "",
        "Imputation Type": "",
        "Drop Due to Missing?": "",
        "Drop Due to High Cardinality?": "",
        "Cardinality": "",
        "Rare Category Level": "",
        "Encoding": "",
        "Justification": "No categorical columns were detected in the dataset."
    })

for col in cat_cols.columns:
    series = cat_cols[col]
    n_rows = len(series)
    missing_count = series.isna().sum()
    missing_ratio = missing_count / n_rows
    corr_with_missing = series.isna().astype(int).corr(y_train) if missing_count > 0 else 0
    informative_missing = abs(corr_with_missing) >= 0.1

    # ---------- Determine imputation type ----------
    if missing_count == 0:
        impute_type = "none"
    elif missing_ratio <= missing_low:
        impute_type = "mode"
    elif missing_low < missing_ratio <= missing_medium:
        impute_type = "missing" if informative_missing else "mode"
    elif missing_medium < missing_ratio <= missing_high:
        impute_type = "missing"
    else:
        impute_type = "consider drop"

    # ---------- Temporary imputed view ----------
    if impute_type == "mode":
        temp_series = series.fillna(series.mode(dropna=True)[0])
    else:
        temp_series = series.fillna("Missing")

    # ---------- Flags ----------
    drop_due_to_missing = "Yes" if impute_type == "consider drop" else "No"
    impute_flag = "n/a" if impute_type == "consider drop" else ("yes" if missing_count > 0 else "no")

    # ---------- Cardinality ----------
    n_unique = temp_series.nunique(dropna=True)
    if n_unique <= 10:
        cardinality = f"low ({n_unique})"
    elif 11 <= n_unique <= 20:
        cardinality = f"medium ({n_unique})"
    else:
        cardinality = f"high ({n_unique})"

    # ---------- Rare category analysis ----------
    value_counts = temp_series.value_counts(normalize=True)
    rare_categories = value_counts[value_counts < rare_threshold].index
    percent_rare = len(rare_categories) / n_unique if n_unique > 0 else 0
    rare_label = ("few" if percent_rare <= 0.2 else 
                  "moderate" if percent_rare <= 0.3 else 
                  "many" if percent_rare <= 0.5 else 
                  "dominantly rare")

    # ---------- Encoding / High cardinality drop decision ----------
    is_ordinal = col in ordinal_columns

    if is_ordinal:
        encoding = "ordinal"
        drop_due_to_high_cardinality = "No"
    elif n_unique == 2:
        encoding = "binary / no encoding"
        drop_due_to_high_cardinality = "No"
    elif n_unique <= 10:
        encoding = "one-hot"
        drop_due_to_high_cardinality = "No"
    elif 11 <= n_unique <= 20:
        if dataset_size < small_dataset_threshold:
            encoding = "one-hot"
        else:
            encoding = "target encoding"
        drop_due_to_high_cardinality = "No"
    else:  # n_unique > 20
        # Compute proxy correlation with target
        encoded_series = pd.factorize(temp_series)[0]
        corr_with_target = pd.Series(encoded_series).corr(y_train)
        if abs(corr_with_target) < 0.05:
            encoding = "consider drop"
            drop_due_to_high_cardinality = "Yes"
        else:
            encoding = "target encoding"
            drop_due_to_high_cardinality = "No"

    # ---------- Justification ----------
    justification_parts = []

    if missing_count == 0:
        justification_parts.append("No missing values detected")
    elif impute_type == "consider drop":
        justification_parts.append(f"{missing_ratio:.0%} of values are missing; consider dropping;")
    else:
        justification_parts.append(f"{missing_ratio:.0%} of values are missing")
        if informative_missing:
            justification_parts.append("missing values are related to the target")
        justification_parts.append(f"imputation via '{impute_type}' is appropriate")

    justification_parts.append(f"the column has {cardinality} cardinality")

    if rare_categories.any():
        justification_parts.append(f"{rare_label} rare categories were identified")

    if is_ordinal:
        justification_parts.append("categories have a natural order")
    else:
        justification_parts.append("categories are nominal")

    # High cardinality justification
    if n_unique > 20:
        if drop_due_to_high_cardinality == "Yes":
            justification_parts.append(f"high cardinality and low correlation with target; consider dropping;")
        else:
            justification_parts.append("high cardinality; target encoding is recommended;")

    if encoding != "consider drop":
        justification_parts.append(f"{encoding} is recommended to balance model stability and complexity")

    justification = "; ".join(justification_parts)

    # ---------- Store results ----------
    categorical_results.append({
        "Col": col,
        "Missing %": round(missing_ratio, 2),
        "Impute?": impute_flag,
        "Imputation Type": impute_type,
        "Drop Due to Missing?": drop_due_to_missing,
        "Drop Due to High Cardinality?": drop_due_to_high_cardinality,
        "Cardinality": cardinality,
        "Rare Category Level": rare_label,
        "Encoding": encoding,
        "Justification": justification
    })

# ---------- Export to CSV ----------
pd.DataFrame(categorical_results).to_csv("categorical_EDA_summary.csv", index=False)

### Multicollinearity Detection and Feature Drop Recommendations  

This section identifies **multicollinearity risks** across numeric and categorical features and produces explicit, target-aware drop recommendations. All findings are consolidated into a CSV audit file.

#### 1. Correlation Threshold Definition
- A **high-correlation threshold** is defined as:

  |corr(Xᵢ, Xⱼ)| ≥ 0.80  

- Any feature pair exceeding this threshold is considered a multicollinearity risk and evaluated for potential feature removal.

#### 2. Categorical Imputation & Drop Context
- Lookup tables are created from prior categorical preprocessing results:
  - **Imputation Type** (e.g., mode, missing, consider drop)
  - **Drop Due to Missing?**
- These lookups allow multicollinearity decisions to incorporate **data quality context**, not just correlation strength.

#### 3. One-Hot Encoding of Categorical Features
- All categorical columns are one-hot encoded.
- No dummy is dropped at this stage (`drop_first=False`) so that:
  - Dummy-variable traps can be explicitly detected
  - Root-column–level decisions remain traceable

- A mapping is created from each one-hot column back to its **root categorical feature** to preserve interpretability.

#### 4. Feature Space Construction
- The analysis operates on a **combined feature matrix**:
  - Imputed numeric features
  - One-hot encoded categorical features

This ensures that all correlation checks are performed in the **actual modeling feature space**.

#### 5. Missing Value Warnings

##### Categorical
- Any categorical feature previously marked as:
  - `consider drop`, or
  - `Drop Due to Missing = Yes`
- Generates a **Missing Value Warning** entry.
- These warnings propagate to all derived one-hot features.

##### Numeric
- Numeric features flagged earlier for excessive missingness (>50%) are also recorded.
- These warnings explicitly note that values were temporarily imputed and should be reconsidered.

#### 6. Categorical–Categorical Correlations
- Pairwise absolute correlations are computed between all one-hot encoded categorical features.
- When correlation exceeds the threshold:
  - Each feature’s correlation with the target is computed.
  - The feature with **lower absolute target correlation** is recommended for removal.
- If either root feature has a missing-value warning:
  - That context is explicitly included in the recommendation.

**Decision rule:**  
Drop argmin(|corr(feature, target)|)

#### 7. Numeric–Numeric Correlations
- Pairwise correlations are computed across all numeric features.
- For any pair exceeding the threshold:
  - Both features’ correlations with the target are evaluated.
  - The feature less predictive of the target is recommended for removal.

This preserves **predictive signal while reducing redundancy**.

#### 8. Categorical–Numeric Correlations
- Each one-hot encoded categorical feature is checked against each numeric feature.
- If correlation exceeds the threshold:
  - No automatic drop is applied due to mixed feature types.
  - If the categorical root feature has a missing-value warning, a drop suggestion is allowed.
  - Otherwise, the issue is documented without forcing removal.

This avoids unsafe drops caused by encoding artifacts.

#### 9. Dummy-Variable Trap Detection
- For each categorical root feature:
  - All associated one-hot columns are checked for perfect correlation:

  |corr(dummyᵢ, dummyⱼ)| ≈ 1.0  

- When detected:
  - The dummy with lower target correlation is recommended for removal.
  - If the root feature had a missing-value warning, this context is included.

This explicitly resolves **linear dependence introduced by full one-hot encoding**.

#### 10. Output Artifact
- All findings are written to a single CSV file (multicollinearity_EDA_Summary):
  - Section (context)
  - Feature pair
  - Correlation between features
  - Correlation with target
  - Drop recommendation and rationale

The result is a **fully auditable, target-aware multicollinearity report** suitable for portfolio-grade documentation and downstream modeling decisions.


In [18]:
# ---------------- Multicollinearity Check ----------------

# Threshold for high correlation
corr_threshold = 0.8
recommendation_file_csv = "multicollinearity_EDA_Summary.csv"


# Prepare categorical imputation lookup
# -----------------------------
cat_impute_lookup = {row['Col']: row['Imputation Type'] for row in categorical_results}
cat_drop_lookup = {row['Col']: row['Drop Due to Missing?'] for row in categorical_results}

# One-hot encode categorical features
# -----------------------------
category_ohe = pd.get_dummies(X_train.select_dtypes(include=['object','category']), drop_first=False)
category_ohe_copy = category_ohe.copy()

# Map one-hot columns to root column
ohe_to_root = {ohe_col: root_col
               for root_col in X_train.select_dtypes(include=['object','category']).columns
               for ohe_col in category_ohe_copy.columns
               if ohe_col.startswith(root_col + "_")}

# Combine numeric + categorical features
# -----------------------------
X_train_combined = pd.concat([X_train_imputed, category_ohe_copy], axis=1)
numeric_cols = X_train_imputed.columns
categorical_cols = category_ohe_copy.columns

# Prepare CSV rows
# -----------------------------
csv_rows = []

if cat_cols.empty:
    csv_rows.append({
        "Section": "Info",
        "Feature 1": "",
        "Feature 2": "",
        "Feature 1 Corr with Target": "",
        "Feature 2 Corr with Target": "",
        "Correlation Between Features": "",
        "Drop Suggestion / Notes": "No categorical columns were detected in the dataset; categorical multicollinearity checks were skipped."
    })

# ---- Missing Value Warnings: Categorical ----
for row in categorical_results:
    if row['Imputation Type'] == "consider drop" or row['Drop Due to Missing?'] == "Yes":
        csv_rows.append({
            "Section": "Missing Value Warnings",
            "Feature 1": row['Col'],
            "Feature 2": "",
            "Feature 1 Corr with Target": "",
            "Feature 2 Corr with Target": "",
            "Correlation Between Features": "",
            "Drop Suggestion / Notes": f"Feature(s) associated with root column '{row['Col']}' have a Missing Value Warning; temporarily filled 'Missing'; consider dropping"
        })

# ---- Missing Value Warnings: Numeric ----
for row in numeric_results:
    if row['RecommendedDrop']:
        csv_rows.append({
            "Section": "Missing Value Warnings",
            "Feature 1": row['Feature'],
            "Feature 2": "",
            "Feature 1 Corr with Target": "",
            "Feature 2 Corr with Target": "",
            "Correlation Between Features": "",
            "Drop Suggestion / Notes": f"Feature '{row['Feature']}' has excessive missing values; temporarily imputed; consider dropping"
        })

#Categorical-Categorical correlations
# -----------------------------
cat_corr = X_train_combined[categorical_cols].corr().abs()
for i in range(len(categorical_cols)):
    for j in range(i+1, len(categorical_cols)):
        if cat_corr.iloc[i, j] > corr_threshold:
            col1, col2 = categorical_cols[i], categorical_cols[j]
            corr1 = X_train_combined[col1].corr(y_train)
            corr2 = X_train_combined[col2].corr(y_train)
            root1, root2 = ohe_to_root.get(col1,""), ohe_to_root.get(col2,"")
            
            # Determine notes
            if cat_impute_lookup.get(root1) == "consider drop" and cat_impute_lookup.get(root2) == "consider drop":
                notes = f"Features associated with root columns '{root1}', '{root2}' have Missing Value Warnings; consider dropping. Suggest drop: {col1 if abs(corr1)<abs(corr2) else col2} (lower correlation with target)"
            elif cat_impute_lookup.get(root1) == "consider drop":
                notes = f"Feature associated with root column '{root1}' has Missing Value Warning; consider dropping. Suggest drop: {col1}"
            elif cat_impute_lookup.get(root2) == "consider drop":
                notes = f"Feature associated with root column '{root2}' has Missing Value Warning; consider dropping. Suggest drop: {col2}"
            else:
                notes = f"Suggest drop: {col1 if abs(corr1)<abs(corr2) else col2} (lower correlation with target)"

            csv_rows.append({
                "Section": "Categorical-Categorical Correlations",
                "Feature 1": col1,
                "Feature 2": col2,
                "Feature 1 Corr with Target": round(corr1,2),
                "Feature 2 Corr with Target": round(corr2,2),
                "Correlation Between Features": round(cat_corr.iloc[i,j],2),
                "Drop Suggestion / Notes": notes
            })

# Numeric-Numeric correlations
# -----------------------------
num_corr = X_train_combined[numeric_cols].corr().abs()
for i in range(len(numeric_cols)):
    for j in range(i+1, len(numeric_cols)):
        if num_corr.iloc[i,j] > corr_threshold:
            col1, col2 = numeric_cols[i], numeric_cols[j]
            corr1 = X_train_combined[col1].corr(y_train)
            corr2 = X_train_combined[col2].corr(y_train)
            drop_col = col1 if abs(corr1)<abs(corr2) else col2
            csv_rows.append({
                "Section": "Numeric-Numeric Correlations",
                "Feature 1": col1,
                "Feature 2": col2,
                "Feature 1 Corr with Target": round(corr1,2),
                "Feature 2 Corr with Target": round(corr2,2),
                "Correlation Between Features": round(num_corr.iloc[i,j],2),
                "Drop Suggestion / Notes": f"Suggest drop: {drop_col} (lower correlation with target)"
            })

# Categorical-Numeric correlations
# -----------------------------
for cat_col in categorical_cols:
    for num_col in numeric_cols:
        corr_value = X_train_combined[cat_col].corr(X_train_combined[num_col])
        if abs(corr_value) > corr_threshold:
            root = ohe_to_root.get(cat_col,"")
            if cat_impute_lookup.get(root)=="consider drop":
                notes = f"Feature associated with root column '{root}' has a Missing Value Warning; consider dropping. Suggest drop: {cat_col} (lower correlation with target)"
            else:
                notes = "No drop suggested (mixed feature types, drop recommendation not applied)"
            csv_rows.append({
                "Section":"Categorical-Numeric Correlations",
                "Feature 1": cat_col,
                "Feature 2": num_col,
                "Feature 1 Corr with Target": round(X_train_combined[cat_col].corr(y_train),2),
                "Feature 2 Corr with Target": round(X_train_combined[num_col].corr(y_train),2),
                "Correlation Between Features": round(corr_value,2),
                "Drop Suggestion / Notes": notes
            })

# Dummy-variable trap
# -----------------------------
for root_col in X_train.select_dtypes(include=['object','category']).columns:
    ohe_cols = [c for c in category_ohe_copy.columns if c.startswith(root_col+"_")]
    for i in range(len(ohe_cols)):
        for j in range(i+1,len(ohe_cols)):
            corr_value = X_train_combined[ohe_cols[i]].corr(X_train_combined[ohe_cols[j]])
            if abs(corr_value-1.0)<1e-8:
                corr1 = X_train_combined[ohe_cols[i]].corr(y_train)
                corr2 = X_train_combined[ohe_cols[j]].corr(y_train)
                drop_col = ohe_cols[i] if abs(corr1)<abs(corr2) else ohe_cols[j]
                if cat_impute_lookup.get(root_col)=="consider drop":
                    notes = f"Original feature ({root_col}) has a Missing Value Warning; consider dropping. Suggest drop: {drop_col} (lower correlation with target)"
                else:
                    notes = f"Suggest drop: {drop_col} (lower correlation with target)"
                csv_rows.append({
                    "Section":"Dummy-Variable Trap",
                    "Feature 1": ohe_cols[i],
                    "Feature 2": ohe_cols[j],
                    "Feature 1 Corr with Target": round(corr1,2),
                    "Feature 2 Corr with Target": round(corr2,2),
                    "Correlation Between Features": round(corr_value,2),
                    "Drop Suggestion / Notes": notes
                })

# Save recommendations to CSV
# -----------------------------
df_recommendations = pd.DataFrame(csv_rows)
df_recommendations.to_csv(recommendation_file_csv, index=False)