<h3>Setup + sample data</h3>

In [9]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# ---- Sample data ----
df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "gender": ["Male", "Female", "Female", "Male", "Male"],
    "region": ["North", "South", "East", "West", "South"],
    "product_category": ["A", "B", "A", "C", "B"],
    "spend": [120.5, 99.9, 233.0, 150.75, 80.0]  # numeric column left as-is
})

print(df)


   id  gender region product_category   spend
0   1    Male  North                A  120.50
1   2  Female  South                B   99.90
2   3  Female   East                A  233.00
3   4    Male   West                C  150.75
4   5    Male  South                B   80.00


<h3>Identify categorical columns</h3>

In [10]:
categorical_cols = ["gender", "region", "product_category"]


<h3>Fit & transform with OneHotEncoder</h3>
Note: In scikit-learn ≥1.2, use sparse_output=False. In older versions, use sparse=False.

In [11]:
# Fit & transform with OneHotEncoder
# Note: In scikit-learn ≥1.2, use sparse_output=False. In older versions, use sparse=False.
enc = OneHotEncoder(handle_unknown="ignore", sparse_output=False)  # or sparse=False if older sklearn
encoded_array = enc.fit_transform(df[categorical_cols])

# Get readable column names
encoded_cols = enc.get_feature_names_out(categorical_cols)

# Wrap back into a DataFrame
df_encoded = pd.DataFrame(encoded_array, columns=encoded_cols, index=df.index)

print(df_encoded.head())


   gender_Female  gender_Male  region_East  region_North  region_South  \
0            0.0          1.0          0.0           1.0           0.0   
1            1.0          0.0          0.0           0.0           1.0   
2            1.0          0.0          1.0           0.0           0.0   
3            0.0          1.0          0.0           0.0           0.0   
4            0.0          1.0          0.0           0.0           1.0   

   region_West  product_category_A  product_category_B  product_category_C  
0          0.0                 1.0                 0.0                 0.0  
1          0.0                 0.0                 1.0                 0.0  
2          0.0                 1.0                 0.0                 0.0  
3          1.0                 0.0                 0.0                 1.0  
4          0.0                 0.0                 1.0                 0.0  


<h3>Merge back with the original DataFrame</h3>

In [12]:
df_final = pd.concat([df.drop(columns=categorical_cols), df_encoded], axis=1)
print(df_final)


   id   spend  gender_Female  gender_Male  region_East  region_North  \
0   1  120.50            0.0          1.0          0.0           1.0   
1   2   99.90            1.0          0.0          0.0           0.0   
2   3  233.00            1.0          0.0          1.0           0.0   
3   4  150.75            0.0          1.0          0.0           0.0   
4   5   80.00            0.0          1.0          0.0           0.0   

   region_South  region_West  product_category_A  product_category_B  \
0           0.0          0.0                 1.0                 0.0   
1           1.0          0.0                 0.0                 1.0   
2           0.0          0.0                 1.0                 0.0   
3           0.0          1.0                 0.0                 0.0   
4           1.0          0.0                 0.0                 1.0   

   product_category_C  
0                 0.0  
1                 0.0  
2                 0.0  
3                 1.0  
4             

<h3>(Optional) Drop one level per category to avoid multicollinearity</h3>
# If you plan to use linear models / logistic regression and want to avoid the dummy variable trap:

In [13]:
enc_drop = OneHotEncoder(handle_unknown="ignore", drop="first", sparse_output=False)
encoded_drop = enc_drop.fit_transform(df[categorical_cols])
encoded_cols_drop = enc_drop.get_feature_names_out(categorical_cols)
df_encoded_drop = pd.DataFrame(encoded_drop, columns=encoded_cols_drop, index=df.index)

df_final_drop = pd.concat([df.drop(columns=categorical_cols), df_encoded_drop], axis=1)
print(df_final_drop)


   id   spend  gender_Male  region_North  region_South  region_West  \
0   1  120.50          1.0           1.0           0.0          0.0   
1   2   99.90          0.0           0.0           1.0          0.0   
2   3  233.00          0.0           0.0           0.0          0.0   
3   4  150.75          1.0           0.0           0.0          1.0   
4   5   80.00          1.0           0.0           1.0          0.0   

   product_category_B  product_category_C  
0                 0.0                 0.0  
1                 1.0                 0.0  
2                 0.0                 0.0  
3                 0.0                 1.0  
4                 1.0                 0.0  


<h3>Doing it the “pipeline way” (recommended for ML)</h3>

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

numeric_cols = ["spend"]

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(), numeric_cols),
    ],
    remainder="drop"  # or 'passthrough' to keep untouched columns
)

pipe = Pipeline([
    ("prep", preprocess),
    ("model", LogisticRegression(max_iter=1000))
])

# X = features; y = your target column (add one if you have it)
#pipe.fit(X, y)


<h3>Key options you should know</h3>
<li>handle_unknown="ignore": prevents errors when unseen categories appear in test/inference data.</li>
<li>drop="first": drops the first level of each categorical variable to reduce collinearity.</li>
<li>sparse_output=False: returns a dense NumPy array (easy to convert to DataFrame). Use the default sparse matrix for large, high-cardinality data to save memory.</li>

<h3>Context: One-Hot Encoding of a Categorical Column</h3>

Suppose we have a column Color with 3 unique values: Red, Blue, and Green
If we apply one-hot encoding, we get 3 binary columns:
| Color | Color_Red | Color_Blue | Color_Green |
|-------|-----------|------------|-------------|
| Red   | 1         | 0          | 0           |
| Blue  | 0         | 1          | 0           |
| Green | 0         | 0          | 1           |


Now comes the key problem:
If you know two of these columns, you can always figure out the third.

Let’s say you have:
Color_Red = 0
Color_Blue = 0
Then you must have:
Color_Green = 1

That’s perfectly predictable — meaning the third column is just a combination of the other two.



Why this causes issues:
This is called perfect multicollinearity, where one column is a linear combination of others.

In math terms:
Color_Green = 1 - Color_Red - Color_Blue

This is a problem for models like linear regression and logistic regression, which assume that all input variables are independent.

If multicollinearity exists:

Coefficients become unstable.

The model may not converge properly.

Interpretations become unreliable.

<h3>The Fix — Drop One Column</h3>
Instead of using 3 columns, we drop one:


| Color | Color_Red | Color_Blue | Note                    |
|-------|-----------|------------|-------------------------|
| Red   | 1         | 0          |                         |
| Blue  | 0         | 1          |                         |
| Green | 0         | 0          | ← implicitly "Green"    |


Now:
<li>There’s no redundancy.</li>
<li>The model still knows all categories (the dropped one is the reference).</li>
<li>TThis avoids the dummy variable trap.</li>

<h3>Bottom Line</h3>
<li>Dummy Variable Trap = Multicollinearity from including all one-hot encoded columns.</li>
<li>Fix = Drop one category per variable (usually done automatically with drop='first' in OneHotEncoder).</li>

Linear regression requires this assumption: Independent variables must not be linearly dependent.

<h3>Why It’s Also a Problem in Logistic Regression</h3>
Even though logistic regression uses a different loss function (log-loss, not squared error), it also:
<li>Computes gradients based on feature weights</li>
<li>Solves for parameters using iterative optimization (like gradient descent or Newton-Raphson)</li>
<hr>
If there's perfect multicollinearity:
<li>Gradient updates can’t converge reliably</li>
<li>The optimization becomes numerically unstable</li>
<li>Coefficients may flip wildly between epochs or become meaningless</li>

<h3>Core of the Problem</h3>

<li>Including all one-hot columns makes one of them redundant (it contains no new information).</li>
<li>This leads to mathematical instability in linear algebra computations inside the model.</li>
<li>Linear and logistic regression require independent features — when you give them fully one-hot encoded data without dropping one column, you violate this rule.</li>

<h3>Safe Fix:</h3>

<p>Drop one category per categorical variable (using drop='first' or drop='if_binary' in OneHotEncoder).</p>

This doesn’t lose information — it just:
<li>Removes redundancy</li>
<li>Keeps math stable</li>
<li>Makes models more interpretable</li>