# One-Hot Encoding

`One-Hot Encoding` is a technique used to convert **categorical variables without any natural order** into a numerical 
format that machine learning models can understand.  
It creates a new binary column for each unique category.

Each category is represented by:
- 1 → category is present  
- 0 → category is absent  

---

## Why One-Hot Encoding is Needed

Many machine learning algorithms assume numerical input and may incorrectly interpret integer-encoded categories 
as having an order or magnitude.

One-Hot Encoding avoids this problem by:
- Removing any artificial ordering  
- Treating all categories equally  
- Making categorical data safe for linear and distance-based models  

---

## Example

Suppose we have a feature called **Color**:

| Color | Encoded (Red) | Encoded (Blue) | Encoded (Green) |
|------|---------------|----------------|-----------------|
| Red  | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |

Each category becomes its own column.

---

## How One-Hot Encoding Works

For a categorical feature with \( k \) unique categories:
- One-Hot Encoding creates \( k \) new binary features  
- Exactly one feature is set to 1 for each data point  



In [7]:
%%capture
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scikit-learn


In [8]:
import numpy as np
import pandas as pd


In [9]:
df=pd.read_csv('cars.csv')
df.sample(5)

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,...,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
16300,Volkswagen,Polo Sedan,mechanical,white,59000,2017,gasoline,False,gasoline,1.6,...,True,False,False,False,False,True,True,False,False,2
18998,Volkswagen,Passat,mechanical,black,267300,2003,diesel,False,diesel,1.9,...,True,False,False,False,True,False,False,True,True,98
13551,Renault,Logan,mechanical,blue,124520,2010,gasoline,False,gasoline,1.6,...,True,False,False,False,True,False,False,False,False,23
3698,Opel,Vectra,mechanical,silver,314000,2000,gasoline,False,gasoline,1.6,...,True,False,False,True,False,False,False,False,True,56
17514,Volkswagen,Passat,mechanical,white,425432,1997,diesel,False,diesel,1.9,...,False,False,False,False,False,False,False,False,False,59


In [10]:
df['manufacturer_name'].nunique()


55

In [14]:
X=df.drop('duration_listed',axis=1)
y=df.iloc[:,-1]
X

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,...,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9
0,Subaru,Outback,automatic,silver,190000,2010,gasoline,False,gasoline,2.5,...,False,True,True,True,False,True,False,True,True,True
1,Subaru,Outback,automatic,blue,290000,2002,gasoline,False,gasoline,3.0,...,False,True,False,False,True,True,False,False,False,True
2,Subaru,Forester,automatic,red,402000,2001,gasoline,False,gasoline,2.5,...,False,True,False,False,False,False,False,False,True,True
3,Subaru,Impreza,mechanical,blue,10000,1999,gasoline,False,gasoline,3.0,...,True,False,False,False,False,False,False,False,False,False
4,Subaru,Legacy,automatic,black,280000,2001,gasoline,False,gasoline,2.5,...,False,True,False,True,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38526,Chrysler,300,automatic,silver,290000,2000,gasoline,False,gasoline,3.5,...,False,True,False,False,True,True,False,False,True,True
38527,Chrysler,PT Cruiser,mechanical,blue,321000,2004,diesel,False,diesel,2.2,...,False,True,False,False,True,True,False,False,True,True
38528,Chrysler,300,automatic,blue,777957,2000,gasoline,False,gasoline,3.5,...,False,True,False,False,True,True,False,False,True,True
38529,Chrysler,PT Cruiser,mechanical,black,20000,2001,gasoline,False,gasoline,2.0,...,False,True,False,False,False,False,False,False,False,True


In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [31]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='first', sparse_output=False, dtype=np.int32)

X_train_encoded = ohe.fit_transform(X_train[['manufacturer_name', 'engine_type']])
X_test_encoded = ohe.transform(X_test[['manufacturer_name', 'engine_type']])

encoded_cols = ohe.get_feature_names_out(['manufacturer_name', 'engine_type'])
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoded_cols, index=X_train.index)
X_test_encoded_df  = pd.DataFrame(X_test_encoded, columns=encoded_cols, index=X_test.index)

X_train_final = pd.concat([X_train.drop(['manufacturer_name', 'engine_type'], axis=1), X_train_encoded_df], axis=1)
X_test_final  = pd.concat([X_test.drop(['manufacturer_name', 'engine_type'], axis=1), X_test_encoded_df], axis=1)

X_train_final


Unnamed: 0,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_capacity,body_type,has_warranty,...,manufacturer_name_Toyota,manufacturer_name_Volkswagen,manufacturer_name_Volvo,manufacturer_name_ВАЗ,manufacturer_name_ГАЗ,manufacturer_name_ЗАЗ,manufacturer_name_Москвич,manufacturer_name_УАЗ,engine_type_electric,engine_type_gasoline
37935,307,mechanical,silver,320000,2003,gasoline,False,1.6,universal,False,...,0,0,0,0,0,0,0,0,0,1
28568,525,automatic,violet,400000,1995,diesel,False,2.5,universal,False,...,0,0,0,0,0,0,0,0,0,0
24311,Captiva,mechanical,grey,79000,2013,diesel,False,2.2,suv,False,...,0,0,0,0,0,0,0,0,0,0
19393,Grand Cherokee,automatic,other,500000,2002,diesel,False,2.7,suv,False,...,0,0,0,0,0,0,0,0,0,0
2540,Astra,mechanical,silver,280000,2005,diesel,False,1.7,universal,False,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6265,Xantia,mechanical,silver,300000,1998,gasoline,False,1.8,hatchback,False,...,0,0,0,0,0,0,0,0,0,1
11284,S-Max,mechanical,grey,159000,2006,gasoline,False,2.5,minivan,False,...,0,0,0,0,0,0,0,0,0,1
38158,300,automatic,white,72420,2006,gasoline,False,3.5,limousine,False,...,0,0,0,0,0,0,0,0,0,1
860,Sportage,automatic,orange,77921,2014,gasoline,False,2.0,suv,False,...,0,0,0,0,0,0,0,0,0,1


### OneHotEncoding with top categories

In [38]:
counts = df['manufacturer_name'].value_counts()
threshold = 100

rare_categories = counts[counts <= threshold].index

X_train['manufacturer_name'] = X_train['manufacturer_name'].replace(rare_categories, 'uncommon')
X_test['manufacturer_name']  = X_test['manufacturer_name'].replace(rare_categories, 'uncommon')

X_train['manufacturer_name'].sample(5)

26078        Nissan
7832       uncommon
5709     Mitsubishi
32441         Skoda
9193           Fiat
Name: manufacturer_name, dtype: object