### **Dummy Variables and One Hot Encoding**

1. Introduction
2. Import Libraries
3. Load and Explore the Dataset
4. Handling Categorical Variables
    - Creating Dummy Variables
    - Avoiding the Dummy Variable Trap
5. Modeling with Dummy Variables
6. One Hot Encoding with Scikit-learn
7. Modeling with One Hot Encoded Data
8. Summary

### 1. Introduction

In this notebook, we will demonstrate how to handle categorical variables using dummy variables and one hot encoding. We'll use a car sales dataset (`car_prices_ohe.csv`), where the car's brand is a categorical variable and prices are in INR (Indian Rupees). We'll show how to prepare the data for machine learning models and explain each step.

### 2. Import Libraries

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

### 3. Load and Explore the Dataset

Let's load the car sales dataset from `car_prices_ohe.csv`. Each record contains the car's brand, engine size, and sale price in INR.

In [13]:
df = pd.read_csv('car_prices_ohe.csv')
df.head()

Unnamed: 0,brand,engine_size,price
0,Toyota,1.8,1800000
1,Ford,2.0,1700000
2,BMW,3.0,3200000
3,Toyota,2.0,1850000
4,Ford,2.2,1720000


Let's check the unique brands and basic statistics.

In [14]:
print('Unique brands:', df['brand'].unique())

Unique brands: ['Toyota' 'Ford' 'BMW']


In [15]:
print(df.describe())

       engine_size         price
count    17.000000  1.700000e+01
mean      2.388235  2.233529e+06
std       0.591484  7.824157e+05
min       1.600000  1.600000e+06
25%       2.000000  1.740000e+06
50%       2.200000  1.800000e+06
75%       3.000000  3.200000e+06
max       3.400000  3.600000e+06


### 4. Handling Categorical Variables

#### 4.1 Creating Dummy Variables

Machine learning models require numerical input. We convert the `brand` column into dummy variables (one column per brand).

In [16]:
df_dummies = pd.get_dummies(df, columns=['brand'], dtype=int)
df_dummies.head()

Unnamed: 0,engine_size,price,brand_BMW,brand_Ford,brand_Toyota
0,1.8,1800000,0,0,1
1,2.0,1700000,0,1,0
2,3.0,3200000,1,0,0
3,2.0,1850000,0,0,1
4,2.2,1720000,0,1,0


#### 4.2 Avoiding the Dummy Variable Trap

To avoid multicollinearity (dummy variable trap), we drop one dummy column. Here, we drop `brand_Toyota`.

In [17]:
df_dummies = df_dummies.drop('brand_Toyota', axis=1)
df_dummies.head()

Unnamed: 0,engine_size,price,brand_BMW,brand_Ford
0,1.8,1800000,0,0
1,2.0,1700000,0,1
2,3.0,3200000,1,0
3,2.0,1850000,0,0
4,2.2,1720000,0,1


### 5. Modeling with Dummy Variables

Let's train a linear regression model to predict car price (in INR) using engine size and brand dummies.

In [18]:
X = df_dummies.drop('price', axis=1)
y = df_dummies['price']
model = LinearRegression()
model.fit(X, y)

In [19]:
print('Model coefficients:', model.coef_)
print('Model intercept:', model.intercept_)
print('R² score:', model.score(X, y))

Model coefficients: [ 470000. 1015500. -193000.]
Model intercept: 880499.9999999995
R² score: 0.9941453558134759


#### Predicting the price of a Ford with 2.2L engine:

In [20]:
# [engine_size, brand_BMW, brand_Ford] (brand_Toyota dropped)
pred = model.predict([[2.2, 0, 1]])
print(f'Predicted price for Ford 2.2L: ₹{pred[0]:.2f}')

Predicted price for Ford 2.2L: ₹1721500.00


### 6. One Hot Encoding with Scikit-learn

We can also use `OneHotEncoder` from scikit-learn for the same purpose. This is useful in pipelines and for more complex datasets.

In [21]:
# Prepare features and target
X2 = df[['brand', 'engine_size']]
y2 = df['price']

# OneHotEncoder with ColumnTransformer
ct = ColumnTransformer([
    ('brand_ohe', OneHotEncoder(drop='first'), ['brand'])
], remainder='passthrough')

X2_encoded = ct.fit_transform(X2)
X2_encoded[:5]

array([[0. , 1. , 1.8],
       [1. , 0. , 2. ],
       [0. , 0. , 3. ],
       [0. , 1. , 2. ],
       [1. , 0. , 2.2]])

### 7. Modeling with One Hot Encoded Data

Let's fit a model using the one-hot encoded features.

In [22]:
model2 = LinearRegression()
model2.fit(X2_encoded, y2)

In [23]:
print('Model coefficients:', model2.coef_)
print('Model intercept:', model2.intercept_)
print('R² score:', model2.score(X2_encoded, y2))

Model coefficients: [-1208500. -1015500.   470000.]
Model intercept: 1895999.9999999981
R² score: 0.9941453558134759


#### Predicting the price of a BMW with 3.0L engine:

In [24]:
# For OneHotEncoder(drop='first'), order is [brand_BMW, brand_Ford, engine_size]
pred2 = model2.predict([[1, 0, 3.0]])
print(f'Predicted price for BMW 3.0L: ₹{pred2[0]:.2f}')

Predicted price for BMW 3.0L: ₹2097500.00


### 8. Summary

* In this notebook, we learned how to handle categorical variables using dummy variables and one hot encoding. 

* We saw how to avoid the dummy variable trap and how to use these techniques in machine learning models. 

* This process is essential for preparing real-world data for regression and classification tasks.

---

#### **Exercise: Dummy Variables and One Hot Encoding**

#### Problem Statement

You are provided with a dataset `smartphone_prices.csv` containing information about different smartphone models, their brand, RAM size (in GB), and their price in INR. The brand is a categorical variable. Your tasks are:

1. Load and explore the dataset.
2. Visualize the relationship between RAM and price, and between brand and price.
3. Convert the `brand` column into dummy variables and avoid the dummy variable trap.
4. Train a linear regression model to predict price using RAM and brand dummies.
5. Evaluate the model's R² score.
6. Use scikit-learn's `OneHotEncoder` to encode the brand column and train another model.
7. Predict the price of a new smartphone: Brand = 'OnePlus', RAM = 8GB.
8. Briefly explain each step and the outcome.

---

* Download dataset : [smartphone_prices.csv](https://raw.githubusercontent.com/prakash-ukhalkar/ML/refs/heads/main/03_Dummy_Variables_and_One_Hot_Encoding_ML/Exercise_Dummy_Variables_OHE/smartphone_prices.csv) 
* Solution : [Exercise - Dummy Variables and One Hot Encoding](https://github.com/prakash-ukhalkar/ML/blob/main/03_Dummy_Variables_and_One_Hot_Encoding_ML/Exercise_Dummy_Variables_OHE/03_Exercise_Dummy_Variables_OHE.ipynb)