# Feature Engineering : Encoding Categorical Values

## Description
The goal of this task is to process and enhance the dataset by handling categorical variables and engineering new features.
### Steps to be Performed
1️. **Identify categorical variables** that need encoding.  
2️. **Choose appropriate encoding techniques** (OneHot Encoding, Label Encoding).  
3️. **Create new features**.  
4️. **Analyze feature importance** before proceeding to modeling.

# Importing Libraries

In [59]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [60]:
data = pd.read_csv('cleaned-in-vehicle-coupon-recommendation.csv')

In [61]:
def unique_values(df):

  for col in df.columns:
    unique_values = df[col].unique()
    print(f"Unique values in column '{col}':")
    print(unique_values)
    print("-" * 50)

unique_values(data)

Unique values in column 'destination':
['no urgent place' 'home' 'work']
--------------------------------------------------
Unique values in column 'passanger':
['alone' 'friend(s)' 'kid(s)' 'partner']
--------------------------------------------------
Unique values in column 'weather':
['sunny' 'rainy' 'snowy']
--------------------------------------------------
Unique values in column 'temperature':
[55 80 30]
--------------------------------------------------
Unique values in column 'time':
['2pm' '10am' '6pm' '7am' '10pm']
--------------------------------------------------
Unique values in column 'coupon':
['restaurant(<20)' 'coffee house' 'carry out & take away' 'bar'
 'restaurant(20-50)']
--------------------------------------------------
Unique values in column 'expiration':
['1d' '2h']
--------------------------------------------------
Unique values in column 'gender':
['female' 'male']
--------------------------------------------------
Unique values in column 'age':
['21' '46' 

# Manual Encoding

In [70]:
# We are manually encoding the following columns as these columns have same classes and gives us more control over data
def manual_encode(df):

  column = ['Restaurant20To50','RestaurantLessThan20','CarryAway','CoffeeHouse','Bar']

  for col in column:

    df[col] = df[col].replace(
        {
            'never': 0,
            'less1': 1,
            '1~3': 2,
            '4~8': 3,
            'gt8': 4
        }
    )

manual_encode(data)

In [71]:
# We will Label Encode the following columns as they don't follow a order of hierarchy

encoder = LabelEncoder()
def label_encode(df):

  column = ["destination", "passanger", "weather", "time", "coupon", "expiration", "age", "maritalStatus", "education", "occupation"]

  for col in column:
    df[col] = encoder.fit_transform(df[col])

label_encode(data)

In [72]:
data['temperature'] = data['temperature'].replace({30: 0, 55: 1, 80: 2}) # Encoding Temperature column as told in previous notebook|

In [73]:
# Ordinal Encoding manually as they are in increasing order of importance
income_order = {'less than $12500': 0, '$12500 - $24999': 1, '$25000 - $37499': 2, '$37500 - $49999': 3,
                '$50000 - $62499': 4, '$62500 - $74999': 5, '$75000 - $87499': 6, '$87500 - $99999': 7, '$100000 or more': 8}

data['income'] = data['income'].replace(income_order)

In [74]:
data['gender'] = data['gender'].replace({'female': 0, 'male': 1}) # Binary Encoding Gender Column

In [75]:
unique_values(data) # All the categorical values have been converted to numerical values

Unique values in column 'destination':
[1 0 2]
--------------------------------------------------
Unique values in column 'passanger':
[0 1 2 3]
--------------------------------------------------
Unique values in column 'weather':
[2 0 1]
--------------------------------------------------
Unique values in column 'temperature':
[1 2 0]
--------------------------------------------------
Unique values in column 'time':
[2 0 3 4 1]
--------------------------------------------------
Unique values in column 'coupon':
[4 2 1 0 3]
--------------------------------------------------
Unique values in column 'expiration':
[0 1]
--------------------------------------------------
Unique values in column 'gender':
[0 1]
--------------------------------------------------
Unique values in column 'age':
[0 5 1 2 4 6 3 7]
--------------------------------------------------
Unique values in column 'maritalStatus':
[3 2 1 0 4]
--------------------------------------------------
Unique values in column 'has_c

# Analysing The performance of Label Encoding on Different Models

In [76]:
X = data.drop('Y', axis=1)
y = data['Y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [77]:
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.73949246629659


In [78]:
xgb = XGBClassifier(n_estimators=200, random_state=42)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7458366375892149


# Feature Engineering

In [79]:
data['is_morning'] = data['time'].apply(lambda x: 1 if x in [0, 1] else 0) # If Time is morning than 1 else 0

In [80]:
data['is_evening'] = data['time'].apply(lambda x: 1 if x in [3, 4] else 0) # If Time is evening than 1 else 0

In [81]:
data['is_urgent'] = data['destination'].apply(lambda x: 1 if x != 0 else 0) # If destination is not 0 than 1 else 0

In [82]:
data['is_food_coupon'] = data['coupon'].apply(lambda x: 1 if x in [0, 2, 4] else 0) # If coupon is food coupon than 1 else 0

In [83]:
data['low_income'] = data['income'].apply(lambda x: 1 if x < 3 else 0) # If low income than 1 else 0

In [84]:
data['high_income'] = data['income'].apply(lambda x: 1 if x > 6 else 0) # If high income than 1 else 0

In [85]:
unique_values(data) # New Features have been added

Unique values in column 'destination':
[1 0 2]
--------------------------------------------------
Unique values in column 'passanger':
[0 1 2 3]
--------------------------------------------------
Unique values in column 'weather':
[2 0 1]
--------------------------------------------------
Unique values in column 'temperature':
[1 2 0]
--------------------------------------------------
Unique values in column 'time':
[2 0 3 4 1]
--------------------------------------------------
Unique values in column 'coupon':
[4 2 1 0 3]
--------------------------------------------------
Unique values in column 'expiration':
[0 1]
--------------------------------------------------
Unique values in column 'gender':
[0 1]
--------------------------------------------------
Unique values in column 'age':
[0 5 1 2 4 6 3 7]
--------------------------------------------------
Unique values in column 'maritalStatus':
[3 2 1 0 4]
--------------------------------------------------
Unique values in column 'has_c

# Analysing New Feature Importance on Model

In [86]:
X = data.drop('Y', axis=1)
y = data['Y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [87]:
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7402854877081682


In [88]:
xgb = XGBClassifier(n_estimators=200, random_state=42)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7482157018239493


In [89]:
data = data.drop(['is_morning', 'is_evening', 'is_urgent', 'is_food_coupon', 'low_income', 'high_income'], axis=1) # Removing New Features

In [92]:
data.to_csv('encoded-in-vehicle-coupon-recommendation.csv', index=False)

# Final Analyses

1. First we Encoded all the categorical columns to the Numerical Values by different ways like Label Encoding, Manually Using Replace Method.
2. Then we calculated the accuracy of dataset on models like XGBoost, Random Forest
3. Then we created New Features and Calculated accuracy again
4. Since the accuracy remained same so new features are of no use we removed them again.
5. We Created the final dataset which contains the Encoded Numerical values in all the columns