# Feature Engineering

Creating, transforming, or selecting features (input variables) to improve the performance of machine learning models.


🛠️ Raw data → Useful features → Better model accuracy

### Key Tasks in Feature Engineering

1. 🧹 Handling Missing Data
2. ⚖️ Handling Imbalanced Datasets
3. 🌱 SMOTE (Synthetic Minority Over-sampling Technique)
4. ⚠️ Handling Outliers
5. 🔤 Data Encoding

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('sales.csv')

In [None]:
df.head(5)

In [None]:
import seaborn as sns

In [None]:
df = sns.load_dataset("titanic")

In [None]:
df.head()

In [None]:
## check missing values
df.isnull().sum()

In [None]:
df.shape

In [None]:
df.dropna().shape

In [None]:
#coloumn wise deletion
df.dropna(axis=1)

## Imputation Missing Values

Imputation is the process of replacing missing (null/NaN) values in your dataset with substitute values, so you can use the data for analysis or machine learning without errors.



1.Mean value imputation

Mean imputation is a method where missing values in a numerical column are replaced with the mean (average) of the non-missing values of that column.

In [None]:
sns.histplot(df['age'])

In [None]:
df['Age_mean']=df['age'].fillna(df['age'].mean())

In [None]:
df[['Age_mean','age']]

^
  Mean imputation works well when we have normally distributed data

2.Median value imputation

Median imputation replaces missing values in a numerical column with the median (middle value) of the non-missing entries.

why it use?

if we have outliers in the dataset





In [None]:
df['age_median']=df['age'].fillna(df['age'].median())

In [None]:
df[['age_median','age']]

3. Mode imputation technique

Mode Imputation is a data cleaning technique where missing values in a column are filled with the mode — the value that appears most frequently in that column.

why use--Mainly for categorical features (e.g., Gender, City, Department)

 Also used for discrete numerical data with repeating values



In [None]:
df.head()

In [None]:
df[df["embarked"].isnull()]

In [None]:
df['embarked'].unique()

In [None]:
mode_value=df[df['embarked'].notna()]['embarked'].mode()[0]

1. df['embarked'].notna()
Returns a Boolean Series: True where 'embarked' is not null, and False where it is NaN.

2. df[df['embarked'].notna()]
Filters the DataFrame to include only rows where 'embarked' is not missing.

3. ['embarked']
Selects the 'embarked' column from the filtered DataFrame.

4. .mode()
Calculates the mode (most frequent value) of the 'embarked' column.

This returns a Series with the most frequent value(s).

5. [0]
Gets the first value of the mode series. Even though mode can return multiple values, here we pick just one (the most common one).

In [None]:
df['embarked_mode']=df['embarked'].fillna(mode_value)

In [None]:
df[['embarked_mode','embarked']]

In [None]:
df['embarked_mode'].isnull().sum()

In [None]:
df['embarked'].isnull().sum()

## Handling Imbalance Dataset

1. up sampling
2. Down sampling

In [None]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

np.random.seed(123): Ensures the random numbers generated will be the same every time you run the code — great for reproducibility in experiments or models.

You are going to create 1,000 total samples.
class_0_ratio = 0.9 means 90% of the samples will belong to Class 0.

n_class_0 = int(1000 * 0.9) → n_class_0 = 900
So, you’ll have 900 samples for Class 0.

n_class_1 = 1000 - 900 → n_class_1 = 100
The remaining 100 samples will be for Class 1.

Class 0 → 900 samples
Class 1 → 100 samples



In [None]:
n_class_0,n_class_1

In [None]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [None]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [None]:
df.tail()

In [None]:
df['target'].value_counts()

upsampling-- Increasing the number of samples in the minority class by duplicating or generating new synthetic data.

In [None]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

df_minority: All rows where target == 1 (minority class)

df_majority: All rows where target == 0 (majority class)

In [None]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True,
         n_samples=len(df_majority),
         random_state=42)

replace=True: Allows the same row to be picked more than once (i.e., with replacement).

n_samples=len(df_majority): You're making the number of samples in the minority class equal to the majority class.

This is upsampling the minority class to balance the dataset.

In [None]:
df_minority_upsampled.shape

In [None]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

 now i have a balanced dataset: both classes have equal number of samples.



In [None]:
df_upsampled['target'].value_counts()

## Down sampling


Reducing the number of samples in the majority class to match the minority class.

In [None]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [None]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False,
         n_samples=len(df_minority),
         random_state=42)

In [None]:
df_majority_downsampled.shape

In [None]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [None]:
df_downsampled.target.value_counts()

## SMOTE


SMOTE doesn’t just copy — it creates new fake data points for Class 1 using math.

Here’s how:

Pick a Class 1 data point.

Look at its nearest neighbors (other similar Class 1 points).

Draw a line between them.

Add a new point somewhere on that line.

🧬 This creates new, slightly different data, not duplicates.



In [None]:
from sklearn.datasets import make_classification

In [None]:
X,y=make_classification(n_samples=1000,n_redundant=0,n_features=2,n_clusters_per_class=1,
                    weights=[0.90],random_state=12)

n_samples=1000: Create 1000 total rows (data points).

n_features=2: We’ll have only 2 input features (f1 and f2).

n_redundant=0: No extra unnecessary (redundant) features.

n_clusters_per_class=1: One cluster per class for simplicity.

weights=[0.90]: 90% of samples are class 0, and 10% are class 1 — this creates imbalance.

random_state=12: Ensures you get the same result every time (reproducibility).



In [None]:
import pandas as pd
df1=pd.DataFrame(x,columns=['f1','f2'])
df2=pd.DataFrame(y,columns=['target'])
final_df=pd.concat([df1,df2],axis=1)
final_df.head()


We're converting the NumPy arrays X and y into DataFrames for easy handling.

Then, we combine features and target columns into a single DataFrame: final_df.



In [None]:
final_df['target'].value_counts()

A print of how many samples belong to class 0 vs class 1 — you'll see class 0 has many more.

In [None]:
import matplotlib.pyplot as plt
plt.scatter(final_df['f1'],final_df['f2'],c=final_df['target'])

A 2D scatter plot where points are colored based on the class (target) — class imbalance is visually clear.



In [None]:
from imblearn.over_sampling import SMOTE


In [None]:
## transform the dataset
oversample=SMOTE()
X,y=oversample.fit_resample(final_df[['f1','f2']],final_df['target'])

SMOTE Magic:
SMOTE = Synthetic Minority Over-sampling Technique.

It generates new (synthetic) data points for the minority class by interpolating between existing minority samples.

Result: Classes become balanced — the number of class 1 samples now equals class 0.



In [None]:
X.shape

In [None]:
y.shape

In [None]:
len(y[y==0])

In [None]:
len(y[y==1])

In [None]:
df1=pd.DataFrame(X,columns=['f1','f2'])
df2=pd.DataFrame(y,columns=['target'])
oversample_df=pd.concat([df1,df2],axis=1)

Same as before — convert the resampled arrays into a clean DataFrame to use in ML or visualization.



In [None]:
plt.scatter(oversample_df['f1'],oversample_df['f2'],c=oversample_df['target'])

Now the plot will show a balanced distribution — visually you'll see more class 1 points



## Handling outliers

### 5 number summary and box plot

minimum, maximum, median , Q1, Q3 IQR

In [None]:
import numpy as np

In [None]:
lst_marks=[45,32,56,75,89,54,32,89,90,87,67,54,45,98,99,67,74]
minimum,Q1,median,Q3,maximum=np.quantile(lst_marks,[0,0.25,0.50,0.75,1.0])

In [None]:
minimum,Q1,median,Q3,maximum

In [None]:

IQR=Q3-Q1
print(IQR)

In [None]:
lower_fence=Q1-1.5*(IQR)
higher_fence=Q3+1.5*(IQR)

In [None]:
lower_fence

In [None]:
higher_fence

In [None]:
lst_marks=[45,32,56,75,89,54,32,89,90,87,67,54,45,98,99,67,74]

In [None]:
import seaborn as sns

In [None]:
sns.boxplot(lst_marks)

In [None]:
lst_marks=[-100,-200,45,32,56,75,89,54,32,89,90,87,67,54,45,98,99,67,74,150,170,180]

In [None]:
sns.boxplot(lst_marks)

## Data encoding

It's a technique to convert categorical data into numbers. ML algorithms can't handle text, so we encode categories as numbers.



### Nominal/One hot encoding

One Hot Encoding transforms categorical variables into binary vectors.
Each unique category gets its own column, and we use 0 or 1 to indicate the presence of that category.

 Why Use It?
Categorical data like ["Red", "Green", "Blue"] can't be directly used by ML models.

One Hot Encoding avoids giving categories any implicit order (unlike label encoding).

red green blue
1	 0	   0
0	 1	   0
0	 0	   1
1	 0	   0

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [None]:
df=pd.DataFrame({'color':['red','blue','green','green','red','blue']})

In [None]:
df.head()

create an instance of Onehotencoder

In [None]:
encoder=OneHotEncoder()

In [None]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [None]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [None]:
encoder_df

In [None]:
##for new data
encoder.transform([['blue']]).toarray()

In [None]:
pd.concat([df,encoder_df], axis=1)

In [None]:
import seaborn as sns
sns.load_dataset('tips')

### Label Encoding

Label Encoding is the process of converting categorical text data (like 'Red', 'Green') into numerical labels (like 0, 1, 2).

Each unique category is assigned an integer value.

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [None]:
lbl_encoder.fit_transform(df[['color']])

In [None]:
lbl_encoder.transform([['red']])

In [None]:
lbl_encoder.transform([['blue']])

## Ordinal Encoding

Ordinal Encoding assigns integer values to categories based on their order or ranking.

Unlike label encoding (which is arbitrary), ordinal encoding assumes the categories follow a logical order.



 ["Small", "Large", "Medium", "Small"] → [0, 2, 1, 0]

In [None]:
from sklearn.preprocessing import OrdinalEncoder
df=pd.DataFrame({'size':['small','medium','large','medium','small','large']})

In [None]:
df

In [None]:
## instance of ordinal encoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [None]:
encoder.fit_transform(df[['size']])

In [None]:
encoder.transform([['small']])

## Target Guided Ordinal Encoding


Target Guided Ordinal Encoding is a categorical encoding technique in which:

Categories are replaced with ordinal numbers based on the average (or other statistic) of the target variable for each category.

This method uses the relationship between the categorical feature and the target variable to assign meaningful numerical values.

In [None]:
df=pd.DataFrame({
    'city':['newyork','london','paris','tokyo','newyork','paris'],
    'price':[200,150,300,250,180,320]
})

In [None]:
df

In [None]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [None]:
mean_price

In [None]:
df['city_encoded']=df['city'].map(mean_price)

In [None]:
df

In [None]:
df[['city','city_encoded']]