<a href="https://colab.research.google.com/github/omkarsjethe/Machine-Learning-Topics/blob/main/FeatureEngineering_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Feature engineering**

Feature engineering is the process of creating, transforming, or selecting features (input variables) to improve the performance of machine learning models. It's one of the most critical steps in the machine learning pipeline. Here are the most important feature engineering techniques, grouped by type:



**Feature Creation**
Creating new features from raw data:

Interaction Features: Multiply or combine two or more features to create interaction terms.

Example: area = length × width

Polynomial Features: Add polynomial terms (e.g., square, cube) of numeric features.

Datetime Features: Extract elements like day, month, weekday, hour from timestamps.

Example: date → year, month, weekday

**Text Features:**

Word count, character count, average word length

Presence of keywords or use of NLP embeddings (TF-IDF, Word2Vec, BERT)

Aggregation Features:

Use group-based statistics (mean, count, sum, etc.) within categories.

Example: Mean salary per department



 **Feature Transformation**

Changing the scale or distribution of features:

Normalization / Min-Max Scaling: Rescales features to a 0-1 range.

Standardization (Z-score): Transforms data to have zero mean and unit variance.

Log/Box-Cox/Power Transformations: Helps handle skewed data.

Quantile Transformation: Maps feature values to a uniform or normal distribution.

Binning: Convert continuous variables into discrete bins (e.g., age groups).

**Handling Missing Values**

Dealing with NaNs or nulls:

Imputation:

Mean, median, mode imputation

KNN or regression-based imputation

Missing Indicator: Add a binary flag column indicating whether a value was missing.

**Encoding Categorical Variables**

Convert categories into numeric formats:

Label Encoding: Assigns a unique integer to each category.

One-Hot Encoding: Creates binary columns for each category.

Target Encoding: Replace categories with the mean of the target variable.

Frequency/Count Encoding: Encode categories with their frequency/count.

**Feature Selection**

Choosing the most important features:

Filter Methods: Use statistical tests (chi-square, ANOVA, correlation).

Wrapper Methods: Recursive Feature Elimination (RFE), Forward/Backward selection.

Embedded Methods: Feature importance from models (e.g., Lasso, Tree-based models).

**Dimensionality Reduction**

Reduce feature space while preserving variance:

Principal Component Analysis (PCA)

t-SNE / UMAP (for visualization)

Autoencoders (neural network-based)

**Domain-Specific Features**

Use knowledge of the problem domain:

Example: In finance, use technical indicators (moving averages, RSI) for stock prediction.

In health, create BMI from height and weight.

In [2]:
import pandas as pd
import numpy as np

data = {
    'age': [25, 45, 35, np.nan, 50],
    'salary': [50000, 80000, 62000, 70000, np.nan],
    'joined_date': pd.to_datetime(['2015-03-01', '2013-07-15', '2016-08-10', '2018-01-01', '2014-09-09']),
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
    'description': [
        "Team player and result-oriented",
        "Expert in Python and machine learning",
        "Hard-working and detail-focused",
        "Excellent communicator",
        "Python developer with data analysis experience"
    ]
}

df = pd.DataFrame(data)


In [3]:
df

Unnamed: 0,age,salary,joined_date,department,description
0,25.0,50000.0,2015-03-01,Sales,Team player and result-oriented
1,45.0,80000.0,2013-07-15,IT,Expert in Python and machine learning
2,35.0,62000.0,2016-08-10,HR,Hard-working and detail-focused
3,,70000.0,2018-01-01,Sales,Excellent communicator
4,50.0,,2014-09-09,IT,Python developer with data analysis experience


In [4]:
df["age"]=df["age"].fillna(df["age"].mean())
df["salary"]=df["salary"].fillna(df["salary"].median())

In [5]:
df

Unnamed: 0,age,salary,joined_date,department,description
0,25.0,50000.0,2015-03-01,Sales,Team player and result-oriented
1,45.0,80000.0,2013-07-15,IT,Expert in Python and machine learning
2,35.0,62000.0,2016-08-10,HR,Hard-working and detail-focused
3,38.75,70000.0,2018-01-01,Sales,Excellent communicator
4,50.0,66000.0,2014-09-09,IT,Python developer with data analysis experience


In [6]:
df['year_joined'] = df['joined_date'].dt.year


In [7]:
df['month_joined'] = df['joined_date'].dt.month
df['weekday_joined'] = df['joined_date'].dt.weekday


In [8]:
df

Unnamed: 0,age,salary,joined_date,department,description,year_joined,month_joined,weekday_joined
0,25.0,50000.0,2015-03-01,Sales,Team player and result-oriented,2015,3,6
1,45.0,80000.0,2013-07-15,IT,Expert in Python and machine learning,2013,7,0
2,35.0,62000.0,2016-08-10,HR,Hard-working and detail-focused,2016,8,2
3,38.75,70000.0,2018-01-01,Sales,Excellent communicator,2018,1,0
4,50.0,66000.0,2014-09-09,IT,Python developer with data analysis experience,2014,9,1


In [9]:
df["experience"]=2025-df["year_joined"]

In [10]:
df

Unnamed: 0,age,salary,joined_date,department,description,year_joined,month_joined,weekday_joined,experience
0,25.0,50000.0,2015-03-01,Sales,Team player and result-oriented,2015,3,6,10
1,45.0,80000.0,2013-07-15,IT,Expert in Python and machine learning,2013,7,0,12
2,35.0,62000.0,2016-08-10,HR,Hard-working and detail-focused,2016,8,2,9
3,38.75,70000.0,2018-01-01,Sales,Excellent communicator,2018,1,0,7
4,50.0,66000.0,2014-09-09,IT,Python developer with data analysis experience,2014,9,1,11


In [11]:
#Label Encoding
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['department_encoded'] = le.fit_transform(df['department'])

In [12]:
#One-Hot Encoding

df = pd.get_dummies(df, columns=['department'], drop_first=True)

In [13]:
#Standarization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])


In [14]:
#Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'salary']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['age', 'salary']))
df = pd.concat([df, poly_df], axis=1)


In [15]:
#Text Feature Engineering (TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5)
tfidf_matrix = vectorizer.fit_transform(df['description'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

df = pd.concat([df, tfidf_df], axis=1)


In [16]:
#Dimensionality Reduction (PCA)
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_features = pca.fit_transform(df[['age', 'salary']].fillna(0))
df['pca_1'], df['pca_2'] = pca_features[:, 0], pca_features[:, 1]


In [18]:
df.head()


Unnamed: 0,age,salary,joined_date,description,year_joined,month_joined,weekday_joined,experience,department_encoded,department_IT,...,age^2,age salary,salary^2,analysis,and,communicator,detail,python,pca_1,pca_2
0,25.0,50000.0,2015-03-01,Team player and result-oriented,2015,3,6,10,2,False,...,625.0,1250000.0,2500000000.0,0.0,1.0,0.0,0.0,0.0,-22061.739638,-4.720869
1,45.0,80000.0,2013-07-15,Expert in Python and machine learning,2013,7,0,12,1,True,...,2025.0,3600000.0,6400000000.0,0.0,0.638711,0.0,0.0,0.769447,20364.676662,-4.753072
2,35.0,62000.0,2016-08-10,Hard-working and detail-focused,2016,8,2,9,0,False,...,1225.0,2170000.0,3844000000.0,0.0,0.556451,0.0,0.830881,0.0,-5091.17123,-1.905323
3,38.75,70000.0,2018-01-01,Excellent communicator,2018,1,0,7,2,False,...,1501.5625,2712500.0,4900000000.0,0.0,0.0,1.0,0.0,0.0,6222.538289,-4.153082
4,50.0,66000.0,2014-09-09,Python developer with data analysis experience,2014,9,1,11,1,True,...,2500.0,3300000.0,4356000000.0,0.778283,0.0,0.0,0.0,0.627914,565.695918,15.532346
