**What is Feature Engineering?**
 
Feature engineering is the process of creating new features or modifying existing ones to improve the performance of machine learning models. It involves techniques like feature extraction, transformation, encoding, and scaling to make data more useful for predictions.

**Why Do We Need Feature Engineering?**

1.**Improves Model Performance** – Good features help models make better predictions.
 
2.**Reduces Overfitting** – Helps eliminate noise and irrelevant data.
 
3.**Handles Missing Data** – Creates meaningful replacements for missing values.
 
4.**Enables Better Interpretability** – Makes features more understandable and useful.

5.**Reduces Dimensionality** – Helps remove unnecessary data points, making the model efficient.
has context menu

In [2]:
import pandas as pd

df=pd.DataFrame({'TransactionDate':pd.to_datetime(['2025-02-05 14:30:00','2025-02-06 18:45:00'])})

#Extract date-related features
df['DayOfWeek']=df['TransactionDate'].dt.dayofweek
df['Hour']=df['TransactionDate'].dt.hour
df['IsWeekend']=df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)
print(df)

      TransactionDate  DayOfWeek  Hour  IsWeekend
0 2025-02-05 14:30:00          2    14          0
1 2025-02-06 18:45:00          3    18          0


In [3]:
#Aggregated Features
df_transactions=pd.DataFrame({
    'UserID':[101,102,101,103,102],
    'TransactionAmount':[500,300,700,1000,400]
})

df_user_avg = df_transactions.groupby('UserID')['TransactionAmount'].mean().reset_index()
df_user_avg.rename(columns={'TransactionAmount': 'AvgTransactionAmount'},inplace=True)
print(df_user_avg)

   UserID  AvgTransactionAmount
0     101                 600.0
1     102                 350.0
2     103                1000.0


In [10]:
#Encoding Categorical variabales
from sklearn.preprocessing import OneHotEncoder

df=pd.DataFrame({'ProductCategory':['Electronics','Clothing','Clothing','Grocery']})
encoder = OneHotEncoder(sparse_output=False)
encoded_features = encoder.fit_transform(df[['ProductCategory']])    

df_encoded = pd.DataFrame(encoded_features,columns=encoder.get_feature_names_out())
print(df_encoded)

   ProductCategory_Clothing  ProductCategory_Electronics  \
0                       0.0                          1.0   
1                       1.0                          0.0   
2                       1.0                          0.0   
3                       0.0                          0.0   

   ProductCategory_Grocery  
0                      0.0  
1                      0.0  
2                      0.0  
3                      1.0  


In [11]:
#Log Transaction for skewed data
import numpy as np
df= pd.DataFrame({'TransactionAmount':[100,200,5000,10000,2000]})
df['LogTransactionAmount'] = np.log1p(df['TransactionAmount'])
print(df)

   TransactionAmount  LogTransactionAmount
0                100              4.615121
1                200              5.303305
2               5000              8.517393
3              10000              9.210440
4               2000              7.601402


In [15]:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Initialize the scalers
scaler = MinMaxScaler()
standard_scaler = StandardScaler()  # Initialize StandardScaler

# Apply MinMaxScaler (normalize the values to [0, 1])
df['NormalizedTransactionAmount'] = scaler.fit_transform(df[['TransactionAmount']])

# Apply StandardScaler (standardize the values to have mean=0 and std=1)
df['StandardizedTransactionAmount'] = standard_scaler.fit_transform(df[['TransactionAmount']])

# Print the updated dataframe
print(df)


   TransactionAmount  LogTransactionAmount  NormalizedTransactionAmount  \
0                100              4.615121                     0.000000   
1                200              5.303305                     0.010101   
2               5000              8.517393                     0.494949   
3              10000              9.210440                     1.000000   
4               2000              7.601402                     0.191919   

   StandardizedTransactionAmount  
0                      -0.903226  
1                      -0.876344  
2                       0.413978  
3                       1.758065  
4                      -0.392473  


Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302


Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU ? BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU ? BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR ? DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR ? DEL,11:30,14:10,2h 40m,non-stop,No info,12648
10682,Air India,9/05/2019,Delhi,Cochin,DEL ? GOI ? BOM ? COK,10:55,19:15,8h 20m,2 stops,No info,11753


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0
