# Introduction

- Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning.
- It can produce new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy.

# Feature Engineering Techniques for Machine Learning

1. Handling Outliers
2. Handling imbalanced data
3. Handling Missing Values
4. Log Transform
5. Binning
5. One-hot encoding
6. Scaling
7. Normalization
8. Standardization

# One-hot encoding

- It allows the use of categorical variables in models that require numerical input.
- It can improve model performance by providing more information to the model about the categorical variable.
- It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”)

Importing module to perform operation

In [2]:
import pandas as pd

- Reading data using pandas

In [3]:
df = pd.read_csv("F:\PGDDS\PGD_Data_Science\csv_data\HousePricePrediction.csv")

In [4]:
df

Unnamed: 0,Town,Area,Price
0,Mumbai,2000,5500000
1,Mumbai,2100,5530000
2,Mumbai,2200,5560000
3,Mumbai,2300,5590000
4,Mumbai,2400,5620000
5,Pune,2500,5650000
6,Pune,2600,5680000
7,Pune,2700,5710000
8,Pune,2800,5740000
9,Delhi,2900,5770000


- Information about our dataset

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Town    39 non-null     object
 1   Area    39 non-null     int64 
 2   Price   39 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.0+ KB


- To check unique values from dataset from column "Town"

In [6]:
df['Town'].unique()

array(['Mumbai', 'Pune', 'Delhi'], dtype=object)

# One-Hot encoding the categorical parameters using get_dummies()

- It will create values for the "Town" Column

In [7]:
dummies = pd.get_dummies(df['Town'])

In [8]:
dummies

Unnamed: 0,Delhi,Mumbai,Pune
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,1,0,0


- now we combine both dataframe 

In [14]:
final = pd.concat([df, dummies], axis = 1)

In [15]:
final

Unnamed: 0,Town,Area,Price,Delhi,Mumbai,Pune
0,Mumbai,2000,5500000,0,1,0
1,Mumbai,2100,5530000,0,1,0
2,Mumbai,2200,5560000,0,1,0
3,Mumbai,2300,5590000,0,1,0
4,Mumbai,2400,5620000,0,1,0
5,Pune,2500,5650000,0,0,1
6,Pune,2600,5680000,0,0,1
7,Pune,2700,5710000,0,0,1
8,Pune,2800,5740000,0,0,1
9,Delhi,2900,5770000,1,0,0


- Here we can remove unwanted column from dataframe

In [16]:
final.drop(['Town', 'Delhi'], axis = 1, inplace = True)

In [18]:
final.head()

Unnamed: 0,Area,Price,Mumbai,Pune
0,2000,5500000,1,0
1,2100,5530000,1,0
2,2200,5560000,1,0
3,2300,5590000,1,0
4,2400,5620000,1,0


- for training purpose we have to define our input and output
- X - input
- y - output

- Here we have 3 input option

In [19]:
X = final[['Area', 'Mumbai', 'Pune']]

In [20]:
X.head()

Unnamed: 0,Area,Mumbai,Pune
0,2000,1,0
1,2100,1,0
2,2200,1,0
3,2300,1,0
4,2400,1,0


In [22]:
y = df.Price

In [24]:
y.head()

0    5500000
1    5530000
2    5560000
3    5590000
4    5620000
Name: Price, dtype: int64

- To check columns names from our dataframe

In [26]:
final.columns

Index(['Area', 'Price', 'Mumbai', 'Pune'], dtype='object')

# Label_Encoding

In [27]:
from sklearn.preprocessing import LabelEncoder

In [28]:
encoder = LabelEncoder()

In [31]:
result = encoder.fit_transform(df['Town'])

In [34]:
result

array([1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0])