#### Data Encoding in simple terms means converting categorical (non-numeric) data into a format that can be understood by machine learning algorithms, which usually only work with numbers. In feature engineering, this is an important step because many datasets have non-numeric features (like country names, product types, etc.), and we need to transform them into numbers.

#####

Label Encoding: Simple number assignment to each category.
One-Hot Encoding: Creates binary columns for each category.
Binary Encoding: Combines label encoding with binary representation.
Ordinal Encoding: Encodes based on order.
Target Encoding: Encodes based on the target variable.
Frequency Encoding: Encodes based on the frequency of occurrence.
In feature engineering, the choice of encoding method depends on the type of categorical data you have and whether the data has any inherent order or relationship.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
from sklearn.preprocessing import OneHotEncoder


In [3]:
## creating a data set
df=pd.DataFrame(
               {
                'color':['red','blue','green','green','red']
               }

)

In [4]:
encoder=OneHotEncoder()

In [5]:
encode=encoder.fit_transform(df[['color']]).toarray()

In [6]:
encode

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [7]:
encoder_df=pd.DataFrame(encode,columns=encoder.get_feature_names_out())

In [8]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [9]:
## for new Data set
encoder.transform([['blue']]).toarray()
pd.concat([df,encoder_df],axis=1)



Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0


In [10]:
tip=sns.load_dataset('tips')

In [11]:
tips=tip.head()
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [12]:
tips['sex']

0    Female
1      Male
2      Male
3      Male
4    Female
Name: sex, dtype: category
Categories (2, object): ['Male', 'Female']

In [13]:
df2=pd.DataFrame(
    {
        'name':tips['sex']
    }
)

In [14]:
encoder=OneHotEncoder()

In [15]:
encoder_df=encoder.fit_transform(df2[['name']]).toarray()

In [16]:
encoder_df

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.]])

In [17]:
encoder_df2=pd.DataFrame(encoder_df,columns=encoder.get_feature_names_out())

In [18]:
encoder_df2

Unnamed: 0,name_Female,name_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
