In Python, there are several ways to encode datasets for Exploratory Data Analysis (EDA). 
Encoding is typically required when dealing with categorical variables,
    as many machine learning algorithms require numerical input.
Here are some common encoding techniques along with their implementations:

### Types:
####    -- One-Hot Encoding
####    -- Lable Encoding
####    -- Ordinal Encoding
####    -- Bianry Encoding
####    -- Frequency Encoding

### One-hot encoding
One-hot encoding is used to convert categorical variables into a binary matrix where each category becomes a binary feature.
Implementation using pandas:

In [1]:
import pandas as pd
data = ["Farmer","Doctor","Enginner"]
data1 = pd.get_dummies(data,columns=['categorical_column'])
data1

Unnamed: 0,Doctor,Enginner,Farmer
0,0,0,1
1,1,0,0
2,0,1,0


### Label Encoding:

Label encoding assigns a unique integer to each category in a categorical variable.
Implementation using scikit-learn:

In [2]:
from sklearn.preprocessing import LabelEncoder
data = ["Farmer","Doctor","Enginner"]
le = LabelEncoder()
le.fit(data)
print("classes_before_transform",le.classes_)
print("transform:",le.transform(["Farmer","Doctor","Enginner"]))
print("inverse_transform:",le.inverse_transform([0,0,1,2,1]))

classes_before_transform ['Doctor' 'Enginner' 'Farmer']
transform: [2 0 1]
inverse_transform: ['Doctor' 'Doctor' 'Enginner' 'Farmer' 'Enginner']


In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit([1, 2, 2, 6])
LabelEncoder()
print("classes_before_transform:",le.classes_)

print("transform:",le.transform([1, 1, 2, 6]))

print("inverse_transform:",le.inverse_transform([0, 0, 1, 2]))



classes_before_transform: [1 2 6]
transform: [0 0 1 2]
inverse_transform: [1 1 2 6]


### NOTE:
OneHotEncoder
Performs a one-hot encoding of categorical features. This encoding is suitable for low to medium cardinality categorical variables, both in supervised and unsupervised settings.

TargetEncoder
Encodes categorical features using supervised signal in a classification or regression pipeline. This encoding is typically suitable for high cardinality categorical variables.

LabelEncoder
Encodes target labels with values between 0 and n_classes-1.

### Ordinal Encoding:

Ordinal encoding is similar to label encoding but assigns integers based on the ordinal relationship of the categories.
Implementation using pandas:

In [4]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

In [5]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

In [6]:
enc.transform([['Female', 3], ['Male', 1]])

array([[0., 2.],
       [1., 0.]])

In [7]:
enc.inverse_transform([[1, 0], [0, 1]])

array([['Male', 1],
       ['Female', 2]], dtype=object)

In [8]:
enc.inverse_transform([[0, 2], [1, 0]])

array([['Female', 3],
       ['Male', 1]], dtype=object)

In [9]:
import pandas as pd

# Sample data
data = pd.DataFrame({'ordinal_column': ['low', 'medium', 'high', 'low', 'high']})

# Define ordinal mapping
ordinal_mapping = {'low': 0, 'medium': 1, 'high': 2}

# Apply ordinal encoding
data['ordinal_column'] = data['ordinal_column'].map(ordinal_mapping)

# Print the encoded data
print(data)

   ordinal_column
0               0
1               1
2               2
3               0
4               2


### Binary Encoding:

Binary encoding converts categories into binary numbers and then splits them into separate columns.
Implementation using category_encoders library:

In [10]:
import pandas as pd

# Sample data
data = pd.DataFrame({'categorical_column': ['A', 'B', 'C', 'A', 'C','D']})

# Define a function for binary encoding
def binary_encode(df, column):
    categories = df[column].unique()
    for category in categories:
        df[category] = df[column].apply(lambda x: 1 if x == category else 0)
    df.drop(column, axis=1, inplace=True)
    return df

# Apply binary encoding
data_encoded = binary_encode(data, 'categorical_column')

# Print the encoded data
print(data_encoded)


   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  1  0  0  0
4  0  0  1  0
5  0  0  0  1


In [11]:
#!pip install category_encoders
import pandas as pd
import category_encoders as ce

# Sample data
data = pd.DataFrame({'categorical_column': ['A', 'B', 'C', 'A', 'C']})

# Define and apply binary encoding
encoder = ce.BinaryEncoder(cols=['categorical_column'])
data_encoded = encoder.fit_transform(data)

# Print the encoded data
print(data_encoded)

   categorical_column_0  categorical_column_1
0                     0                     1
1                     1                     0
2                     1                     1
3                     0                     1
4                     1                     1


### Frequency Encoding:

Frequency encoding replaces categories with their frequency of occurrence in the dataset.
Implementation using pandas:

In [12]:
import pandas as pd

# Assuming 'data' is your DataFrame and 'categorical_column' is the column to be frequency encoded
data = pd.DataFrame({'categorical_column': ['A', 'B', 'C', 'A', 'C','D']})
# Calculate the frequency of each category in the 'categorical_column'
frequency_map = data['categorical_column'].value_counts(normalize=True)

# Map the frequency values to the original column
data['categorical_column'] = data['categorical_column'].map(frequency_map)


In [13]:
data

Unnamed: 0,categorical_column
0,0.333333
1,0.166667
2,0.333333
3,0.333333
4,0.333333
5,0.166667


### with dataset

## Label Encoder

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
data= pd.read_csv("D:\Ineuron\Libraries for Manipulation and visualization\Dataset\data5\Visadataset.csv")

In [16]:
data

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


In [18]:
data.columns

Index(['case_id', 'continent', 'education_of_employee', 'has_job_experience',
       'requires_job_training', 'no_of_employees', 'yr_of_estab',
       'region_of_employment', 'prevailing_wage', 'unit_of_wage',
       'full_time_position', 'case_status'],
      dtype='object')

In [19]:
data.shape

(25480, 12)

In [20]:
data.isnull().sum()

case_id                  0
continent                0
education_of_employee    0
has_job_experience       0
requires_job_training    0
no_of_employees          0
yr_of_estab              0
region_of_employment     0
prevailing_wage          0
unit_of_wage             0
full_time_position       0
case_status              0
dtype: int64

In [21]:
data_num = [nume for nume in data.columns if data[nume].dtype != "O"]

In [22]:
data[data_num]

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
0,14513,2007,592.2029
1,2412,2002,83425.6500
2,44444,2008,122996.8600
3,98,1897,83434.0300
4,1082,2005,149907.3900
...,...,...,...
25475,2601,2008,77092.5700
25476,3274,2006,279174.7900
25477,1121,1910,146298.8500
25478,1918,1887,86154.7700


In [23]:
data_col = [nume for nume in data.columns if data[nume].dtype == "O"]

In [24]:
data[data_col]

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,region_of_employment,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,West,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,Northeast,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,West,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,West,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,South,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,South,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,Northeast,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,South,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,West,Year,Y,Certified


In [25]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [26]:
fata_enc = le.fit(data_col)
print("classes_before_transform",le.classes_)
print("transform:",le.transform(['case_id','case_status','continent','education_of_employee','full_time_position','has_job_experience','region_of_employment','requires_job_training','unit_of_wage']))
print("inverse_transform:",le.inverse_transform([0,0,1,2,1]))

classes_before_transform ['case_id' 'case_status' 'continent' 'education_of_employee'
 'full_time_position' 'has_job_experience' 'region_of_employment'
 'requires_job_training' 'unit_of_wage']
transform: [0 1 2 3 4 5 6 7 8]
inverse_transform: ['case_id' 'case_id' 'case_status' 'continent' 'case_status']


In [27]:
data["case_status"]

0           Denied
1        Certified
2           Denied
3           Denied
4        Certified
           ...    
25475    Certified
25476    Certified
25477    Certified
25478    Certified
25479    Certified
Name: case_status, Length: 25480, dtype: object

In [28]:
le.fit_transform(data['case_status'])

array([1, 0, 1, ..., 0, 0, 0])

In [29]:
print("inverse_transform:",le.inverse_transform([0,0,1,0,1]))

inverse_transform: ['Certified' 'Certified' 'Denied' 'Certified' 'Denied']


In [31]:
data['education_of_employee'].unique()

array(['High School', "Master's", "Bachelor's", 'Doctorate'], dtype=object)

### one-hot encoding

In [32]:
data.columns

Index(['case_id', 'continent', 'education_of_employee', 'has_job_experience',
       'requires_job_training', 'no_of_employees', 'yr_of_estab',
       'region_of_employment', 'prevailing_wage', 'unit_of_wage',
       'full_time_position', 'case_status'],
      dtype='object')

In [38]:
import pandas as pd
data1 = pd.get_dummies(data)
data1

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage,case_id_EZYV01,case_id_EZYV02,case_id_EZYV03,case_id_EZYV04,case_id_EZYV05,case_id_EZYV06,case_id_EZYV07,...,region_of_employment_South,region_of_employment_West,unit_of_wage_Hour,unit_of_wage_Month,unit_of_wage_Week,unit_of_wage_Year,full_time_position_N,full_time_position_Y,case_status_Certified,case_status_Denied
0,14513,2007,592.2029,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,1,0,1
1,2412,2002,83425.6500,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
2,44444,2008,122996.8600,0,0,1,0,0,0,0,...,0,1,0,0,0,1,0,1,0,1
3,98,1897,83434.0300,0,0,0,1,0,0,0,...,0,1,0,0,0,1,0,1,0,1
4,1082,2005,149907.3900,0,0,0,0,1,0,0,...,1,0,0,0,0,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25475,2601,2008,77092.5700,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,1,1,0
25476,3274,2006,279174.7900,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
25477,1121,1910,146298.8500,0,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,1,0
25478,1918,1887,86154.7700,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0


In [39]:
import pandas as pd
data1 = pd.get_dummies(data,columns=data_col)
data1

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage,case_id_EZYV01,case_id_EZYV02,case_id_EZYV03,case_id_EZYV04,case_id_EZYV05,case_id_EZYV06,case_id_EZYV07,...,region_of_employment_South,region_of_employment_West,unit_of_wage_Hour,unit_of_wage_Month,unit_of_wage_Week,unit_of_wage_Year,full_time_position_N,full_time_position_Y,case_status_Certified,case_status_Denied
0,14513,2007,592.2029,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,1,0,1
1,2412,2002,83425.6500,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
2,44444,2008,122996.8600,0,0,1,0,0,0,0,...,0,1,0,0,0,1,0,1,0,1
3,98,1897,83434.0300,0,0,0,1,0,0,0,...,0,1,0,0,0,1,0,1,0,1
4,1082,2005,149907.3900,0,0,0,0,1,0,0,...,1,0,0,0,0,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25475,2601,2008,77092.5700,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,1,1,0
25476,3274,2006,279174.7900,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
25477,1121,1910,146298.8500,0,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,1,0
25478,1918,1887,86154.7700,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0


### original Encoding

In [40]:
data.head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [41]:
data[data_col].head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,region_of_employment,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,West,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,Northeast,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,West,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,West,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,South,Year,Y,Certified


In [42]:
from sklearn.preprocessing import OrdinalEncoder
enc1 = OrdinalEncoder()
data2 = data['case_status']
data3 = pd.DataFrame(data2)
enc1.fit(data3)

In [43]:
enc1.categories_

[array(['Certified', 'Denied'], dtype=object)]

In [44]:
enc1.fit_transform(data3)

array([[1.],
       [0.],
       [1.],
       ...,
       [0.],
       [0.],
       [0.]])

### frequency encoding

In [49]:
data_fre = pd.DataFrame(data['case_status'])


In [None]:
import pandas as pd

# Assuming 'data' is your DataFrame and 'categorical_column' is the column to be frequency encoded
data = pd.DataFrame(data_fre)
# Calculate the frequency of each category in the 'categorical_column'
frequency_map = data[data_col].value_counts(normalize=True)

# Map the frequency values to the original column
data[data_col] = data[data_col].map(frequency_map)