In [13]:
#one hot encoding using OneHotEncoder of Scikit-Learn

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [14]:
#Building a dummy employee dataset for example
data = {'Employee id': [10, 20, 15, 25, 30],
        'Gender': ['M', 'F', 'F', 'M', 'F'],
        'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
        }
#Converting into a Pandas dataframe
df = pd.DataFrame(data)
#Print the dataframe:
print(f"Employee data : \n{df}")

Employee data : 
   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice


Extract categorical columns from the dataframe
Here we extract the columns with object datatype as they are the categorical columns
df.select_dtypes(include=['object']): This selects columns in the DataFrame df that have a data type of 'object'. 
In pandas, 'object' typically refers to strings or categorical data.
.columns.tolist(): This extracts the column names of the selected columns and converts them to a list.
categorical_columns: This variable now holds the list of column names that are categorical.

In [15]:

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

OneHotEncoder(sparse_output=False): This initializes a OneHotEncoder object from the scikit-learn library. 
The sparse_output=False parameter ensures that the output will be a dense array rather than a sparse matrix
 (a sparse matrix is more memory efficient but harder to work with in some cases).

In [16]:
encoder = OneHotEncoder(sparse_output=False)

df[categorical_columns]: This selects the categorical columns from the DataFrame df using the list of column names stored in categorical_columns.
encoder.fit_transform(...): This applies the one-hot encoding to the selected categorical columns.
 The fit_transform method both fits the encoder to the data and transforms the data, returning the one-hot encoded values.
one_hot_encoded: This variable now holds the transformed data as a NumPy array where each categorical value is replaced with a one-hot encoded vector.

In [17]:
# Apply one-hot encoding to the categorical columns
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

pd.DataFrame(one_hot_encoded, ...): This creates a new DataFrame from the one-hot encoded NumPy array.
columns=encoder.get_feature_names_out(categorical_columns): This sets the column names of the new DataFrame to the feature names generated by the encoder. get_feature_names_out provides the names of the one-hot encoded columns based on the original categorical columns.
one_hot_df: This is the DataFrame containing the one-hot encoded columns

In [18]:
#Create a DataFrame with the one-hot encoded columns
#We use get_feature_names_out() to get the column names for the encoded data
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

pd.concat([df, one_hot_df], axis=1): This concatenates the original DataFrame df and the new DataFrame one_hot_df along the columns (side by side).
df_encoded: This variable now holds the combined DataFrame which includes both the original columns and the one-hot encoded columns.

In [19]:
# Concatenate the one-hot encoded dataframe with the original dataframe
df_encoded = pd.concat([df, one_hot_df], axis=1)

df_encoded.drop(categorical_columns, axis=1): This drops the original categorical columns from df_encoded. 
axis=1 specifies that columns are being dropped (as opposed to rows).
df_encoded: This variable is updated to exclude the original categorical columns, 
leaving only the one-hot encoded columns and any other non-categorical columns from the original DataFrame.


In [20]:

# Drop the original categorical columns
df_encoded = df_encoded.drop(categorical_columns, axis=1)

# Display the resulting dataframe
print(f"Encoded Employee data : \n{df_encoded}")

Encoded Employee data : 
   Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0           10       0.0       1.0           1.0            0.0           0.0
1           20       1.0       0.0           0.0            0.0           1.0
2           15       1.0       0.0           1.0            0.0           0.0
3           25       0.0       1.0           0.0            1.0           0.0
4           30       1.0       0.0           0.0            0.0           1.0
