<a href="https://colab.research.google.com/github/lokinegalur/ML/blob/main/Encoding_Categorical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler,OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer

In [None]:
#importing the dataset
df=pd.read_csv('/content/Data.csv')

In [None]:
df.isnull().sum()

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64

In [None]:
#handling missing values
df['Age'].fillna(df['Age'].mean(),inplace=True)
df['Salary'].fillna(df['Age'].mean(),inplace=True)

In [None]:
x=np.array(df.iloc[:,:-1])
y=np.array(df.iloc[:,-1:])

In [None]:
#encoding categorical data
#encoding dependent data using onehotencoding
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
ct.fit_transform(x)

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 38.77777777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [None]:
#encoding depedent variable
le=LabelEncoder()
le.fit(['Yes','No'])
y=le.transform(y)

  y = column_or_1d(y, warn=True)


In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]




```
LabelEncoding your features is a bad practice
You should avoid using LabelEncoder to encode your input features! Don't believe me? Here's what scikit-learn's official documentation for LabelEncoder says:

This transformer should be used to encode target values, i.e. y, and not the input X.

That's why it's called LabelEncoding.

Why you shouldn't use LabelEncoder to encode features.
This encoder simply makes a mapping of a feature's unique values to integers. For example, let's say we want to encode a feature called shirt color, which represents the color of the shirt someone's wearing. This feature has values ['red', 'green', 'blue', ...]. If you encode these into integers, i.e. [1, 2, 3, ...], you might confuse your model by because you have now given relationships to these values that don't exist in the real world, e.g. red < greed < blue or red + green = blue. This type of feature is called nominal and preferably should be one-hot encoded.

There are features however, where you might want to map their values to integers. These are called ordinal. For example, the feature rating, which has values ['bad', 'good', 'excellent', ...]. By mapping these to integers you actually preserve the relationsips these values hold in the real world, e.g. bad < good < excellent. There is a catch to this however, in order to do the above, you need to map each value with a specific integer (e.g. we can't map 'good' -> 1, 'bad' -> 2, 'excellent' -> 3, because that doesn't preserve the real-world relationship of these values). The computer doesn't know which number to map to each value, though, so if you use LabelEncoder even on ordinal variables, it most likely won't generate the correct encoding.

How to properly encode ordinal features
A more proper way of encoding ordinal variables is manually choosing the mapping. This requires more work and isn't as elegant as a one-liner that encodes all values, but is the only correct way. Let's see how we can do this in pandas.

custom_mapping = {'bad': 1, 'good': 2, 'excellent': 3}


df['rating'] = df['rating'].map(custom_mapping)
Obviously this needs to be done for each ordinal feature.

At this point I think it's clear that I strongly recommend against using LabelEncoder, but if you still want to do it at least do it correctly.

If you still want to use LabelEncoding
While both answers by @ggordon and @Anan Srivastava will do what you want, they don't have much value in practice. The problem isthat by not bounding the fitted LabelEncoder to a variable, you are loosing the mapping from categories to numbers. If you want to predict on future data, you won't know which number to encode each category with.
```



In [2]:
#encoding multiple columns 
#if there are n categories in a column then it can be represented with (n-1) columns
from pandas import get_dummies

In [21]:
df=pd.read_csv('/content/Placement_Dataset.csv')

In [22]:
#dropping sl_no and salary column
df.drop(columns=['sl_no','salary'],inplace=True)

In [23]:
df.head()

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed
3,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed


In [24]:
df_features=get_dummies(df.iloc[:,:-1],drop_first=True)

In [25]:
df_features.head()

Unnamed: 0,ssc_p,hsc_p,degree_p,etest_p,mba_p,gender_M,ssc_b_Others,hsc_b_Others,hsc_s_Commerce,hsc_s_Science,degree_t_Others,degree_t_Sci&Tech,workex_Yes,specialisation_Mkt&HR
0,67.0,91.0,58.0,55.0,58.8,1,1,1,1,0,0,1,0,1
1,79.33,78.33,77.48,86.5,66.28,1,0,1,0,1,0,1,1,0
2,65.0,68.0,64.0,75.0,57.8,1,0,0,0,0,0,0,0,0
3,56.0,52.0,52.0,66.0,59.43,1,0,0,0,1,0,1,0,1
4,85.8,73.6,73.3,96.8,55.5,1,0,0,1,0,0,0,0,0


In [26]:
df_target=LabelEncoder().fit_transform(df.iloc[:,-1])

In [27]:
print(df_target)

[1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0
 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1
 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1
 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0
 1 0 1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 0]


In [28]:
print(df_features)

     ssc_p  hsc_p  ...  workex_Yes  specialisation_Mkt&HR
0    67.00  91.00  ...           0                      1
1    79.33  78.33  ...           1                      0
2    65.00  68.00  ...           0                      0
3    56.00  52.00  ...           0                      1
4    85.80  73.60  ...           0                      0
..     ...    ...  ...         ...                    ...
210  80.60  82.00  ...           0                      0
211  58.00  60.00  ...           0                      0
212  67.00  67.00  ...           1                      0
213  74.00  66.00  ...           0                      1
214  62.00  58.00  ...           0                      1

[215 rows x 14 columns]
