### Data Encoding

Data encodig is useful for representing categorical features into numerical values so that model can handle them.<br>
##### **Types**

**1. Nominal or One Hot Encoding**<br>
**2. Label and Ordinal Encoding**<br>
**3. Target Guided Ordinal Encoding**


In [11]:
##### One hot Encoding
# All categorical variables are converted into features and represented by 1 and 0.
## Not good if its Spark matrix becasue it might overfit the model

In [12]:
## One hot encoding example
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [13]:
df = pd.DataFrame({"colour":['red','green','green','yellow','black','red','green','black']})

In [14]:
df

Unnamed: 0,colour
0,red
1,green
2,green
3,yellow
4,black
5,red
6,green
7,black


In [15]:
## create instance of one hot encoder
encoder = OneHotEncoder()

In [16]:
encoder.fit_transform(df[['colour']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (8, 4)>

In [17]:
## above return matrix

In [18]:
encoder.fit_transform(df[['colour']]).toarray()

array([[0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.]])

In [19]:
encoded_array  = encoder.fit_transform(df[['colour']]).toarray()
encoded_df = pd.DataFrame(encoded_array)

In [20]:
encoded_df

Unnamed: 0,0,1,2,3
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0


In [21]:
encoded_array  = encoder.fit_transform(df[['colour']]).toarray()
encoded_df = pd.DataFrame(encoded_array,columns=encoder.get_feature_names_out())

In [22]:
encoded_df

Unnamed: 0,colour_black,colour_green,colour_red,colour_yellow
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0


In [23]:
## For new data

encoder.transform([['red']]).toarray()



array([[0., 0., 1., 0.]])

In [24]:
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,colour,colour_black,colour_green,colour_red,colour_yellow
0,red,0.0,0.0,1.0,0.0
1,green,0.0,1.0,0.0,0.0
2,green,0.0,1.0,0.0,0.0
3,yellow,0.0,0.0,0.0,1.0
4,black,1.0,0.0,0.0,0.0
5,red,0.0,0.0,1.0,0.0
6,green,0.0,1.0,0.0,0.0
7,black,1.0,0.0,0.0,0.0


## Label Encoding<br><br>
In this technique we apply unique numerical value to each category availiable.<br>
Example:<br>
[Red,Green,Black] can be Label encoded as,<br>
1. Red = 1
2. Green = 2
3. Black = 3

In [26]:
df.head()

Unnamed: 0,colour
0,red
1,green
2,green
3,yellow
4,black


In [27]:
from sklearn.preprocessing import LabelEncoder

In [28]:
label_encoder = LabelEncoder()

In [29]:
label_encoder.fit_transform(df['colour'])

array([2, 1, 1, 3, 0, 2, 1, 0])

In [30]:
label_encoded_df = pd.DataFrame(label_encoder.fit_transform(df['colour']),columns=["encoded_color"])

In [31]:
label_encoded_df

Unnamed: 0,encoded_color
0,2
1,1
2,1
3,3
4,0
5,2
6,1
7,0


In [32]:
pd.concat([df,label_encoded_df],axis=1)

Unnamed: 0,colour,encoded_color
0,red,2
1,green,1
2,green,1
3,yellow,3
4,black,0
5,red,2
6,green,1
7,black,0


## Disadvantage of label encoding<br>
Observe in above example, yellow is given 3 and black is given 0. Since al thease are numerical value my model will think that yellow has higher preference/value. And we dont want our model to conclude wrongly that yellow is greater than black or so. No ranking we need betweeen categories.

## Ordinal Encoding

Use this when you want to have ranking between categories.<br>
Example<br>
Primary:0<br>
HighSchool:1<br>
Graduate:2<br>

In [36]:
## Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

In [37]:
df = pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [38]:
ordinal_encoder = OrdinalEncoder(categories=[['small','medium','large']])## Here given list as per ranking, First less rank 

In [39]:
ordinal_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

## Target Guided Ordinal Encoding

In [43]:
df = pd.DataFrame({
    'city':['New York','London','Paris', 'Tokyo', 'New York', 'Paris'],
    'price':[200,150,300,250,180,320]})

In [45]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [49]:
df.groupby('city')['price'].mean()

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [51]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [53]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [55]:
df['city_encoded'] = df['city'].map(mean_price)

In [57]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [59]:
##IN this technique simle we replace cateroriecal variables with mean,mode,or median of that perticular categories.
## Here mean and more are based on target column.