## Label Encoding


In machine learning, we usually deal with datasets which contains multiple labels in one or more than one columns. These labels can be in the form of words or numbers. To make the data understandable or in human readable form, the training data is often labeled in words.

`Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form`.


In [2]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:

# Importing the dataset
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
# Separating Independent and Dependent variable
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 3]


### Encoding Nominal Categorical Data

Most Machine Learning algorithms prefer to work with numbers anyway, so let’s
convert these text labels to numbers.
**`Scikit-Learn provides a transformer for this task called LabelEncoder`**:

In [6]:

# Encoding the Categorical Variables

from sklearn.preprocessing import LabelEncoder

#Initialize the class to the object
labelcoder = LabelEncoder()
X.iloc[:,0] = labelcoder.fit_transform(X.iloc[:,0])

# Encoding Dependent variable
y = labelcoder.fit_transform(y)

In [7]:
y # yes = 1 No = 0

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [8]:
X # France = 0, Germany = 1, Spain = 2

Unnamed: 0,Country,Age,Salary
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,
5,0,35.0,58000.0
6,2,,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


In above encoding, type of country variable is a nominal not an ordinal variable so we have to create a `dummy variables`.

To fix this issue, a common solution is to create one `binary attribute per category`:
one attribute equal to 1 when the category is “present" in that row (and 0 otherwise), another
attribute equal to 1 when the category is “present" in that row (and 0 otherwise), and so on. 
This is called `one-hotencoding`, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold).

**Scikit-Learn provides a OneHotEncoder to convert integer categorical values into one-hot
vectors. Let’s encode the categories as one-hot vectors.**

In [12]:
from sklearn.preprocessing import OneHotEncoder

# Note that fit_transform() expects a 2D array,but Country variable is a 1D array, so we need to reshape it.
OneEncoder = OneHotEncoder()
Country_OneHot = OneEncoder.fit_transform(X.iloc[:,0].values.reshape(-1,1))

In [13]:
Country_OneHot

<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

Notice that the output is a `SciPy sparse matrix`, instead of a `NumPy array`. This is very useful when you
have categorical attributes with thousands of categories. After one-hot encoding we get a matrix with
thousands of columns, and the matrix is full of zeros except for one 1 per row. Using up tons of memory
mostly to store zeros would be very wasteful, so instead `a sparse matrix only stores the location of the
nonzero elements`. You can use it mostly like a normal 2D array, but **if you really want to convert it to a
(dense) NumPy array, just call the toarray() method**:

In [15]:
Country_OneHot.toarray()

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

**We can apply both transformations (from text categories to integer categories, then from integer categories
to one-hot vectors) in one shot using the LabelBinarizer class:**

In [17]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
encoder.fit_transform(X.iloc[:,0])

array([[1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [1, 0, 0]])

### Creating Dummy Variables

By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

https://www.algosome.com/articles/dummy-variable-trap-regression.html

In [18]:
pd.get_dummies(X,columns=['Country'],drop_first=True) # We dropped first Column to avoid dummy variable trap

Unnamed: 0,Age,Salary,Country_1,Country_2
0,44.0,72000.0,0,0
1,27.0,48000.0,0,1
2,30.0,54000.0,1,0
3,38.0,61000.0,0,1
4,40.0,,1,0
5,35.0,58000.0,0,0
6,,52000.0,0,1
7,48.0,79000.0,0,0
8,50.0,83000.0,1,0
9,37.0,67000.0,0,0


### Encoding Ordianl Categorical Data

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known

For example:

- Student's grade in an exam (A, B, C or Fail).
- Educational level, with the categories: Elementary school,  High school, College graduate, PhD ranked from 1 to 4.

When the categorical variables are ordinal, the most straightforward best approach is to replace the labels by some ordinal number based on the ranks.


In [31]:
# create a variable with dates, and from that extract the weekday
# I create a list of dates with 20 days difference from today
# and then transform it into a datafame

import datetime

today_date = datetime.datetime.today()
date_list = [today_date - datetime.timedelta(x) for x in range(0,20) ]
df = pd.DataFrame(date_list,columns=['day'])
df

Unnamed: 0,day
0,2019-11-09 12:24:05.159542
1,2019-11-08 12:24:05.159542
2,2019-11-07 12:24:05.159542
3,2019-11-06 12:24:05.159542
4,2019-11-05 12:24:05.159542
5,2019-11-04 12:24:05.159542
6,2019-11-03 12:24:05.159542
7,2019-11-02 12:24:05.159542
8,2019-11-01 12:24:05.159542
9,2019-10-31 12:24:05.159542


In [32]:
# extract the week day name
df['Day of the Week'] = df['day'].dt.weekday_name

In [33]:
df

Unnamed: 0,day,Day of the Week
0,2019-11-09 12:24:05.159542,Saturday
1,2019-11-08 12:24:05.159542,Friday
2,2019-11-07 12:24:05.159542,Thursday
3,2019-11-06 12:24:05.159542,Wednesday
4,2019-11-05 12:24:05.159542,Tuesday
5,2019-11-04 12:24:05.159542,Monday
6,2019-11-03 12:24:05.159542,Sunday
7,2019-11-02 12:24:05.159542,Saturday
8,2019-11-01 12:24:05.159542,Friday
9,2019-10-31 12:24:05.159542,Thursday


In [34]:
# Engineer categorical variable by ordinal number replacement

weekday_map = {'Monday':1,
               'Tuesday':2,
               'Wednesday':3,
               'Thursday':4,
               'Friday':5,
               'Saturday':6,
               'Sunday':7
}


In [35]:
df['day_ordinal'] = df['Day of the Week'].map(weekday_map)

In [36]:
df

Unnamed: 0,day,Day of the Week,day_ordinal
0,2019-11-09 12:24:05.159542,Saturday,6
1,2019-11-08 12:24:05.159542,Friday,5
2,2019-11-07 12:24:05.159542,Thursday,4
3,2019-11-06 12:24:05.159542,Wednesday,3
4,2019-11-05 12:24:05.159542,Tuesday,2
5,2019-11-04 12:24:05.159542,Monday,1
6,2019-11-03 12:24:05.159542,Sunday,7
7,2019-11-02 12:24:05.159542,Saturday,6
8,2019-11-01 12:24:05.159542,Friday,5
9,2019-10-31 12:24:05.159542,Thursday,4
