## CATEGORICAL FEATURES / ENCODING:

#### DEFINITION: 
Categorical encoding is a process of converting categories to numbers

we can see there are two kinds of categorical data-

 #### 1. Ordinal Data: 
       a. The categories have an inherent order
       b. Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.
 #### 2. Nominal Data: 
       a. The categories do not have an inherent order
       b. Nominal Variable (Categorical). It comprises a finite set of discrete values with no r'ship between values.

### TECHNIQUES TO HANDLE CATEGORICAL FEATURES:
    1. ONE HOT ENCODING
            1a. One-hot-encoding with many categories in a feature.
    2. ORDINAL NUMBER ENCODING
    3. COUNT Or FREQUENCY ENCODING
    4. TARGET GUIDED ORDINAL ENCODING
    5. MEAN ENCODING
    6. PROBABILITY RATIO ENCODING

### 1. ONE HOT ENCODING

#### DEFINITION:
We use this categorical data encoding technique when the features are nominal(do not have any order). In one hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.

These newly created binary features are known as Dummy variables. The number of dummy variables depends on the levels present in the categorical variable.


#### We apply One-Hot Encoding when:

    1. The categorical feature is not ordinal (like the countries above)
    2. The number of categorical features is less so one-hot encoding can be effectively applied

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv(r"C:\Users\Harsh Jain\Desktop\HARSH JAIN\3. EVERYTHING RELATED TO DATA\DATA SCIENCE\DATASETS\titanic train.csv", usecols=['Sex'])

In [4]:
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [5]:
### Converts the categorical features into 1's and 0's: and drop one column.

pd.get_dummies(df, drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [6]:
df = pd.read_csv(r"C:\Users\Harsh Jain\Desktop\HARSH JAIN\3. EVERYTHING RELATED TO DATA\DATA SCIENCE\DATASETS\titanic train.csv", usecols=['Embarked'])

In [7]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [8]:
df.dropna(inplace=True)

In [9]:
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [10]:
pd.get_dummies(df, drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


### 1a. One-hot-encoding with many categories in a feature

In [11]:
df = pd.read_csv(r"C:\Users\Harsh Jain\Desktop\HARSH JAIN\3. EVERYTHING RELATED TO DATA\DATA SCIENCE\DATASETS\mercedes.csv", usecols=["X0","X1","X2","X3","X4","X5","X6"])

In [12]:
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [13]:
### DISPLAY THE UNIQUE CATEGORIES OF EACH FEATURE:
for i in df.columns:
    print(len(df[i].unique()))

47
27
44
7
4
29
12


#### IMP NOTE:
FROM KDD CUP ORANGE CHALLENGE -  WE HAVE TO TAKE TOP TEN FEATURES TO PERFORM ONE HOT ENCODING. SO WE HAVE TO HANDLE THIS KIND OF DATA USING THIS TECHNIQUE

In [29]:
## GET THE TOP TEN CATEGORIES OF X1 FEATURE:
lst_10 = df.X1.value_counts().sort_values(ascending=False).head(10).index
lst_10=list(lst_10)

In [30]:
lst_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

In [31]:
### perform one hot encoding only on these 10 features:

for categories in lst_10:
    df[categories]=np.where(df['X1']==categories,1,0)

In [33]:
lst_10.append('X1')

In [34]:
df[lst_10].head()

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1
0,0,0,0,0,1,0,0,0,0,0,v
1,0,0,0,0,0,0,0,0,0,0,t
2,0,0,0,0,0,0,0,0,0,0,w
3,0,0,0,0,0,0,0,0,0,0,t
4,0,0,0,0,1,0,0,0,0,0,v
