## **ENCODING NUMERICAL FEATURES**

The numerical categories may have the wide range. In such cases numrical features of the data are grouped into the catergories of range of values.

### **Techniques used are**
* **Discretization**
* **Binning**

#### **1.DISCRETIZATION**

Discretization is the process of transforming continious variable into discrete variables by creating set of contigious intervals that span the range of the variable's value. Discretization is also called binning, where bin is an aleternative name for interval.

Why use Discretization:
1. To handle outliers
2. To improve the value spread

Suppose we have the column name age 
Age:
23,24,33,45,33,22,53,44\
After the discretization or binning we create the bin:
|**Age**|20-30|30-40|40-50|50-60|
|--|--|--|--|--|
|**Frequency**|3|3|2|1|

Above example shows the discretization of the age. 
Here we assume that the value which is missing will be assumed same as its negihbour. Thus it handle the outliers.

### 1.1 **Types of Discretization**

* Unsupervise Binning
    - equal width binning(uniform binning)
    - equal frequency(quantile) binning
    - Kmeans binning
* Supervised Binning
    - Decision Tree binning
* Coustom Binning

#### Equal width binning
Suppose we have age data:
20,21,24,31,33,44,36,75
Then we give the number of bins:
bin=10(let)
$number of intervals=  \frac{max-min}{bins}$

Here for the every intervals there is equal width thus it is called equal width.
- There will be no change of the spread of the data

#### Equal Frequecy binning
![image](images/efbin.png)

Suppoe we need ten bins then we keep 10% of the total observations in each bins thus we may have the unequal width of bins as in the above picture. 
- it is useful in case of outliers
- it make the spread uniform

#### KMeans binning

Makes the clusters of the values which are the groped. Clsuters are formed as per the distance i.e nearest neighbourhood.
<br><p style='color:green'>
We assign centroids randomly which are<br>          &darr;<br> Then we make the bisecting lines in between the intervals<br>          &darr;<br> Then group each into a group.<br></p>
 We may also do something like k mean clustering. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer

In [2]:
titanic=sns.load_dataset('titanic')

In [3]:
titanic=titanic[['age','fare','survived']]

In [4]:
titanic.head()

Unnamed: 0,age,fare,survived
0,22.0,7.25,0
1,38.0,71.2833,1
2,26.0,7.925,1
3,35.0,53.1,1
4,35.0,8.05,0


In [5]:
titanic.isnull().sum()

age         177
fare          0
survived      0
dtype: int64

In [6]:
titanic=titanic.dropna(subset=['age'])

In [7]:
titanic.isnull().sum()

age         0
fare        0
survived    0
dtype: int64

In [8]:
X=titanic[['age','fare']]
Y=titanic['survived']

In [9]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=2,test_size=0.2)

In [10]:
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
accuracy_score(y_pred=y_pred,y_true=y_test)

0.6503496503496503

In [11]:
kbin_age=KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='quantile')
kbin_fare=KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='quantile')
trf=ColumnTransformer(transformers=[
    ('first',kbin_age,[0])
    ,('second',kbin_fare,[1])
])


In [12]:
x_train_trf=trf.fit_transform(x_train)
x_test_trf=trf.transform(x_test)

In [13]:
trf.named_transformers_

{'first': KBinsDiscretizer(encode='ordinal', n_bins=10),
 'second': KBinsDiscretizer(encode='ordinal', n_bins=10)}

In [14]:
trf.named_transformers_['first'].bin_edges_[0]

array([ 0.42, 10.  , 18.  , 22.  , 25.  , 28.  , 31.  , 35.  , 41.  ,
       50.  , 80.  ])

In [15]:
output=pd.DataFrame({
    'age':x_train['age'],
    'age_trf':x_train_trf[:,0],
    'fare':x_train['fare'],
    'fare_trf':x_train_trf[:,1]
})

In [16]:
output.head()

Unnamed: 0,age,age_trf,fare,fare_trf
3,35.0,7.0,53.1,8.0
541,9.0,0.0,31.275,7.0
679,36.0,7.0,512.3292,9.0
14,14.0,1.0,7.8542,1.0
238,19.0,2.0,10.5,3.0


In [17]:
output['age_labels']=pd.cut(x=x_train['age'],
                            bins=trf.named_transformers_['first'].bin_edges_[0].tolist())
output['fare_labels']=pd.cut(x=x_train['fare'],
                            bins=trf.named_transformers_['second'].bin_edges_[0].tolist())

In [18]:
output.head()

Unnamed: 0,age,age_trf,fare,fare_trf,age_labels,fare_labels
3,35.0,7.0,53.1,8.0,"(31.0, 35.0]","(46.9, 78.85]"
541,9.0,0.0,31.275,7.0,"(0.42, 10.0]","(29.0, 46.9]"
679,36.0,7.0,512.3292,9.0,"(35.0, 41.0]","(78.85, 512.329]"
14,14.0,1.0,7.8542,1.0,"(10.0, 18.0]","(7.75, 7.925]"
238,19.0,2.0,10.5,3.0,"(18.0, 22.0]","(9.225, 12.875]"


In [19]:
clf=DecisionTreeClassifier()
clf.fit(x_train_trf,y_train)
y_pred=clf.predict(x_test_trf)

In [20]:
accuracy_score(y_pred,y_test)

0.7132867132867133

### BINARIZATION
The values of the ceratin ranges are assigned zero and above the threshols are assigned 1. Which means the values are converted to the binary values. 

In [21]:
titanic=sns.load_dataset('titanic')

In [22]:
titanic=titanic[['age','fare','parch','survived']]

In [23]:
titanic.sample(6)

Unnamed: 0,age,fare,parch,survived
692,,56.4958,0,1
733,23.0,13.0,0,0
787,8.0,29.125,1,0
133,29.0,26.0,0,1
652,21.0,8.4333,0,0
672,70.0,10.5,0,0


In [24]:
titanic=titanic.dropna(subset=['age'])

In [25]:
titanic.isnull().sum()

age         0
fare        0
parch       0
survived    0
dtype: int64

In [26]:
X=titanic[['age','fare','parch']]
Y=titanic['survived']
x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=30,test_size=0.3)

In [27]:
#binarization
from sklearn.preprocessing import Binarizer
trf=ColumnTransformer(transformers=[
('bin',Binarizer(copy=False),['parch']),
],remainder='passthrough')

In [28]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('prep', trf),
    ('clf', DecisionTreeClassifier())
])

cross_val_score(pipeline, X, Y, cv=10, scoring='accuracy').mean()

np.float64(0.6093114241001565)