# 1. Normalization
Feature Scaling: make sure the features are on a similar scale
There are two ways to achieve normalization:


## Min-Max Scaling
Using linear transformation to project the result to [0, 1]
$$ X_{norm} = (X - X_{min}) / (X_{max} - X_{min}) $$



## Z-Score Normalization
$$ z = (x - mean)/std $$


* Normalization will accelerate the gradient descent process

In [15]:
# create some data to play with
import random
import numpy as np
data = np.asarray([[random.randint(0,10)+ random.random() * i * j for i in range(10)] for j in range(10)], dtype = int)
label = np.asarray([random.randint(0,1) for i in range(10)])

In [16]:
data

array([[ 6,  5,  8,  8,  2,  7,  6, 10,  4,  7],
       [ 7,  2,  3,  9,  4,  7, 12,  8,  7,  1],
       [ 1,  7,  9,  2, 11, 11, 12,  6,  7, 23],
       [ 5, 10, 11, 11, 11, 14, 22, 22, 15,  9],
       [ 8,  1, 12, 14, 10, 18, 27, 18, 22, 25],
       [ 3,  6, 13, 12, 10, 22, 22,  4, 31, 17],
       [ 3, 12, 16, 10, 16,  6, 30, 22, 34, 16],
       [ 3,  9,  1, 20, 13, 18, 40,  2, 32, 37],
       [ 7,  6, 18, 13,  4, 16, 54, 33, 62,  5],
       [10, 13, 13, 30, 25, 12, 26,  8, 76, 50]])

In [19]:
# create a new matrix with scaled data using min-max scaling
def min_max_scaling(data):
    min_max = lambda v: (v - min(v))/(max(v) - min(v))
    return np.apply_along_axis(min_max,0,data)

#print(min_max_scaling(data))
#print(data)

In [21]:
def z_score_scaling(data):
    z_score = lambda v: (v-np.mean(v))/np.std(v)
    return np.apply_along_axis(z_score,0,data)

#print(min_max_scaling(data))
#print(data)

# 2. Encoding Categorical Feature
1. Ordinal Encoding
2. One-hot Encoding
3. Binary Encoding


* algorithms like Logistic Regression need encoding categorical feature 

In [33]:
# create categorical features
c1 = ["very cold", "cold", "normal", "hot", "vert hot"]
c2 = ["cotton", "silk", "wool"]
c3 = ["orange", "apple", "banana", "lemon", "watermelon", "strawberry", "peach", "pear"]
data = np.asarray([[c1[random.randint(0,4)],c2[random.randint(0,2)],c3[random.randint(0,7)]] for j in range(10)])
print(data[:,0])


['cold' 'cold' 'normal' 'normal' 'vert hot' 'very cold' 'normal' 'cold'
 'hot' 'cold']


## Ordinal Encoding 
Ordinal Encoding is often used when the categories have an ordinal relationship.

For example, grade "High", "Mid", "Low"

Thus this method need to label each one by hand

## One-hot Encoding
One-hot encoding is used for categories that does not have a big/small difference

Note this method could takes a lot of space with high dimensional data, to deal with it:

1. use sparse vectors
2. use feature selection techniques to lower the dimension

## Binary Encoding
Binary Encoding is like one-hot encoding, but takes less space. 

Step 1: Assign each categoy an ID (e.g. 1,2,3,4 ...)

Step 2: (Hash Code the ID) Present the ID in binary number

In [36]:
# c1 uses ordinal encoding:
def ordinal(value):
    if value == "very cold":
        return 1
    elif value == "cold":
        return 2
    elif value =="normal":
        return 3
    elif value =="hot":
        return 4
    else:
        return 5
    
def ordinal_encoding(col):
    vfunc = np.vectorize(ordinal)
    return vfunc(col)

print(ordinal_encoding(data[:,0]))

[2 2 3 3 5 1 3 2 4 2]


In [38]:
# c2 uses one-hot encoding
def one_hot_encoding(col):
    _dict={}
    i = 0
    for it in (set(col)):
        _dict[it] = i
        i += 1
    d = np.zeros([len(col),len(set(col))])
    for i in range(len(col)):
        d[i, _dict[col[i]]]=1
    return d

print(data[:,1])
print(one_hot_encoding(data[:,1]))

['wool' 'wool' 'silk' 'silk' 'silk' 'silk' 'cotton' 'silk' 'silk' 'silk']
[[0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


In [53]:
# c3 uses binary encoding

# x is an non-negative integer
def convert_to_binary(x):
    if x < 2:
        return x
    ans = 0
    i = 1
    while x > 0:
        ans += x%2 * i
        i *= 10
        x -= x%2
        x/=2
    return int(ans)

#convert_to_binary(4) == 100

def binary_encoding(col):
    _dict={}
    i = 0
    for it in (set(col)):
        _dict[it] = convert_to_binary(i)
        i += 1
    
    f = lambda x: _dict[x]
    vfunc = np.vectorize(f)
    return vfunc(col)

print(data[:,2])
print(binary_encoding(data[:,2]))

['strawberry' 'pear' 'orange' 'orange' 'banana' 'orange' 'orange'
 'watermelon' 'pear' 'banana']
[100  10   0   0  11   0   0   1  10  11]
