# Data Transformation

Data transformation converts the data into appropriate form for data mining. Common data
transformation methods can be:

    - Normalization
    - Discretization

## Normalization


For some attributes it varies much between minimum and maximum values, e.g. 0.1 and 1000.
Normalization scales attribute values to fall within a specified range, like [0,1]. This is
particularly useful for some algorithms such as **neural networks** and **k-Nearest Neighborhood**.
The two most common normalization methods are:
Let v be the old value for feature A, v’ is the normalized value for feature A, the values will be transformed to fall in [new_min A , new_max A ].

        a) Z-score normalization --> (v -mean(v))/std(v))
        b) Min-max normalization: (v - min(v))/(max(v) - min(v))

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [4]:
df = pd.read_csv("/home/antonio/Dropbox/DataAnalysis/python/dati/all_age.csv")
dfTotal = df.Total
dfTotal.describe()

count    1.730000e+02
mean     2.302566e+05
std      4.220685e+05
min      2.396000e+03
25%      2.428000e+04
50%      7.579100e+04
75%      2.057630e+05
max      3.123510e+06
Name: Total, dtype: float64

In [16]:
ss = StandardScaler()
mms = MinMaxScaler()
dfTotalStand = ss.fit_transform(dfTotal)
dfTotalScaled = mms.fit_transform(dfTotal)
np.mean(dfTotalStand), np.std(dfTotalStand), np.max(dfTotalScaled)



(1.0267958609250002e-17, 1.0, 1.0)

## Discretization

Discretization transforms a numeric attribute into a discrete one by creating a set of contiguous
intervals (or equivalently a set of split points) that spans the range of the attribute’s values. The number of value possibilities of the numeric attribute should be significantly reduced
by applying discretization, since large amounts of possible attribute values arouse slowness or
ineffectiveness in model building

According to the consideration of class attributes, discretization algorithms can be divided into
**unsupervised methods and supervised algorithms.**
**Unsupervised discretization algorithms that
do not take class information into account are very simple.** Two common unsupervised dis-
cretization algorithms are **equal-width discretization and equal-frequency discretization.**
Equal-width discretization methods compute the maximum and minimum for the attribute
that is to be discretized and divide the range into k equal-width intervals. Equal-frequency
discretization method partitions the value range into k intervals so that each interval contains
the same number of instances. **Supervised discretization algorithms that take class information
into account are more complicated. There are many manners in which to do supervised dis-
cretization. Chi merge is a simple supervised method that uses the chi-square to do dis-
cretization. It sorts the values of the given attribute in ascending order and initially constructs
one interval for each value so that they are separate. It then calculates the chi-square for any
two adjacent intervals and merges those pairs with lowest chi-square values. It stops when all
pairs of adjacent intervals are more than a specified threshold value.**

In python the unsupervised discretization algorithms are implemented in pandas:

In [19]:
pd.cut(dfTotal, bins = 2).value_counts()

(-725.114, 1562953]    169
(1562953, 3123510]       4
Name: Total, dtype: int64

In [21]:
pd.qcut(dfTotal,2).value_counts()

[2396, 75791]       87
(75791, 3123510]    86
Name: Total, dtype: int64

#### cut simply discretize using the values, while qcut use the distribution of the input variable in making the binning.