<h1 align="center">7.3 Data Transformation Part II

In [1]:
import pandas as pd
import numpy as np

<b>This Part Includes:
    
    * Discretization and Binning
    * Detecting and Filtering Outliers
    * Permutation and Random Sampling
    * Computing Indicator/Dummy Variables

<b>Discretization and Binning

Continuous  data  is  often  discretized  or  otherwise  separated  into  “bins”  for  analysis

In [2]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [3]:
bins = [18, 25, 35, 60, 100]

In [4]:
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a  special  Categorical  object.  

    The  output  you  see describes  the  bins  computed  by  pandas.cut.  
    You  can  treat  it  like  an  array  of  strings indicating the bin name; 
    internally it contains a categories array specifying the dis‐tinct category 
    names along with a labeling for the ages data in the codes attribute

In [5]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [6]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [7]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

     pd.value_counts(cats) are the bin counts for the result of pandas.cut

By default right side is closed, you can change it so that left side is closed by passing right=False.

In [8]:
pd.cut(ages, [18, 26, 36, 61, 100],right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [9]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [10]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

In [11]:
data = np.random.rand(20)
pd.cut(data,3,precision=3,labels=['1/3','2/3','3/3'])

[2/3, 1/3, 2/3, 3/3, 3/3, ..., 3/3, 2/3, 1/3, 3/3, 3/3]
Length: 20
Categories (3, object): [1/3 < 2/3 < 3/3]

A closely related function, qcut, bins the data based on sample quantiles. 

    Depending on the distribution of the data, using cut will not usually result in each bin having the same 
    number  of  data  points.  Since  qcut  uses  sample  quantiles  instead,  by  definition you will obtain 
    roughly equal-size bins

In [12]:
cats = pd.qcut(data, 4)
cats

[(0.257, 0.601], (0.012799999999999999, 0.257], (0.257, 0.601], (0.731, 0.983], (0.731, 0.983], ..., (0.601, 0.731], (0.257, 0.601], (0.012799999999999999, 0.257], (0.731, 0.983], (0.731, 0.983]]
Length: 20
Categories (4, interval[float64]): [(0.012799999999999999, 0.257] < (0.257, 0.601] < (0.601, 0.731] < (0.731, 0.983]]

In [13]:
cats = pd.qcut(ages, 4)
cats

[(19.999, 22.75], (19.999, 22.75], (22.75, 29.0], (22.75, 29.0], (19.999, 22.75], ..., (29.0, 38.0], (38.0, 61.0], (38.0, 61.0], (38.0, 61.0], (29.0, 38.0]]
Length: 12
Categories (4, interval[float64]): [(19.999, 22.75] < (22.75, 29.0] < (29.0, 38.0] < (38.0, 61.0]]

In [14]:
pd.value_counts(cats)

(38.0, 61.0]       3
(29.0, 38.0]       3
(22.75, 29.0]      3
(19.999, 22.75]    3
dtype: int64

You can pass your own quantiles (numbers between 0 and 1, inclusive)

In [15]:
cats=pd.qcut(ages, [0, 0.1, 0.5, 0.9, 1.])

In [16]:
pd.value_counts(cats)

(29.0, 44.6]      4
(21.1, 29.0]      4
(44.6, 61.0]      2
(19.999, 21.1]    2
dtype: int64

<b>Detecting and Filtering Outliers

In [17]:
data = pd.DataFrame(np.random.randn(1000, 3))
data.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,0.002698,0.044184,-0.001567
std,1.004708,1.041357,1.004593
min,-4.058307,-3.028553,-3.217595
25%,-0.65754,-0.638667,-0.70027
50%,0.011738,0.03173,-0.029948
75%,0.695563,0.701066,0.68649
max,3.074714,3.713445,2.770659


Suppose  you  wanted  to  find  values  in  one  of  the  columns  exceeding  3  in  absolute value

In [18]:
col = data[2]

In [19]:
col[np.abs(col) >3]

296   -3.217595
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on aboolean DataFrame

In [20]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2
32,0.121728,3.186055,2.471668
275,3.033218,-0.259795,0.256089
296,1.398065,-1.766047,-3.217595
382,-1.801886,3.559961,0.645694
399,-1.633081,3.562227,-1.793979
572,-1.235226,-3.028553,0.878305
770,0.18062,-3.010497,0.516598
774,-0.414712,3.713445,0.346374
807,-4.058307,0.30117,-0.074842
928,3.074714,-0.932968,1.4174


Values can be set based on these criteria. Here is code to cap values outside the inter‐val –3 to 3

In [21]:
data[np.abs(data) > 3] = np.sign(data) * 3

The statement np.sign(data) produces 1 and –1 values based on whether the valuesin data are positive or negative

In [22]:
np.sign(data).head()

Unnamed: 0,0,1,2
0,-1.0,1.0,-1.0
1,1.0,-1.0,1.0
2,1.0,1.0,-1.0
3,1.0,1.0,1.0
4,-1.0,1.0,-1.0


In [23]:
data.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,0.003648,0.042202,-0.00135
std,1.000649,1.034913,1.003919
min,-3.0,-3.0,-3.0
25%,-0.65754,-0.638667,-0.70027
50%,0.011738,0.03173,-0.029948
75%,0.695563,0.701066,0.68649
max,3.0,3.0,2.770659


In [24]:
data[(np.abs(data) >= 3).any(1)]

Unnamed: 0,0,1,2
32,0.121728,3.0,2.471668
275,3.0,-0.259795,0.256089
296,1.398065,-1.766047,-3.0
382,-1.801886,3.0,0.645694
399,-1.633081,3.0,-1.793979
572,-1.235226,-3.0,0.878305
770,0.18062,-3.0,0.516598
774,-0.414712,3.0,0.346374
807,-3.0,0.30117,-0.074842
928,3.0,-0.932968,1.4174


<b>Permutation and Random Sampling

Permuting  (randomly  reordering)  a  Series  or  the  rows  in  a  DataFrame  is  easy  to  dousing the numpy.random.permutation function. 

    Calling permutation with the lengthof  the  axis  you  want  to  permute  produces  an  array  of  integers  
    indicating  the  new ordering

In [25]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [26]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [27]:
sampler = np.random.permutation(5)

In [28]:
df.take(sampler)

Unnamed: 0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
1,4,5,6,7
2,8,9,10,11
0,0,1,2,3


To  select  a  random  subset  without  replacement,  you  can  use  the  sample  method  on Series and DataFrame

In [29]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
3,12,13,14,15


To generate a sample with replacement (to allow repeat choices), pass replace=Trueto sample

In [30]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)

In [31]:
draws

2   -1
3    6
2   -1
2   -1
2   -1
0    5
4    4
1    7
2   -1
2   -1
dtype: int64

<b>Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica‐tions  is  converting  a  categorical  variable  into  a  “dummy”  or  “indicator”  matrix. 

    If  a column  in  a  DataFrame  has  k  distinct  values,  you  would  derive  a  matrix  or  Data‐Frame  with  
    k  columns  containing  all  1s  and  0s

In [32]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [33]:
pd.get_dummies(df)

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In  some  cases,  you  may  want  to  add  a  prefix  to  the  columns  in  the  indicator  Data‐Frame,

which can then be merged with the other data. get_dummies has a prefix argu‐ment for doing this

In [34]:
dummies = pd.get_dummies(df['key'], prefix='keo')
df_with_dummy = df[['data1']].join(dummies)

In [35]:
df_with_dummy

Unnamed: 0,data1,keo_a,keo_b,keo_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


A useful recipe for statistical applications is to combine get_dummies with a discreti‐zation function like cut

    We set the random seed with numpy.random.seed to make the example deterministic
    np.random.seed(2121) makes the random numbers predictable

In [36]:
np.random.seed(2121)
values = np.random.rand(10)
values

array([0.2544823 , 0.98083066, 0.82143569, 0.90573965, 0.61934275,
       0.56108577, 0.28651289, 0.85731024, 0.31784322, 0.93673663])

In [37]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [38]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,1,0,0,0
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,1,0
5,0,0,1,0,0
6,0,1,0,0,0
7,0,0,0,0,1
8,0,1,0,0,0
9,0,0,0,0,1
