# Data Transformation

So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning,
and other tranformations are another class of important operations.

In [88]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

from ipykernel import kernelapp as app



## Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an
example:

In [2]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
    'k2': [1, 1, 2, 3, 3, 4, 4]})

In [3]:
data 

Unnamed: 0,k1,k2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate or not:

In [4]:
data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is True:

In [5]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


Both of these methods by default consider all of the columns; alternatively you can
specify any subset of them to detect duplicates. Suppose we had an additional column
of values and wanted to filter duplicates only based on the 'k1' column:

In [6]:
data['v1'] = range(7)

In [7]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
3,two,3,3


duplicated and drop_duplicates by default keep the first observed value combination.
Passing take_last=True will return the last one:

In [12]:
# NOTE: take_last=True deprecated for keep 
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
1,one,1,1
2,one,2,2
4,two,3,4
6,two,4,6


## Transforming Data Using a Function or Mapping

For many data sets, you may wish to perform some transformation based on the values
in an array, Series, or column in a DataFrame. Consider the following hypothetical data
collected about some kinds of meat:

In [13]:

data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
    'corned beef', 'Bacon', 'pastrami', 'honey ham',
    'nova lox'],
    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [14]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [15]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping,
but here we have a small problem in that some of the meats above are capitalized and
others are not. Thus, we also need to convert each value to lower case:

In [16]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)

In [17]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [18]:
data['food'].map(lambda x:meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other data
cleaning-related operations.

## Replacing Values

Filling in missing data with the fillna method can be thought of as a special case of
more general value replacement. While map, as you’ve seen above, can be used to modify
a subset of values in an object, replace provides a simpler and more flexible way to do
so. Let’s consider this Series:

In [19]:
data = Series([1., -999., 2., -999., -1000., 3.])

In [20]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

In [21]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute
value:

In [22]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [23]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [24]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

## Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects. The axes can also be modified
in place without creating a new data structure. Here’s a simple example:

In [25]:
data = DataFrame(np.arange(12).reshape((3, 4)),
    index=['Ohio', 'Colorado', 'New York'],
    columns=['one', 'two', 'three', 'four'])

Like a Series, the axis indexes have a map method:

In [26]:
data.index.map(str.upper)

array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)

You can assign to index, modifying the DataFrame in place:

In [27]:
data.index = data.index.map(str.upper)

In [28]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


If you want to create a transformed version of a data set without modifying the original,
a useful method is rename:

In [29]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values
for a subset of the axis labels:

In [30]:
data.rename(index={'OHIO': 'INDIANA'},
    columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


rename saves having to copy the DataFrame manually and assign to its index and col
umns attributes. Should you wish to modify a data set in place, pass inplace=True:

In [31]:
# Always returns a reference to a DataFrame
_ = data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [32]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


## Discretization and Binning

Continuous data is often discretized or otherwised separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group them
into discrete age buckets:

In [33]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To
do so, you have to use cut, a function in pandas:

In [34]:
bins = [18, 25, 35, 60, 100]

In [35]:
cats = pd.cut(ages, bins)

In [36]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. You can treat it like an array
of strings indicating the bin name; internally it contains a levels array indicating the
distinct category names along with a labeling for the ages data in the labels attribute:

In [38]:
cats.labels

  if __name__ == '__main__':


array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [39]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [41]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side
is open while the square bracket means it is closed (inclusive). Which side is closed can
be changed by passing right=False:

In [42]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, object): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [43]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [44]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass cut a integer number of bins instead of explicit bin edges, it will compute
equal-length bins based on the minimum and maximum values in the data. Consider
the case of some uniformly distributed data chopped into fourths:

In [45]:
data = np.random.rand(20)

In [46]:
data

array([ 0.48285705,  0.55141999,  0.9450738 ,  0.36713918,  0.10996427,
        0.01770216,  0.16209992,  0.01053709,  0.03402729,  0.56094978,
        0.40432593,  0.59683134,  0.22527193,  0.94812215,  0.24892144,
        0.95271137,  0.6151931 ,  0.24894314,  0.30492163,  0.97893311])

In [47]:
pd.cut(data, 4, precision=2)

[(0.25, 0.49], (0.49, 0.74], (0.74, 0.98], (0.25, 0.49], (0.0096, 0.25], ..., (0.74, 0.98], (0.49, 0.74], (0.0096, 0.25], (0.25, 0.49], (0.74, 0.98]]
Length: 20
Categories (4, object): [(0.0096, 0.25] < (0.25, 0.49] < (0.49, 0.74] < (0.74, 0.98]]

A closely related function, qcut, bins the data based on sample quantiles. Depending
on the distribution of the data, using cut will not usually result in each bin having the
same number of data points. Since qcut uses sample quantiles instead, by definition
you will obtain roughly equal-size bins:

In [48]:
data = np.random.randn(1000) #Normally Distributed

In [49]:
cats = pd.qcut(data, 4) #Cut into quartiles

In [50]:
cats

[[-3.0618, -0.721], (0.683, 2.687], [-3.0618, -0.721], [-3.0618, -0.721], (-0.721, -0.0391], ..., [-3.0618, -0.721], (0.683, 2.687], (-0.0391, 0.683], (-0.721, -0.0391], [-3.0618, -0.721]]
Length: 1000
Categories (4, object): [[-3.0618, -0.721] < (-0.721, -0.0391] < (-0.0391, 0.683] < (0.683, 2.687]]

In [51]:
pd.value_counts(cats)

(0.683, 2.687]       250
(-0.0391, 0.683]     250
(-0.721, -0.0391]    250
[-3.0618, -0.721]    250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [52]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.256, -0.0391], (-0.0391, 1.322], [-3.0618, -1.256], (-1.256, -0.0391], (-1.256, -0.0391], ..., (-1.256, -0.0391], (1.322, 2.687], (-0.0391, 1.322], (-1.256, -0.0391], (-1.256, -0.0391]]
Length: 1000
Categories (4, object): [[-3.0618, -1.256] < (-1.256, -0.0391] < (-0.0391, 1.322] < (1.322, 2.687]]

We’ll return to cut and qcut later in the chapter on aggregation and group operations,
as these discretization functions are especially useful for quantile and group analysis.

# Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider
a DataFrame with some normally distributed data:

In [54]:
np.random.seed(12345)

In [55]:
data = DataFrame(np.random.randn(1000, 4))

In [62]:
data.head()

Unnamed: 0,0,1,2,3
0,-0.204708,0.478943,-0.519439,-0.55573
1,1.965781,1.393406,0.092908,0.281746
2,0.769023,1.246435,1.007189,-1.296221
3,0.274992,0.228913,1.352917,0.886429
4,-2.001637,-0.371843,1.669025,-0.43857


In [56]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067684,0.067924,0.025598,-0.002298
std,0.998035,0.992106,1.006835,0.996794
min,-3.428254,-3.548824,-3.184377,-3.745356
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.366626,2.653656,3.260383,3.927528


Suppose you wanted to find values in one of the columns exceeding three in magnitude:

In [57]:
col = data[3]

In [58]:
col[np.abs(col) > 3]

97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64

To select all rows having a value exceeding 3 or -3, you can use the any method on a
boolean DataFrame:

In [59]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
5,-0.539741,0.476985,3.248944,-1.021228
97,-0.774363,0.552936,0.106061,3.927528
102,-0.655054,-0.56523,3.176873,0.959533
305,-2.315555,0.457246,-0.025907,-3.399312
324,0.050188,1.951312,3.260383,0.963301
400,0.146326,0.508391,-0.196713,-3.745356
499,-0.293333,-0.242459,-3.05699,1.918403
523,-3.428254,-0.296336,-0.439938,-0.867165
586,0.275144,1.179227,-3.184377,1.369891
808,-0.362528,-3.548824,1.553205,-2.186301


Values can just as easily be set based on these criteria. Here is code to cap values outside
the interval -3 to 3:

In [63]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [65]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067623,0.068473,0.025153,-0.002081
std,0.995485,0.990253,1.003977,0.989736
min,-3.0,-3.0,-3.0,-3.0
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.0,2.653656,3.0,3.0


The ufunc np.sign returns an array of 1 and -1 depending on the sign of the values.

# Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using
the numpy.random.permutation function. Calling permutation with the length of the axis
you want to permute produces an array of integers indicating the new ordering:

In [66]:
df = DataFrame(np.arange(5 * 4).reshape(5, 4))

In [67]:
sampler = np.random.permutation(5)

In [68]:
sampler

array([1, 0, 2, 3, 4])

That array can then be used in ix-based indexing or the take function:

In [69]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [70]:
df.take(sampler)

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


To select a random subset without replacement, one way is to slice off the first k elements
of the array returned by permutation, where k is the desired subset size. There
are much more efficient sampling-without-replacement algorithms, but this is an easy
strategy that uses readily available tools:

In [72]:
df.take(np.random.permutation(len(df))[:3])

Unnamed: 0,0,1,2,3
1,4,5,6,7
3,12,13,14,15
4,16,17,18,19


To generate a sample with replacement, the fastest way is to use np.random.randint to
draw random integers:

In [73]:
bag = np.array([5, 7, -1, 6, 4])

In [74]:
sampler = np.random.randint(0, len(bag), size=10)

In [75]:
sampler

array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1])

In [76]:
draws = bag.take(sampler)

In [77]:
draws

array([ 4,  4, -1, -1, -1,  5,  6,  5,  4,  7])

## Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications
is converting a categorical variable into a “dummy” or “indicator” matrix. If a
column in a DataFrame has k distinct values, you would derive a matrix or DataFrame
containing k columns containing all 1’s and 0’s. pandas has a get_dummies function for
doing this, though devising one yourself is not difficult. Let’s return to an earlier example
DataFrame:

In [78]:
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                'data1': range(6)})

In [79]:
df

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [80]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame,
which can then be merged with the other data. get_dummies has a prefix argument for
doing just this:

In [81]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [82]:
df_with_dummy = df[['data1']].join(dummies)

In [83]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated.
Let’s return to the MovieLens 1M dataset from earlier in the book:

In [84]:
mnames = ['movie_id', 'title', 'genres']

In [94]:
engine='python'

movies = pd.read_table('movielens/movies.dat', sep='::', header=None,
                       names=mnames)





In [95]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [96]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Adding indicator variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset (using a nice set.union trick):

In [97]:
genre_iter = (set(x.split('|')) for x in movies.genres)

In [100]:
genres = sorted(set.union(*genre_iter))

In [101]:
genres

['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

Now, one way to construct the indicator DataFrame is to start with a DataFrame of all
zeros:

In [103]:
dummies = DataFrame(np.zeros((len(movies), len(genres))), 
                       columns=genres)

Now, iterate through each movie and set entries in each row of dummies to 1:

In [105]:
for i, gen in enumerate(movies.genres):
    dummies.ix[i, gen.split('|')] = 1

Then, as above, you can combine this with movies:

In [106]:
movies_windic = movies.join(dummies.add_prefix('Genre'))

In [108]:
movies_windic.head()

Unnamed: 0,movie_id,title,genres,GenreAction,GenreAdventure,GenreAnimation,GenreChildren's,GenreComedy,GenreCrime,GenreDocumentary,...,GenreFantasy,GenreFilm-Noir,GenreHorror,GenreMusical,GenreMystery,GenreRomance,GenreSci-Fi,GenreThriller,GenreWar,GenreWestern
0,1,Toy Story (1995),Animation|Children's|Comedy,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [112]:
movies_windic.ix[:0]

Unnamed: 0,movie_id,title,genres,GenreAction,GenreAdventure,GenreAnimation,GenreChildren's,GenreComedy,GenreCrime,GenreDocumentary,...,GenreFantasy,GenreFilm-Noir,GenreHorror,GenreMusical,GenreMystery,GenreRomance,GenreSci-Fi,GenreThriller,GenreWar,GenreWestern
0,1,Toy Story (1995),Animation|Children's|Comedy,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


NOTE: For much larger data, this method of constructing indicator variables
with multiple membership is not especially speedy. A lower-level function
leveraging the internals of the DataFrame could certainly be written
.

A useful recipe for statistical applications is to combine get_dummies with a discretization
function like cut:

In [113]:
values = np.random.rand(10)

In [114]:
values

array([ 0.75603383,  0.90830844,  0.96588737,  0.17373658,  0.87592824,
        0.75415641,  0.163486  ,  0.23784062,  0.85564381,  0.58743194])

In [115]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [116]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1]"
0,0,0,0,1,0
1,0,0,0,0,1
2,0,0,0,0,1
3,1,0,0,0,0
4,0,0,0,0,1
5,0,0,0,1,0
6,1,0,0,0,0
7,0,1,0,0,0
8,0,0,0,0,1
9,0,0,1,0,0
