## Data Cleaning and Preparation

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm

## Handling Missing Data

For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected:

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python **None** value is also treated as NA in object arrays:

In [7]:
string_data[0] = None

In [8]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

* dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.

* fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.

* isnull Return boolean values indicating which values are missing/NA.

* notnull Negation of isnull.

### Filtering Out Missing Data

*dropna* returns the Series with only the non-null data and index values:

In [9]:
from numpy import nan as NA

In [10]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [11]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any NAs. **dropna** by default drops
any row containing a missing value:

In [27]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])

In [14]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
cleaned = data.dropna()

In [16]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing **how='all'** will only drop rows that are all NA:

In [17]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass *axis=1*:

In [33]:
data[4] = NA

In [34]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [28]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [29]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [32]:
data.dropna(how='any')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Suppose you want to keep only rows containing a certain number of observations. You can
indicate this with the **thresh** argument:

In [39]:
df = pd.DataFrame(np.random.randn(7, 3))

In [40]:
df.iloc[:4, 1] = NA

In [41]:
df.iloc[:2, 2] = NA

In [44]:
df

Unnamed: 0,0,1,2
0,-0.000684,,
1,-0.137928,,
2,1.969575,,-0.145899
3,0.838387,,1.469794
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


In [46]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


In [47]:
df.dropna(axis=1)

Unnamed: 0,0
0,-0.000684
1,-0.137928
2,1.969575
3,0.838387
4,-0.813363
5,0.986271
6,-0.351158


In [48]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.969575,,-0.145899
3,0.838387,,1.469794
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


### Filling In Missing Data

Calling **fillna** with a constant replaces missing values with that value:

In [49]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.000684,0.0,0.0
1,-0.137928,0.0,0.0
2,1.969575,0.0,-0.145899
3,0.838387,0.0,1.469794
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


Calling fillna with a dict, you can use a different fill value for each column:

In [50]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.000684,0.5,0.0
1,-0.137928,0.5,0.0
2,1.969575,0.5,-0.145899
3,0.838387,0.5,1.469794
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


**fillna** returns a new object, but you can modify the existing object in-place:

In [51]:
_ = df.fillna(0, inplace=True)

In [52]:
df

Unnamed: 0,0,1,2
0,-0.000684,0.0,0.0
1,-0.137928,0.0,0.0
2,1.969575,0.0,-0.145899
3,0.838387,0.0,1.469794
4,-0.813363,-0.221633,-0.581085
5,0.986271,-1.42381,0.745606
6,-0.351158,-1.469969,-0.268093


The same interpolation methods available for reindexing can be used with **fillna**:

In [53]:
df = pd.DataFrame(np.random.randn(6, 3))

In [54]:
df.iloc[2:, 1] = NA

In [55]:
df.iloc[4:, 2] = NA

In [56]:
df

Unnamed: 0,0,1,2
0,2.136861,-0.728507,1.077434
1,-1.044978,2.995481,1.608769
2,0.207383,,0.882619
3,1.333759,,0.096557
4,0.504452,,
5,0.156064,,


In [57]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,2.136861,-0.728507,1.077434
1,-1.044978,2.995481,1.608769
2,0.207383,2.995481,0.882619
3,1.333759,2.995481,0.096557
4,0.504452,2.995481,0.096557
5,0.156064,2.995481,0.096557


In [59]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,2.136861,-0.728507,1.077434
1,-1.044978,2.995481,1.608769
2,0.207383,2.995481,0.882619
3,1.333759,2.995481,0.096557
4,0.504452,,0.096557
5,0.156064,,0.096557


With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:

In [60]:
data = pd.Series([1., NA, 3.5, NA, 7])

In [61]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

## Data Transformation

### Removing Duplicates

The DataFrame method **duplicated** returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [62]:
data = pd.DataFrame(
    {'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [63]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [64]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

**drop_duplicates** returns a DataFrame where the duplicated array is *False*:

In [65]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider *all of the columns*; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [66]:
data['v1'] = range(7)

In [67]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [68]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. Passing **keep='last'** will return the last one:

In [69]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [70]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

You may wish to perform some transformation based on the val‐ ues in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [71]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [72]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [73]:
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the _str.lower_ Series method:

In [75]:
lowercased = data['food'].str.lower()

In [76]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [77]:
lowercased.map(meat_to_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [78]:
data['animal'] = lowercased.map(meat_to_animal)

In [80]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [275]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

**Using map is a convenient way to perform element-wise transformations **

In [81]:
f = lambda x: meat_to_animal[x.lower()]

In [82]:
f('pastrami')

'cow'

### Replacing Values

Filling in missing data with the **fillna** method is a special case of more general value replacement. As you’ve already seen, **map** can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so. Let’s con‐ sider this Series:

In [83]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [84]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True):

In [85]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the substitute value:

In [86]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [87]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [88]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example:

In [91]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [92]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method:

In [93]:
transform = lambda x: x[:4].upper()

In [94]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [95]:
data.index = data.index.map(transform)

In [96]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the original, a useful method is **rename**:

In [290]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, **rename** can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [97]:
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


**rename** saves you from the chore of copying the DataFrame manually and assigning to its index and columns attributes.

### Discretization and Binning

Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [99]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use **cut**, a function in pandas:

In [100]:
bins = [18, 25, 35, 60, 100]

In [103]:
cats = pd.cut(ages, bins)

In [104]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special *Categorical* object. The output you see describes the bins computed by *pandas.cut*. You can treat it like an array of strings indicating the bin name; internally it contains a **categories** array specifying the distinct category names along with a labeling for the ages data in the **codes** attribute:

In [105]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [106]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [107]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side
is _open_, while the square bracket means it is _closed_ (inclusive). You can change which side is closed by passing _right=False_:

In [108]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the **labels** option:

In [109]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [111]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass an integer number of bins to **cut** instead of explicit bin edges, it will compute *equal-length* bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into fourths:

In [112]:
data = np.random.rand(20)

In [113]:
pd.cut(data, 4, precision=2)

[(0.7, 0.93], (0.03, 0.25], (0.25, 0.48], (0.03, 0.25], (0.7, 0.93], ..., (0.48, 0.7], (0.48, 0.7], (0.03, 0.25], (0.25, 0.48], (0.7, 0.93]]
Length: 20
Categories (4, interval[float64]): [(0.03, 0.25] < (0.25, 0.48] < (0.48, 0.7] < (0.7, 0.93]]

A closely related function, **qcut**, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the
same number of data points. 
Since **qcut** uses sample quantiles instead, by definition you will obtain roughly *equal-size* bins:

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

`pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6`
Conversely, for cut you will see something more uneven:

`
pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2`

In [121]:
data = np.random.randn(1000) # Normally distributed

In [122]:
#cats = pd.cut(data, 4)

In [123]:
cats = pd.qcut(data, 4)

In [124]:
cats

[(0.709, 3.203], (0.0833, 0.709], (0.0833, 0.709], (-3.082, -0.624], (0.709, 3.203], ..., (-0.624, 0.0833], (-3.082, -0.624], (-0.624, 0.0833], (0.0833, 0.709], (-0.624, 0.0833]]
Length: 1000
Categories (4, interval[float64]): [(-3.082, -0.624] < (-0.624, 0.0833] < (0.0833, 0.709] < (0.709, 3.203]]

In [125]:
pd.value_counts(cats)

(0.709, 3.203]      250
(0.0833, 0.709]     250
(-0.624, 0.0833]    250
(-3.082, -0.624]    250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [126]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(1.315, 3.203], (0.0833, 1.315], (0.0833, 1.315], (-1.203, 0.0833], (0.0833, 1.315], ..., (-1.203, 0.0833], (-1.203, 0.0833], (-1.203, 0.0833], (0.0833, 1.315], (-1.203, 0.0833]]
Length: 1000
Categories (4, interval[float64]): [(-3.082, -1.203] < (-1.203, 0.0833] < (0.0833, 1.315] < (1.315, 3.203]]

#### Detecting and Filtering Outliers

Consider a DataFrame with some normally distributed data:

In [127]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [128]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.046368,0.02064,-0.024727,-0.004983
std,1.043136,1.010167,0.969533,1.024755
min,-3.097962,-3.388268,-3.215733,-3.131419
25%,-0.670718,-0.678169,-0.704977,-0.67644
50%,0.063287,0.011452,0.014606,-0.014137
75%,0.765672,0.683636,0.587717,0.681133
max,3.985879,3.509243,3.140374,3.152124


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

In [130]:
col = data[2]

In [131]:
col[np.abs(col) > 3]

441   -3.176045
519   -3.215733
618    3.140374
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the **any** method on a boolean DataFrame:

In [136]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
12,-1.074207,-0.036433,-1.440935,-3.131419
49,-3.097962,-1.729341,-0.123129,0.235654
54,0.147321,-3.388268,-0.757408,-1.187469
122,0.711750,3.509243,1.605087,-0.562102
136,3.985879,-0.678471,0.181373,0.314820
...,...,...,...,...
559,0.366415,0.006604,0.439382,3.152124
618,1.127329,0.256846,3.140374,0.062039
786,-0.498729,-3.147964,-0.607669,0.921124
807,-0.037667,-0.526306,-0.773637,-3.004452


In [137]:
pd.options.display.max_rows = 10

In [138]:
(np.abs(data) > 3)

Unnamed: 0,0,1,2,3
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
995,False,False,False,False
996,False,False,False,False
997,False,False,False,False
998,False,False,False,False


In [139]:
(np.abs(data) > 3).any() # is there any >3 in each column

0    True
1    True
2    True
3    True
dtype: bool

In [140]:
(np.abs(data) > 3).any(1) #is there any >3 in each row

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [141]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [142]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.04548,0.020525,-0.024476,-0.004878
std,1.039578,1.00641,0.967839,1.023529
min,-3.0,-3.0,-3.0,-3.0
25%,-0.670718,-0.678169,-0.704977,-0.67644
50%,0.063287,0.011452,0.014606,-0.014137
75%,0.765672,0.683636,0.587717,0.681133
max,3.0,3.0,3.0,3.0


The statement np.sign(data) produces 1 and –1 values based on whether the values in *data* are positive or negative:

In [143]:
np.sign(data)

Unnamed: 0,0,1,2,3
0,-1.0,-1.0,1.0,-1.0
1,-1.0,-1.0,-1.0,-1.0
2,1.0,1.0,-1.0,-1.0
3,-1.0,-1.0,-1.0,1.0
4,1.0,1.0,-1.0,1.0
...,...,...,...,...
995,1.0,-1.0,1.0,-1.0
996,1.0,1.0,1.0,-1.0
997,-1.0,1.0,1.0,1.0
998,1.0,-1.0,1.0,1.0


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the *numpy.random.permutation* function. Calling **permutation** with the length of the axis you want to permute produces an array of integers indicating the new ordering:

In [144]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

In [145]:
sampler = np.random.permutation(5)

In [146]:
sampler

array([3, 2, 0, 4, 1])

That array can then be used in iloc-based indexing or the equivalent take function:

In [147]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [148]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
2,8,9,10,11
0,0,1,2,3
4,16,17,18,19
1,4,5,6,7


To select a random subset without replacement, you can use the **sample** method on Series and DataFrame:

In [153]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
0,0,1,2,3
3,12,13,14,15
1,4,5,6,7


In [154]:
df.sample(n=3, replace=True)

Unnamed: 0,0,1,2,3
1,4,5,6,7
3,12,13,14,15
1,4,5,6,7


To generate a sample with replacement (to allow repeat choices), pass **replace=True** to sample:

In [155]:
choices = pd.Series([5, 7, -1, 6, 4])

In [156]:
draws = choices.sample(n=10, replace=True)

In [157]:
draws

0    5
1    7
2   -1
4    4
1    7
4    4
2   -1
1    7
2   -1
2   -1
dtype: int64

### Computing Indicator/Dummy Variables

If a column in a DataFrame has k distinct values, you would derive a matrix or Data‐ Frame with k columns containing all 1s and 0s. pandas has a **get_dummies** function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame:

In [162]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

In [163]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [164]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. **get_dummies** has a _prefix_ argument for doing this:

In [168]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [169]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [170]:
df_with_dummy = df[['data1']].join(dummies)

In [171]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In [336]:
#page 229

A useful recipe for statistical applications is to combine **get_dummies** with a discretization function like **cut**:

In [172]:
values = np.random.rand(10)

In [173]:
values

array([0.02078163, 0.28387008, 0.83017586, 0.0913666 , 0.14290305,
       0.74803654, 0.28980622, 0.55440641, 0.49268062, 0.92691099])

In [174]:
np.random.seed(12345)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [176]:
pd.cut(values, bins)

[(0.0, 0.2], (0.2, 0.4], (0.8, 1.0], (0.0, 0.2], (0.0, 0.2], (0.6, 0.8], (0.2, 0.4], (0.4, 0.6], (0.4, 0.6], (0.8, 1.0]]
Categories (5, interval[float64]): [(0.0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1.0]]

In [178]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,0,0,1
3,1,0,0,0,0
4,1,0,0,0,0
5,0,0,0,1,0
6,0,1,0,0,0
7,0,0,1,0,0
8,0,0,1,0,0
9,0,0,0,0,1


## String Manipulation

Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

### String Object Methods

In [179]:
val = 'a,b, guido'

In [180]:
val.split(',')

['a', 'b', ' guido']

**split** is often combined with **strip** to trim whitespace (including line breaks):

In [181]:
pieces = [x.strip() for x in val.split(',')]

In [182]:
pieces

['a', 'b', 'guido']

In [183]:
first, second, third = pieces

In [184]:
first + '::' + second + '::' + third

'a::b::guido'

In [185]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s **in** keyword is the best way to detect a substring, though **index** and **find** can also be used:

In [186]:
'guido' in val

True

In [190]:
val.index(',')

1

In [191]:
val.find(',')

1

In [188]:
val.find(':')

-1

In [189]:
val.index(':')

ValueError: substring not found

**count** returns the number of occurrences of a particular substring:

In [192]:
val.count(',')

2

**replace** will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

In [193]:
val.replace(',', '::')

'a::b:: guido'

In [194]:
val.replace(',', '')

'ab guido'

![alt text](images/stringmethods.png "Python built-in string methods")

### Regular Expressions

Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a *regex*, is a string formed according to the regular expression language. Python’s built-in **re** module is responsible for applying regular expressions to strings.

The **re** module functions fall into three categories: pattern matching, substitution, and splitting.

Suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is *\s+*:

In [195]:
import re

In [196]:
text = "foo bar\t baz \tqux"

In [197]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first *compiled,* and then its split method is called on the passed text. You can compile the regex yourself with *re.compile*, forming a reusable regex objec

In [198]:
regex = re.compile('\s+')

In [199]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the **findall** method:

In [200]:
regex.findall(text)

[' ', '\t ', ' \t']

**avoid unwanted escaping with \ in a regular expression, use raw tring literals like r'C:\x' instead of the equivalent 'C:\\x'.**

**Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.**

**match** and **search** are closely related to findall. 

While **findall** returns *all* matches in a string, **search** returns _only the first match_.
**match** only matches at the beginning of the string.

let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [201]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

In [202]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [204]:
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using **findall** on the text produces a list of the email addresses:

In [214]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

**search** returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [215]:
m = regex.search(text)

In [216]:
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [217]:
m.start()

5

In [218]:
m.end()

20

In [219]:
text[m.start():m.end()]

'dave@google.com'

**regex.match** returns None, as it only will match if the pattern occurs at the start of the string:

In [220]:
print(regex.match(text))

None


**sub** will return a new string with occurrences of the pattern replaced by the a new string:

In [212]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



In [213]:
re.sub(pattern, 'REDACTED', text, flags=re.IGNORECASE)

'Dave REDACTED\nSteve REDACTED\nRob REDACTED\nRyan REDACTED\n'

Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [221]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [222]:
regex = re.compile(pattern, flags=re.IGNORECASE)

A **match** object produced by this modified regex returns a tuple of the pattern components with its **groups** method:

In [223]:
m = regex.match('wesm@bright.net')

In [224]:
m.groups()

('wesm', 'bright', 'net')

**findall** returns a list of tuples when the pattern has groups:

In [225]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [226]:
for name, domain, suffix in regex.findall(text):
    print ( "Name: %s, Domain: %s, Suffix: %s" % (name, domain, suffix))

Name: dave, Domain: google, Suffix: com
Name: steve, Domain: gmail, Suffix: com
Name: rob, Domain: gmail, Suffix: com
Name: ryan, Domain: yahoo, Suffix: com


**sub** also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [227]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



![alt text](images/regexs.png "Regular expression methods")

### Vectorized String Functions in pandas

In [228]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [229]:
data = pd.Series(data)

In [230]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [231]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a _lambda_ or other function) to each value using **data.map**, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s _str_ attribute; for example, we could check whether each email address has 'gmail' in it with **str.contains**:

In [232]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any *re* options like IGNORECASE

In [233]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [234]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use **str.get** or index into the **str** attribute:

In [235]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

In [236]:
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

To access elements in the embedded lists, we can pass an index to either of these functions:

In [237]:
###? matches.str.get(1)

In [418]:
###? matches.str[0]

You can similarly slice strings using this syntax:

In [238]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [239]:
data.str.len()

Dave     15.0
Steve    15.0
Rob      13.0
Wes       NaN
dtype: float64

![alt text](images/vectorizedstrings.png "Partial listing of vectorized string methods")