# CHAPTER 5
# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

## 5.1 Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default. 

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value **NaN** (Not a Number) to represent missing data. We call this a *sentinel value* that can be easily detected:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull() 

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as **NA**, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data. 

The built-in Python None value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull() 

0     True
1    False
2     True
3    False
dtype: bool

There is work ongoing in the pandas project to improve the internal details of how missing data is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying details. See Table 5-1 for a list of some functions related to missing data handling.

<br>
<center>Table 5.1: NA handling methods</center>
<img src="Table5.1.jpg">

### 5.1.1 Filtering Out Missing Data 

There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the **dropna** can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [5]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [6]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [7]:
data[data.notnull()] 

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. **dropna** by default drops any row containing a missing value:

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing **how='all'** will only drop rows that are all NA:

In [10]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass **axis=1**:

In [11]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [12]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the **thresh** argument:

In [13]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.591295,,
1,-0.790926,,
2,0.067239,,-0.2628
3,0.225386,,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


In [14]:
df.dropna() 

Unnamed: 0,0,1,2
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


In [15]:
df.dropna(thresh=2)
# it will drop any rows that contain less than 2 observation values

Unnamed: 0,0,1,2
2,0.067239,,-0.2628
3,0.225386,,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


### 5.1.2 Filling In Missing Data 

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the **fillna** method is the workhorse function to use. Calling **fillna** with a constant replaces missing values with that value:

In [16]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.591295,0.0,0.0
1,-0.790926,0.0,0.0
2,0.067239,0.0,-0.2628
3,0.225386,0.0,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


Calling **fillna** with a dict, you can use a different fill value for each column:

In [17]:
df

Unnamed: 0,0,1,2
0,-0.591295,,
1,-0.790926,,
2,0.067239,,-0.2628
3,0.225386,,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


In [18]:
df.fillna({1: 0.5, 2: 0}) 

Unnamed: 0,0,1,2
0,-0.591295,0.5,0.0
1,-0.790926,0.5,0.0
2,0.067239,0.5,-0.2628
3,0.225386,0.5,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


**fillna** returns a new object, but you can modify the existing object *in-place*:

In [19]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.591295,0.0,0.0
1,-0.790926,0.0,0.0
2,0.067239,0.0,-0.2628
3,0.225386,0.0,-0.287232
4,-0.383468,0.671251,-0.53767
5,-0.307005,-1.88013,-1.832381
6,0.754017,-0.147614,0.279526


The same interpolation methods available for reindexing can be used with **fillna**:

In [20]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df 

Unnamed: 0,0,1,2
0,1.247679,-0.260224,-0.202078
1,-0.671989,1.671397,0.461024
2,1.676363,,0.864718
3,-0.425799,,-1.688526
4,1.107944,,
5,0.705568,,


In [21]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,1.247679,-0.260224,-0.202078
1,-0.671989,1.671397,0.461024
2,1.676363,1.671397,0.864718
3,-0.425799,1.671397,-1.688526
4,1.107944,1.671397,-1.688526
5,0.705568,1.671397,-1.688526


In [22]:
 df.fillna(method='ffill', limit=2) 

Unnamed: 0,0,1,2
0,1.247679,-0.260224,-0.202078
1,-0.671989,1.671397,0.461024
2,1.676363,1.671397,0.864718
3,-0.425799,1.671397,-1.688526
4,1.107944,,-1.688526
5,0.705568,,-1.688526


With **fillna** you can do lots of other things with a little creativity. For example, you might pass the *mean* or *median* value of a Series:

In [23]:
data = pd.Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [24]:
data.fillna(data.mean()) 

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

See Table 5-2 for a reference on **fillna**.

<br>
<center>Table 5.2:  fillna function arguments </center>
<img src="Table5.2.jpg">

In [25]:
https://chrisalbon.com/python/data_wrangling/pandas_missing_data/

SyntaxError: invalid syntax (<ipython-input-25-c8c14180c1e1>, line 1)

## 5.2 Data Transformation 

So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning, and other transformations are another class of important operations. 

### 5.2.1 Removing Duplicates 

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [None]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
data

The DataFrame method **duplicated** returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [None]:
data.duplicated() 

Relatedly, **drop_duplicates** returns a DataFrame where the duplicated array is False:

In [None]:
data.drop_duplicates() 

Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [None]:
data['v1'] = range(7)
data

In [None]:
data.drop_duplicates(['k1']) 

**duplicated** and **drop_duplicates** by default keep the first observed value combination. Passing *keep='last'* will return the last one:

In [None]:
data.drop_duplicates(['k1', 'k2'], keep='last') 

### 5.2.2 Transforming Data Using a Function or Mapping 

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of vegetables:

In [None]:
data = pd.DataFrame({'vege': ['cabbage', 'carrot', 'cabbage','Lettuce', 'potato', 'Cabbage', 
                             'lettuce', 'tomato', 'cucumber'],  'weight': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Suppose you wanted to add a column indicating the seller that each food came from. Let’s write down a mapping of each distinct vegetables type to the seller:

In [None]:
vege_to_seller = {'cabbage': 'David',  'carrot': 'David',  'lettuce': 'Ahmad',  'potato': 'Ahmad', 
                  'tomato': 'David',  'cucumber': 'Tina'}

The **map** method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the vegetables are capitalized and others are not. Thus, we need to convert each value to lowercase using the **str.lower** Series method:

In [None]:
lowercased = data['vege'].str.lower()
lowercased

In [None]:
data['seller'] = lowercased.map(vege_to_seller)
data

We could also have passed a function that does all the work:

In [None]:
data['vege'].map(lambda x: vege_to_seller[x.lower()]) 

Using **map** is a convenient way to perform element-wise transformations and other data cleaning–related operations. 

### 5.2.3 Replacing Values 

Filling in missing data with the **fillna** method is a special case of more general value replacement. As you’ve already seen, **map** can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so. Let’s consider this Series:

In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True):

In [None]:
data.replace(-999, np.nan) 

If you want to replace multiple values at once, you instead pass a list and then the substitute value:

In [None]:
data.replace([-999, -1000], np.nan) 

To use a different replacement for each value, pass a list of substitutes:

In [None]:
data.replace([-999, -1000], [np.nan, 0]) 

The argument passed can also be a dict:

In [None]:
data.replace({-999: np.nan, -1000: 0})

The **data.replace** method is distinct from **data.str.replace**, which performs string substitution element-wise. We look at these string methods on Series later in the chapter.

### 5.2.4 Renaming Axis Indexes 

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example:

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['Ohio', 'Colorado', 'New York'], 
                    columns=['one', 'two', 'three', 'four']) 
data

Like a Series, the axis indexes have a **map** method:

In [None]:
transform = lambda x: x[:4].upper()
data.index.map(transform) 

You can assign to **index**, modifying the DataFrame in-place:

In [None]:
data.index = data.index.map(transform)
data

If you want to create a transformed version of a dataset without modifying the original, a useful method is **rename**:

In [None]:
data.rename(index=str.title, columns=str.upper) 

Notably, **rename** can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [None]:
 data.rename(index={'OHIO': 'INDIANA'},columns={'three': 'peekaboo'}) 

**rename** saves you from the chore of copying the DataFrame manually and assigning to its **index** and **columns** attributes. Should you wish to modify a dataset in-place, pass *inplace=True*:

In [None]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

### 5.2.5 Discretization and Binning 

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [26]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] 

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use **cut**, a function in pandas:

In [27]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special **Categorical** object. The output you see describes the bins computed by **pandas.cut**. You can treat it like an array of strings indicating the bin name; internally it contains a **categories** array specifying the distinct category names along with a labeling for the **ages** data in the codes attribute:

In [28]:
cats.codes 

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [29]:
cats.categories 

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [None]:
pd.value_counts(cats) 

Note that **pd.value_counts(cats)** are the bin counts for the result of **pandas.cut**. 

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

In [None]:
 pd.cut(ages, [18, 26, 36, 61, 100], right=False) 

You can also pass your own bin names by passing a list or array to the **labels** option:

In [None]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names) 

If you pass an integer number of bins to **cut** instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [30]:
data = np.random.rand(20)
data

array([0.63457051, 0.73149253, 0.60915267, 0.11122163, 0.10640604,
       0.17748612, 0.62583029, 0.6118654 , 0.98467176, 0.86285701,
       0.5859921 , 0.31273611, 0.91055582, 0.69239441, 0.32955659,
       0.19306209, 0.77887661, 0.78067036, 0.13684622, 0.13832606])

In [31]:
data.ndim

1

In [32]:
pd.cut(data, 4, precision=2) 

[(0.55, 0.77], (0.55, 0.77], (0.55, 0.77], (0.11, 0.33], (0.11, 0.33], ..., (0.11, 0.33], (0.77, 0.98], (0.77, 0.98], (0.11, 0.33], (0.11, 0.33]]
Length: 20
Categories (4, interval[float64]): [(0.11, 0.33] < (0.33, 0.55] < (0.55, 0.77] < (0.77, 0.98]]

In [38]:
pd.cut(data, 4, precision=2).categories

IntervalIndex([(0.11, 0.33], (0.33, 0.55], (0.55, 0.77], (0.77, 0.98]],
              closed='right',
              dtype='interval[float64]')

In [37]:
pd.cut(data, 4, precision=2).codes

array([2, 2, 2, 0, 0, 0, 2, 2, 3, 3, 2, 0, 3, 2, 1, 0, 3, 3, 0, 0],
      dtype=int8)

The *precision=2* option limits the decimal precision to two digits. 

A closely related function, **qcut**, bins the data based on sample quantiles. Depending on the distribution of the data, using **cut** will not usually result in each bin having the same number of data points. Since **qcut** uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [39]:
data = np.random.randn(1000)  # Normally distributed
data

array([ 8.95656115e-01, -6.43254915e-01, -4.36890098e-01,  3.33642408e-01,
       -1.29278186e+00, -2.49689119e-01,  8.13193271e-01, -3.93717881e-01,
        5.34532209e-01,  4.22221451e-01, -6.36939202e-01, -1.52093672e+00,
        5.21215555e-01,  1.66849981e-01, -1.20244001e+00,  1.53232324e+00,
       -1.67072505e+00, -3.21110719e-01, -7.93003338e-01,  1.81420350e-01,
       -2.19626713e-01, -3.23729545e-01,  3.55135141e-01,  2.41128637e-01,
       -4.78980427e-01,  1.35657339e+00,  8.54556024e-01,  7.66264488e-01,
       -1.84095322e-01,  2.26290855e+00,  1.06760141e+00, -1.34414053e+00,
       -8.43844095e-01,  5.54300805e-01, -7.49051335e-01, -6.03127991e-01,
        2.61507037e-01, -9.89506581e-01, -8.32118079e-01,  2.10444364e-01,
       -5.85222584e-01, -1.79225850e+00, -3.16365863e+00,  7.50421457e-01,
       -3.86005244e-01, -6.62852835e-01,  5.77321695e-01, -5.76615716e-02,
       -1.29682377e-01,  1.18108230e+00, -2.99209068e-01,  5.22353415e-01,
        6.65453108e-01,  

In [40]:
cats = pd.qcut(data, 4)  # Cut into quartiles
cats

[(0.665, 2.971], (-0.647, 0.0133], (-0.647, 0.0133], (0.0133, 0.665], (-3.165, -0.647], ..., (0.0133, 0.665], (0.665, 2.971], (-0.647, 0.0133], (0.665, 2.971], (0.665, 2.971]]
Length: 1000
Categories (4, interval[float64]): [(-3.165, -0.647] < (-0.647, 0.0133] < (0.0133, 0.665] < (0.665, 2.971]]

In [41]:
pd.value_counts(cats) 

(0.665, 2.971]      250
(0.0133, 0.665]     250
(-0.647, 0.0133]    250
(-3.165, -0.647]    250
dtype: int64

Similar to **cut** you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [42]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) 

[(0.0133, 1.329], (-1.294, 0.0133], (-1.294, 0.0133], (0.0133, 1.329], (-1.294, 0.0133], ..., (0.0133, 1.329], (0.0133, 1.329], (-1.294, 0.0133], (0.0133, 1.329], (1.329, 2.971]]
Length: 1000
Categories (4, interval[float64]): [(-3.165, -1.294] < (-1.294, 0.0133] < (0.0133, 1.329] < (1.329, 2.971]]

## 5.3  String Manipulation 

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data. 


### 5.3.1 String Object Methods 

In many string munging and scripting applications, built-in string methods are sufficient. As an example, a comma-separated string can be broken into pieces with **split**:

In [43]:
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

**split** is often combined with **strip** to trim whitespace (including line breaks):

In [44]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [45]:
first, second, third = pieces
first + '::' + second + '::' + third 

'a::b::guido'

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a list or tuple to the **join** method on the string '::':

In [None]:
'::'.join(pieces) 

Other methods are concerned with locating substrings. Using Python’s **in** keyword is the best way to detect a substring, though **index** and **find* can also be used:

In [46]:
'guido' in val 

True

In [47]:
val.index(',') 

1

In [48]:
val.find(':') 

-1

Note the difference between **find** and **index** is that index raises an exception if the string isn’t found (versus returning –1):

In [49]:
val.index(':') 

ValueError: substring not found

Relatedly, **count** returns the number of occurrences of a particular substring:

In [50]:
val.count(',') 

2

**replace** will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:


In [51]:
val.replace(',', '::') 

'a::b::  guido'

In [52]:
val.replace(',', '') 

'ab  guido'

See Table 5-3 for a listing of some of Python’s string methods. Regular expressions can also be used with many of these operations, as you’ll see.

<br>
<center>Table 5.3: Python built-in string methods</center>
<img src="Table5.3.jpg">

### 5.3.2  Regular Expressions 

*Regular expressions* provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a *regex*, is a string formed according to the regular expression language. Python’s built-in **re** module is responsible for applying regular expressions to strings.

The **re** module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a *regex* describes a pattern to locate in the text, which can then be used for many purposes. Let’s look at a simple example:

Suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The *regex* describing one or more whitespace characters is **\s+**:

In [53]:
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text) 

['foo', 'bar', 'baz', 'qux']

When you call **re.split('\s+', text)**, the regular expression is first compiled, and then its split method is called on the passed text. You can compile the *regex* yourself with **re.compile**, forming a reusable *regex* object:

In [55]:
regex = re.compile('\s+')
regex.split(text) 

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the *regex*, you can use the **findall** method:

In [56]:
regex.findall(text) 

['    ', '\t ', '  \t']

**match** and **search** are closely related to **findall**. While **findall** returns all matches in a string, **search** returns only the first match. More rigidly, **match** only matches at the beginning of the string. As a less trivial example, let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [57]:
text = """Dave dave@google.com 
Steve steve@gmail.com 
Rob rob@gmail.com 
Ryan ryan@yahoo.com 
""" 
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

#re.IGNORECASE makes the regex case-insensitive 
regex = re.compile(pattern, flags=re.IGNORECASE) 

Using **findall** on the text produces a list of the email addresses:

In [58]:
regex.findall(text) 

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

**search** returns a special match object for the first email address in the text. For the preceding *regex*, the match object can only tell us the start and end position of the pattern in the string:

In [59]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [60]:
text[m.start():m.end()] 

'dave@google.com'

**regex.match** returns None, as it only will match if the pattern occurs at the start of the string:

In [61]:
print(regex.match(text)) 

None


Relatedly, **sub** will return a new string with occurrences of the pattern replaced by the a new string:

In [62]:
print(regex.sub('REDACTED', text)) 

Dave REDACTED 
Steve REDACTED 
Rob REDACTED 
Ryan REDACTED 



Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [63]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE) 

A match object produced by this modified *regex* returns a tuple of the pattern components with its **groups** method:

In [64]:
m = regex.match('wesm@bright.net')
m.groups() 

('wesm', 'bright', 'net')

**findall** returns a list of tuples when the pattern has groups:

In [65]:
regex.findall(text) 

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

**sub** also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [66]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)) 

Dave Username: dave, Domain: google, Suffix: com 
Steve Username: steve, Domain: gmail, Suffix: com 
Rob Username: rob, Domain: gmail, Suffix: com 
Ryan Username: ryan, Domain: yahoo, Suffix: com 



There is much more to regular expressions in Python, most of which is outside our scope. Table 5-4 provides a brief summary.

<br>
<center>Table 5.4: Regular expression methods</center>
<img src="Table5.4.jpg">

### 5.3.3 Vectorized String Functions in pandas 

Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

In [67]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [68]:
data.isnull() 

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a **lambda** or other function) to each value using **data.map**, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s **str** attribute; for example, we could check whether each email address has 'gmail' in it with **str.contains**:

In [69]:
data.str.contains('gmail') 

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any **re** options like **IGNORECASE**:

In [70]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [71]:
data.str.findall(pattern, flags=re.IGNORECASE) 

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use **str.get** or index into the **str** attribute:

In [76]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

To access elements in the embedded lists, we can pass an index to either of these functions:

In [74]:
matches.str.get(1) 

AttributeError: Can only use .str accessor with string values!

In [75]:
matches.str[0] 

AttributeError: Can only use .str accessor with string values!

You can similarly slice strings using this syntax:

In [None]:
data.str[:5] 

See Table 5-5 for more pandas string methods.

<br>
<center>Table 5.5: Partial listing of vectorized string methods</center>
<img src="Table5.5.jpg">