<a href="https://colab.research.google.com/github/nilesh-Baraksar/Corizo_practice/blob/main/Data_Cleaning_and_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Corizo homework

In [351]:
import pandas as pd
import numpy as np
from skimage.io import imread
import matplotlib.pyplot as plt
from numpy import nan as NA

In [352]:
string_data = pd.Series(['mango','pinaple', np.nan,'avacado'])
string_data

0      mango
1    pinaple
2        NaN
3    avacado
dtype: object

In [353]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [354]:
string_data[0] = None
string_data

0       None
1    pinaple
2        NaN
3    avacado
dtype: object

In [355]:
string_data.isnull()


0     True
1    False
2     True
3    False
dtype: bool

Filtering Out Missing Data

In [356]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [357]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

If want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value

In [358]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
 ....: [NA, NA, NA], [NA, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [359]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


passing how='all' will only dropna rows that are all NA:

In [360]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns on the same way, pass axis=1:

In [361]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [362]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [363]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,1.9801,,
1,-2.449962,,
2,-0.921879,,1.395295
3,0.125828,,-0.238444
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


In [364]:
df.dropna()

Unnamed: 0,0,1,2
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


In [365]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-0.921879,,1.395295
3,0.125828,,-0.238444
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most pur‐ poses, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value

In [366]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.9801,0.0,0.0
1,-2.449962,0.0,0.0
2,-0.921879,0.0,1.395295
3,0.125828,0.0,-0.238444
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


Calling fillna with dict, can use a diffrent fill value for each column:

In [367]:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,1.9801,0.5,0.0
1,-2.449962,0.5,0.0
2,-0.921879,0.5,1.395295
3,0.125828,0.5,-0.238444
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


Note: fillna returns as new object, but can modify the existing object in place:

In [368]:
_= df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,1.9801,0.0,0.0
1,-2.449962,0.0,0.0
2,-0.921879,0.0,1.395295
3,0.125828,0.0,-0.238444
4,-2.65441,0.732904,-2.388083
5,1.135724,-0.644844,-0.232129
6,0.0035,-0.915199,-0.485604


The same interpolation methods available for reindexing can be used with fillna:

In [369]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df


Unnamed: 0,0,1,2
0,0.909229,-1.142957,0.321532
1,1.585069,-0.141547,-0.795565
2,-0.525484,,1.410556
3,0.077096,,-0.071084
4,0.043646,,
5,0.820078,,


In [370]:
df.fillna(method="ffill")

Unnamed: 0,0,1,2
0,0.909229,-1.142957,0.321532
1,1.585069,-0.141547,-0.795565
2,-0.525484,-0.141547,1.410556
3,0.077096,-0.141547,-0.071084
4,0.043646,-0.141547,-0.071084
5,0.820078,-0.141547,-0.071084


In [371]:
df.fillna(method='ffill', limit =2)

Unnamed: 0,0,1,2
0,0.909229,-1.142957,0.321532
1,1.585069,-0.141547,-0.795565
2,-0.525484,-0.141547,1.410556
3,0.077096,-0.141547,-0.071084
4,0.043646,,-0.071084
5,0.820078,,-0.071084


with fillna can do lots of other things with a little creativity. For example, might pass the mean or median value of a Series

In [372]:
data = pd.Series([1., NA, 3.5, NA,7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

**Data Transformation**
Removing Duplicates

In [373]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
 ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})

data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The Daftaframe method duplicated returns a bollean Series indicating wheter each row is a duplicate(has been observed in a prvious row) or not:

In [374]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:

In [375]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these medtods by default consider all of the columns; alternativly, can specify any subject of them to detect duplicates. Suppose we had an additional columns of value and wanted to filter duplicates only based on the 'K1' coloumn:

In [376]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combina‐ tion. Passing keep='last' will return the last one:

In [377]:
data.drop_duplicates(['k1', 'k2'], keep = 'last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


## **Transforming Data Using a Function or Mapping**
For many datasets, you may wish to perform some transformation based on the val‐ ues in an array, Series, or column in a DataFrame. Consider the following hypotheti‐ cal data collected about various kinds of meat:

In [378]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
 ....: 'Pastrami', 'corned beef', 'Bacon',
 ....: 'pastrami', 'honey ham', 'nova lox'],
 ....: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [379]:
meat_to_animal = {
  'bacomn': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon',
}


In [380]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [381]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,
1,pulled pork,3.0,pig
2,bacon,12.0,
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon



Using map is a convenient way to perform element-wise transformations and other data cleaning–related operations.

# **Replacing Values**
Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s con‐ sider this Series:

In [382]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True):

In [383]:
data.replace(-999, NA)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If want to replace multiple values at once, you instead pass a list and then the substitute value:

In [384]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

In [385]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [386]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The data.replace method is distinct from data.str.replace, which performs string substitution element-wise.
## **Renaming Axis Indexes**
Like values in a Series, axis labels can be similarly transformed by a function or map‐ ping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example:

In [387]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
 ....: index=['Ohio', 'Colorado', 'New York'],
 ....: columns=['one', 'two', 'three', 'four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


Like a Series, the axis indexes have a map method:

In [388]:
transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

Can assign to index, modifying the DataFrame in-place

In [389]:
data.index =data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If want to create a transformed version of a dataset without modifiying the origanal, a useful method is rename

In [390]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [391]:
data.rename(index={'OHIO': 'INDIANA'},
 ....: columns={'three': 'peekaboo'})


Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


rename saves you from the chore of copying the DataFrame manually and assigning to its index and columns attributes. Should you wish to modify a dataset in-place, pass inplace=True:

In [392]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


## **Discretization and Binning**
Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [393]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so,  have to use cut, a function in pandas

In [394]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [395]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut. Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

In [396]:
pd.cut(ages,[18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [397]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [398]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.27, 0.49], (0.038, 0.27], (0.038, 0.27], (0.49, 0.72], (0.038, 0.27], ..., (0.038, 0.27], (0.72, 0.95], (0.72, 0.95], (0.72, 0.95], (0.038, 0.27]]
Length: 20
Categories (4, interval[float64, right]): [(0.038, 0.27] < (0.27, 0.49] < (0.49, 0.72] <
                                           (0.72, 0.95]]

The precision=2 option limits the decimal precision to two digits.

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [399]:
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into quartiles

cats

[(-2.957, -0.672], (-2.957, -0.672], (-0.672, 0.035], (0.035, 0.66], (-0.672, 0.035], ..., (-2.957, -0.672], (-0.672, 0.035], (-2.957, -0.672], (0.035, 0.66], (0.035, 0.66]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.957, -0.672] < (-0.672, 0.035] < (0.035, 0.66] <
                                           (0.66, 3.384]]

In [400]:
pd.value_counts(cats)

(-2.957, -0.672]    250
(-0.672, 0.035]     250
(0.035, 0.66]       250
(0.66, 3.384]       250
dtype: int64

In [401]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

[(-1.217, 0.035], (-2.957, -1.217], (-1.217, 0.035], (0.035, 1.302], (-1.217, 0.035], ..., (-1.217, 0.035], (-1.217, 0.035], (-1.217, 0.035], (0.035, 1.302], (0.035, 1.302]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.957, -1.217] < (-1.217, 0.035] < (0.035, 1.302] <
                                           (1.302, 3.384]]

## **Detecting and Filtering Outliers**
Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [402]:
data = pd.DataFrame(np.random.randn(1000, 4))
data

Unnamed: 0,0,1,2,3
0,-0.260373,-0.592345,-1.830452,0.221563
1,0.264375,0.092516,0.881996,-0.669380
2,-1.443584,-1.029656,0.945038,-0.721456
3,-1.138380,1.765959,0.348759,0.495520
4,1.205638,-0.159268,0.710223,-1.017971
...,...,...,...,...
995,0.577878,0.843092,-2.074228,1.158913
996,1.457142,0.316787,0.534642,0.234682
997,1.007660,-1.045925,0.817396,-0.501123
998,0.613161,-0.214058,1.066708,-0.861148


In [403]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.018004,-0.041402,-0.008347,-0.00627
std,0.968497,1.03049,0.978804,0.997992
min,-3.675757,-3.313355,-3.417202,-2.929583
25%,-0.634343,-0.743031,-0.683797,-0.699204
50%,0.068339,-0.066309,0.008767,-0.06094
75%,0.664486,0.687075,0.696085,0.701341
max,2.664516,3.04892,3.202124,3.294393


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

In [404]:
col = data[2]
col[np.abs(col) > 3]

48    -3.417202
105    3.202124
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame:

In [405]:
data[(np.abs(data) > 3).any(1)]

  data[(np.abs(data) > 3).any(1)]


Unnamed: 0,0,1,2,3
38,-3.102162,-0.260064,-0.224476,-2.382982
48,-1.394008,-2.204325,-3.417202,0.479806
105,0.026561,-1.959987,3.202124,-0.599973
144,-0.618877,1.127841,0.747692,3.294393
347,-3.675757,-0.530284,0.755287,0.69865
535,1.926606,-3.313355,0.770462,0.585955
666,0.16023,3.04892,0.727615,-0.399965


Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [406]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.018782,-0.041138,-0.008132,-0.006565
std,0.965825,1.029395,0.976793,0.99706
min,-3.0,-3.0,-3.0,-2.929583
25%,-0.634343,-0.743031,-0.683797,-0.699204
50%,0.068339,-0.066309,0.008767,-0.06094
75%,0.664486,0.687075,0.696085,0.701341
max,2.664516,3.0,3.0,3.0


The statement np.sign(data) produces 1 and –1 values based on whether the values in data are positive or negative:

In [407]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,-1.0,-1.0,1.0
1,1.0,1.0,1.0,-1.0
2,-1.0,-1.0,1.0,-1.0
3,-1.0,1.0,1.0,1.0
4,1.0,-1.0,1.0,-1.0


# **String Manipulation**
Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

# **String Object Methods**
In many string munging and scripting applications, built-in string methods are sufficient. As an example, a comma-separated string can be broken into pieces with split:

In [408]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including line breaks):

In [409]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

A faster and more Pythonic way is to pass a list or tuple to the join method on the string '::':

In [410]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s in keyword is the best way to detect a substring, though index and find can also be used:

In [411]:
'guido' in val

True

In [412]:
val.index(',')

1

In [413]:
val.find(':')

-1

Note the difference between find and index is that index raises an exception if the string isn’t found (versus returning –1):

In [414]:
val.index(':')

ValueError: ignored

Relatedly, count returns the number of occurrences of a particular substring:

In [415]:
val.count(',')

2

replace will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

In [416]:
val.replace(',', '::')

'a::b:: guido'

In [417]:
val.replace(',', '')

'ab guido'

## **Regular Expressions**
Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings; I’ll give a number of examples of its use here.

The re module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let’s look at a simple example:

suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+:

In [419]:
import re

In [420]:
text = "foo bar\t baz \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object:

In [423]:
regex = re.compile('\s+')

In [424]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the findall method:

In [425]:
regex.findall(text)

[' ', '\t ', ' \t']

[' ', '\t ', ' \t']
NOTE: Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.
match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string. As a less trivial example, let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [426]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using findall on the text produces a list of the email addresses:

In [427]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [428]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [429]:
text[m.start():m.end()]

'dave@google.com'

regex.match returns None, as it only will match if the pattern occurs at the start of the string:

In [430]:
print(regex.match(text))

None


Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string:

In [431]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

