In [1]:
import numpy as np
import pandas as pd

# Section 11: Regex and Text Manipulation

Python and Pandas have a lot to offer in terms of extracting information from text and manipulation text. In this section we will cover#
* a detailed overview of Python string methods
* the Pandas `.str` family of methods
* advanced splits and replacements in Pandas
* hands-on introduction of RegEx
  * character sets, anchors, metasequences, quantifiers and more!

We'll get some hands-on practice on this using a Boston marathon dataset

## Our data: Boston Marathon Runners


For this section we'll be working with a dataset for Boston marathon participants.

https://andybek.com/pandas-marathon

In [2]:
boston_url = 'https://andybek.com/pandas-marathon'

In [3]:
boston = pd.read_csv(boston_url)

In [4]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


In [5]:
boston.info

<bound method DataFrame.info of                     Name  Age M/F  ... Overall Gender  Years Ran
0       Kirui, Geoffrey    24   M  ...       1      1        NaN
1         Rupp, Galen      30   M  ...       2      2        NaN
2        Osako, Suguru     25   M  ...       3      3        NaN
3       Biwott, Shadrack   32   M  ...       4      4        NaN
4         Chebet, Wilson   31   M  ...       5      5       2015
..                   ...  ...  ..  ...     ...    ...        ...
995         Larosa, Mark   38   M  ...     996    940  2015:2016
996  Williamson, Jerry A   43   M  ...     997    941       2015
997      Mccue, Daniel T   40   M  ...     998    942        NaN
998         Larosa, John   35   M  ...     999    943        NaN
999       Sanchez, Sam R   35   M  ...    1000    944        NaN

[1000 rows x 10 columns]>

We hvae a dataset of 10 columns that include the Name, Age, and gender of eahc runner for the year 2017. Most of the fields are strings, including the "Official Time" which is text-based. This gives us plent of text data to play around with in this section.

## String Methods in Python

We'll start by playing around with pure text in Python. We will cover the following concepts:
* `len`
* `center`
* `startswith` and `endswith`
* the `in` operator
* list comprehension with strings

Link to common python string operations: https://docs.python.org/3/library/string.html

Let's begin with a text string.

In [6]:
s = "Welcome to the text manipulation section"

We can get the length of the string (number of characters)

In [7]:
len(s)

40

The `center()` method creates a longer string that has the current string at the center and adds to both sides characters that we specify. The first value passed in to the function indicates the length of the final string, and the second value specifies that character(s) to be added to each side of the starting string to get the final string.

In [8]:
s.center(100, '*')

'******************************Welcome to the text manipulation section******************************'

Note that if your starting string is longer than the string you are attempting to build, you'll simply get the starting string back.

In [9]:
s.center(30, '*')

'Welcome to the text manipulation section'

We can also check whether the string starts or ends with a given character or characters using `startswith()` and `endswith()`. 
* Note that these methods are case-sensitive.

In [10]:
s.endswith('tion')

True

In [11]:
s.startswith("Wel")

True

To confirm that the string contains the given character or substring, Python does NOT have a dedicated "contains" or "includes" method. Instead, we check for inclusion using the `in` operator.

In [12]:
'text manipulation' in s

True

In [13]:
'texted' in s

False

When analyzing datasets that contain text, we don't usually operate on individual strings. Instead, we take an operation and apply it to the entire collection of strings.

One way to do this in Python is to apply text transforms within list comprehensions.

In [14]:
names = ['Alanah', 'Albion', 'Andrew', 'Brian']

Suppose we want to find the lengths of all of the strings in this list. We could do this with list comprehension

In [15]:
[len(name) for name in names]

[6, 6, 6, 5]

Similarly, we can call any function we want, including functions we define. For instance, we can check whether the names start with "A".

In [16]:
[name.startswith('A') for name in names]

[True, True, True, False]

This approach is okay-looking, but it's actually quite fragile. For example if we had an invalid string or a missing value (which happens all of the time in real-world data), Python will thrown an error. 

Example:

In [17]:
names = ['Alanah', 'Albion', 'Andrew', np.NaN, 'Brian']

In [18]:
## Results in TypeError: object of type 'float' has no len()
# [len(name) for name in names]

So we need some special logic to accommodate issues such as these. This is one aspect where Numpy and Pandas improve on the built-in Python capabilities. Pandas allows us to conduct large-scale text manipulation without having to worry about missing values.

## Vectorized String Operations in Pandas

Pandas offers an extensive toolset for vectorized string operations on large sequences of text values. Many of the methods we discussed still apply, but the way we access them is a bit difference.

In [19]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Suppose we want to find the name of each runner's name. Let's do it in Python first.

In [20]:
len('Kirui, Geoffrey')

15

But if we get a hold of the entire name range as a Series and pass it to the `len()` function, we'll quickly see that it doesn't work. Instead, we simply get a single number indicating the number of values in the "Name" column.

In [21]:
len(boston.Name)

1000

To get a series of lengths of names, we can use the `.str` family of methods. The `.str` is a common attribute that allows us to access vectorized string operations in Pandas. We can use it to, for example, perform vectorized calculations on the length of each name in the "Name" columns.

In [22]:
boston.Name.str.len()

0      17
1      14
2      15
3      16
4      14
       ..
995    12
996    19
997    15
998    12
999    14
Name: Name, Length: 1000, dtype: int64

The same goes for other functions.

In [23]:
boston.Name.str.startswith('A')

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: Name, Length: 1000, dtype: bool

For the most part, vectorized string methods in Pandas follow the same naming convention as built-in string methods. We'll see some exceptions later, but for the most part they are the same methods that we see in Python. The only difference is that they operate on the entire sequence at once and they exclude any missing values.
* https://docs.python.org/3/library/stdtypes.html#string-methods

## Case Operations

There exist a family of methods that impact casing of text data. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.upper.html
* The page above also contains links to other string methods.

For these examples let's focus on the "City" column

In [24]:
boston.City

0           Keringet
1           Portland
2       Machida-City
3      Mammoth Lakes
4           Marakwet
           ...      
995    North Andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa Fe
Name: City, Length: 1000, dtype: object

The casing we see here is known as "title case", where the first letter of each word is capitalized. In both Python and Pandas it is applied using the `.title()` method.

In [25]:
boston.City.str.title()

0           Keringet
1           Portland
2       Machida-City
3      Mammoth Lakes
4           Marakwet
           ...      
995    North Andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa Fe
Name: City, Length: 1000, dtype: object

This series was already title-cased, so the result is not particularly interesting (there was no change). So let's try another case operation. 

We can convert everything to upper case using the `.upper()` method.

In [26]:
boston.City.str.upper()

0           KERINGET
1           PORTLAND
2       MACHIDA-CITY
3      MAMMOTH LAKES
4           MARAKWET
           ...      
995    NORTH ANDOVER
996          RALEIGH
997        ARLINGTON
998          DANBURY
999         SANTA FE
Name: City, Length: 1000, dtype: object

A few other case methods include
* `lower()`
* `swapcase()` - reverses the current casing; upper becomes lower and lower becomes upper (instructor hasn't really found a great use for this method)
* `capitalize()` - capitalize the first letter of the *string* only (NOT the first letter of every word). All other letters are lower case

In [27]:
boston.City.str.lower()

0           keringet
1           portland
2       machida-city
3      mammoth lakes
4           marakwet
           ...      
995    north andover
996          raleigh
997        arlington
998          danbury
999         santa fe
Name: City, Length: 1000, dtype: object

In [28]:
boston.City.str.swapcase()

0           kERINGET
1           pORTLAND
2       mACHIDA-cITY
3      mAMMOTH lAKES
4           mARAKWET
           ...      
995    nORTH aNDOVER
996          rALEIGH
997        aRLINGTON
998          dANBURY
999         sANTA fE
Name: City, Length: 1000, dtype: object

In [29]:
boston.City.str.capitalize()

0           Keringet
1           Portland
2       Machida-city
3      Mammoth lakes
4           Marakwet
           ...      
995    North andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa fe
Name: City, Length: 1000, dtype: object

## Finding Characters and Words: `str.find()` and `str.rfind()`

We'll begin with a review of the Python `find()` and `rfind()` function. Recall our simply string `s`

In [30]:
s

'Welcome to the text manipulation section'

Suppose we want to identify the exact position of the first lower-case "x" character in this string. To do this, we call the `find()` method on the string we are looking for and provide the search character.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.find.html

In [31]:
s.find('x')

17

We see that the first lower-case "x" is at index position 17 (the 18th letter of the string since Python is zero-indexed).

In [32]:
s[17]

'x'

We can search for any sequence of characters that we want. For instance, let's look for a full substring "text". What returns is the position of the first character in that substring

In [33]:
s.find('text')

15

If you ever provide a search string that does not exist in the queried string, the `find()` method will return -1.

Returning now to Pandas, let's list the first few records to orient ourselves.

In [34]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Here we'll pick the "Name" column and explore the number of top marathon runners have 'Andy' in their names. This can be easily achieved by applying the `find()` method to the entire sequence of names. Per usual, we will use the `.str` accessor and then apply the `find()` method.

In [35]:
boston.Name.str.find('Andy')

0     -1
1     -1
2     -1
3     -1
4     -1
      ..
995   -1
996   -1
997   -1
998   -1
999   -1
Name: Name, Length: 1000, dtype: int64

What returns is a long sequence of integers, indicating the place in each name in which the substring "Andy" is located. Let's do a quick `value_counts()` analysis.

In [36]:
boston.Name.str.find('Andy').value_counts()

-1     998
 12      1
 8       1
Name: Name, dtype: int64

There are actually two instances of someone having "Andy" in their name. Seems underrepresented. How about a name like "James?"

In [37]:
boston.Name.str.find('James').value_counts()

-1     988
 10      3
 8       3
 9       2
 7       2
 12      1
 6       1
Name: Name, dtype: int64

The `find()` method performs a left-to-right search by default. If we start from the right instead, we'll get a different position integer returned. Let's illustrate this directionality with a new string.

In [38]:
p = 'pandas numpy numpy pandas'

Let's first try searching for "pandas"

In [39]:
p.find('pandas')

0

We get zero, indicating that the first "pandas" substring instance begins at the 0th indexed position, as we expected. What if we want to start counting from the right side and determine that position in which "pandas" appears closest to the right? We do that using the `rfind()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.rfind.html

In [40]:
p.rfind('pandas')

19

Here the method indicated that, starting from the right side of the string, the first "pandas" occurrence is at position 19. We can verify this with a slice.

In [41]:
p[19:]

'pandas'

## Strips and Whitespace Methods

**Whitespace** refers to characters that represent vertical or horizontal space, such as tab and newline characters. They are oftentimes not visible when a stirng is printed, but they do impact the spacing and positioning of the output.

In this lecture we'll cover the following methods:
* `isspace()`
* `lstrip()`
* `rstrip()`
* `strip()`

Descriptions of these methods can be find here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

To check whether a given character is a whitespace, we can use the Python method `isspace()`

In [42]:
' '.isspace()

True

In [43]:
'\n'.isspace()

True

Let's contain some sample strings that contain whitespace to work with for the rest of the section.

In [44]:
left_spaced = '     this is a pandas course'

In [45]:
right_spaced = 'we cover plenty of Python too!      '

In [46]:
spaced = '    the name is: BOND \t JAMES BOND \n\n'

When printing the left_spaced string, the leading space will not be immediately obvious.

In [47]:
print(left_spaced)

     this is a pandas course


It's tough to see, but it's there and the whitespace contributes to the length of the overall string.

In [48]:
print(spaced)

    the name is: BOND 	 JAMES BOND 




Whitespace can be troublesome when working with text, which happens when gathering text from unstructured input, such as forums, comments, etc.

Luckily, Python and Pandas offer a number of very useful methods to string whitespace from text. First up is `lstrip()`, which removes leading whitespace.

In [49]:
left_spaced.lstrip()

'this is a pandas course'

`rstrip()` does the exact same thing, but on the right-hand side.

In [50]:
right_spaced.rstrip()

'we cover plenty of Python too!'

The generic `strip()` method does the same thing but on both ends at the same time.

In [51]:
spaced.strip()

'the name is: BOND \t JAMES BOND'

Notice that the horizonatal tab character "\t" is still there. Unfortunately there's no method that handles this character specifically. However, we'll hand this with replacement later on when we combine the replacement methods with regular expressions.

Moving on to Pandas, let's again look at our dataframe

In [52]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Looking at our "Name" column, we see that some names have leading or trailing whitespace, for example the first two names on the list.

In [53]:
boston.Name.iloc[0]

' Kirui, Geoffrey '

In [54]:
boston.Name.iloc[1]

'Rupp, Galen   '

How do we apply the vectorized `strip()` method to this? It's the same syntax that we're familiar with.

In [55]:
boston.Name.iloc[0:2].str.strip()

0    Kirui, Geoffrey
1        Rupp, Galen
Name: Name, dtype: object

This has stripped all leading and trailing white space from the first two names. Let's go ahead and apply this to our entire sequence of names, and then assign the result back to the "Name" column.

In [56]:
boston.Name = boston.Name.str.strip()

In [57]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Hard to tell if that did anything, let's verify that it worked by looking at the first name again. 

In [58]:
boston.Name.iloc[0]

'Kirui, Geoffrey'

Sure enough, the leading whitespace is gone!