In [3]:
import numpy as np
import pandas as pd

# Section 11: Regex and Text Manipulation

Python and Pandas have a lot to offer in terms of extracting information from text and manipulation text. In this section we will cover#
* a detailed overview of Python string methods
* the Pandas `.str` family of methods
* advanced splits and replacements in Pandas
* hands-on introduction of RegEx
  * character sets, anchors, metasequences, quantifiers and more!

We'll get some hands-on practice on this using a Boston marathon dataset

## Our data: Boston Marathon Runners


For this section we'll be working with a dataset for Boston marathon participants.

https://andybek.com/pandas-marathon

In [6]:
boston_url = 'https://andybek.com/pandas-marathon'

In [8]:
boston = pd.read_csv(boston_url)

In [9]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


In [10]:
boston.info

<bound method DataFrame.info of                     Name  Age M/F  ... Overall Gender  Years Ran
0       Kirui, Geoffrey    24   M  ...       1      1        NaN
1         Rupp, Galen      30   M  ...       2      2        NaN
2        Osako, Suguru     25   M  ...       3      3        NaN
3       Biwott, Shadrack   32   M  ...       4      4        NaN
4         Chebet, Wilson   31   M  ...       5      5       2015
..                   ...  ...  ..  ...     ...    ...        ...
995         Larosa, Mark   38   M  ...     996    940  2015:2016
996  Williamson, Jerry A   43   M  ...     997    941       2015
997      Mccue, Daniel T   40   M  ...     998    942        NaN
998         Larosa, John   35   M  ...     999    943        NaN
999       Sanchez, Sam R   35   M  ...    1000    944        NaN

[1000 rows x 10 columns]>

We hvae a dataset of 10 columns that include the Name, Age, and gender of eahc runner for the year 2017. Most of the fields are strings, including the "Official Time" which is text-based. This gives us plent of text data to play around with in this section.

## String Methods in Python

We'll start by playing around with pure text in Python. We will cover the following concepts:
* `len`
* `center`
* `startswith` and `endswith`
* the `in` operator
* list comprehension with strings

Link to common python string operations: https://docs.python.org/3/library/string.html

Let's begin with a text string.

In [11]:
s = "Welcome to the text manipulation section"

We can get the length of the string (number of characters)

In [12]:
len(s)

40

The `center()` method creates a longer string that has the current string at the center and adds to both sides characters that we specify. The first value passed in to the function indicates the length of the final string, and the second value specifies that character(s) to be added to each side of the starting string to get the final string.

In [14]:
s.center(100, '*')

'******************************Welcome to the text manipulation section******************************'

Note that if your starting string is longer than the string you are attempting to build, you'll simply get the starting string back.

In [15]:
s.center(30, '*')

'Welcome to the text manipulation section'

We can also check whether the string starts or ends with a given character or characters using `startswith()` and `endswith()`. 
* Note that these methods are case-sensitive.

In [16]:
s.endswith('tion')

True

In [17]:
s.startswith("Wel")

True

To confirm that the string contains the given character or substring, Python does NOT have a dedicated "contains" or "includes" method. Instead, we check for inclusion using the `in` operator.

In [18]:
'text manipulation' in s

True

In [19]:
'texted' in s

False

When analyzing datasets that contain text, we don't usually operate on individual strings. Instead, we take an operation and apply it to the entire collection of strings.

One way to do this in Python is to apply text transforms within list comprehensions.

In [20]:
names = ['Alanah', 'Albion', 'Andrew', 'Brian']

Suppose we want to find the lengths of all of the strings in this list. We could do this with list comprehension

In [22]:
[len(name) for name in names]

[6, 6, 6, 5]

Similarly, we can call any function we want, including functions we define. For instance, we can check whether the names start with "A".

In [23]:
[name.startswith('A') for name in names]

[True, True, True, False]

This approach is okay-looking, but it's actually quite fragile. For example if we had an invalid string or a missing value (which happens all of the time in real-world data), Python will thrown an error. 

Example:

In [24]:
names = ['Alanah', 'Albion', 'Andrew', np.NaN, 'Brian']

In [26]:
## Results in TypeError: object of type 'float' has no len()
# [len(name) for name in names]

So we need some special logic to accommodate issues such as these. This is one aspect where Numpy and Pandas improve on the built-in Python capabilities. Pandas allows us to conduct large-scale text manipulation without having to worry about missing values.

## Vectorized String Operations in Pandas

Pandas offers an extensive toolset for vectorized string operations on large sequences of text values. Many of the methods we discussed still apply, but the way we access them is a bit difference.

In [27]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Suppose we want to find the name of each runner's name. Let's do it in Python first.

In [28]:
len('Kirui, Geoffrey')

15

But if we get a hold of the entire name range as a Series and pass it to the `len()` function, we'll quickly see that it doesn't work. Instead, we simply get a single number indicating the number of values in the "Name" column.

In [30]:
len(boston.Name)

1000

To get a series of lengths of names, we can use the `.str` family of methods. The `.str` is a common attribute that allows us to access vectorized string operations in Pandas.

In [31]:
boston.Name.str.len()

0      17
1      14
2      15
3      16
4      14
       ..
995    12
996    19
997    15
998    12
999    14
Name: Name, Length: 1000, dtype: int64

The same goes for other functions.

In [32]:
boston.Name.str.startswith('A')

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: Name, Length: 1000, dtype: bool

For the most part, vectorized string methods in Pandas follow the same naming convention as built-in string methods. We'll see some exceptions later, but for the most part they are the same methods that we see in Python. The only difference is that they operate on the entire sequence at once and they exclude any missing values.
* https://docs.python.org/3/library/stdtypes.html#string-methods