In [94]:
import numpy as np
import pandas as pd

# Section 11: Regex and Text Manipulation

Python and Pandas have a lot to offer in terms of extracting information from text and manipulation text. In this section we will cover#
* a detailed overview of Python string methods
* the Pandas `.str` family of methods
* advanced splits and replacements in Pandas
* hands-on introduction of RegEx
  * character sets, anchors, metasequences, quantifiers and more!

We'll get some hands-on practice on this using a Boston marathon dataset

## Our data: Boston Marathon Runners


For this section we'll be working with a dataset for Boston marathon participants.

https://andybek.com/pandas-marathon

In [95]:
boston_url = 'https://andybek.com/pandas-marathon'

In [96]:
boston = pd.read_csv(boston_url)

In [97]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


In [98]:
boston.info

<bound method DataFrame.info of                     Name  Age M/F  ... Overall Gender  Years Ran
0       Kirui, Geoffrey    24   M  ...       1      1        NaN
1         Rupp, Galen      30   M  ...       2      2        NaN
2        Osako, Suguru     25   M  ...       3      3        NaN
3       Biwott, Shadrack   32   M  ...       4      4        NaN
4         Chebet, Wilson   31   M  ...       5      5       2015
..                   ...  ...  ..  ...     ...    ...        ...
995         Larosa, Mark   38   M  ...     996    940  2015:2016
996  Williamson, Jerry A   43   M  ...     997    941       2015
997      Mccue, Daniel T   40   M  ...     998    942        NaN
998         Larosa, John   35   M  ...     999    943        NaN
999       Sanchez, Sam R   35   M  ...    1000    944        NaN

[1000 rows x 10 columns]>

We hvae a dataset of 10 columns that include the Name, Age, and gender of eahc runner for the year 2017. Most of the fields are strings, including the "Official Time" which is text-based. This gives us plent of text data to play around with in this section.

## String Methods in Python

We'll start by playing around with pure text in Python. We will cover the following concepts:
* `len`
* `center`
* `startswith` and `endswith`
* the `in` operator
* list comprehension with strings

Link to common python string operations: https://docs.python.org/3/library/string.html

Let's begin with a text string.

In [99]:
s = "Welcome to the text manipulation section"

We can get the length of the string (number of characters)

In [100]:
len(s)

40

The `center()` method creates a longer string that has the current string at the center and adds to both sides characters that we specify. The first value passed in to the function indicates the length of the final string, and the second value specifies that character(s) to be added to each side of the starting string to get the final string.

In [101]:
s.center(100, '*')

'******************************Welcome to the text manipulation section******************************'

Note that if your starting string is longer than the string you are attempting to build, you'll simply get the starting string back.

In [102]:
s.center(30, '*')

'Welcome to the text manipulation section'

We can also check whether the string starts or ends with a given character or characters using `startswith()` and `endswith()`. 
* Note that these methods are case-sensitive.

In [103]:
s.endswith('tion')

True

In [104]:
s.startswith("Wel")

True

To confirm that the string contains the given character or substring, Python does NOT have a dedicated "contains" or "includes" method. Instead, we check for inclusion using the `in` operator.

In [105]:
'text manipulation' in s

True

In [106]:
'texted' in s

False

When analyzing datasets that contain text, we don't usually operate on individual strings. Instead, we take an operation and apply it to the entire collection of strings.

One way to do this in Python is to apply text transforms within list comprehensions.

In [107]:
names = ['Alanah', 'Albion', 'Andrew', 'Brian']

Suppose we want to find the lengths of all of the strings in this list. We could do this with list comprehension

In [108]:
[len(name) for name in names]

[6, 6, 6, 5]

Similarly, we can call any function we want, including functions we define. For instance, we can check whether the names start with "A".

In [109]:
[name.startswith('A') for name in names]

[True, True, True, False]

This approach is okay-looking, but it's actually quite fragile. For example if we had an invalid string or a missing value (which happens all of the time in real-world data), Python will thrown an error. 

Example:

In [110]:
names = ['Alanah', 'Albion', 'Andrew', np.NaN, 'Brian']

In [111]:
## Results in TypeError: object of type 'float' has no len()
# [len(name) for name in names]

So we need some special logic to accommodate issues such as these. This is one aspect where Numpy and Pandas improve on the built-in Python capabilities. Pandas allows us to conduct large-scale text manipulation without having to worry about missing values.

## Vectorized String Operations in Pandas

Pandas offers an extensive toolset for vectorized string operations on large sequences of text values. Many of the methods we discussed still apply, but the way we access them is a bit difference.

In [112]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Suppose we want to find the name of each runner's name. Let's do it in Python first.

In [113]:
len('Kirui, Geoffrey')

15

But if we get a hold of the entire name range as a Series and pass it to the `len()` function, we'll quickly see that it doesn't work. Instead, we simply get a single number indicating the number of values in the "Name" column.

In [114]:
len(boston.Name)

1000

To get a series of lengths of names, we can use the `.str` family of methods. The `.str` is a common attribute that allows us to access vectorized string operations in Pandas. We can use it to, for example, perform vectorized calculations on the length of each name in the "Name" columns.

In [115]:
boston.Name.str.len()

0      17
1      14
2      15
3      16
4      14
       ..
995    12
996    19
997    15
998    12
999    14
Name: Name, Length: 1000, dtype: int64

The same goes for other functions.

In [116]:
boston.Name.str.startswith('A')

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: Name, Length: 1000, dtype: bool

For the most part, vectorized string methods in Pandas follow the same naming convention as built-in string methods. We'll see some exceptions later, but for the most part they are the same methods that we see in Python. The only difference is that they operate on the entire sequence at once and they exclude any missing values.
* https://docs.python.org/3/library/stdtypes.html#string-methods

## Case Operations

There exist a family of methods that impact casing of text data. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.upper.html
* The page above also contains links to other string methods.

For these examples let's focus on the "City" column

In [117]:
boston.City

0           Keringet
1           Portland
2       Machida-City
3      Mammoth Lakes
4           Marakwet
           ...      
995    North Andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa Fe
Name: City, Length: 1000, dtype: object

The casing we see here is known as "title case", where the first letter of each word is capitalized. In both Python and Pandas it is applied using the `.title()` method.

In [118]:
boston.City.str.title()

0           Keringet
1           Portland
2       Machida-City
3      Mammoth Lakes
4           Marakwet
           ...      
995    North Andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa Fe
Name: City, Length: 1000, dtype: object

This series was already title-cased, so the result is not particularly interesting (there was no change). So let's try another case operation. 

We can convert everything to upper case using the `.upper()` method.

In [119]:
boston.City.str.upper()

0           KERINGET
1           PORTLAND
2       MACHIDA-CITY
3      MAMMOTH LAKES
4           MARAKWET
           ...      
995    NORTH ANDOVER
996          RALEIGH
997        ARLINGTON
998          DANBURY
999         SANTA FE
Name: City, Length: 1000, dtype: object

A few other case methods include
* `lower()`
* `swapcase()` - reverses the current casing; upper becomes lower and lower becomes upper (instructor hasn't really found a great use for this method)
* `capitalize()` - capitalize the first letter of the *string* only (NOT the first letter of every word). All other letters are lower case

In [120]:
boston.City.str.lower()

0           keringet
1           portland
2       machida-city
3      mammoth lakes
4           marakwet
           ...      
995    north andover
996          raleigh
997        arlington
998          danbury
999         santa fe
Name: City, Length: 1000, dtype: object

In [121]:
boston.City.str.swapcase()

0           kERINGET
1           pORTLAND
2       mACHIDA-cITY
3      mAMMOTH lAKES
4           mARAKWET
           ...      
995    nORTH aNDOVER
996          rALEIGH
997        aRLINGTON
998          dANBURY
999         sANTA fE
Name: City, Length: 1000, dtype: object

In [122]:
boston.City.str.capitalize()

0           Keringet
1           Portland
2       Machida-city
3      Mammoth lakes
4           Marakwet
           ...      
995    North andover
996          Raleigh
997        Arlington
998          Danbury
999         Santa fe
Name: City, Length: 1000, dtype: object

## Finding Characters and Words: `str.find()` and `str.rfind()`

We'll begin with a review of the Python `find()` and `rfind()` function. Recall our simply string `s`

In [123]:
s

'Welcome to the text manipulation section'

Suppose we want to identify the exact position of the first lower-case "x" character in this string. To do this, we call the `find()` method on the string we are looking for and provide the search character.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.find.html

In [124]:
s.find('x')

17

We see that the first lower-case "x" is at index position 17 (the 18th letter of the string since Python is zero-indexed).

In [125]:
s[17]

'x'

We can search for any sequence of characters that we want. For instance, let's look for a full substring "text". What returns is the position of the first character in that substring

In [126]:
s.find('text')

15

If you ever provide a search string that does not exist in the queried string, the `find()` method will return -1.

Returning now to Pandas, let's list the first few records to orient ourselves.

In [127]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Here we'll pick the "Name" column and explore the number of top marathon runners have 'Andy' in their names. This can be easily achieved by applying the `find()` method to the entire sequence of names. Per usual, we will use the `.str` accessor and then apply the `find()` method.

In [128]:
boston.Name.str.find('Andy')

0     -1
1     -1
2     -1
3     -1
4     -1
      ..
995   -1
996   -1
997   -1
998   -1
999   -1
Name: Name, Length: 1000, dtype: int64

What returns is a long sequence of integers, indicating the place in each name in which the substring "Andy" is located. Let's do a quick `value_counts()` analysis.

In [129]:
boston.Name.str.find('Andy').value_counts()

-1     998
 12      1
 8       1
Name: Name, dtype: int64

There are actually two instances of someone having "Andy" in their name. Seems underrepresented. How about a name like "James?"

In [130]:
boston.Name.str.find('James').value_counts()

-1     988
 10      3
 8       3
 9       2
 7       2
 12      1
 6       1
Name: Name, dtype: int64

The `find()` method performs a left-to-right search by default. If we start from the right instead, we'll get a different position integer returned. Let's illustrate this directionality with a new string.

In [131]:
p = 'pandas numpy numpy pandas'

Let's first try searching for "pandas"

In [132]:
p.find('pandas')

0

We get zero, indicating that the first "pandas" substring instance begins at the 0th indexed position, as we expected. What if we want to start counting from the right side and determine that position in which "pandas" appears closest to the right? We do that using the `rfind()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.rfind.html

In [133]:
p.rfind('pandas')

19

Here the method indicated that, starting from the right side of the string, the first "pandas" occurrence is at position 19. We can verify this with a slice.

In [134]:
p[19:]

'pandas'

## Strips and Whitespace Methods

**Whitespace** refers to characters that represent vertical or horizontal space, such as tab and newline characters. They are oftentimes not visible when a stirng is printed, but they do impact the spacing and positioning of the output.

In this lecture we'll cover the following methods:
* `isspace()`
* `lstrip()`
* `rstrip()`
* `strip()`

Descriptions of these methods can be find here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

To check whether a given character is a whitespace, we can use the Python method `isspace()`

In [135]:
' '.isspace()

True

In [136]:
'\n'.isspace()

True

Let's contain some sample strings that contain whitespace to work with for the rest of the section.

In [137]:
left_spaced = '     this is a pandas course'

In [138]:
right_spaced = 'we cover plenty of Python too!      '

In [139]:
spaced = '    the name is: BOND \t JAMES BOND \n\n'

When printing the left_spaced string, the leading space will not be immediately obvious.

In [140]:
print(left_spaced)

     this is a pandas course


It's tough to see, but it's there and the whitespace contributes to the length of the overall string.

In [141]:
print(spaced)

    the name is: BOND 	 JAMES BOND 




Whitespace can be troublesome when working with text, which happens when gathering text from unstructured input, such as forums, comments, etc.

Luckily, Python and Pandas offer a number of very useful methods to string whitespace from text. First up is `lstrip()`, which removes leading whitespace.

In [142]:
left_spaced.lstrip()

'this is a pandas course'

`rstrip()` does the exact same thing, but on the right-hand side.

In [143]:
right_spaced.rstrip()

'we cover plenty of Python too!'

The generic `strip()` method does the same thing but on both ends at the same time.

In [144]:
spaced.strip()

'the name is: BOND \t JAMES BOND'

Notice that the horizonatal tab character "\t" is still there. Unfortunately there's no method that handles this character specifically. However, we'll hand this with replacement later on when we combine the replacement methods with regular expressions.

Moving on to Pandas, let's again look at our dataframe

In [145]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Looking at our "Name" column, we see that some names have leading or trailing whitespace, for example the first two names on the list.

In [146]:
boston.Name.iloc[0]

' Kirui, Geoffrey '

In [147]:
boston.Name.iloc[1]

'Rupp, Galen   '

How do we apply the vectorized `strip()` method to this? It's the same syntax that we're familiar with.

In [148]:
boston.Name.iloc[0:2].str.strip()

0    Kirui, Geoffrey
1        Rupp, Galen
Name: Name, dtype: object

This has stripped all leading and trailing white space from the first two names. Let's go ahead and apply this to our entire sequence of names, and then assign the result back to the "Name" column.

In [149]:
boston.Name = boston.Name.str.strip()

In [150]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Hard to tell if that did anything, let's verify that it worked by looking at the first name again. 

In [151]:
boston.Name.iloc[0]

'Kirui, Geoffrey'

Sure enough, the leading whitespace is gone!

## String Splitting and Concatenation: `split()`, `get()`, and `cat()`

Splitting methods take a piece of text and break it down into smaller strings based on a break point that we specify.

Recall our string from a few lectures ago, which we will call `split()` on.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html

By default (when called without arguments), the method splits the string on **whitespace** and creates a list of strings consisting of the component "words" of the original string.
* Oftentimes this means splitting on single spaces, but any whitespace will be considered a split point by the method.

In [152]:
s

'Welcome to the text manipulation section'

In [153]:
s.split()

['Welcome', 'to', 'the', 'text', 'manipulation', 'section']

Consider the James Bond string from earlier. First we will attempt to split on any whitespace.

In [154]:
spaced

'    the name is: BOND \t JAMES BOND \n\n'

In [155]:
spaced.split()

['the', 'name', 'is:', 'BOND', 'JAMES', 'BOND']

WHat if we try to split specifically by a single space?

In [156]:
spaced.split(' ')

['', '', '', '', 'the', 'name', 'is:', 'BOND', '\t', 'JAMES', 'BOND', '\n\n']

It is also worth mentioning that we can split a string on anything we want. For instance, we can split or `s` string on "to"

In [157]:
s.split('to')

['Welcome ', ' the text manipulation section']

One last important note, as observed above, is that the string that is chosen as the split point is not included in the returned collection; it is always discarded.

Now let's bring this over to Pandas and our dataframe.

In [158]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


Suppose we need to introduce two new columns to our dataframe, one with runners' first names and another with runners' last names. Notice how every runner is identified by their last name, comma, space, first name. Thus, a good condidate for the split string is ", ". 

Let's try it.

In [159]:
boston.Name.str.split(', ')

0          [Kirui, Geoffrey]
1              [Rupp, Galen]
2            [Osako, Suguru]
3         [Biwott, Shadrack]
4           [Chebet, Wilson]
               ...          
995           [Larosa, Mark]
996    [Williamson, Jerry A]
997        [Mccue, Daniel T]
998           [Larosa, John]
999         [Sanchez, Sam R]
Name: Name, Length: 1000, dtype: object

This returns a pandas series of Python lists - each runner's name was split on the ", " and returned a list of the two component names.

Now we need to get these first and last names in their own respective columns. How do we extract the first item for each of our records?

One way to it is to chain on a special Pandas string method called `str.get()`, which is designed precisely for instances like this. Simply put. it extracts an element from each component at the specified location.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get.html



In [160]:
boston.Name.str.split(', ').str.get(0)

0           Kirui
1            Rupp
2           Osako
3          Biwott
4          Chebet
          ...    
995        Larosa
996    Williamson
997         Mccue
998        Larosa
999       Sanchez
Name: Name, Length: 1000, dtype: object

That gaveu s the last names. We can do the same for the first names.

In [161]:
boston.Name.str.split(', ').str.get(1)

0      Geoffrey
1         Galen
2        Suguru
3      Shadrack
4        Wilson
         ...   
995        Mark
996     Jerry A
997    Daniel T
998        John
999       Sam R
Name: Name, Length: 1000, dtype: object

The only thing left to do now is to assign these names to their own columns in our dataframe.

In [162]:
boston['First Name'] = boston.Name.str.split(', ').str.get(1)

In [163]:
boston['Last Name'] = boston.Name.str.split(', ').str.get(0)

In [164]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran,First Name,Last Name
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,,Geoffrey,Kirui
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,,Galen,Rupp
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,,Suguru,Osako
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,,Shadrack,Biwott
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0,Wilson,Chebet


How would we do the opposite of this, and concatenate two strings from different columns together? 

Suppose we wanted to combine Age and Gender into a single column. We could do this using the `str.cat()` method (short for concatenate)
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.cat.html

In [165]:
boston['M/F'].str.cat(boston.Age.astype(str), sep = '_')

0      M_24
1      M_30
2      M_25
3      M_32
4      M_31
       ... 
995    M_38
996    M_43
997    M_40
998    M_35
999    M_35
Name: M/F, Length: 1000, dtype: object

In [166]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran,First Name,Last Name
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,,Geoffrey,Kirui
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,,Galen,Rupp
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,,Suguru,Osako
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,,Shadrack,Biwott
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0,Wilson,Chebet


## More Split Parameters

In this lecture we'll cover additional parameters in the `split()` method.

Let's start by dropping the "First Name" and "Last Name" columns that we created in the previous lecture, as we're about to discover a new way to create those columns.

In [190]:
boston.drop(labels = ['First Name', 'Last Name'], axis = 1, inplace = True)

In [191]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0


The first parameter we'll explore is `expand`. When set to True, the `split()` method returns a dataframe that has as many columns as the component strings were split into. Compare this to the behavior of `split()` without this parameter, where each name returned a list of substrings.

In [192]:
boston.Name.str.split(', ', expand = True)

Unnamed: 0,0,1
0,Kirui,Geoffrey
1,Rupp,Galen
2,Osako,Suguru
3,Biwott,Shadrack
4,Chebet,Wilson
...,...,...
995,Larosa,Mark
996,Williamson,Jerry A
997,Mccue,Daniel T
998,Larosa,John


What happens if we exclude the split pattern altogether? What would happen then?

In [193]:
boston.Name.str.split(expand = True)

Unnamed: 0,0,1,2,3,4
0,"Kirui,",Geoffrey,,,
1,"Rupp,",Galen,,,
2,"Osako,",Suguru,,,
3,"Biwott,",Shadrack,,,
4,"Chebet,",Wilson,,,
...,...,...,...,...,...
995,"Larosa,",Mark,,,
996,"Williamson,",Jerry,A,,
997,"Mccue,",Daniel,T,,
998,"Larosa,",John,,,


In this case we get a five-column dataframe, the reason being that some runners have names with more than two substrings, and at least one runner has a name with 5 substrings that are split by a whitespace. Can we identify these long-named folks?

One way we can do this is by running a `.count()` method on the columns (axis = 1), which returns the number of non-null columns. Anyone with more than 3 non-null columns has a three-component or longer name.



In [194]:
boston.Name.str.split(expand = True).count(axis = 1)

0      2
1      2
2      2
3      2
4      2
      ..
995    2
996    3
997    3
998    2
999    3
Length: 1000, dtype: int64

How do we isolate people will, for instance, 5-component names? We can do this by setting a conditional to the count.

In [195]:
boston.Name.str.split(expand = True).count(axis = 1) == 5

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

This returns a boolean mask, which we can then use as a selector to pass

In [196]:
boston.Name[boston.Name.str.split(expand = True).count(axis = 1) == 5]

203    Cifuentes Fetiva, Miguel Angel Sr.
467      Martinez Solano, Juan Manuel Jr.
678        Melendez, Carlos Manuel M. Sr.
733        Castano Gonzalez, Angel U. Sr.
Name: Name, dtype: object

Thus, here are the folks who have long, five-component names.

The `split()` method also has a parameter called `n`, which can be used to specify the number of substrings that is returned by the split.

In [197]:
boston.Name.str.split(expand = True)

Unnamed: 0,0,1,2,3,4
0,"Kirui,",Geoffrey,,,
1,"Rupp,",Galen,,,
2,"Osako,",Suguru,,,
3,"Biwott,",Shadrack,,,
4,"Chebet,",Wilson,,,
...,...,...,...,...,...
995,"Larosa,",Mark,,,
996,"Williamson,",Jerry,A,,
997,"Mccue,",Daniel,T,,
998,"Larosa,",John,,,


By default, we get the maximum number of substrings. But if we want to change that, we can set the `n` parameter. For instance, if we set it to 2, we will get 3 columns.

In [198]:
boston.Name.str.split(expand = True, n = 2)

Unnamed: 0,0,1,2
0,"Kirui,",Geoffrey,
1,"Rupp,",Galen,
2,"Osako,",Suguru,
3,"Biwott,",Shadrack,
4,"Chebet,",Wilson,
...,...,...,...
995,"Larosa,",Mark,
996,"Williamson,",Jerry,A
997,"Mccue,",Daniel,T
998,"Larosa,",John,


Let's return now to our ', ' split so that we get two-component names

In [199]:
boston.Name.str.split(', ', expand = True)

Unnamed: 0,0,1
0,Kirui,Geoffrey
1,Rupp,Galen
2,Osako,Suguru
3,Biwott,Shadrack
4,Chebet,Wilson
...,...,...
995,Larosa,Mark
996,Williamson,Jerry A
997,Mccue,Daniel T
998,Larosa,John


Now, how do we incorporate our columns in this dataframe into our original dataframe? 

We could use `join()` or `concat()`, but we can also use a *direct assignment* approach. This is also called *setting with enlargement* in Pandas because Pandas will check whether the columns exist in the dataframe, and if they do, the output will be overriden, and if they do not, the columns will be created anew.

In [200]:
boston[['Last Name', 'First Name']] = boston.Name.str.split(', ', expand = True)

In [201]:
boston.head()

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran,Last Name,First Name
0,"Kirui, Geoffrey",24,M,Keringet,,KEN,2:09:37,1,1,,Kirui,Geoffrey
1,"Rupp, Galen",30,M,Portland,OR,USA,2:09:58,2,2,,Rupp,Galen
2,"Osako, Suguru",25,M,Machida-City,,JPN,2:10:28,3,3,,Osako,Suguru
3,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,2:12:08,4,4,,Biwott,Shadrack
4,"Chebet, Wilson",31,M,Marakwet,,KEN,2:12:35,5,5,2015.0,Chebet,Wilson


## Skill Challenge #1

#### 1. How many runners in our dataset have "James" as a last name?

We've already done the hard work of adding "Last Name" as a unique column in our dataframe. All we really need to do now is query it for the name "James", which we can do using the `loc[]` indexer.

In [210]:
boston.loc[boston['Last Name'] == "James"]

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran,Last Name,First Name


It looks like there are no runners with the last name of James. Do do this more computationally, we can simply perform a `count()` on the result.

In [211]:
boston.loc[boston['Last Name'] == "James"].count()

Name             0
Age              0
M/F              0
City             0
State            0
Country          0
Official Time    0
Overall          0
Gender           0
Years Ran        0
Last Name        0
First Name       0
dtype: int64

Do any runners have the First Name of James? Let's find out. 

In [212]:
boston.loc[boston['First Name'] == "James"]

Unnamed: 0,Name,Age,M/F,City,State,Country,Official Time,Overall,Gender,Years Ran,Last Name,First Name
243,"Lloyd, James",24,M,San Diego,CA,USA,2:42:38,244,220,2016.0,Lloyd,James
574,"Onigkeit, James",49,M,Rochester,MN,USA,2:49:48,575,537,2016.0,Onigkeit,James
650,"O'Sullivan, James",32,M,Arvada,CO,USA,2:51:15,651,611,2016.0,O'Sullivan,James
923,"Baek, James",23,M,Indianapolis,IN,USA,2:55:12,924,873,2016.0,Baek,James
976,"Blowers, James",45,M,Cary,NC,USA,2:55:57,977,922,,Blowers,James


Yes, it appears that five runners have the first name "James".

#### 2. Split all of the "City" names in the dataset by the hyphen character, and create a dataframe containing each split component of the split name. Assign this dataframe to the variable `city_parts`.

We'll accomplish this using the `str.split()` method, passing in a hyphen as the split string and setting `expand` to `True`.

In [184]:
boston.City.str.split('-', expand = True)

Unnamed: 0,0,1,2,3
0,Keringet,,,
1,Portland,,,
2,Machida,City,,
3,Mammoth Lakes,,,
4,Marakwet,,,
...,...,...,...,...
995,North Andover,,,
996,Raleigh,,,
997,Arlington,,,
998,Danbury,,,


Let's assign this to the variable as required by the prompt.

In [185]:
city_parts = boston.City.str.split('-', expand = True)

#### 3. Determine the number of cities in the `boston` dataframe that have more than 1 component, and identify those cities.

Let's start by querying our `city_parts` variable with a conditional, where we want to identify cities that have more than one component in their name.

In [186]:
city_parts.count(axis = 1) > 1

0      False
1      False
2       True
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

With this boolean mask in hand, we can now select the cities that have compound names (at least when separated by hyphens). We can do this selection either from `city_parts` or from `boston`. Both approaches are shown below.

In [187]:
city_parts[city_parts.count(axis = 1) > 1]

Unnamed: 0,0,1,2,3
2,Machida,City,,
35,Sao Paulo,Sp,,
188,Baie,St,Paul,
201,Houghton,Le,Spring,
371,Boulogne,Billancourt,,
420,Mont,Royal,,
585,Gif,Sur,Yvette,
615,Fossambault,Sur,Le,Lac
724,Wiesbaden,Breckenheim,,
727,Saint,Tite,,


In [189]:
boston[city_parts.count(axis = 1) > 1]['City']

2                    Machida-City
35                 Sao Paulo - Sp
188                  Baie-St-Paul
201            Houghton-Le-Spring
371          Boulogne-Billancourt
420                    Mont-Royal
585                Gif-Sur-Yvette
615        Fossambault-Sur-Le-Lac
724         Wiesbaden-Breckenheim
727                    Saint-Tite
794                   Marica - Rj
820    Sainte-Catherine-De-Hatley
830                    Pont-Rouge
Name: City, dtype: object