## Manipulating Strings in Python

## Extracting different categories of character from a string.

The `.isdigit()` method tests whether each character in a string is a digit.  It is one of a family of string methods including `.isalpha()`, `.isnumeric()`, `.isspace()`, `.isalnum()` that are very useful for extracting different aspects of a string.  For a list of all string methods and more detail on each of the above, see the [Python 3 string method documentation](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [1]:
import pandas as pd

sample_string = 'take 5'
print(sample_string.isdigit())

# Experiment on the sample string below using the other methods listed above.
# What does each do? When would they be useful? What happens when you try them
# on different slices of the sample string, like sample_string[:4]?

False


In [2]:
m = '123'
print(m.isdigit())

True


In [3]:
p = 'Hello world'
print(p.isalpha())

False


In [10]:
n = ' '
print(n.isspace())

True


In [4]:
p2 = 'Helloworld'
print(p2.isalpha())

True


In [7]:
a = '365days'
print(a[:2].isnumeric())

True


In [11]:
print(a.isalnum())

True


### Apply

When cleaning data, you will be dealing with data frames made up of variables containing strings, rather than individual strings directly. But `isdigit()` and the other methods described above are string methods, _not_ data frame methods. 

This is where the [Pandas `.apply()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) steps in. `.apply()` lets us specify a function and _apply_ that function to each element in the data frame or series we're working with.

**Apply** applies a function to each element of a dataframe.  

In [12]:
# Create a series of dirty, annoying values.
money = pd.Series([400, 111, '$20', 57, 'Lots'])

# Running `money.isdigit()` throws an error because .isdigit() is a string
# attribute, _not_ a series attribute. Uncomment the line below to see.

print(money.isdigit())

AttributeError: 'Series' object has no attribute 'isdigit'

In [14]:
# Instead, let's define a new function that takes a string as an argument
# and returns True if the string is all digits, otherwise False.

def is_a_string(x):
    # First make sure we're operating on a string, then use our string method.
    return str(x).isdigit()

# Now let's apply our custom function to each element in our series.
print(money.apply(is_a_string))

0     True
1     True
2    False
3     True
4    False
dtype: bool


### Lambda functions

In the example above we defined a very small function that does basically the same thing as `.isdigit()` but that isn't tied to strings as a method. We had to stretch to find a good name for it because it reproduces functionality we already have. Frequently we'll want to define this new function on the fly rather than using `def` to create a named function like the above that sticks around. That's where lambda functions come in.

**Lambda functions** are small, temporary, unnamed (sometimes called "anonymous") functions created with the `lambda` keyword. Let's look at an example that's identical to the above except using a lambda function instead of a named function.

In [15]:
# Here's a lambda function that mirrors the is_a_digit function above.
# Read this print statement carefully and compare to the previous one.
print(money.apply(lambda x: str(x).isdigit()))

0     True
1     True
2    False
3     True
4    False
dtype: bool


The key here is this bit: `lambda x: str(x).isdigit())`. To break down the syntax, you start with the `lambda` keyword, then follow with the parameters for your function. Here we use just `x`, but we could have multiple parameters separated by a comma like `x, y` if we wanted. Next comes a colon `:` followed by the expression that would normally be in our function body and be preceded by `return`: `str(x).isdigit()`. Here we omit the `return` keyword; it isn't necessary with lambda functions.

If you'd like more information on lambda functions in Python see the (very terse) [Python documentation](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions), or this [detailed tutorial](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/).

### Filter 

We can use the [built-in `filter()` function](https://docs.python.org/3/library/functions.html#filter) to extract parts of our strings. For example, we can use it to extract just the digit information. 

The `filter()` function expects two arguments. The first argument is a function that takes a single input and returns a boolean value, and the second argument is an iterable where each of elements of the iterable will be fed into the first argument and processed into a boolean. The `filter()` function returns an iterator that picks out the instances of the iterable where the corresponding boolean value is `True`.

In our example we can use `filter()` directly on our series to filter out all the elements that aren't nice and clean, or we can _apply_ filter to each element of our series to extract the numeric parts. Let's see both approaches and stick with our lambda functions.

In [18]:
# We're using list() on the result because filter() returns an iterator.

#If element in money running through lambda x: str(x).isdigit() returns True, list it.
print('Filtering the whole series:')
print(list(filter(lambda x: str(x).isdigit(), money)))

# Because it is check if each character is a digit, we need ''.join to join element back together.
print('\nApplying filter() to each value in the series:')
print(money.apply(lambda x: ''.join(list(filter(str.isdigit, str(x))))))

Filtering the whole series:
[400, 111, 57]

Applying filter() to each value in the series:
0    400
1    111
2     20
3     57
4       
dtype: object


### Splitting strings apart

Sometimes we want to keep all the information in a string but divide it up into several pieces.  The `.split()` string method takes one argument, the character or substring to split on, and returns a list of the pieces of the string it's called on, using the separator as a delimiter for each piece. Conveniently, Pandas gives us its [own version of this built-in method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) with  `Series.str.split()` that you can use directly on series objects without needing `.apply()`.

In [20]:
# Create a series of dirty, annoying strings.
words = pd.Series([
    'MollyMalone$molmal@gmail.com',
    'JeffreyJones$jefjo@hotmail.com',
    'DeadParrot$fjords@gmail.com'
])

# Split on '$'. We'll use the Pandas split method.
word_split = words.str.split('$', expand=True)
names = word_split[0]
emails = word_split[1]
print(names, '\n')
print(emails)

0     MollyMalone
1    JeffreyJones
2      DeadParrot
Name: 0, dtype: object 

0     molmal@gmail.com
1    jefjo@hotmail.com
2     fjords@gmail.com
Name: 1, dtype: object


In [21]:
# Splitting on capital letters.
# Just because we can doesn't mean we should:
print(names.str.split('[A-Z]', expand=True))

  0       1      2
0      olly  alone
1    effrey   ones
2       ead  arrot


That example is obviously problematic. A better method of dividing on capital letters, which retains the character used as a separator, is to use the '.findall()' method from the re package. The regular expression '\[A-Z][a-z]*' says to find each instance where an uppercase letter is followed by lowercase letters, and return each instance as an item in a list.

In [22]:
import re

# We expect the first name to follow the first capital letter.
firstname = names.apply(lambda x: re.findall('[A-Z][a-z]*', x)[0])

# We expect the last name to follow the second capital letter.
lastname = names.apply(lambda x: re.findall('[A-Z][a-z]*', x)[1])

print(firstname, '\n')
print(lastname)

0      Molly
1    Jeffrey
2       Dead
Name: 0, dtype: object 

0    Malone
1     Jones
2    Parrot
Name: 0, dtype: object


## Changing the content of strings

### Replace

We can replace specific characters or combinations of characters with a new string, or with nothing at all. Again, [Pandas gives us](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) the useful  `Series.str.replace()` so we don't need to apply the built-in Python `.replace()` string method.

In [None]:
print(emails.str.replace('@', ' at '),'\n')

print(emails.str.replace('.com', ''))

### Changing case

If capitalization varies inconsistently in a variable, it may be useful to eliminate the inconsistency by casting each string to a specific case:

In [23]:
print(names.str.lower(), '\n')
print(names.str.upper(), '\n')
print(names.str.capitalize())

0     mollymalone
1    jeffreyjones
2      deadparrot
Name: 0, dtype: object 

0     MOLLYMALONE
1    JEFFREYJONES
2      DEADPARROT
Name: 0, dtype: object 

0     Mollymalone
1    Jeffreyjones
2      Deadparrot
Name: 0, dtype: object


### Stripping whitespace

Another common pain in dirty data is irregularly applied whitespace at the beginning or end of a string. The best way to deal with this is to remove leading and trailing whitespace with the `.strip()` string method. We can strip a single string:

In [25]:
spacy = '   What, on earth, is going on here?      '
print(spacy)
print(spacy.strip())

   What, on earth, is going on here?      
What, on earth, is going on here?


In [34]:
# Series of strings with annoying whitespace.
words = pd.Series([' duck', 'duck ', ' goose ', 'goose'])

print(words)
print('\n')

stripped = words.str.strip()


print(stripped)

0       duck
1      duck 
2     goose 
3      goose
dtype: object


0     duck
1     duck
2    goose
3    goose
dtype: object


In the following Challenge, you'll get lots of opportunities to hone your string-wrangling skills.  If you'd like to practice or compose regular expressions, check out [RegExr](http://regexr.com/), a wonderful interactive website that lets you tinker with regular expressions and visually breaks down the pattern matching.