# <font color='#eb3483'> Data Wrangling - Strings </font>
Text data is going to pop up again and again in your analysis (you're reading textual data right now!). In this module we're going to explore some handy functions for working with strings in python.

## <font color='#eb3483'> Basic String Operations </font>
Base Python (i.e. python without any additional packages) already provides a lot of great functions for working with text data. Let's do a lightning quick review of some handy ones.

### <font color='#eb3483'> Changing Cases:  `upper`, `lower`, `capitalize` </font>
We can use python to easily change the case (i.e. uppercase vs lowercase) of strings.

In [1]:
title = "introduction to PYTHON"

print("We can convert a string to uppercase with upper()")
print(title.upper())

print("We can convert a string to lowercase with lower()")
print(title.lower())

print("We can convert the first letter to upper case with capitalize()")
print(title.capitalize())

We can convert a string to uppercase with upper()
INTRODUCTION TO PYTHON
We can convert a string to lowercase with lower()
introduction to python
We can convert the first letter to upper case with capitalize()
Introduction to python



### <font color='#eb3483'> Removing Characters:  `strip` </font>
Strip let's us remove characters from the beginning and end of a string (this is super handy when your data is surrounded by unnecessary whitespace).

In [2]:
name_with_commas = ",Connor,"

print("We can use `strip()` to remove characters from the beginning and the end of a string")
print(name_with_commas.strip(","))

We can use `strip()` to remove characters from the beginning and the end of a string
Connor


### <font color='#eb3483'> Splitting Strings:  `split` </font>
We can also break long strings into smaller strings by using the split function.

In [3]:
# We can split a string in multiple strings by using `split()`

sentence = "I love yoga"
words = sentence.split(" ")
print(words)

['I', 'love', 'yoga']


### <font color='#eb3483'> Replacing Characters:  `replace` </font>
Replace let's us replace parts of a string with a different string.

In [4]:
print("We use `replace()` to replace parts of a string for something else")
print(name_with_commas.replace("nor", "man"))

We use `replace()` to replace parts of a string for something else
,Conman,


### <font color='#eb3483'> Other Functions </font>
The above are hardly an exhaustive list of built-in string functions in python. Some other handy ones are:
1. `count`: Returns the number times a substring occurs in a string
2. `join`: Use string as delimiter for concatenating a sequence of other strings.
3. `endswith` (`startswith`): Returns True if string ends with suffix (starts with prefix).

Check out Chapter 7 in this great [data science reference book](https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf) for a more exhaustive list.

We can also combine multiple functions together in a sequence (called "chaining").

In [5]:
print("We can chain methods together")
print(
      name_with_commas
      .strip(",")
      .replace("nor", "man")
      .upper())


We can chain methods together
CONMAN


### <font color='#eb3483'> Quick knowledge check! </font>
1. (Challenge) I typed up a super quick grocery list on my phone. Can you clean it (i.e. capitalize words, remove random ! marks caused by my chubby fingers, and print the list where each item is seperated with a pipe '|'). Your final results should look something like: "Avocado|Chocolate|Grapes"

In [6]:
groceries = "avoca!do Cho!!coLaTe Gr!apeS"

modified_list= groceries.lower().replace("!","")
word_list = modified_list.split()

capitalized = [word.capitalize() for word in word_list]

final = "|".join(capitalized)

print(final)


Avocado|Chocolate|Grapes


## <font color='#eb3483'> Vectorized String Operations </font>
Python's built-in string functions are great for working with individual or small lists of strings. But what if we have an array or Series of strings (i.e. as a column in our dataframe)? Let's consider a small list of emails where we want to extract the name.

In [7]:
emails = ['trudeau@canada.gov', 'macron@france.gov', 'merkel@germany.gov']

Well we could start by just applying our functions in a loop (we'll use the nice compact way of writing a for loop into another list).

In [8]:
[email.split('@')[0] for email in emails]

['trudeau', 'macron', 'merkel']

Great seems simple enough! But what if we have missing values (like we often do in real data)?

In [9]:
real_emails = ['trudeau@canada.gov', None, 'macron@france.gov', 'merkel@germany.gov']
[email.split('@')[0] for email in real_emails]

AttributeError: 'NoneType' object has no attribute 'split'

Uh oh - now we have an error! Our string functions only work one strings not None objects. Luckily, we can turn to our trusty friend pandas for some elegant ways to not only deal with vectors of strings effortlessly, but also handle missing values with it's trademark grace.

### <font color='#eb3483'> Pandas and String Operations </font>
Pandas provides vectorized (i.e. functions that work quickly on arrays of data) operations for all of the string operations we've already covered. To apply vectorized strinig operations on a Series you need to access the `str` attribute of your series and you'll have a whole host of functions available (i.e. `mySeries.str.capitalize`).

Let's see it in action!

In [10]:
import pandas as pd

We'll work through some examples with a pretty cool dataset - all of Hilary Clinton's emails! One of the awesome things about transparent governments is that there's a whole host of cool public data you can sift through.

In [11]:
# Let's load in our data (we're only going to look at a few fields, but feel free to play around with the whole data)
emails = pd.read_csv('data/Emails.csv')[['MetadataFrom', 'MetadataTo']]

In [12]:
#And take a peak at it
emails.head()

Unnamed: 0,MetadataFrom,MetadataTo
0,"Sullivan, Jacob J",H
1,,H
2,"Mills, Cheryl D",;H
3,"Mills, Cheryl D",H
4,H,"Abedin, Huma"


Notice that we have a lot of missing values so our simple string python functions won't quite cut it. Let's start by removing commas from the sender of each email.

In [13]:
#To use a vectorized string operator we use first use the str attribute and then the function name
emails['SenderClean'] = emails['MetadataFrom'].str.replace(',','') #save our result to a new column
emails.head() #check out our new column!

Unnamed: 0,MetadataFrom,MetadataTo,SenderClean
0,"Sullivan, Jacob J",H,Sullivan Jacob J
1,,H,
2,"Mills, Cheryl D",;H,Mills Cheryl D
3,"Mills, Cheryl D",H,Mills Cheryl D
4,H,"Abedin, Huma",H


The syntax is similar for any of the vectorized string functions (i.e. we could have used strip, capitalize, lower, pad instead of replace). Check out a full list of vectorized string operartions [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html).

Now let's try to extract the last name of the sender.

In [14]:
#Let's start by splitting the names into individual words (aka tokens)
emails['SenderClean'].str.split(' ')

0       [Sullivan, Jacob, J]
1                        NaN
2         [Mills, Cheryl, D]
3         [Mills, Cheryl, D]
4                        [H]
                ...         
7940     [Verma, Richard, R]
7941     [Verma, Richard, R]
7942     [Jiloty, Lauren, C]
7943              [PVerveer]
7944    [Sullivan, Jacob, J]
Name: SenderClean, Length: 7945, dtype: object

Hmm this is a little tricky. We only want to keep the last name (i.e. the first token), but it's not obvious how we can just keep it. Luckily pandas comes to the rescue once again! We can use the `str.get` function to get a specific array element.

In [None]:
#Let's split based on a space and then get the 1st (0th) array element
emails['SenderClean'].str.split(' ').str.get(0)

Boo yah! Now we have a way to get all the last names. But what if we want the first name (i.e. the second token). "H" just has one token - will it throw an error?

In [None]:
#Now we can get the second element (index 1)
emails['SenderClean'].str.split(' ').str.get(1)

Of course not! Pandas is amazing and just gave us a NaN instead of throwing an issue. Let's save our hard work as new columns.

In [None]:
emails['last'] = emails['SenderClean'].str.split(' ').str.get(0)
emails['first'] = emails['SenderClean'].str.split(' ').str.get(1)
emails.head()

### <font color='#eb3483'> Quick knowledge check! </font>
1. Looks like our data has a few random characters floating around (i.e. ;H instead of H). Can you clean up the MetadataTo column to remove semi-colons?

In [20]:
emails['MetadataTo'] = emails['MetadataTo'].str.replace(";","")

print(emails)

           MetadataFrom    MetadataTo       SenderClean
0     Sullivan, Jacob J             H  Sullivan Jacob J
1                   NaN             H               NaN
2       Mills, Cheryl D             H    Mills Cheryl D
3       Mills, Cheryl D             H    Mills Cheryl D
4                     H  Abedin, Huma                 H
...                 ...           ...               ...
7940   Verma, Richard R             H   Verma Richard R
7941   Verma, Richard R             H   Verma Richard R
7942   Jiloty, Lauren C             H   Jiloty Lauren C
7943           PVerveer             H          PVerveer
7944  Sullivan, Jacob J             H  Sullivan Jacob J

[7945 rows x 3 columns]


## <font color='#eb3483'> Regex </font>
You might have noticed that our dataset has a lot of "H"'s in it (short-form for Hillary). Why don't we try to replace them with the full name. We'll start by using our vectorized string operations. `Replace` seems like a reasonable place to start.

In [None]:
emails['MetadataTo'].str.replace('H', 'Hillary')

Hmm something's not quite right. We've replaced all of the "H"'s with Hillary, but now Human Abedin is Hillaryuma Abedin. We need to be a bit smarter on how we replace our H's - we only want to replace standalone H's with Hillary. To do that, we're going to use an amazing pattern matching tool called **regex**. This notebook isn't going to cover regex (it could be it's own notebook) - instead I'm going to direct you to some tutorials to learn regex like [this](https://regexone.com/), [this](https://ryanstutorials.net/regular-expressions-tutorial/), or [this](https://www.youtube.com/watch?v=UQQsYXa1EHs). Take some time to go through a tutorial then pop back to this notebook.

![kitty](https://memegenerator.net/img/instances/38645087.jpg)

Great glad to have you back! Let's revist ooour original problem - replacing "H" with Hillary when appropriate. We're going to do that by making sure we're only replacing cases where H is the whole string (i.e. the start of the string is before it, and the end of the string is after it). We can do that using the special values '\A' and '\Z'.

In [17]:
#To use regex, we simply add an 'r' infront of a string to say we want to match regex, not just the string
emails['MetadataTo'].str.replace(r'\A[H]\Z', 'Hillary')

  emails['MetadataTo'].str.replace(r'\A[H]\Z', 'Hillary')


0            Hillary
1            Hillary
2                 ;H
3            Hillary
4       Abedin, Huma
            ...     
7940         Hillary
7941         Hillary
7942         Hillary
7943         Hillary
7944         Hillary
Name: MetadataTo, Length: 7945, dtype: object

Sweet - looks a lot better now! There are a whole host of other vectorized functions that can use regex including

1. `match`: Sees if each string matches a regex pattern
2. `count`: Count occurances of a pattern
3. `split`: Similar to split except we can feed in a pattern to split on

Check out our [handy data science manual](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html) for more regex functions. To practice and hone your regex skills try your hand at doing some [regex crosswords](https://regexcrossword.com/).

### <font color='#eb3483'> Quick knowledge check! </font>
1. Whoops our dataset got Cheryl Mill's middle name wrong! Can you use regex to replace the "D" in Cheryl D Mills to a E? We only want this change to apply to Cheryl Mills.

In [28]:
emails['SenderClean'] = emails['SenderClean'].str.replace(r'\bMills Cheryl D\b', 'Mills Cheryl E', regex=True)
emails['MetadataFrom'] = emails['MetadataFrom'].str.replace(r'\bMills, Cheryl D\b', 'Mills, Cheryl E', regex=True)

print(emails)

           MetadataFrom    MetadataTo       SenderClean
0     Sullivan, Jacob J             H  Sullivan Jacob J
1                   NaN             H               NaN
2       Mills, Cheryl E             H    Mills Cheryl E
3       Mills, Cheryl E             H    Mills Cheryl E
4                     H  Abedin, Huma                 H
...                 ...           ...               ...
7940   Verma, Richard R             H   Verma Richard R
7941   Verma, Richard R             H   Verma Richard R
7942   Jiloty, Lauren C             H   Jiloty Lauren C
7943           PVerveer             H          PVerveer
7944  Sullivan, Jacob J             H  Sullivan Jacob J

[7945 rows x 3 columns]


2. Take some time to do a few regex crosswords :)