# String Operations

## Introduction

As a data analyst, you will find yourself wrangling with text strings regularly. Categorical variables, documents, and other text-based data often come inconsistently structured. Because of this, it is helpful to know about different methods for transforming, cleaning, and extracting text. Python comes with several tools for performing string operations. In this lesson, we will learn about how to use these tools to work with strings.

## Python String Operations

Thus far in this program, you have seen a few examples here and there that involve string operations in the context of other topics we have covered. In this section, we will cover string operations more comprehensively so that you have a solid understanding of how to use them.

Recall from your Python prework that the + operator concatenates two strings together and that the * operator repeats a string a given number of times.

In [107]:
print('Hello' + 'World')

HelloWorld


In [110]:
 print('Hello' * 8)

HelloHelloHelloHelloHelloHelloHelloHello


Recall that you can also join strings in a list together using a designated separator with the join method.

In [112]:
x = 'Happy'
y = 'Puppies'
z = [x,y]
print(z)

['Happy', 'Puppies']


In [113]:
print(z)

['Happy', 'Puppies']


In [120]:
' '.join(z)

'Happy Puppies'

We also covered how to get the length of strings and how to subset them via indexing.

In [121]:
word = 'automobile'

In [122]:
print(word[0]) 

a


In [123]:
print(word[5])

o


In [124]:
print(word[-1])

e


In [125]:
print(len(word))

10


We can use the split method to turn strings into lists based on a separator that we designate (spaces if left empty).

In [130]:
a = 'They ate the mystery meat. It tasted like chicken.'

print(a.split())

['They', 'ate', 'the', 'mystery', 'meat.', 'It', 'tasted', 'like', 'chicken.']


In [127]:
print(a.split('.'))

['They ate the mystery meat', ' It tasted like chicken', '']


In [128]:
print(a.split('m'))

['They ate the ', 'ystery ', 'eat. It tasted like chicken.']


We can also use boolean methods such as startswith, endswith, and in to check if strings start with, end with, or contain certain characters or other strings.

In [136]:
b = 'There is no business like show business.'

print(b.startswith('T')) 

True


In [133]:
print(b.startswith('There'))

True


In [134]:
print(b.startswith('there'))

False


In [137]:
print(b.endswith('.'))

True


In [139]:
print(b.endswith('business.'))

True


In [140]:
print(b.endswith('Business.'))

False


In [141]:
b = 'There is no business like show business.'

In [142]:
print('like' in b)

True


In [143]:
print('business' in b)

True


In [144]:
print('Business' in b)

False


Note from the examples above that these are case sensitive. Speaking of cases, Python provides us with several useful ways to change the cases of strings.



In [150]:
c = 'shE HaD a maRveLoUs aSsoRtmeNt of PUPPETS.'

print(c.lower())

she had a marvelous assortment of puppets.


In [151]:
print(c.upper())

SHE HAD A MARVELOUS ASSORTMENT OF PUPPETS.


In [152]:
print(c.capitalize())

She had a marvelous assortment of puppets.


In [153]:
print(c.title())

She Had A Marvelous Assortment Of Puppets.


We can also remove any white space from the beginning and end of a string using the strip method. If we want to remove white space from just the beginning, we would use lstrip. If we wanted to remove white space from just the end, we would use rstrip.

In [154]:
# remove space at beginning and end
d = ' I have a tendency to leave trailing spaces. '

print(d.strip())

I have a tendency to leave trailing spaces.


In [155]:
# remove space at beginning 
print(d.lstrip())


I have a tendency to leave trailing spaces. 


In [156]:
# remove space at end 
print(d.rstrip())

 I have a tendency to leave trailing spaces.


Another useful string operation, which we saw briefly in the data wrangling lessons, is using the replace method which replaces one string with another.

In [157]:
e = 'I thought the movie was wonderful!'

In [158]:
print(e.replace('wonderful', 'horrible'))

I thought the movie was horrible!


In [159]:
print(e.replace('wonderful', 'just OK'))

I thought the movie was just OK!


## Regular Expressions

Python's string operation methods can take us a long way, but we will inevitably encounter a situation where we need to rely on some additional tools called regular expressions. Regular expressions allow us to perform different types of pattern matching on text in order to arrive at the result we want.

In order to use regular expressions, we will import the re library.

In [160]:
import re

Some of the most useful methods in the re library are:



*   search: Returns the first instance of an expression in a string.
*   findall: Finds all instances of an expression in a string and returns them as a list.
*   split: Splits a string based on a specified delimiter.
*   sub: Substitutes a string/substring with another.


Regular expressions consist of sequences that represent certain types of characters that can appear in strings. We can use the findall method to return all characters in a string that match a series of characters as follows:

In [163]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'
print(re.findall('Mr', text))
#print(len(re.findall('Mr', text)))

['Mr', 'Mr']


If we want to return all the characters that match within the text, we can turn the series of characters in the pattern into a set by enclosing them in square brackets([]).

In [168]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [170]:
print(re.findall('[yx]', text))

['y']


In [172]:
print(re.findall('[yr]', text))

['y', 'r', 'r', 'r', 'r', 'r']


In [175]:
print(re.findall('[Mr]', text))

['M', 'r', 'M', 'r', 'r', 'r', 'M', 'r']


Regular expressions also have predefined sets that we can use as shortcuts so that, for example, we don't have to type out every letter in the alphabet or every number in order to match them. Below are some of the most useful regular expression sets.



*   [a-z]: Any lowercase letter between a and z.
*   [A-Z]: Any uppercase letter between A and Z.
*   [0-9]: Any numeric character between 0 and 9.




In [176]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [179]:
print(re.findall('[a-z]', text))

['y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'r', 'o', 'g', 'e', 'r', 's', 'o', 'r', 'r', 'o', 'n', 'e', 's', 'h', 'a', 's', 'd', 'o', 'g', 's']


Note that this set returned all lower case letters and excluded the capital M's and R's, the number 5, and all the punctuation marks. We can add the ^ character inside the square brackets to return everything that doesn't match the sequence we have designated.

In [180]:
print(re.findall('[^a-z]', text))

['M', ' ', ',', ' ', 'M', '.', ' ', 'R', ' ', ' ', 'M', '.', ' ', 'J', ',', ' ', ' ', '5', ' ', '.']


In this case, it returned the capital letters, the number, and all punctuation and white spaces.

What if we wanted to extract both upper and lower case letters from our string? We can just add A-Z inside our square brackets.

In [181]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [182]:
print(re.findall('[a-zA-Z]', text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'o', 'r', 'M', 'r', 'J', 'o', 'n', 'e', 's', 'h', 'a', 's', 'd', 'o', 'g', 's']


And if we wanted to also extract spaces, we can add a space.

In [183]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [184]:
print(re.findall('[a-z A-Z]', text))

['M', 'y', ' ', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', ' ', 'M', 'r', ' ', 'R', 'o', 'g', 'e', 'r', 's', ' ', 'o', 'r', ' ', 'M', 'r', ' ', 'J', 'o', 'n', 'e', 's', ' ', 'h', 'a', 's', ' ', ' ', 'd', 'o', 'g', 's']


Once we get to a point where we are adding multiple things to our regular expression, we will want to leverage additional shortcuts called character classes (also known as special sequences). Below are some of the most useful ones and what they match.

* \w: Any alphanumeric character.
* \W: Any non-alphanumeric character.
* \d: Any numeric character.
* \D: Any non-numeric character.
* \s: Any whitespace characters.
* \S: Any non-whitespace characters.
* .: Any character except newline (\n).

Let's take a look at how some of these work.

In [198]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [199]:
print(re.findall('[\d]', text))

['5']


In [201]:
print(re.findall('[\w]', text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', 'M', 'r', 'R', 'o', 'g', 'e', 'r', 's', 'o', 'r', 'M', 'r', 'J', 'o', 'n', 'e', 's', 'h', 'a', 's', '5', 'd', 'o', 'g', 's']


In [202]:
# return non-whitespace chars 
print(re.findall('[\S]', text))

['M', 'y', 'n', 'e', 'i', 'g', 'h', 'b', 'o', 'r', ',', 'M', 'r', '.', 'R', 'o', 'g', 'e', 'r', 's', 'o', 'r', 'M', 'r', '.', 'J', 'o', 'n', 'e', 's', ',', 'h', 'a', 's', '5', 'd', 'o', 'g', 's', '.']


We can use the split method to split a string on specific characters, such as commas or any numeric values.

In [203]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [204]:
print(re.split(', ', text)) 

['My neighbor', 'Mr. Rogers or Mr. Jones', 'has 5 dogs.']


In [205]:
print(re.split('[0-9]', text))

['My neighbor, Mr. Rogers or Mr. Jones, has ', ' dogs.']


Let's also take a look at how we can use the sub method to substitute out how many dogs our neighbor has.

In [211]:
text = 'My neighbor, Mr. Rogers or Mr. Jones, has 5 dogs.'

In [212]:
print(re.sub('[0-9]', '100', text))

My neighbor, Mr. Rogers or Mr. Jones, has 100 dogs.


## Summary 

In this lesson, we learned how to manipulate strings with Python. We started by reviewing some of the Python string operations we had seen in previous lessons. Then we learned how to subset strings and split them based on designated characters. From there, we covered boolean methods and how to use them while operating on strings. We also looked at how to change cases of strings, how to strip white spaces, and how to replace substrings with other strings. Finally, we finished up the lesson learning about some of the most frequently used regular expressions and how to use them to match characters in strings.