# Strings

# Formatted strings

Suppose you want to print a number with just two decimal places of precision - so, not 12.95821, but just 12.96.  This is a very common use case for formatted strings, which also are sometimes just easier to use than concatenating other strings.




f-strings (the f is for formatted) are the easiest way to get a formatted string.  f precedes the string, and an expression (like a variable name) can be placed between curly braces in the string.

In [37]:
my_cost = 12.95821
print(f'The total cost was {my_cost} dollars')

The total cost was 12.95821 dollars


To give two places after the decimal, add :.2f to the expression.  (The f stands for the float "presentation type.")

In [38]:
print(f'The total cost was {my_cost:.2f} dollars')

The total cost was 12.96 dollars


The older way of doing this, with .format() as a string method, is still something you may see around.  It gives variable names as arguments to a format() method of the string, but still puts the formatting information in the curly braces.

In [None]:
print('The total cost was {:.2f} dollars'.format(my_cost))

If the decimal point is left out, the number now means how many spaces the number should take up, padding with whitespace.  This can be useful when writing tables of numbers with values that have different orders of magnitude.

We'll also change the 'f' for float to 'd' for decimal because these are integer values.  Keeping it 'f' would give us unnecessary places after the decimal.

In [39]:
for i in range(20):
  j = i*10
  print(f'{i:4d}{j:4d}')

   0   0
   1  10
   2  20
   3  30
   4  40
   5  50
   6  60
   7  70
   8  80
   9  90
  10 100
  11 110
  12 120
  13 130
  14 140
  15 150
  16 160
  17 170
  18 180
  19 190


# Exercise

Try out this method of formatting strings by printing a string of the form "The price was $[price to 2 decimal places], so I rated it [rating] stars."

In [None]:
price = 1000/3
rating = 4.5


In [None]:
print(f'The price was ${price:.2f}, so I rated it {rating} stars.')

# Other useful string functions

One of the most useful string functions is split(), which turns a string like "A,B,C,D" into a list of strings ['A','B','C','D'].  It takes the valid separators as an argument.


In [40]:
mylist = "milk,eggs,yogurt"
mylistlist = mylist.split(',')
print(mylistlist)

['milk', 'eggs', 'yogurt']


In [41]:
mylist = "milk eggs yogurt"
mylistlist = mylist.split(' ')
print(mylistlist)

['milk', 'eggs', 'yogurt']


splitlines() is a special case for splitting by the newline character.

In [42]:
mylist = "milk\neggs\nyogurt".splitlines()
print(mylist)

['milk', 'eggs', 'yogurt']


A common thing that is 'wrong' with strings is too much invisible whitespace where you didn't expect any.  strip() gets rid of whitespace on either side of the string.

In [43]:
'     milk,eggs,yogurt     '.strip()

'milk,eggs,yogurt'


Sometimes lines of interest can be identified with particular starting substrings.  startswith() can easily pick these out.  (There's also an endswith(), probably less used.)

In [44]:
lines = "SERVANT: Sir, there are ten thousand--\nMACBETH: Geese, villain?"
linelist = lines.splitlines()
for line in linelist:
  if line.startswith("MACBETH"):
    print(line.split(": ")[1])

Geese, villain?


When looking for nonsense entries in files, isdigit() returns true only if a whole string uses just digits 0-9; isalpha() return true only if all the characters are lower or uppercase letters; isalnum() returns true if the string is all letters and digits.  Checks like these can tell when something has gone very wrong.

In [45]:
mygroceries = 'milk,honey,eggs,!3#$@$'
[x.isalpha() for x in mygroceries.split(",")] 

[True, True, True, False]

The simple "in" operator can let you check for substrings within strings.

In [46]:
print("foo" in "food")

True


# String comparison

Strings can be compared using <, <=, ==, and similar comparison operators.  The comparison follows alphabetical order to a point, but it's really comparing the values of the characters in their binary encodings (accessible with the ord() function).  So uppercase registers as less than lowercase.  It's advisable to convert to lowercase with lower() before comparing strings in this way.

In [47]:
'bat' < 'cat'

True

In [48]:
'bat' < 'Cat'

False

In [49]:
f"b is {ord('b')}, c is {ord('c')}, C is {ord('C')}"

'b is 98, c is 99, C is 67'

# DataFrames and String operations

DataFrames have great integration with the Python string methods.  You can call a string function and have it automatically applied to every item in the same column.  You just need to access the .str attribute, as demonstrated below.


In [52]:
import numpy as np
import pandas as pd

# Hotel ratings on a 5-star scale
my_data = np.array([["Excellent", "   Okay   ", "   Okay"], ["Great    ", "   Good", "   Good"]])
df = pd.DataFrame(my_data, columns = ["Hilton", "Marriott", "Four Seasons"], index = ["Alice", "Bob"])
df

Unnamed: 0,Hilton,Marriott,Four Seasons
Alice,Excellent,Okay,Okay
Bob,Great,Good,Good


In [56]:
marriott = df["Marriott"] # Gets a Series from the DataFrame
print(marriott)
marriott.str.strip()

Alice       Okay   
Bob            Good
Name: Marriott, dtype: object


Alice    Okay
Bob      Good
Name: Marriott, dtype: object

Another use for this can be checking whether the string entries are all of an expected type - alphabetical or alphanumerical.

In [57]:
marriott.str.isalpha()  # Expect false because of spaces

Alice    False
Bob      False
Name: Marriott, dtype: bool

DataFrames are also well-integrated with regular expression matching, which we'll explore just below.

In [58]:
marriott.str.match("\s*Okay\s*")

Alice     True
Bob      False
Name: Marriott, dtype: bool

# Regular expressions

Regular expressions search for patterns in your data.  They can be either very specific, such as looking for a particular zip code (02143); or they could be very broad, like looking for words.  Once a pattern has been found, you can get back the string that matched the pattern; or you could extract details of the string that matched the pattern, like the name next to the zip code.


Regular expressions generally are hand-designed instead of generated with machine learning.  Machine learning also looks for patterns, but outshines humans when we can't easily describe the pattern we're looking for.  Regular expressions tend to be most successful when we know what we're looking for.


The pattern we're looking for is described by a string; it can literally be exactly the string we're looking for.  We hand this pattern to a method search() from the "re" module, and it can tell us whether the pattern is in the data.  The result is a "match object," but its group() method tells us the string it found.  (The result is None if no match was found.)

In [59]:
import re

pattern = '02143'
longstring = '0132428190214300'  #It's before the last two zeros
result = re.search(pattern, longstring)
print(result.group())

02143


In [60]:
longstring = '0132428190214300'
pattern2 = '999'
result2 = re.search(pattern2, longstring)
print(result2)

None


We can increase the power of the regular expression by using "escape sequences" that represent whole categories of symbol, like \d for any digit, \s for any whitespace, and \w for any alphanumeric character.  So we could look for an arbitrary zip code with \d\d\d\d\d.

In [65]:
pattern3 = '\d\d\d\d\d'

longstring = 'blah0132blah42819blah0214300'

result3 = re.search(pattern3, longstring)

print(result3.group())

42819


A * can match zero or more characters, and + can match one or more characters.  So looking for '3\d*3' will try to find a sequence of digits bookended by 3's, but the stuff between could be arbitrarily long.

In [66]:
longstring = '0132428190214300'
pattern4 = '3\d*3'
result4 = re.search(pattern4, longstring)
print(result4.group())

324281902143


It's also possible to be uncertain about a particular character being there or not:  a '?' in the expression indicates that the character could be there or not, and the expression still matches.

In [67]:
longstring = '0132428190214300'
pattern5 = '2\d?2'
result5 = re.search(pattern5, longstring)
print(result5.group())


242


Something that can greatly increase the power of the regular expression is putting subsequences in parentheses, then using +, *, or ? on that "group."  In the case of star, the whole sequence now can appear zero or more times, and + and ? are similarly extended to apply to the whole subgroup.

In [68]:
pattern = "(foo)+"
longstring = "foofoofoobar."
result = re.search(pattern, longstring)
print(result.group())

foofoofoo


Sometimes you may want to accept one of a few different strings.  The | (or) operator allows the regular expression to accept this or that.

In [69]:
pattern = "(foo|bar)+"
longstring = "foofoofoobar."
result = re.search(pattern, longstring)
print(result.group())

foofoofoobar


A last interesting thing to do with a regex is capture information from subgroups (parentheses) in the regular expression.  This kind of thing is often done by machine learning now, because people can't write big enough expressions to make this work for arbitrary natural language, but you can do it yourself if you know what you're looking for.

In [70]:
longstring = "I ate 200 ham sandwiches last year"
pattern = "(\d+)\sham"  # Grab the digits before the word ham
result = re.search(pattern, longstring)
print(result.group(1))  # Subgroup 1, the first () in the pattern

200


# Exercise

Write a regular expression that looks for a price:  a dollar sign followed by at least one digit.  Try running the search on the provided string.  Note that a dollar sign has a specific meaning in regular expressions (end of string), so you'll need to write it as an escape character (precede it with \\).

In [None]:
import re

longstring = "We paid $100 for those shoes"
pattern = '\$\d+'
result = re.search(pattern, longstring)
print(result.group())

For some more exercise with special regex characters and groups, try writing an expression that looks for two words separated by a space; print the second word.

In [None]:
pattern = '(\w+)\s(\w+)'
result = re.search(pattern, longstring)
print(result.group(2))