<a href="https://colab.research.google.com/github/nicsim22/DS110-Content/blob/main/Lecture16Strings_nosol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strings

*Suddenly, the solution to the puzzle occurred to Cynthia.  The clue said she was supposed to go to "0ac0cafe".  But what if each 0 meant something like "some other characters here"? In other words, she was looking for (probably) a cafe where the other word(s) contained the characters "ac".*

*Cynthia returned to her DataFrame of businesses.  It was still a long list of cafes to go through.  Luckily, Cynthia knew how to use regular expressions!*

# Formatted strings

Suppose you want to print a number with just two decimal places of precision - so, not 12.95821, but just 12.96.  This is a very common use case for formatted strings, which also are sometimes just easier to use than concatenating other strings.




**f-strings (the f is for formatted)** are the easiest way to get a formatted string.  f precedes the string, and an expression (like a variable name) can be placed between curly braces in the string.

In [None]:
my_cost = 12.95821
print(f'The total cost was {my_cost} dollars') #{} prints the variable value, SUPER USEFUL! #doesn't need to only be a variable, can be some math too (eg. {my_cost/2})

The total cost was 6.479105 dollars


To give two places after the decimal, add :.2f to the expression.  (The f stands for the float "presentation type.")

In [None]:
print(f'The total cost was {my_cost:.2f} dollars')

The total cost was 12.96 dollars


# Other useful string functions

One of the most useful string functions is **split()**, which turns a string like "A,B,C,D" into a list of strings ['A','B','C','D'].  It takes the valid separators as an argument.


In [None]:
groceries = "milk,eggs,yogurt"
grocerieslist = groceries.split(',')
print(grocerieslist)

['milk', 'eggs', 'yogurt']


**.join()** does the opposite of split() - it glues strings in a list together with the given delimiter.  (Oddly, this is called as a method of the delimiter string.)

In [None]:
','.join(['milk', 'eggs', 'yogurt'])

'milk,eggs,yogurt'

A common thing that is 'wrong' with strings is too much invisible whitespace where you didn't expect any.  strip() gets rid of whitespace on either side of the string.

In [None]:
'     milk,eggs,yogurt     '.strip()

'milk,eggs,yogurt'


Sometimes lines of interest can be identified with particular starting substrings. **startswith()** can easily pick these out.  (There's also an **endswith()**, probably less used.)

**lines.splitlines()** is a shortcut for split('\n')

In [None]:
lines = "SERVANT: Sir, there are ten thousand--\nMACBETH: Geese, villain?"
linelist = lines.splitlines()  # A shortcut for split('\n')
for line in linelist:
  if line.startswith("MACBETH"):
    print(line.split(": ")[1]) #starts with 1st index, takes : as where to split

Geese, villain?


The simple **"in" operator can let you check for substrings within strings**.  .**replace()** can go a step further and **replace the matches** that are found.

In [None]:
print("foo" in "food")
print("foodfood".replace("foo", "ra"))

True
radrad


# DataFrames and String operations

DataFrames have great integration with the Python string methods.  You can call a string function and have it automatically applied to every item in the same column.  You just need to access the .str attribute, as demonstrated below.


In [None]:
import numpy as np
import pandas as pd

# Hotel ratings on a 5-star scale
my_data = np.array([["Excellent", "   Okay   ", "   Okay"], ["Great    ", "   Good", "   Good"]])
df = pd.DataFrame(my_data, columns = ["Hilton", "Marriott", "Four Seasons"], index = ["Alice", "Bob"])
df

Unnamed: 0,Hilton,Marriott,Four Seasons
Alice,Excellent,Okay,Okay
Bob,Great,Good,Good


In [None]:
marriott = df['Marriott']
for s in marriott:
    print(s)
print('---')
for s in marriott.str.strip(): #.str shows tells system that we are abt to give a string
    print(s)  # Look, no extra whitespace

   Okay   
   Good
---
Okay
Good


DataFrames are also well-integrated with regular expression matching, which we'll explore just below.

In [None]:
marriott.str.match("\s*Okay\s*")
#means looking for string 'Okay' and there can be any amount of whitespace before or after the string

Unnamed: 0,Marriott
Alice,True
Bob,False


# Regular expressions

**Regular expressions search for patterns in your data**.  They can be either very specific, such as looking for a particular zip code (02143); or they could be very broad, like looking for words.  Once a pattern has been found, you can get back the string that matched the pattern; or you could extract details of the string that matched the pattern, like the name next to the zip code.


The pattern we're looking for is described by a string; it can literally be exactly the string we're looking for.  We hand this pattern to a method **search()** from the "re" module, and it can tell us whether the pattern is in the data.  The result is a "match object," but its **group()** method tells us the string it found.  (The result is None if no match was found.)

In [None]:
import re #regular expressions

pattern = '02143'
longstring = 'Somerville, MA 02143'
result = re.search(pattern, longstring) #search function looks for pattern in the longstring
if result:  # (if result: means if result it's not None, with if, None also functions as false)
    print(result.group())

02143


In [None]:
longstring = '0132428190214200'
pattern2 = '02143'
result2 = re.search(pattern2, longstring)
print(result2)

None


We can increase the power of the regular expression by using "escape sequences" that represent whole categories of symbol, like **\d for any digit**, **\s for any whitespace**, and **\w for any alphanumeric character**.  So we could look for an arbitrary 5-digit zip code with \d\d\d\d\d.

In [None]:
pattern3 = '\d\d\d\d\d' #looking for any digit, any digit, any digit, any digit, any digit --> just finds 5 digits tgt

longstring = 'Somerville, MA 02130' #thus even if numbers change, as long as there are 5 numbers, will match
#however, if it is more than 5 numbers, e.g. 7 numbers 1234567 --> will show first 5 numbers 12345 --> BE CAREFUL!

result3 = re.search(pattern3, longstring)
if result3:
    print(result3.group())

A * can match zero or more characters, and + can match one or more characters.  So looking for **'3\d*3**' will try to find a sequence of digits bookended by 3's, but the stuff between could be arbitrarily long (means 0 or more digits between first digit 3 and last digit 3).

IMPT for test: **+ means one or more, * means 0 or more**  

In [None]:
longstring = 'My phone number is 5555555'
pattern4 = 'phone number is \d+' #means 1 or more digits
result4 = re.search(pattern4, longstring)
if result4:
    print(result4.group())

phone number is 5555555


It's also possible to be uncertain about a particular character being there or not:  a **'?'** in the expression indicates that the character could be there or not, and the expression still matches.

In [None]:
longstring = 'Call me at 5555555'
pattern5 = '\d\d\d-?\d\d\d\d' #match with or without the dash, not sure if have dash or not. however, ? only matches one character, -- will not work
result5 = re.search(pattern5, longstring)
if result5:
    print(result5.group())


5555555


Something that can greatly increase the power of the regular expression is putting subsequences in parentheses, then using +, *, or ? on that "group."  In the case of star, the whole sequence now can appear zero or more times, and + and ? are similarly extended to apply to the whole subgroup.

In [None]:
longstring = "Call me at 1-800-555-5555."
pattern = "(\d-)?(\d\d\d-)?\d\d\d-?\d\d\d\d" #(may or may not have first digit)(may or may not have next 3 digits)(next 3 digits)(may or may not have dash)(last 4 digits)
#allows for match to occur even if phone number is just last 7 digits without country and area code and dashes
result = re.search(pattern, longstring)
if result:
    print(result.group())

1-800-555-5555


In [None]:
# Same regex now finds a phone number without area code
longstring2 = "Call me at 555-5555."
result = re.search(pattern, longstring2)
if result:
    print(result.group())

555-5555


Sometimes you may want to accept one of a few different strings.  **The | (or)** operator allows the regular expression to **accept this or that**.

In [None]:
pattern = "Somerville, (MA|NJ)"
longstring = "Somerville, MA 02143"
result = re.search(pattern, longstring)
if result:
    print(result.group())

Somerville, MA


Sometimes you want to find multiple matches for a pattern in the target string, rather than just the first match.  re.findall() will look for all the matches and return them in a list.

In [None]:
longstring = "States with a Somerville:  AL, IN, ME, MA, NJ, OH, TN, TX"
pattern = "[A-Z][A-Z]"  # Get capital letters within A-Z range #searching for range of letters between A to Z #case sensitive! not equal [a-z]
result = re.findall(pattern, longstring)
print(result)

['AL', 'IN', 'ME', 'MA', 'NJ', 'OH', 'TN', 'TX']


A last interesting thing to do with a regex is capture information from groups (in parentheses) in the regular expression.  This requires the text to have a predictable structure - if it doesn't, you could try AI to extract the information instead.  But regexes are a little more predictable.

In [None]:
longstring = "The stock NVDA went down 4.54 points"
pattern = "stock (\w+) went down (\d+.\d+) points" #\w+ means can be any number of characters  \d+.\d+ means that can be any number of digits before and aft decimal pt
#put in parenthesis so that that specific information can be grabbed directly next time

result = re.search(pattern, longstring)
if result:
    print(result.group(1))  # Subgroup 1, the first () in the pattern
    print(result.group(2))

NVDA
4.54


# Exercise (3 min)

Write a regular expression that looks for a price:  a dollar sign followed by at least one digit.  Try running the search on the provided string.  Note that a dollar sign has a specific meaning in regular expressions (end of string), so you'll need to write it as an escape character (precede it with \\).

In [None]:
import re

longstring = "We paid $100 for those shoes"
pattern = '\$\d+' #NOT "paid\s*\(\$\d+)"
result = re.search(pattern, longstring)

if result:
 print(result.group())

$100


*Using regular expressions to find cafes matching her pattern, Cynthia found the Face à Face Café, a small quaint cafe with the shades drawn.  But when she opened the door, she found not a quaint café, but a security checkpoint that reminded her of the TSA.  A Black man wearing clip-on sunglasses, a U2 T-shirt, and jeans beckoned her through the currently inert metal detector.*

*He grinned.  "You passed the interview, Cynthia," he said, offering his hand.  "I'm Aubin.  Welcome to SAGE."*

# Image Credits

"I know regular expressions" - xkcd 208 (https://xkcd.com/208/)