# Python Strings and Regex primer

Python strings, even by themselves, can be tricky.  Whenever I type a string into my program directly (as opposed to fetching it from data, or asking the user for input) this is called a *string literal*.

In [27]:
# string literal
str1 = 'Python string literals are really easy!'
print(str1)

Python string literals are really easy!


In [30]:
str2 = '...or are they?\nThere are special escape sequences like this newline' 
print(str2)

...or are they?
There are special escape sequences like this newline


Here is a list of special "escape sequences" that mean special things to Python:

https://linuxconfig.org/list-of-python-escape-sequence-characters-with-examples

For example, if I want a literal backslash (not an escape sequence) then I have to escape it by using 2 backslashes:

In [32]:
str3 = 'I do not want a\\newline here'
print(str3)

I do not want a\newline here


Another interesting escape sequence, the `\uXXXX` allows you to have arbitrary characters from Unicode:

https://en.wikipedia.org/wiki/List_of_Unicode_characters

In [33]:
str4 = "That will cost you \u20A45!"
print(str4)

That will cost you ₤5!


Dealing with strings that themselves contain quotation marks can be a real pain.  One option is to escape the quotation mark:

In [34]:
str5 = "My favorite punctuation is the \" mark"
print(str5)

My favorite punctuation is the " mark


Beware that Python doesn't ALWAYS interpret a `\` as starting an escape sequence.  If the character that comes after isn't in this list

https://linuxconfig.org/list-of-python-escape-sequence-characters-with-examples

then Python ignores it.  For example, `\S` is not a special escape sequence to Python, so it gets ignored.

In [41]:
print("There is nothing particulary special about \S to Python")

There is nothing particulary special about \S to Python


Another option is to delimit the string using a quotation mark that you aren't using in the string itself.  Here I switched to single quotes, which Python is totally happy with:

In [35]:
str6 = 'My favorite punctuation is the " mark'
print(str6)

My favorite punctuation is the " mark


What if you want a string that uses both `'` and `"` characters in it?  You can either escape them, or switch to Python's handy "triple-quote" syntax:

In [36]:
str7 = """I like both " and ' punctuation!"""
print(str7)

I like both " and ' punctuation!


In [37]:
# You can do the same with triple single quotes
str8 = '''I like both " and ' punctuation!'''
print(str8)

I like both " and ' punctuation!


Triple quotes are most often used because they allow "multiline" string literals without having to use ugly `\n` escape sequences:

In [38]:
str9 = """This is a
cool string"""
print(str9)

This is a
cool string


Don't make the mistake of trying to align a multiline string in your code, though.  It looks pretty, but doesn't give you what you want:

In [39]:
str10 = """This is a
           cool string"""
print(str10)

This is a
           cool string


Sometimes all of this character escaping that Python does can be annoying.  You can turn it off using so-called "raw" string syntax:

In [40]:
str11 = r'In this raw string Python ignores the backslash \n'
print(str11)

In this raw string Python ignores the backslash \n


## Regex

Regex brings its own set of complications to string literals.

The very first thing to know is that PYTHON gets to interpret your string literal FIRST.

In other words, regex doesn't see the pattern that you literally typed in.  It sees the string that PYTHON has messed with.

Regex is like a little language by itself, and comes with its own commands and special escape sequences:

https://www.debuggex.com/cheatsheet/regex/python

Problems arise when you are trying to tell regex to match an escape sequence, but that escape sequence also happens to mean something to Python.  Here's an example:

In [45]:
scary_data = r'hi\there'
print(scary_data)

hi\there


In [46]:
import re

# I want to match the backslash LITERALLY, so I escape it
pattern = '[A-Za-z]+\\[A-Za-z]+'
m = re.match(pattern, scary_data)
m

This failed (`m` is `None`) because the regex didn't even SEE your escaped backslash.  Python gobbled it up.  You can see that if you print your pattern:

In [47]:
print(pattern)

[A-Za-z]+\[A-Za-z]+


What you need to do here is DOUBLE ESCAPE each backslash (once for Python, and once for regex):

In [48]:
pattern = '[A-Za-z]+\\\\[A-Za-z]+'
print(pattern)
m = re.match(pattern, scary_data)
m

[A-Za-z]+\\[A-Za-z]+


<_sre.SRE_Match object; span=(0, 8), match='hi\\there'>

Now it matched just fine.  In order to avoid this double-escaping hell, it is usually best with regex to use RAW Python strings (i.e. tell Python to stop trying to interpret backslashes - leave the string unmolested and send it directly to the regex):

In [49]:
pattern = r'[A-Za-z]+\\[A-Za-z]+'
print(pattern)
m = re.match(pattern, scary_data)
m

[A-Za-z]+\\[A-Za-z]+


<_sre.SRE_Match object; span=(0, 8), match='hi\\there'>

Here is one example of how to do your Week 4 homework:

In [50]:
logentry = 'maynard.isi.uconn.edu - - [28/Jul/1995:13:32:22 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891'

In [54]:
pattern = r'(\S+)\s+(\S+)\s+(\S+)\s+\[(.*)\]\s+"(.*)"\s+(\d+)\s+(\d+)'
myregex = re.compile(pattern)

In [55]:
m = myregex.match(logentry)
m

<_sre.SRE_Match object; span=(0, 108), match='maynard.isi.uconn.edu - - [28/Jul/1995:13:32:22 ->

In [58]:
result_tuple = m.groups()
result_tuple

('maynard.isi.uconn.edu',
 '-',
 '-',
 '28/Jul/1995:13:32:22 -0400',
 'GET /images/shuttle-patch-logo.gif HTTP/1.0',
 '200',
 '891')

In [59]:
result_tuple[3]

'28/Jul/1995:13:32:22 -0400'