# More Regular Expression
v.ekc

If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec07_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [3]:
import re 

# Raw strings

We want to use raw strings if there are formatting characters or patterns in the text that look like meta characters or class characters.

## Common example: backslashes

For example if we wanted to read directories:

```'C:\Users\Real Python\main.py'```

The backslash is a metacharacter. We would have to escape the backslashes with a double backslash:

```regex = 'C:\\Users\\Real Python\\main.py'```

or we could just make it a raw string!

```regex = r'C:\Users\Real Python\main.py' ```

In [17]:
# compare strings and raw strings

d1 = 'C:\\Users\\Real Python\\main.py'

d2 = r'C:\Users\Real Python\main.py'

d1==d2

True

## Formatting characters

New line, tab: ``` \n \t```

In [18]:
# A string in python
print('\tTab')

	Tab


In [19]:
# A raw string in python
print(r'\tTab')

\tTab


In [20]:
print('Line1\nLine2')

Line1
Line2


In [21]:
print(r'Line1\nLine2')

Line1\nLine2


# Set behavior

In [25]:
question = "What is the meaning of life? ^o^ "

In [9]:
# We have to escape special characters
re.findall(r'\?',question)

['?']

In [10]:
# We don't have to escape if it's inside of a set
re.findall(r'[li?]',question)

['i', 'i', 'l', 'i', '?']

In [24]:
# ^ inside a set excludes characters in that set
re.findall(r'[^lift? ]',question)

['W',
 'h',
 'a',
 's',
 'h',
 'e',
 'm',
 'e',
 'a',
 'n',
 'n',
 'g',
 'o',
 'e',
 '^',
 'o',
 '^']

In [23]:
# escape a ^ inside a set with a backslash
re.findall(r'[\^l]',question)


['l', '^', '^']

## Quantifiers

In [None]:
string = "She sells seashells by the seashore."

In [None]:
# get patterns that start with s and end with s with any character in between
re.findall(r's.s',string)

In [None]:
# get patterns that start with s and end with s with any 2 characters in between
re.findall(r's.{2}s',string)

In [None]:
# three characters in between
re.findall(r's.{3}s',string)

In [None]:
# 1 to 4 characters between s and s
re.findall(r's.{1,4}s',string)

In [None]:
# 1 tp 4 characters between s and s (shifted)
re.findall(r's.{1,4}s',string[string.find('s')+1:])

## Look ahead

In [None]:
cats = "I love cats, but not catnaps! Cats are great! üê±"

In [None]:
# positive lookahead; match cat or Cat only if it is followed by s
re.findall('[Cc]at(?=s)',cats)

In [None]:
# negative lookahead; match cat or Cat only if it is NOT followed by naps
re.findall('[Cc]at(?!nap)',cats)

In [None]:
# positive lookbehind; match cat or Cat only if it is preceded by "love "
re.findall('(?<=love )[Cc]at',cats)

In [None]:
# positive lookbehind; match cat or Cat only if it is NOT preceded by "love "
re.findall('(?<!love )[Cc]at',cats)

## Capture groups 

In [None]:
text = 'apple banana appleappleapple applee orange'

In [None]:
# + only matches characters immediately to the left
re.findall(r'apple+',text)

In [None]:
# If we want to match a whole word 1 or more times, we group
# note that it only matches the things in the group -- "capture group" 
re.findall(r'(apple)+',text)

In [None]:
# To make it a non-capturing group, use ?:
re.findall(r'(?:apple)+',text)

In [None]:
# Using grouping for collections of info
statement = 'Mary has 3 cats. Ben had 2 dogs. Maya has 14 chickens, and April has 1 alpaca.'

In [None]:
# get all the statements in the form "person has or had # pets"
re.findall(r'[A-Za-z]+\s[A-Za-z]+\s\d+\s[A-Za-z]+',statement)

In [None]:
# if I only care about the people, group by the first part
re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s\d+\s[A-Za-z]+',statement)

In [None]:
# if I care about the people and the number of pets they have or had
re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s(\d+)\s[A-Za-z]+',statement)

In [None]:
# if I care about the people and the number of pets and the type of pet
re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s(\d+)\s([A-Za-z]+)',statement)

You can backreference things in capture groups.

In [None]:
dates = "12/25/2025 01/01/2023 11/14/2022"
re.findall('(\d{2})/(\d{2})/(\d{4})',dates)

In [None]:
# Reference the capture groups switch from MM/DD/YYY to YYYY/MM/DD
re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3/\1/\2', dates)

In [None]:
# You can name capture groups with ?P to make it easier to reference. use \g when referencing
# Note this only works for substitutions
re.sub(r'(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})',r'\g<year>/\g<month>/\g<day>', dates)

In [None]:
# This type of technique is helpful for cleaning data
messy_dates = "12/25/2025 01/1/2023 11/14/2022"
re.sub(r'/(\d)/', r'/0\1/', messy_dates)

Capture groups and backreferences can be used within a single regex.

In [None]:
string_with_nums = "123 432 543 578 443 444 757 577 222 974 199"

In [None]:
# find all occurances of 3 digits in a row
re.findall(r'\d{3}',string_with_nums)

In [None]:
# find all occurances of identical digit repeated 3 times in a row
re.findall(r'(\d)\1{2}',string_with_nums)

In [None]:
# it takes some work to show the repeats
matches = re.findall(r'((\d)\2{2})', string_with_nums)
[tup[0] for tup in matches]

## Activities

1. Extract all the emoticons such as `:)` or `:-)` etc. 

In [None]:
greeting = """
Hi! :D 
It is so nice to meet you! :-) 
I wish I could stay and chat :P but I have to go. :( 
Bye bye. D,:
"""

In [None]:
...

2. Extract the year, month, and day for each date in the list. The dates are in the form MM-DD-YYYY.

In [None]:
dates = ['01-31-2001','02-28-2002','03-30-2003','04-29-2004','05-28-2005','06-27-2006',
         '07-07-2007','08-08-2008','09-09-2009','10-10-2010','11-11-2011','12-12-2012']

year = ...
month = ...
day = ...

print(year)
print(month)
print(day)