# Regular Expressions

Import the regular expressions Python library

In [None]:
import re

| Regular Expression Pattern       | Matches |
|:---------------------------:|:-----------------------------------------------------------------------------------------------------------:|
| `.` | any character                                         | 
| `\w` | word                                         | 
| `\W`                      | NOT word                                           |  
| `\d` | digit                                         | 
| `\D`                      | NOT digit                                           | 
| `\s` | whitespace                                         | 
| `\S`                      | NOT whitespace                                          | 
| `[abc]`                      | Any of abc                                         |
| `[^abc]`                      | Not any of abc                                         | 
| `(abc)`                      | Specific capture of "abc"                                         
| `+`                      | 1 or more instances                                       | 
| `*`                      | 0 or more instances                                         | 
| `?`                      | 0 or 1 instance                                        | 
                   

# Split Anything That's Not a Word

Let's try to split this string into individual words and remove all punctuation

In [None]:
fav_animals = """Penguin
Butterflies
Buffalo
Koala
Birds
Elephant
Dogs
Dog
Raccoon 
can Balto count? :)
Rough collie!
Dinosaur"""

`\W+`

In [None]:
re.split('\W+', fav_animals)

compare to the inferior method

In [None]:
fav_animals.replace("!", "").split()

# Match 4 Digits in a Row

`[0-9]{4}`

Check it out on Regexer: https://regexr.com/6jr7b

In [None]:
music_info = "Harry Styles was born in 1994. He is 29 years old. Mitski was born in 1990. She is 32. Mitski's album Laurel Hell came out in 2022."

To extract the first matching pattern, you can use `re.search()`: https://docs.python.org/3/library/re.html#re.search

In [None]:
re.search("[0-9]{4}", music_info)

To access the extracted text, you need to use `.group()`

In [None]:
re.search("[0-9]{4}", music_info).group()

To extract all matching patterns, you can use `re.findall()`: https://docs.python.org/3/library/re.html#re.findall

In [None]:
re.findall("[0-9]{4}", music_info)

# Extract Last Name From English Catalogue of Books Entry

Extract anything that comes before a parenthesis

In [None]:
ecb_entry = "Andersen (Hans)-Fairy tales. Ryl. 8vo. bds., 1s. net NELSON, Sep."  

In [None]:
re.search("(.*)(?=\()", ecb_entry).group(0)

# Further Resources

- https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/03-Web-Scraping-Part2.html#regular-expressions
- https://programminghistorian.org/en/lessons/understanding-regular-expressions