# RegEx

In Python, this primarily involves the `re` package. If you need *speed*, check out the `flashtext` package.

Commonly used methods in `re` include:

- `match` - matches string from *beginning*
- `search` - goes through entire string to find match
- `split`
- `findall`

The form of these expressions is `re.match(pattern, string)`, i.e. pattern is the first argument and string the second.

Greedy matching:

- `*` matches 0 or more
- `+` matches 1 or more

In [2]:
import re

## Split the string

Use `split`. This is similar to R's `strsplit()` and `stringr::str_split()`

In [3]:
re.split(" ", "5 10 15 20")

['5', '10', '15', '20']

In [4]:
re.split("\s+", "Something \n Next")

['Something', 'Next']

**Example** - Extract all sentences.

In [7]:
my_string = "Let's write RegEx! Won't that be fun? I sure think so."

sentence_endings = r"[!|?|.]"

re.split(sentence_endings, my_string)

["Let's write RegEx", " Won't that be fun", ' I sure think so', '']

**Example** - Split strings by spaces

In [10]:
my_string = "Let's write RegEx! Won't that be fun? I sure think so."

re.split(r"\s+", my_string)

["Let's",
 'write',
 'RegEx!',
 "Won't",
 'that',
 'be',
 'fun?',
 'I',
 'sure',
 'think',
 'so.']

# Extract all strings that match pattern

Use `findall`, which works like `stringr::str_extract()` in R.

In [6]:
my_string = "Yay! This is a sentence."

# Extract only words
re.findall(r"\w+", my_string)

['Yay', 'This', 'is', 'a', 'sentence']

**Example** - Find all capitalized words

In [12]:
my_string = "Let's write RegEx! Won't that be fun? I sure think so."

caps = r"[A-Z]\w*"

re.findall(caps, my_string)

['Let', 'RegEx', 'Won', 'I']

**Example** - count all of the words in hamlet

In [None]:
# from collections import Counter

# words = re.findall(r'\w+', open('hamlet.txt').read().lower())

# # What are the 10 most common words?
# Counter(words).most_common(10)