# regex cheatsheet

I use regular expressions frequently enough, but have to look up half the relevant syntax each time. Maybe if I write my own cheatsheet and compile examples I've come across I'll actually memorize it. 

## Syntax

Quantifiers
* `?`: previous character X appears 0 or 1 times
* `*`: previous character X appears 0 or more times 
* `+`: previous character X appears 1 or more times 
* `{n}`: previous character appears n times 

Characters
* `.`: any character!
* `\d`: digit (0-9) 
* `\D`: not a digit
* `\s`: whitespace 
* `\S`: not a whitespace 
* `\w`: letter, digit, or underscore
* `\W`: not a letter, digit, or underscore (special character)

Other
* `X|Y`: X or Y
* `[^X]`: not X
* `()`: grouping 
* `[]`: one of these characters
* `a-z`: range from a to z, can be digits or letters

Specific to [Nutrimatic](https://nutrimatic.org/2024/)
* `"team mate"`: this exact length and maintain space breaks
* `<mate>`: anagram of mate 
* `_`: alphanumeric 
* `A`: alphabetic  
* `#`: numeric 
* `V`: vowel 
* `C`: consonant 


### Simple Examples

* `c(at|oat)`: `cat` or `coat` 
* `[a-z]at`: could be 26 words
* `#CVCVCVC`: could be `5 minutes`
* `go*gle`: could be ggle, gogle, google, gooogle, etc 
* `(go)*gle`: could be gogle, gogogle, gogogogle, etc 

## Python Examples



In [1]:
import re 
import urllib

### 1) Stripping a string


First a simple parsing example of stripping a given string 

In [2]:
string = '2#C a$_-T#^s*'

# strip all but alphanumeric characters and change to uppercase
newstring1 = re.sub(r'[^a-zA-Z0-9]','',string).upper()
print(newstring1)

# strip all but alphabetic characters and change to uppercase
newstring2= re.sub(r'[^a-zA-Z]','',string).upper()
print(newstring2)

2CATS
CATS


### 2) Parsing through a wordlist

Wordlists are taken from: 
* [List of English Words](https://github.com/dwyl/english-words/tree/master)
* [Collins Scrabble Dictionary](https://boardgames.stackexchange.com/questions/38366/latest-collins-scrabble-words-list-in-text-file)

The `^` and `$` in the query signal the beginning and end of the word.

In [3]:
def load_words_english():
    with open('data/wordlist/words_alpha.txt') as word_file:
        valid_words = set(word_file.read().split())
    valid_words = [item.upper() for item in valid_words]
    return valid_words

def load_words_scrabble():
    with open('data/wordlist/Collins Scrabble Words (2019).txt') as word_file:
        valid_words = set(word_file.read().split())
    return valid_words

def regexWordSearch(query, wordlist, show=True):
    r = re.compile(query.upper())
    matches = list(filter(r.match, wordlist))
    if show:
        print(matches)
    return matches


In [4]:

words_english = load_words_english() 
words_scrabble = load_words_scrabble()

ans = regexWordSearch('^c(at|oat)$', words_english)
ans = regexWordSearch("[a-z]{3}cat$", words_scrabble)


['COAT', 'CAT']
['BOBCAT', 'OCICAT', 'RAMCAT', 'MERCAT', 'MUDCAT', 'HEPCAT', 'MUSCAT', 'TIPCAT', 'FORCAT', 'TOMCAT']


### 3) Querying Nutrimatic

Simply just converting the query into html and passing the url using `urllib`

In [5]:
# max number of solutions on a nutrimatic page is 100
tol_number = 10 
# weight tolerance - change to 0 to ignore constraint
tol_weight = 0


def get_raw_data(query):
    query = urllib.parse.quote_plus(query) # html syntax
    url = 'https://nutrimatic.org/?q='+query+'&go=Go'
    text = urllib.request.urlopen(url).read()
    text1 = text.decode()
    return text1

def parse_raw_data(rawdata):
    # there's probably a package to do this, oh well
    posA = [m.start() for m in re.finditer('<span',rawdata)]
    posB = [m.start() for m in re.finditer('</span',rawdata)]

    # extract solution and weights
    solutions = []
    weights = []
    for n in range(0,min(len(posA),tol_number)):
        word = rawdata[posA[n]+36:posB[n]]
        size = float(rawdata[posA[n]+23:posA[n]+32])
        solutions.append(word)
        weights.append(size)
    return solutions,weights

def print_results(solutions, weights):
    for idx, sol in enumerate(solutions):
        if weights[idx] > tol_weight:
            print(sol+'\t'+str(weights[idx])) 


Example query

In [6]:
query = '"<asympote_>"'
rawdata = get_raw_data(query)
solutions,weights = parse_raw_data(rawdata)
print_results(solutions, weights)


asymptote	2.742922
pentasomy	2.404358
mastopexy	2.395467
tapecomys	2.193147
montsapey	2.108904
hypostema	1.979579
mesotrypa	1.979579
samythope	1.960517


Example from the nutrimatic landing page

In [7]:
query = '"C*aC*eC*iC*oC*uC*yC*"'
rawdata = get_raw_data(query)
solutions,weights = parse_raw_data(rawdata)
print_results(solutions, weights)

facetiously	2.677221
abcdefghijklmnopqrstuvwxyz	2.27003
aeiouy	1.996981
