# **Regex**

**Regular expressions** is a pattern matching language.

Instead of writing 0 1 2 3 4 5 6 7 8 9, you can write [0-9] or \d

It is Domain Specific Language (DSL). Powerful (but limited language).

## **Revisiting tokienization**

In this case, we'll use Regex to create our own tokens.

In [0]:
import re

In [0]:
re_punc = re.compile("([\"\''().,;:/_?!—\-])") # add spaces around punctuation
re_apos = re.compile(r"n ' t ")    # n't
re_bpos = re.compile(r" ' s ")     # 's
re_mult_space = re.compile(r"  *") # replace multiple spaces with just one

In [0]:
def simple_toks(sent):
    sent = re_punc.sub(r" \1 ", sent)
    sent = re_apos.sub(r" n't ", sent)
    sent = re_bpos.sub(r" 's ", sent)
    sent = re_mult_space.sub(' ', sent)
    return sent.lower().split()

In [6]:
text = "I don't know what to write here (this is just an example) "
text

"I don't know what to write here (this is just an example) "

In [7]:
text2 = re_punc.sub(r" \1 ",text)
text2

"I don ' t know what to write here  ( this is just an example )  "

In [8]:
text3 = re_apos.sub(r" n't ", text2)
text3

"I do n't know what to write here  ( this is just an example )  "

In [10]:
text4 = re_bpos.sub(r" 's ", text3)
text4

"I do n't know what to write here  ( this is just an example )  "

In [0]:
sentences = ['All this happened, more or less.',
             'The war parts, anyway, are pretty much true.',
             "One guy I knew really was shot for taking a teapot that wasn't his.",
             'Another guy I knew really did threaten to have his personal enemies killed by hired gunmen after the war.',
             'And so on.',
             "I've changed all their names."]

In [13]:
tokens = list(map(simple_toks, sentences))
tokens

[['all', 'this', 'happened', ',', 'more', 'or', 'less', '.'],
 ['the',
  'war',
  'parts',
  ',',
  'anyway',
  ',',
  'are',
  'pretty',
  'much',
  'true',
  '.'],
 ['one',
  'guy',
  'i',
  'knew',
  'really',
  'was',
  'shot',
  'for',
  'taking',
  'a',
  'teapot',
  'that',
  'was',
  "n't",
  'his',
  '.'],
 ['another',
  'guy',
  'i',
  'knew',
  'really',
  'did',
  'threaten',
  'to',
  'have',
  'his',
  'personal',
  'enemies',
  'killed',
  'by',
  'hired',
  'gunmen',
  'after',
  'the',
  'war',
  '.'],
 ['and', 'so', 'on', '.'],
 ['i', "'", 've', 'changed', 'all', 'their', 'names', '.']]

Now that we have our tokens, it's time to convert them to integer ids.

In [0]:
import collections

In [0]:
PAD = 0; SOS = 1

def toks2ids(sentences):
    voc_cnt = collections.Counter(t for sent in sentences for t in sent)
    vocab = sorted(voc_cnt, key=voc_cnt.get, reverse=True)
    vocab.insert(PAD, "<PAD>")
    vocab.insert(SOS, "<SOS>")
    w2id = {w:i for i,w in enumerate(vocab)}
    ids = [[w2id[t] for t in sent] for sent in sentences]
    return ids, vocab, w2id, voc_cnt

In [0]:
ids, vocab, w2id, voc_cnt = toks2ids(tokens)

In [19]:
ids[0] #first sentence

[5, 13, 14, 3, 15, 16, 17, 2]

In [21]:
vocab[:10] # some words

['<PAD>', '<SOS>', '.', ',', 'i', 'all', 'the', 'war', 'guy', 'knew']

In [24]:
w2id

{"'": 47,
 ',': 3,
 '.': 2,
 '<PAD>': 0,
 '<SOS>': 1,
 'a': 28,
 'after': 43,
 'all': 5,
 'and': 44,
 'another': 32,
 'anyway': 19,
 'are': 20,
 'by': 40,
 'changed': 49,
 'did': 33,
 'enemies': 38,
 'for': 26,
 'gunmen': 42,
 'guy': 8,
 'happened': 14,
 'have': 36,
 'hired': 41,
 'his': 12,
 'i': 4,
 'killed': 39,
 'knew': 9,
 'less': 17,
 'more': 15,
 'much': 22,
 "n't": 31,
 'names': 51,
 'on': 46,
 'one': 24,
 'or': 16,
 'parts': 18,
 'personal': 37,
 'pretty': 21,
 'really': 10,
 'shot': 25,
 'so': 45,
 'taking': 27,
 'teapot': 29,
 'that': 30,
 'the': 6,
 'their': 50,
 'this': 13,
 'threaten': 34,
 'to': 35,
 'true': 23,
 've': 48,
 'war': 7,
 'was': 11}

**RegEx** is mainly used for:
* Find/search
* Fin and replace
* Clean


Don't forget that **str** methods can be usefull too

In [0]:
str.find?

Something interesting

In [26]:
message = "😒🎦 🤢🍕"

re_frown = re.compile(r"😒|🤢")
re_frown.sub(r"😊", message)

'😊🎦 😊🍕'