![](images/regex.jpg)

----
By the end of this session, you should be able to
-----

1.	Describe why computer programmers use Regular Expression (regex) to pattern-match strings.
2.	Describe the performance of your computer program using True/False and Positive/Negative.
3.	Write a regex to find phone numbers and emails in text documents.

--- 
What is regex?
---

Regular expression is a pattern matching language.

It is Domain Specific Language (DSL). Powerful (but limited language). 

It is like learning any new language (e.g., French or Chinese). Piece by piece. Lot's of practice

What are other DSL you already know?

- SQL  
- Markdown

----
Why regex?
---

Programmers are lazy. Regex use few symbols to handle text search.

![cartoon](http://imgs.xkcd.com/comics/regular_expressions.png)

---
Regex example
---

Spec:
> Write a regex to match common misspellings of calendar: "calendar", "calandar", or "celender" 

In [2]:
reset -fs

In [3]:
# Let's explore how to do this

# Patterns to match
chunk = ["calendar", "calandar", "celender"]

# Patterns to not match
chink = ["foo", "cal", "calli", "calaaaandar"] 

# Interleave them
string = " ".join([item 
                      for pair in zip(chunk, chink) 
                      for item in pair])

In [4]:
string

'calendar foo calandar cal celender calli'

In [5]:
# You match it with literals
literal1 = 'calendar'
literal2 = 'calandar'
literal3 = 'celender'

pattern = "|".join([literal1, literal2, literal3])

__THERE MUST BE A BETTER WAY__

In [6]:
import re

# Yes: you should test regexes. ALWAYS HAVE TESTS!
assert sorted(re.findall(pattern, string)) != sorted(chink)
assert sorted(re.findall(pattern, string)) == sorted(chunk) 

Let's write it with regex language

In [11]:
# A little bit of meta-programming (strings all the way down!)
sub_pattern = '[ae]'
pattern2 = sub_pattern.join(["c","l","nd","r"])

# Does our test still pass?
assert sorted(re.findall(pattern2, string)) != sorted(chink)
assert sorted(re.findall(pattern2, string)) == sorted(chunk)

---
Regex Terms
----


- __target string__:	This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.


- __search expression__: The pattern we use to find what we want. Most commonly called the regular expression. 


- __literal__:	A literal is any character we use in a search or matching expression, for example, to find ind in windows the ind is a literal string - each character plays a part in the search, it is literally the string we want to find.
`

- __metacharacter__: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character.

- __escape sequence__:	An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. 

In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal. If we want to find \\file in the target string c:\\file then we would need to use the search expression \\\\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence \).


Regex: Connection to statistical concepts
----

![](images/type_I_error.jpg)

__False positives__ (Type I): Matching strings that we should __not__ have
matched

__False negatives__ (Type II): __Not__ matching strings that we should have
matched

Reducing the error rate for a task often involves two antagonistic efforts:

1. Minimizing false positives
2. Minimizing false negatives

In a perfect world, you would be able to minimize both but in reality you often have to trade one for the other.

Summary
----

- Regex is a "sepical" language for speical problem - pattern matching
- You'll make a lot of mistakes in regex 😩. 
    - False Positive: Thinking you are right but you are wrong
    - False Negative: Missing something

<br>
<br>
---