# Regular Expressions

Regular expressions describe _patterns_ which are used to find _matches_ in target strings.

In [1]:
import pandas as pd

hn = pd.read_csv("hacker_news.csv")
hn

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48
...,...,...,...,...,...,...,...
20094,12379592,How Purism Avoids Intels Active Management Tec...,https://puri.sm/philosophy/how-purism-avoids-i...,10,6,AdmiralAsshat,8/29/2016 2:22
20095,10339284,YC Application Translated and Broken Down,https://medium.com/@zreitano/the-yc-applicatio...,4,1,zreitano,10/6/2015 14:57
20096,10824382,Microkernels are slow and Elvis didn't do no d...,http://blog.darknedgy.net/technology/2016/01/0...,169,132,vezzy-fnord,1/2/2016 0:49
20097,10739875,How Product Hunt really works,https://medium.com/@benjiwheeler/how-product-h...,695,222,brw12,12/15/2015 19:32


## Sets

A set specifies two or more characters that can match in a single character's position.

Sets are defined with square brackets:

```
[msb]end - would match mend, send, bend
```

In [2]:
import re

python_mentions = 0
pattern = "[Pp]ython"
for t in hn["title"]:
    if re.search(pattern, t):
        python_mentions += 1
python_mentions


160

In [3]:
# loops should be avoided, of course
python_mentions = hn["title"].str.contains("[Pp]ython").sum()
python_mentions

160

In [4]:
# select titles mentioning ruby
ruby_titles = hn[hn["title"].str.contains("[Rr]uby")].loc[:,"title"]
ruby_titles

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

## Quantifiers

- Used to specify repetition for the previous pattern, eg:
   - `a{3}` 'a' three times
   - `a{3,5}` 'a' three, four or five times
   - `a{3,}` 'a' three or more times
   - `a{,3}` 'a' three or fewer times
- Special quantifiers
   - `a*` - _Zero_ or more, 'a' zero or more times, same as `a{0,}`
   - `a+` - _One or more_, 'a' one or more times, same as `a{1,}`
   - `a?` - _Optional_, a zero or one time, same as `a{0,1}`

In [5]:
# find titles with the string email or e-mail in them...
email_bool = hn.loc[:, "title"].str.contains("e[-]*mail")
# True and False are treated as 0, 1 so sum() gives total matches
email_count = email_bool.sum()
print(f"{email_count} matches found")
email_titles = hn[email_bool].loc[:, "title"]
email_titles

86 matches found


119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

## Character Classes

- Allows matches on sets / ranges of characters, eg:
   - `[fud]` (set) matches f, u or d
   - `[a-e]` (range) matches a, b, c, d or e
   - `[0-3]` (range) matches 0, 1, 2 or 3
   - `[A-Z]` (range) matches any uppercase char
   - `[A-Za-z]` (set+range) matches any upp or lowercase char
   
- Common abbreviated character classes
   - `\d` - Digit, `[0-9]`
   - `\w - Word, `[A-Za-z0-9_]` including undrescore
   - `\s` - Whitespace, any space, tab or linebreak char
   - `.` - Dot, any char except newline

In [6]:
# find strings with a single word in square brackets, eg [go]
pattern = "\[\w+\]"
tag_titles = hn[hn.loc[:,"title"].str.contains(pattern)].loc[:, "title"]
tag_titles

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 444, dtype: object

## Escape Sequences, Raw Strings

- Escape sequences such as `\t` (tab),`\b` (backspace)m `\n` (newline) etc, can make writing regex messy
- To include regex special chars, backslashed etc in a pattern need to escape them, eg `\[`
- To search for a literal `\n` would need double backslash,  `\\n`
- **Raw Strings** are preferable for creating regex patterms
- Raw strings are denoted `r"ABC\n123"`

In [7]:
print("This is \nNOT a raw string")
print(r"This \nis \na \nRAW \nstring")

This is 
NOT a raw string
This \nis \na \nRAW \nstring


## Capture Groups

- Allow for the literal string(s) that matched the regex, to be captured
- Capture groups are specified using parenthesis, eg `(\[\w+\])`
- where `.contains()` returns a bool on match, `extract()` captures the match defined by the capture group

In [8]:
tag_freq = hn.loc[:,"title"].str.extract(r"\[(\w+)\]").value_counts()
tag_freq

pdf            276
video          111
2015             3
audio            3
2014             2
slides           2
beta             2
viz              1
German           1
Petition         1
NSFW             1
Map              1
Live             1
JavaScript       1
Infograph        1
HBR              1
Challenge        1
GOST             1
Excerpt          1
React            1
CSS              1
Beta             1
Benchmark        1
Australian       1
ANNOUNCE         1
5                1
2008             1
Python           1
SpaceX           1
SPA              1
gif              1
updated          1
transcript       1
survey           1
song             1
satire           1
repost           1
png              1
much             1
map              1
detainee         1
Skinnywhale      1
crash            1
comic            1
coffee           1
blank            1
ask              1
Videos           1
Ubuntu           1
USA              1
videos           1
1996             1
dtype: int64

## Negative Character Classes

- Used to match every character EXCEPT a character class
- Negative sets denoted `^`, eg:
   - `[^fud]` - Any char except 'f', 'u' or 'd'
   - `[^1-3Z\s]` - Any char except '1', '2', '3', 'Z' or a whitespace
- Common negative character classes:
   - `\D` - any char except a digit chars
   - `\W` - any char except word chars
   - `\S` - any char except whitespace chars

In [9]:
# Match 'Java' but not JavaScript
regex = r"[Jj]ava[^Ss]"
java_titles_bool = hn.loc[:, "title"].str.contains(regex)
java_titles = hn[java_titles_bool].loc[:, "title"]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

## Word Boundary Anchor

An important thing to note about negative sets is that they must must one character so they don't work at the end of a string.

For example, matching 'Java' but not 'JavaScript' with `r"[Jj]ava[^Ss]"` would not pick up 'I hate Java' as there is no char 
following "a" that is a negative match for `[^Ss]`.

A **word boundary anchor** is an alternative approach. It is used to match the boundary between w _word_ char and a _non-word_ char.

eg: `r"\bJava\b"`


In [10]:
str1 = "I hate Java"
print(re.search(r"[Jj]ava[^Ss]", str1))
print(re.search(r"\b[Jj]ava\b", str1))

None
<re.Match object; span=(7, 11), match='Java'>


In [12]:
# Improved match for Java and not Javascript
java_titles_bool = hn.loc[:,"title"].str.contains(r"\b[Jj]ava\b")
java_titles = hn[java_titles_bool].loc[:,"title"]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

## Beginning and End Anchors

In regex, **anchors** are generally used match something that is not a character:

- `^abc` - Beginning, matches abc ONLY at the start of a string
- `abc$` - End, matches abc at the end of a string

Note `[^...]`  is a negative set and `^...` is a beginning anchor.


In [16]:
# How many times does any tag, eg [pdf], appear at the start of a title
beginning_count = hn.loc[:, "title"].str.contains(r"^\[\w+\]").sum()
print(beginning_count)

# How many times does any tag, eg [pdf], appear at the end of a title
ending_count = hn.loc[:, "title"].str.contains(r"\[\w+\]$").sum()
print(ending_count)

15
417


## Flags

Flags are used to indicate special considerations for a regex, such as ignoring case.

[Full list of flags](https://docs.python.org/3/library/re.html#re.A)

In [18]:
# Check for mention of the word email in titles, in any form, eg email, e-mails etc
# flag re.I is IGNORECASE
rgx = r"\be\s?-?mails?\b"
email_mentions = hn.loc[:,"title"].str.contains(rgx, flags=re.I).sum()
email_mentions

141

In [19]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

## Regular Expression Basics

### Syntax
---

#### REGULAR EXPRESSION MODULE

Importing the regular expression module:
```python
import re
```

Searching a string for a regex pattern:
```python
re.search(r"blue", "Rhythm and blues")
```

#### PANDAS REGEX METHODS

Return a boolean mask if a regex pattern is found in a series:
```python
s.str.contains(pattern)
```

Extract a regex capture group from a series:
```python
s.str.extract(pattern_with_capture_group)
```

#### ESCAPING CHARACTERS

Treating special characters as ordinary text using backslashes:

```python
r"\[pdf\]"
```

### Concepts
- Regular expressions, often referred to as regex, are a set of syntax components used for matching sequences of characters in strings.
- A pattern is described as a regular expression that we've written. We say regular expression has matched if it finds the pattern exists in the string.
- Character classes allow us to match certain classes of characters.
- A set contains two or more characters that can match in a single character's position.
- Quantifiers specify how many of the previous characters the pattern requires.
- Capture groups allow us to specify one or more groups within our match that we can access separately.
- Negative character classes are character classes that match every character except a character class.
- An anchor matches something that isn't a character, as opposed to character classes which match specific characters.
- A word boundary matches the space between a word character and a non-word character, or a word character and the start/end of a string

- Common character classes: 

|Character Class|Pattern|Explanation|
|:-|:-|:-|
|Set|`[fud]`|Either f, u, or d|
|Range|`[a-e]`|Any of the characters a, b, c, d, or e|
|Range|`[0-3]`|Any of the characters 0, 1, 2, or 3|
|Range|`[A-Z]`|Any uppercase letter|
|Set + Range|`[A-Za-z]`|Any uppercase or lowercase character|
|Digit|`\d`|Any digit character (equivalent to `[0-9]`)|
|Word|`\w`|Any digit, uppercase, or lowercase character (equivalent to `[A-Za-z0-9]`)|
|Whitespace|`\s`|Any space, tab or linebreak character|
|Dot|`.`|Any character except newline|


- Common quantifiers: 

|Quantifier|Pattern|Explanation|
|:-|:-|:-|
|Zero or more}|`a*`|The character a zero or more times|
|One or more|`a+`|The character a one or more times|
|Optional|`a?`|The character a zero or one times|
|Numeric|`a{3}`|The character a three times|
|Numeric|`a{3,5}`|The character a three, four, or five times|
|Numeric|`a{,3}`|The character a one, two, or three times|
|Numeric|`a{8,}`|The character a eight or more times|

- Common negative character classes: 

|Character Class|Pattern|Explanation|
|:-|:-|:-|
|Negative Set|`[^fud]`|Any character except f, u, or d|
|Negative Set|`[^1-3Z\s]`|Any characters except 1, 2, 3, Z, or whitespace characters|
|Negative Digit|`\D`|Any character except digit characters|
|Negative Word|`\W`|Any character except word characters|
|Negative Whitespace|`\S`|Any character except whitespace characters|

- Common anchors: 

|Anchor|Pattern|Explanation|
|:-|:-|:-|
|Beginning|`^abc`|Matches abc only at the start of a string|
|End|`abc$`|Matches abc only at the end of a string|
|Word boundary|`s\b`|Matches s only when it's followed by a word boundary|
|Word boundary|`s\B`|Matches s only when it's not followed by a word boundary|

### Resources
- [re module](https://docs.python.org/3/library/re.html#module-re)
- [Regexr for building regular expressions](https://regexr.com/)