# Regular Expressions - Advanced

Extracting subsets of data based on regex


In [1]:
import pandas as pd
import re

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

In [2]:
# Start looking at the flavours of SQL that are mentioned in article titles
sql_counts = titles.str.contains(r"sql", flags=re.I).sum()
sql_counts

108

In [3]:
# Extract rows that mention SQL and create a column containing the captured sql string, ie the flavour of SQL
sql_bool = titles.str.contains(r"sql", flags=re.I)
# subset the hn df with a copy containing only records that metion SQL in their title
hn_sql = hn[sql_bool].copy()
hn_sql


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
9,11370829,Crate raises $4M seed round for its next-gen S...,http://techcrunch.com/2016/03/15/crate-raises-...,3,1,hitekker,3/27/2016 18:08
142,10957172,PostgreSQL: Linux VS Windows part 2,http://www.sqig.net/2016/01/postgresql-linux-v...,16,3,based2,1/23/2016 4:21
221,11544342,MemSQL (YC W11) Raises $36M Series C,http://blog.memsql.com/memsql-raises-series-c/,74,14,ericfrenkiel,4/21/2016 18:32
394,10620525,The History of SQL Injection,http://motherboard.vice.com/read/the-history-o...,38,9,kawera,11/24/2015 13:25
419,10301554,Pentesterlab Tutorial SQL injection to web ad...,https://pentesterlab.com/exercises/from_sqli_t...,2,1,pentestercrab,9/30/2015 3:32
...,...,...,...,...,...,...,...
19133,12041615,PostgreSQL: Linux VS Windows [Benchmark],http://www.sqig.net/2016/01/postgresql-linux-v...,2,3,insulanian,7/6/2016 7:01
19580,12252112,PostgreSQL Index Internals,https://www.pgcon.org/2016/schedule/events/934...,211,21,snaga,8/9/2016 2:09
19769,11953895,SQL Server on Linux in Preview,https://azure.microsoft.com/en-us/blog/microso...,15,2,rjdevereux,6/22/2016 13:55
19802,12223216,Uber's Move Away from PostgreSQL,http://rhaas.blogspot.com/2016/08/ubers-move-a...,119,15,ioltas,8/4/2016 3:36


In [4]:
# Create columns with SQL flavour, lowercase to avoid duplicates
pattern = r"(\w+sql)"
flavours = hn_sql["title"].str.lower().str.extract(pattern, flags=re.I)
hn_sql["flavour"] = flavours
hn_sql

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,flavour
9,11370829,Crate raises $4M seed round for its next-gen S...,http://techcrunch.com/2016/03/15/crate-raises-...,3,1,hitekker,3/27/2016 18:08,
142,10957172,PostgreSQL: Linux VS Windows part 2,http://www.sqig.net/2016/01/postgresql-linux-v...,16,3,based2,1/23/2016 4:21,postgresql
221,11544342,MemSQL (YC W11) Raises $36M Series C,http://blog.memsql.com/memsql-raises-series-c/,74,14,ericfrenkiel,4/21/2016 18:32,memsql
394,10620525,The History of SQL Injection,http://motherboard.vice.com/read/the-history-o...,38,9,kawera,11/24/2015 13:25,
419,10301554,Pentesterlab Tutorial SQL injection to web ad...,https://pentesterlab.com/exercises/from_sqli_t...,2,1,pentestercrab,9/30/2015 3:32,
...,...,...,...,...,...,...,...,...
19133,12041615,PostgreSQL: Linux VS Windows [Benchmark],http://www.sqig.net/2016/01/postgresql-linux-v...,2,3,insulanian,7/6/2016 7:01,postgresql
19580,12252112,PostgreSQL Index Internals,https://www.pgcon.org/2016/schedule/events/934...,211,21,snaga,8/9/2016 2:09,postgresql
19769,11953895,SQL Server on Linux in Preview,https://azure.microsoft.com/en-us/blog/microso...,15,2,rjdevereux,6/22/2016 13:55,
19802,12223216,Uber's Move Away from PostgreSQL,http://rhaas.blogspot.com/2016/08/ubers-move-a...,119,15,ioltas,8/4/2016 3:36,postgresql


In [5]:
# SQL flavour counts
hn_sql["flavour"].value_counts()

postgresql    27
nosql         17
mysql         13
cloudsql       1
sparksql       1
memsql         1
Name: flavour, dtype: int64

In [6]:
# Create a pivot table with flavour as index and average number of comments
pt = pd.pivot_table(hn_sql, index="flavour", values="num_comments").rename(columns={"num_comments": "average_comments"})
pt


Unnamed: 0_level_0,average_comments
flavour,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


In [7]:
# Extract python version mentioned most often
pattern = r"([Pp]ython ?[\d.]+)"
hn["title"].str.extract(pattern).value_counts()

Python 3        10
Python 3.5       3
Python 3.6       2
Python 2         2
python 2         1
Python4          1
Python3.5        1
Python 8         1
Python 4         1
Python 3.5.0     1
Python 2.7       1
Python 1.5       1
dtype: int64

In [56]:
# Same, but just the version number
pattern = r"[Pp]ython ?([\d.]+)"
hn["title"].str.extract(pattern).value_counts()



3        10
3.5       4
2         3
4         2
3.6       2
8         1
3.5.0     1
2.7       1
1.5       1
dtype: int64

In [78]:
# Count mentions of C , but not C++, not C., not C#
pattern = r"(\b[Cc]\b[^.^+^#])"
#pattern = r"(\b[Cc]\b) "
hn["title"].str.extract(pattern).value_counts()

C     52
C/     5
C,     3
c      2
C:     2
C-     2
c-     1
C?     1
C'     1
dtype: int64

## Lookarounds

- define a character or sequence of characters that either must or must not come before or after regex match.
- lookahead `?`, lookbehind `?<`
- positive `=`, negative `!`

Four types:

- **Positive lookahead**:
   - `zzz(?=abc)`
   - Matches `zzz` only when followed by `abc`
- **Negative lookahead**:
   - `zzz(?!abc)`
   - Matches `zzz` only when _NOT_ followed by `abc`
- **Positive lookbehind**:
   - `(?<=abc)zzz`
   - Matches `zzz` only when preceded by `abc`
- **Negative lookbehind**:
   - `(?<!abc)zzz`
   - Matches `zzz` only when _NOT_ preceded by `abc`



In [97]:
# Improved regex to find mentions of the C programming language
# C not preceded or followed by a word character
pattern = r"(?<!Series )(?<!\w)[Cc](?!\w)(?![\+\.])"
c_mentions = hn[hn["title"].str.contains(pattern)]["title"]
c_mentions

365                       The new C standards are worth it
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
1326                      The C standard formalized in Coq
1365                           GNU C Library 2.23 released
                               ...                        
18543                 C-style for loops removed from Swift
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 102, dtype: object

## Backreferences

- Reference multiple capture groups from left to right
- `\1` refers to first capture group, `\2` second, etc
- Examples:
   - `(e)\1` matches `ee`
   - `(\w)\1` matches any repeated word char
   - `(abc)\1` matches `abcabc`
   - `(a)\1(b)\2` matches `aabb`
   - `(a)\1(b)\1` matches `aaba`


In [121]:
# Match repeated words in story titles
# One or more word chars with boundary anchor, repeated with optional white space between
pattern = r"\b(\w+)\s\1\b"
hn[hn["title"].str.contains(pattern)]["title"].count

<bound method Series.count of 3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object>

---
## Substituting RegEx Matches

- `re.sub()` equivalent in pandas is `Series.str.replace()`
- Useful for creating uniform string values

In [136]:
# Standardize all variations of 'email'

email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

pattern = r"\be[-\s]?mail"
email_uniform = email_variations.str.replace(pattern, "email", flags=re.I)
email_uniform

0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [137]:
# Standardize variations of email in the titles
pattern = r"\be[-\s]?mail"
email_titles = hn[hn["title"].str.contains(pattern, flags=re.I)]["title"]
email_titles.str.replace(pattern, "email", flags=re.I)

119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton emails...
174                                        email Apps Suck
261      emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
                               ...                        
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19395    I used HTML email when applying for jobs, here...
19446    Tell HN: Secure email provider Riseup will run...
19905    Gmail Will Soon Warn Users When emails Arrive ...
Name: title, Length: 143, dtype: object

In [143]:
# Extract domains from a series of URLs

test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
])

pattern = r"https?:\/\/([a-z,A-Z,0-9,.,-]+)"
test_urls.str.extract(pattern, flags=re.I)

Unnamed: 0,0
0,www.amazon.com
1,www.interactivedynamicvideo.com
2,www.nytimes.com
3,evonomics.com
4,github.com
5,phys.org
6,iot.seeed.cc
7,www.bfilipek.com
8,beta.crowdfireapp.com
9,www.valid.ly


In [150]:
# Same, from the dataframe
pattern = r"https?:\/\/([a-z,A-Z,0-9,.,-]+)"
top_5_domains = hn["url"].str.extract(pattern, flags=re.I).value_counts().iloc[:5]
top_5_domains

github.com             1008
medium.com              825
www.nytimes.com         525
www.theguardian.com     248
techcrunch.com          245
dtype: int64

In [156]:
# Extract URL parts

test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
])

pattern = r"(https?)://([\w\.\-]+)/?(.*)"
test_urls.str.extract(pattern, flags=re.I)

Unnamed: 0,0,1,2
0,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
1,http,www.interactivedynamicvideo.com,
2,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
3,http,evonomics.com,advertising-cannot-maintain-internet-heres-sol...
4,HTTPS,github.com,keppel/pinn
5,Http,phys.org,news/2015-09-scale-solar-youve.html
6,https,iot.seeed.cc,
7,http,www.bfilipek.com,2016/04/custom-deleters-for-c-smart-pointers.html
8,http,beta.crowdfireapp.com,?beta=agnipath
9,https,www.valid.ly,?param


In [157]:
# ... from the df
pattern = r"(https?)://([\w\.\-]+)/?(.*)"
hn["url"].str.extract(pattern, flags=re.I)

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
...,...,...,...
20094,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20095,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20096,http,blog.darknedgy.net,technology/2016/01/01/0/
20097,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


## Named Capture Groups

- Use syntax `?P<name>`, inside parens, before regex
- Eg: `(P<date>.+)\s(?P<time>.+)`
- Pandas will give same names to columns for extracted capture groups

In [158]:
# As above, but with named capture groups
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)"
hn["url"].str.extract(pattern, flags=re.I)

Unnamed: 0,protocol,domain,path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
...,...,...,...
20094,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20095,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20096,http,blog.darknedgy.net,technology/2016/01/01/0/
20097,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


## Advanced Regular Expressions:

### Syntax
---

#### CAPTURE GROUPS

Extracting text using a capture group:
```python
s.str.extract(pattern_with_capture_group)
```

Extracting text using multiple capture groups:
```python
s.str.extract(pattern_with_multiple_capture_groups)
```
#### SUBSTITUTION

Substituting a regex match:
```python
s.str.replace(pattern, replacement_text)
```

### Concepts

- Capture groups allow us to specify one or more groups within our match that we can access separately.

|Pattern |Explanation |
|:-|:-|
|`(yes)no`   | Matches `yesno` , capturing `yes` in a single capture group.
|`(yes)(no)` | Matches `yesno` , capturing `yes` and `no` in two capture groups.

- Backreferences allow us to repeat a capture group within our regex pattern by referring to them with an integer in the order they are captured.

|Pattern |Explanation |
|:-|:-|
|`(yes)no\1`     | Matches `yesnoyes`|
|`(yes)(no)\2\1` | Matches `yesnonoyes`|

- Lookarounds let us define a positive or negative match before or after our string.

|Pattern |Explanation |
|:-|:-|
|`zzz(?=abc)`  | Matches `zzz` only when it is followed by `abc`|
|`zzz(?!abc)`  | Matches `zzz` only when it is _not_ followed by `abc`|
|`(?<=abc)zzz` | Matches `zzz` only when it is preceded by `abc`|
|`(?<!zzz)abc` | Matches `zzz` only when it is not preceded by `abc`|


### Resources
- [re module](https://docs.python.org/3/library/re.html#module-re)
- [RegExr Regular Expression Builder](https://regexr.com/)