Regular expressions are a powerful way of building patterns to matching text. As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of people end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

* Once we understand how they work, complex operations with string data can be written a lot quicker, which will save time.
* Regular expressions are often faster to execute than their manual equivalents.
* Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives us a powerful tool that we can use wherever we work with data.

We'll be applying regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

**Hacker News** is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "stories") are voted and commented upon, similar to **reddit**. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off [this CSV of Hacker News stories from September 2015 to September 2016](https://www.kaggle.com/hacker-news/hacker-news-posts). The columns in the dataset are explained below:

* `id`: The unique identifier from Hacker News for the story
* `title`: The title of the story
* `url`: The URL that the stories links to, if the story has a URL
* `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the story
* `author`: The username of the person who submitted the story
* `created_at`: The date and time at which the story was submitted

We have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
hn = pd.read_csv("hacker_news.csv")
print(hn.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20099 non-null  int64 
 1   title         20099 non-null  object
 2   url           17659 non-null  object
 3   num_points    20099 non-null  int64 
 4   num_comments  20099 non-null  int64 
 5   author        20099 non-null  object
 6   created_at    20099 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB
None


In [3]:
# hn.describe(include = "all")
hn.describe()

Unnamed: 0,id,num_points,num_comments
count,20099.0,20099.0,20099.0
mean,11317550.0,50.296632,24.803025
std,696453.1,107.110322,56.108639
min,10176910.0,1.0,1.0
25%,10701720.0,3.0,1.0
50%,11284520.0,9.0,3.0
75%,11926130.0,54.0,21.0
max,12578980.0,2553.0,1733.0


In [4]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


When working with regular expressions, we use the term **pattern** to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The [re module](https://docs.python.org/3/library/re.html#module-re). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the `re` module is the `re.search()` function, which takes two required arguments

* The regex pattern
* The string we want to search that pattern for

`import re
m = re.search("and", "hand")
print(m)`

`< _sre.SRE_Match object; span=(1, 4), match='and' >`

The `re.search()` function will return a Match object if the pattern is found anywhere within the string. If the pattern is not found, `re.search()` returns `None`:

`m = re.search("and", "antidote")
print(m)
None`

We can use the fact that the boolean value of a match object is `True` while `None` is `False` to easily check whether our regex matches each string in a list. 

The power of regular expressions comes when we use one of the `special character` sequences.

The first of these we'll use is called a `set`. A `set` allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets

`example: [msb]end`

The regular expression above will match the strings `mend`, `send`, and `bend`.

In [5]:
# how many times Python is mentioned in the title of stories in our Hacker News dataset.

import re

titles = hn["title"]


pattern = r'[Pp]ython'
python_counts = 0

for title in titles:
    if re.search(pattern, title):
        python_counts += 1
python_counts


160

`Series.str.contains()` method can be used to test whether a Series of strings match a particular regex pattern.

In [7]:
# replicate the above method
pattern = r'[Pp]ython'
python_counts = hn['title'].str.contains(pattern).sum() # True value counting as 1, and each False as 0
python_counts 


160

In [8]:
# extracting a series of the values from titles that contain Ruby or ruby.
pattern = "[Rr]uby"

ruby_find = hn["title"].str.contains(pattern)
ruby_titles = hn.loc[ruby_find,"title"]
ruby_titles.head()

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object

If we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression 
 
`[1-2][0-9][0-9][0-9] or [1-2][0-9]{3}`({3} this type of regular expression syntax is called a **quantifier**. In this case, its a **numeric quantifier**.

Quantifiers specify how many of the **previous** character our pattern requires, which can help us when we want to match substrings of specific lengths
 

 `a{3}` -> The character `a` three times

 `a{3,5}` ->The character `a` three, four or five times

 `a{,3}` ->The character `a` zero, one, two or three times

 `a{8,}` ->The character `a` eight or more times

In addition to numeric quantifiers, there are **single characters** in regex that specify some common quantifiers that we're likely to use. A summary of them is below.

`a*` -> equivalent to a{0,} zero or more

`a+` -> equivalent to a{1,} one or more

`a?` -> equivalent to a{0,1} zero or one (optional)

In [9]:
# we're going to find how many titles in our dataset mention email or e-mail.

pattern = "e-?mail"

email_bool = hn['title'].str.contains(pattern)
email_count = email_bool.sum()
email_titles = hn["title"][email_bool] # select only the items from titles that matched the regular expression
email_titles.head()

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object

Some stories submitted to Hacker News include a topic tag in brackets, like `[pdf]`, `[video]` etc.

The other critical part of our task of identifying how many titles have tags, is knowing how to match the characters between the brackets (like `pdf` and `video`) without knowing ahead of time what the different topic tags will be.

To match **unknown characters** using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The **set notation** using brackets to match any of a number of characters.
2. The **range notation**, which we used to match ranges of digits (like `[0-9]`).

Let's look at a summary of syntax for some of the regex character classes:

* `set` ->[fud] either f,u or d
* `range` -> [a-e] any of the charachter a,b,c,d or e
* `range` -> [0-3] any of the charachter 0,1,2 or 3
* `range` -> [A-Z] any uppercase letter
* `set+range` -> [A-Za-z] any uppercase or lower case letter

There are two new things we can observe from this table:

1. Ranges can be used for **letters** as well as **numbers**.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

* `Digit` -> **`\d`** any digit character(equivalent to `[0-9]`)
* `Word` -> **`\w`** any digit, uppercase, lowercase or underscore character (equivalent to `[A-Za-z0-9_]`). **Does not** include any **special character** 
* `Whitepace` -> **`\s`** any `space`, `tab` or `linebreak` character
* `Dot` -> **`.`**  any character or special character **except newline**

In [124]:
# We are going to find how many titles in our dataset have tags i.e anythning in bracket[]. 
# in this case, we're just interested in single-word tags without special characters like [pdf] or [video].

# The regular expression should match, in order:
# A single open bracket character.
# One or more word characters.
# A single close bracket character.

pattern = "\[\w+\]" # bracket are consider to be a set so we use backslashes to escape both the open and closing brackets like \[pdf\]. 
# dot i.e "\[\w.+\]" is used to cater space or punctuation character like [PHP-DEV] or [XKCD Flowchart]
titles = hn['title']
tag_titles = titles[titles.str.contains(pattern)]
tag_count = tag_titles.shape[0]
tag_count



444

In [17]:
pattern = "\[\w.+\]" 
# pattern = "\[\w+.+\]" # Alternate of above
tag_titles = titles[titles.str.contains(pattern)]
tag_titles

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19975    [FOR GIT USERS] YoLog  Lightweight Wrapper to ...
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 469, dtype: object

We can use backslashes to escape the brackets `[` and `]` characters

In Python, a backslash followed by certain characters represents an escape sequence — like the `\n` sequence — which represents a **new line**. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

`print('hello\b')
hell`

The escape sequence `\b` represents a **backspace**, so the final letter from our string is removed. The character sequence `\b` has a special meaning in regular expressions, so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the `"b"`:

`print('hello\\b')
hello\b`

This can make regular expressions even more difficult to read and interpret, so instead we use raw strings, which we denote by prefixing our string with the `r` or `R`  character. Let's take a look at the code from above with a raw string:

`print(r'hello\b')
hello\b`


We strongly recommend using raw strings for every regex we write, rather than remember which sequences are escape sequences and using raw strings selectively.

We were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. **Capture groups** allow us to specify one or more groups within our match that we can access separately. 

We specify capture groups using parentheses `(` `)`.

In [57]:
pattern = r'\[(\w+)\]' # parentheses is used to specify the capature group

titles = hn['title']
tag = titles.str.extract(pattern)
tag_freq = tag[0].value_counts()
tag_freq.head()

pdf      276
video    111
2015       3
audio      3
beta       2
Name: 0, dtype: int64

In [60]:
pattern = r'(\[\w+\])' # parentheses is taken out of square bracket[] to capature group with full tag

titles = hn['title']
tag = titles.str.extract(pattern)
tag_freq = tag[0].value_counts()
tag_freq.head()

[pdf]      276
[video]    111
[audio]      3
[2015]       3
[beta]       2
Name: 0, dtype: int64

In reality, regular expressions are often complex. When creating complex regular expressions, we often need to work iteratively so we can find `"bad"` instances that match our pattern and then exclude them.

In order to work faster as we build our regular expression, it can be helpful to create a function that returns the first few matching strings:

In [18]:
def first_10_matches(pattern):
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows us to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference

* **Negative Set** -> `[^fud]` any charachter except f,u or d
* **Negative Set** -> `[^1-3Z\s]` any charachter except 1,2,3,Z or Whitespace character
* **Negative Digit** -> `\D` any charachter except Digit character
* **Negative Word** -> `\W` any charachter except word character
* **Negative whitespace** -> `\S` any charachter except space character

In [61]:
# Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript

def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

In [125]:
pattern  = r'[Jj]ava[^Ss]' # regular expression that will match titles containing Java but not followed by the letter 'S' or 's'.

java_titles = first_10_matches(pattern)
java_titles

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
2910                2016 JavaOne Intel Keynote  32mn Talk
3452    What are the Differences Between Java Platform...
Name: title, dtype: object

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where **Java** occurs at the end of the string, like this title:

**`Pippo  Web framework in Java`**

This is because the negative set `[^Ss]` must match one character, so instances at the end of a string do not match.

A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax `\b`. A word boundary matches the position between a `word character` and a `non-word character`, or a `word character` and the `start/end of a string`. 

Word boundary anchor also consider boundary character which is immediately followed by special characters. For example

Pattern = `"\bC\b"` # Find C only

output  = `C`, `C++`, `C.`, `C$`  # Our patter also found C followed by special characters

How using a word boundary changes the match from the string in the example above:

`string = "Sometimes people confuse JavaScript with Java"`

`pattern_1 = r"Java[^S]"
m1 = re.search(pattern_1, string)
print(m1)`

`None`

The regular expression returns `None`, because there is no substring that contains `Java` followed by a character that isn't `S`

Instead use word boundaries in our regular expression:

`pattern_2 = r"\bJava\b"
m2 = re.search(pattern_2, string)
print(m2)`

`_sre.SRE_Match object; span=(41, 45), match='Java'`

With the word boundary, our pattern matches the `Java` at the end of the string.

In [126]:
# using the word boundary anchor as part of our regular expression to select the titles that mention Java.

pattern = r'\b[Jj]ava\b' # regular expression that will match titles containing Java
# regular expressions to match pattern contained anywhere within text that contain only Jjava

In [127]:
java_titles = first_10_matches(pattern)
java_titles

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
3228                              Comparing Rust and Java
3452    What are the Differences Between Java Platform...
Name: title, dtype: object

In [88]:
import re
pattern = r'\b[Jj]ava\b'

string = "Sometimes people confuse JavaScript with Java" # regular expressions to match pattern contained anywhere within text
print(re.search(pattern, string))


<re.Match object; span=(41, 45), match='Java'>


The **word boundary anchor** matches the space between a `word character` and a `non-word character`. More generally in regular expressions, an **anchor** matches something that isn't a character, as opposed to **character classes** which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **ending anchor**, which represent the start and the end of the string, respectfully.

There are often scenarios where we want to specifically match a pattern at the start and end of strings.

`Anchor`|`Pattern`|`Explanation`

`Begining`|`^abc`| `Matches abc at the start of the string`

`End`|`abc$`|      `Matches abc at the end of the string`

Note that the `^` character is used both as a **beginning anchor** and to indicate a **negative set**, depending on whether the character preceding it is a `[`or not

In [99]:
# Count the number of times that a tag (e.g. [pdf] or [video]) occurs at the start of a title in titles

pattern_beginning = r'^\[\w+\]'

beginning_count = titles.str.contains(pattern_beginning).sum()
beginning_count

15

In [102]:
pattern_ending = r'\[\w+\]$'
ending_count = titles.str.contains(pattern_ending).sum()
ending_count

417

Up until now, we've been using sets like `[Pp]` to match different capitalizations in our regular expressions. This strategy works well when there is only **one** character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word `"email"`. Here is a list of the variations:

* email
* Email
* e Mail
* e mail
* E-mail
* e-mail
* eMail
* E-Mail
* EMAIL

We can use flags to specify that our regular expression should ignore case.

A [list of all available flags](https://docs.python.org/3/library/re.html#re.A) is in the documentation, but by far the most common and the most useful is the `re.IGNORECASE` flag, which is also available using the alias `re.I` for convenience.

When we use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:

In [19]:
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])

email_tests.str.contains(r"email")

0     True
1    False
2    False
3    False
dtype: bool

In [20]:
# Now let's look at what happens when we use the flag:

import re

email_tests.str.contains(r"email",flags=re.I)

0    True
1    True
2    True
3    True
dtype: bool

In [128]:
import re
pattern = r'e-?\s?mail' # or  r'e[-\s]?mail'


email_mentions = titles[titles.str.contains(pattern, flags = re.I)].shape[0]
email_mentions

151

In [129]:
pattern = r'(e-?\s?mail)' # or  r'(e[-\s]?mail)'
titles.str.extract(pattern, flags = re.I)[0].unique()

array([nan, 'email', 'Email', 'e Mail', 'e mail', 'E-mail', 'e-mail',
       'eMail', 'E-Mail', 'EMAIL'], dtype=object)