# 1.0 Regular Expression Basics

## 1.1 Introduction

In the previous lesson, we learned that regular expressions are a powerful way of building patterns to match text. In the first two missions of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.


<center><img width="600" src="https://drive.google.com/uc?export=view&id=1jMWPd4CJTo60fWWRCU6Tkxw2ot-6KRB6"></center>


That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.


We could probably fill a whole Dataquest course with the intricacies of regular expressions, but instead we're going to give you a two-mission tour of the main components.

One thing to keep in mind before we start: **don't expect to remember all of the regular expression syntax**. The most important thing is to understand the core principles, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

We'll be learning regular expressions while performing analysis on a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

**Hacker News** is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. **Hacker News is extremely popular in technology and startup circles**; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off this CSV of [Hacker News stories from September 2015 to September 2016](https://www.kaggle.com/hacker-news/hacker-news-posts). The columns in the dataset are explained below:


- **id**: The unique identifier from Hacker News for the story
- **title**: The title of the story
- **url**: The URL that the stories links to, if the story has a URL
- **num_points**: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the story
- **author**: The username of the person who submitted the story
- **created_at**: The date and time at which the story was submitted

For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

Let's start by reading our Hacker News dataset into a pandas dataframe.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Import the pandas library.
2. Read the **hacker_news.csv** file into a pandas dataframe. Assign the result to **hn**.


In [1]:
# put your code here

In [63]:
import pandas as pd
hn = pd.read_csv("hacker_news.csv")
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


## 1.2 The regular expression module

When working with regular expressions, we use the term **pattern** to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has **matched**.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string **"and"** within another string, the regex pattern for that is simply **and**:

<center><img width="500" src="https://drive.google.com/uc?export=view&id=1NRIn1qMY4KJ55kyyM9bLsv_xvxT1Qhqq"></center>

In the third example above, the pattern **and** does not match **Andrew** because even though **a** and **A** are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The [re module](https://docs.python.org/3/library/re.html#module-re). This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the **re** module is the [re.search()](https://docs.python.org/3/library/re.html#re.search) function, which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for

In [64]:
import re

m = re.search("and", "hand")
print(m)

<_sre.SRE_Match object; span=(1, 4), match='and'>


The **re.search()** function will return a [Match object](https://docs.python.org/3/library/re.html#match-objects) if the pattern is found anywhere within the string. If the pattern is not found, **re.search()** returns **None**:

In [65]:
m = re.search("and", "antidote")
print(m)

None


We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is **True** while **None** is **False** to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

In [66]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


So far, we haven't done anything with regular expressions that we couldn't do using the **in** keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

<center><img width="800" src="https://drive.google.com/uc?export=view&id=1X7htN9UeY9lkWz5X3sVSLahQWthens-7"></center>

The regular expression above will match the strings **mend**, **send**, and **bend**.

Let's look at how we can add sets to match more of our example strings from earlier:

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1VPINjfNHUMV_oV3UeuKivacr3yDOPPaF"></center>

Let's take another look at the list of strings we used earlier:






In [67]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

If you look closely, you'll notice the first string contains the substring **Blue** with a capital letter, where the third string contains the substring **blue** in all lowercase. We can use the set **[Bb]** for the first character so that we can match both variations, and then use that to count how many times **Blue** or **blue** occur in the list:

In [68]:
blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1

print(blue_mentions)

2


We're going to use this technique to find out how many times **Python** is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both **Python** with a capital **'P'** and **python** with a lowercase **'p'**.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have provided code to import the **re** module and extract a **list**, **titles**, containing all the titles from our dataset.

1. Initialize a variable **python_mentions** with the integer value **0**.
2. Create a string — **pattern** — containing a regular expression pattern that uses a set to match **Python** or **python**.
3. Use a loop to iterate over each item in the **titles** list, and for each item:
  - Use the **re.search()** function to check whether **pattern** matches the title.
  - If **re.search()** returns a match object, increment (add 1 to) the **python_mentions** variable.

In [69]:
import re

titles = hn["title"].tolist()

# put your code here

In [70]:
import re

titles = hn["title"].tolist()
python_mentions = 0
pattern = "[Pp]ython"

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1

## 1.3 Counting matches with pandas methods

We've learned that we should avoid using loops in pandas, and that vectorized methods are often faster and require less code.

In the data cleaning lesson, we learned that the [Series.str.contains()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) method can be used to test whether a Series of strings match a particular regex pattern. Let's look at how we can replicate the example from the previous section using pandas.

We'll start by creating a pandas object containing our strings:

In [71]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


Next, we'll create our regex pattern, and use **Series.str.contains()** to compare to each value in our series:

In [72]:
pattern = "[Bb]lue"

pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


The result is a boolean mask: a series of **True/False** values.

One of the neat things about boolean masks is that you can use the [Series.sum()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) method to sum all the values in the boolean mask, with each **True** value counting as 1, and each **False** as 0. This means that we can easily count the number of values in the original series that matched our pattern:

In [73]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


If we wanted, we could use method chaining to do the whole operation on one line:

In [74]:
pattern_count = eg_series.str.contains(pattern).sum()
print(pattern_count)

2


Let's use this technique to replicate the analysis we did in the previous section.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have provided the regex pattern from the solution to the previous section.

- Assign the **title** column from the hn dataframe to the variable **titles**.
- Use **Series.str.contains()** and **Series.sum()** with the provided regex pattern to count how many Hacker News titles contain **Python** or **python**. Assign the result to **python_mentions**.



In [75]:
pattern = '[Pp]ython'

# put your code here

In [76]:
pattern = '[Pp]ython'
titles = hn['title']
python_mentions = titles.str.contains(pattern).sum()
python_mentions

160

## 1.4 Using regular expressions to select data

On the previous two sections, we used regular expressions to count how many titles contain **Python** or **python**. What if we wanted to view those titles?

In that case, we can use the boolean array returned by **Series.str.contains()** to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.

In [77]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool


Then, we can use that boolean array to select just the matching rows:

In [78]:
py_titles = titles[py_titles_bool]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


We can also do it in a streamlined, single line of code:

In [79]:
py_titles = titles[titles.str.contains("[Pp]ython")]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use **Series.str.contains()** to create a series of the values from **titles** that contain **Ruby** or **ruby**. Assign the result to **ruby_titles**.

In [80]:
titles = hn['title']

# put your code here

In [81]:
titles = hn['title']
ruby_titles = titles[titles.str.contains(r"[Rr]uby")]
ruby_titles

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

## 1.5 Quantifiers

In the data cleaning lesson, we learned that we could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from **1000** to **2999** we could write the regular expression below:



<center><img width="800" src="https://drive.google.com/uc?export=view&id=1E8AxfVB26IRlafnPX7aEBnmLraMBGEPR"></center>


The name for this type of regular expression syntax is called a **quantifier**. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both **e-mail** and **email**. To do this, we would want to specify to match **-** either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:


<center><img width="600" src="https://drive.google.com/uc?export=view&id=12Xjddsnk3vIPy1Jr0705-mnE86Dk2-rn"></center>

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1lQxGeNXxFXW7aRnFIE0o-QQTFQPssBaa"></center>

On this screen, we're going to find how many titles in our dataset mention **email** or **e-mail**. To do this, we'll need to use **?**, the optional quantifier, to specify that the dash character **-** is optional in our regular expression.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Use a regular expression and **Series.str.contains()** to create a boolean mask that matches items from **titles** containing **email** or **e-mail**. Assign the result to **email_bool**.
2. Use **email_bool** to count the number of titles that matched the regular expression. Assign the result to **email_count**.
3. Use **email_bool** to select only the items from **titles** that matched the regular expression. Assign the result to **email_titles**.


In [82]:
# put your code here

In [83]:
# The `titles` variable is available from
email_bool = titles.str.contains("e-?mail")
email_count = email_bool.sum()
email_titles = titles[email_bool]

## 1.6 Character classes

So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like **[pdf]**. Here are a few examples of story titles with these tags:

```
[video] Google Self-Driving SUV Sideswipes Bus
New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
Wallace and Gromit  The Great Train Chase (1993) [video]
```

In this section, our task is going to be to find how many titles in our dataset have tags.

Our first inclination may be to create the regex **[pdf]**. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters **p**, **d**, or **f**.

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1-qA_Mqp_mlEIdiWWgYgzLwjY6on-uaNP"></center>

To match the substring **"[pdf]"**, we can use backslashes to escape both the open and closing brackets: **\[pdf\]**.

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1tpDE1r0WT1iXso--H95E525KHvRLJpWj"></center>

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like **pdf** and **video**) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

1. The set notation using brackets to match any of a number of characters.
2. The range notation, which we used to match ranges of digits (like **[0-9]**).

Let's look at a summary of syntax for some of the regex character classes:

<center><img width="600" src="https://drive.google.com/uc?export=view&id=11k144eQXo1YickVjFd9Kr1jJe3978shg"></center>

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.


<center><img width="600" src="https://drive.google.com/uc?export=view&id=12bh9tK8QLmCCT11IviujJKI4J6SYIK7p"></center>

The one that we'll be using in order to match characters in tags is \w, which represents any digit uppercase or lowercase letter. Each character class represents a single character, so to match multiple characters (e.g. words like **video** and **pdf**), we'll need to combine them with **quantifiers**.

In order to match word characters between our brackets, we can combine the word character class (**\w**) with the 'one or more' quantifier (**+**), giving us a combined pattern of **\w+**.

This will match sequences like **pdf**, **video**, **Python**, and **2018** but won't match a sequence containing a space or punctuation character like **PHP-DEV** or **XKCD Flowchart**. If we wanted to match those tags as well, we could use **.+**; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this section:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. \w will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.

We'll use these concepts to count the number of titles that contain a tag.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Write a regular expression, assigning it as a string to the variable **pattern**. The regular expression should match, in order:
  - A single open bracket character.
  - One or more word characters.
  - A single close bracket character.
2. Use the regular expression to select only items from **titles** that match. Assign the result to the variable **tag_titles**.
3. Count how many matching titles there are. Assign the result to **tag_count**.



In [84]:
# put your code here

In [85]:
pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
tag_count = tag_titles.shape[0]

In [86]:
tag_count

444

## 1.7 Accessing the matching text with capture groups

On the previous section, we learned that we can use backslashes to escape the [ and ] characters. Backslashes are used to escape many other characters in regular expressions, as well as to denote some special character sequences (like character classes).

In Python, a backslash followed by certain characters represents an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences) — like the **\n** sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring **\b**:

In [87]:
print('hello\b world')

hello world


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

In [88]:
print('hello\\b world')

hello\b world


This can make regular expressions even more difficult to read and interpret, so instead we use [raw strings](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), which we denote by prefixing our string with the r character. Let's take a look at the code from above with a raw string:

In [89]:
print(r'hello\b world')

hello\b world


**We strongly recommend using raw strings for every regex you write**, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous section, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

<left><img width="800" src="https://drive.google.com/uc?export=view&id=1tnvdhk-hr5hp8zpcrncpip65C4X1rBvO"></left>

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:

In [90]:
tag_5 = tag_titles.head()
print(tag_5)

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object


We use the [Series.str.extract()](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.str.extract.html) method to extract the match within our parentheses:

In [91]:
pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

            0
66      [pdf]
100  [German]
159     [pdf]
162     [pdf]
195    [Beta]


We can move our parentheses inside the brackets to get just the text:



In [92]:
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

          0
66      pdf
100  German
159     pdf
162     pdf
195    Beta


If we then use **Series.value_counts()** we can quickly get a frequency table of the tags:

In [93]:
tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)

pdf       3
German    1
Beta      1
dtype: int64


Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

We have provided a commented line of code with the pattern from the previous exercise.

1. Uncomment the line of code and add parentheses to create a capture group inside the brackets.
2. Use **Series.str.extract()** and **Series.value_counts()** with the modified regex pattern to produce a frequency table of all the tags in the **titles** series. Assign the frequency table to **tag_freq**.


In [94]:
# pattern = r"\[\w+\]"
# put your code here

In [95]:
pattern = r"\[(\w+)\]"
tag_freq = titles.str.extract(pattern).value_counts()

## 1.8 Negative character classes

On the previous section, we wrote mostly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

In [96]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows you to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. For this screen, we'll use the **first_10_matches** function we just built to iteratively build a regular expression.

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

In [97]:
first_10_matches(r"[Jj]ava")

267      Show HN: Hire JavaScript - Top JavaScript Talent
436     Unikernel Power Comes to Java, Node.js, Go, an...
580     Python integration for the Duktape Javascript ...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1046    If you write JavaScript tools or libraries, bu...
1093    Rollup.js: A next-generation JavaScript module...
1162                 V8 JavaScript Engine: V8 Release 5.4
1195                   Proposed JavaScript Standard Style
1314           Show HN: Design by Contract for JavaScript
Name: title, dtype: object

We can see that there are a number of matches that contain **Java** as part of the word **JavaScript**. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using **negative character classes**. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:


<left><img width="600" src="https://drive.google.com/uc?export=view&id=1-iABsRHMQ1aR_fKpO_Q0MHKe0E0DUR7y"></left>

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript:

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Write a regular expression that will match titles containing Java.
  - You might like to use the **first_10_matches()** function or a site like [RegExr](https://regexr.com/) to build your regular expression.
  - The regex should match whether or not the first character is capitalized.
  - The regex shouldn't match where 'Java' is followed by the letter 'S' or 's'.
2. Select every row from titles that match the regular expression. Assign the result to java_titles.


In [98]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

# put your code here

In [99]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10
pattern = r"[Jj]ava[^Ss]"
java_titles = titles[titles.str.contains(pattern)]

## 1.9 Word Boundaries

On the previous section, we used a negative set to find all of the mentions of "Java" in our dataset:

In [100]:
first_10_matches(r"[Jj]ava[^Ss]")

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
2367    Code that is valid in both PHP and Java, and p...
2493    Ask HN: I've been a java dev for a couple of y...
2751                Eventsourcing for Java 0.4.0 released
2910                2016 JavaOne Intel Keynote  32mn Talk
3452    What are the Differences Between Java Platform...
Name: title, dtype: object

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where **Java** occurs at the end of the string, like this title:

```
Pippo  Web framework in Java
```

This is because the negative set **[^Ss]** must match one character. Instances at the end of a string aren't followed by any characters, so there is no match.

A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax **\b**. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1n2zGGsbf01dqP1o2zl2UXdKZUx0w_jRF"></left>

Let's look at how using a word boundary changes the match from the string in the example above:




In [101]:
string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"

m1 = re.search(pattern_1, string)
print(m1)

None


The regular expression returns **None**, because there is no substring that contains Java followed by a character that isn't S.

Let's instead use word boundaries in our regular expression:

In [102]:
pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)

<_sre.SRE_Match object; span=(41, 45), match='Java'>


With the word boundary, our pattern matches the **Java** at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention **Java**.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Write a regular expression that will match titles containing Java.
  - You might like to use the **first_10_matches()** function or a site like [RegExr](https://regexr.com/) to build your regular expression.
  - The regex should match whether or not the first character is capitalized.
  - The regex should match only where 'Java' is preceded and followed by a word boundary.
2. Select from **titles** only the items that match the regular expression. Assign the result to **java_titles**.

In [103]:
# put your code here

In [104]:
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]

In [105]:
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

## 1.10 Matching at the start and end of strings

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the **word boundary anchor** matches the space between a word character and a non-word character. More generally in regular expressions, an **anchor** matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and the end of the string.

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1E7LTSRdam8bZRrlqL2VQwO8sJgmCy8rI"></left>

Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

Let's start with a few test cases that all contain the substring **Red** at different parts of the string, as well as a test function:

In [106]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases)

0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object


If we want to match the word **Red** only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:

In [107]:
test_cases.str.contains(r"^Red")

0     True
1    False
2    False
dtype: bool

If we want to match the word **Red** only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:

In [108]:
test_cases.str.contains(r"Red$")

0    False
1     True
2    False
dtype: bool

Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Count the number of times that a tag (e.g. **[pdf]** or **[video]**) occurs at the start of a title in **titles**. Assign the result to **beginning_count**.
2. Count the number of times that a tag (e.g. **[pdf]** or **[video]**) occurs at the end of a title in **titles**. Assign the result to **ending_count**.


In [109]:
# put your code here

In [110]:
pattern_beginning = r"^\[\w+\]"
beginning_count = titles.str.contains(pattern_beginning).sum()
print(beginning_count)

pattern_ending =  r"\[\w+\]$"
ending_count = titles.str.contains(pattern_ending).sum()
print(ending_count)

15
417


## 1.11 Challenge: using flags to modify regex patterns

Up until now, we've been using sets like **[Pp]** to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

```
email
Email
e Mail
e mail
E-mail
e-mail
eMail
E-Mail
EMAIL
emails
Emails
E-Mails
```

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use **flags** to specify that our regular expression should ignore case.

Both **re.search()** and the pandas regular expression methods accept an optional **flags** argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A [list of all available flags](https://docs.python.org/3/library/re.html#re.A) is in the documentation, but by far the most common and the most useful is the [re.IGNORECASE flag](https://docs.python.org/3/library/re.html#re.I), which is also available using the alias **re.I** for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:


In [111]:
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")

0     True
1    False
2    False
3    False
dtype: bool

Now let's look at what happens when we use the flag:



In [112]:
import re
email_tests.str.contains(r"email",flags=re.I)

0    True
1    True
2    True
3    True
dtype: bool

No matter what the capitalization is, our regular expression matches.

We'll finish the section 1 by writing a regular expression and count the number of times that email is mentioned in story titles. You'll need to use both ignorecase as well as some of the other regex components you've already learned in this first section.

This subsection is a challenge section, so it's a little less guided than the exercises so far. As we mentioned at the start of this lesson, **regular expressions** can be very complex, and **unless you write them frequently, it's unlikely that you will remember all the syntax**.

With that in mind, we don't expect that you will immediately remember how to perform this task so don't get disheartened if this exercise takes you more attempts than the other exercises. If you get stuck, you might try one or more of the following:

- Scanning over the regex concepts we've taught in the previous lesson.
- Using the test cases that we'll provide.
- Using a web tool like [RegExr](https://regexr.com/) that lets you write a regex iteratively and see how it matches the test cases.

We've also provided a number of hints, however we strongly recommend trying to complete the challenge without them first. The skills you build as you try to solve the puzzle will be extremely valuable for you as you continue on your journey to becoming a data expert!

To help you test the regular expression that you build, we have provided a variable that includes each of the different ways "email" is included in the data.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Write a regular expression that will match all variations of email included in the starter code. Write your regular expression in a way that will be compatible with the ignorecase flag.
  - As you build your regular expression, you might like to use **Series.str.contains()** like we did in the examples earlier in this screen.
2. Once your regular expression matches all the test cases, use it to count the number of mentions of email in titles in the dataset. Assign the result to **email_mentions**.

In [113]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL'])

# put your code here

In [114]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])
pattern = r"\be[\-\s]?mails?\b"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()

## 1.12 Next steps

In the section 1, we learned the basics of using regular expressions to perform powerful text matching, including:

- Character classes to match certain groups of characters, including sets to match different capitalizations of programming languages.
- Quantifiers to match different quantities of characters, including matching different variations of "email."
- Negative character classes for matching anything except certain groups of characters.
- Word boundaries to match only specific instances of words.
Positional anchors to match only at the start and end of strings.
- The ignorecase flag to make patterns case insensitive.

In the next section, we'll expand on our regular expression knowledge with some advanced regex concepts!

# 2.0 Advanced Regular Expressions

## 2.1 Introduction

In the previous section, we learned that regular expressions provide powerful ways to describe patterns in text that can help us clean and extract data. In this section, we're going to build on those foundational principles, and learn:

- Several new regex syntax components to allow us to express more complex criteria.
- How to combine regular expression patterns to extract and transform data.
- How to replace and clean data using regular expressions.


We'll continue to analyze and count mentions of different programming languages in the dataset, and then we'll finish by extracting the different components of the URLs submitted to **Hacker News**.

**As we mentioned in the previous section, you shouldn't expect to remember every single detail of regular expression syntax. The most important thing is to understand the core principles, what is possible, and where to look up the details**. This will mean you can quickly jog your memory whenever you need regular expressions.

We'll be building on the foundational concepts that we learned in the previous section. If you need to refresh any points of the syntax while you complete exercises in this section, we recommend using a regex syntax reference like [RegExr](https://regexr.com/) so you can practice looking up syntax as you need it.

Let's start by reading in the dataset using pandas and extracting the story titles from the **title** column:

In [115]:
import pandas as pd

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

In the story titles, we have two different capitalizations for the Python language: Python and python. In the previous lesson, we learned two techniques for handling cases like these. The first is to use a set to match either **P** or **p**:

In [116]:
pattern = r"[Pp]ython"
python_counts = titles.str.contains(pattern).sum()
print(python_counts)

160


The second option we learned is to use **re.I** — the ignorecase flag — to make our pattern case insensitive:



In [117]:
pattern = r"python"
python_counts = titles.str.contains(pattern, flags=re.I).sum()
print(python_counts)

160


The ignorecase flag is particularly useful when we have many different capitalizations for a word or phrase. In our dataset, the SQL language has three different capitalizations: **SQL**, **sql**, and **Sql**.

To use sets to capture all of these variations, we would need to use a set for each character:

In [118]:
pattern = r"[Ss][Qq][Ll]"
sql_counts = titles.str.contains(pattern).sum()
print(sql_counts)

108


Instead, let's use the ignorecase flag to write a case-insensitive version of this regular expression.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

We have already imported **pandas** and **re**, read the CSV and extracted the **title** column.

1. Create a case insensitive regex pattern that matches all case variations of the letters **SQL**.
2. Use that regex pattern and the ignorecase flag to count the number of mentions of SQL in **titles**. Assign the result to **sql_counts**.

In [119]:
import pandas as pd
import re

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

# put your code here

In [120]:
import pandas as pd
import re

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']
sql_pattern = r"SQL"
sql_counts = titles.str.contains(sql_pattern, flags=re.I).sum()
sql_counts

108

## 2.2 Capture groups

In the previous exercise, we counted the number of mentions of "SQL" in the titles of stories. As we learned in the previous mission, to extract those mentions, we need to do two things:

- Use the [Series.str.extract()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) method.
- Use a regex capture group.

We define a capture group by wrapping the part of our pattern we want to capture in parentheses. If we want to capture the whole pattern, we just wrap the whole pattern in a pair of parentheses:

<center><img width="800" src="https://drive.google.com/uc?export=view&id=11UR9TJQw1cy7_SP73oe2nTRXrYrGDdnC"></center>

Let's look at how we can use a capture group to create a frequency table of the different capitalizations of SQL in our dataset. We start by wrapping our regex pattern in parentheses:

In [121]:
pattern = r"(SQL)"

Next, we use **Series.str.extract()** to extract the different capitalizations:

In [122]:
sql_capitalizations = titles.str.extract(pattern, flags=re.I)

Lastly, we use the [Series.value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method to create a frequency table of those capitalizations:

In [123]:
sql_capitalizations_freq = sql_capitalizations.value_counts()
print(sql_capitalizations_freq)

SQL    101
Sql      4
sql      3
dtype: int64


We can extend this analysis by looking at titles that have letters immediately before the "SQL," which is a convention often used to denote different variations or flavors of SQL:

In [124]:
pattern = r"(\w+SQL)"
sql_flavors = titles.str.extract(pattern, flags=re.I)
sql_flavors_freq = sql_flavors.value_counts()
print(sql_flavors_freq)

PostgreSQL    27
NoSQL         16
MySQL         12
nosql          1
mySql          1
SparkSQL       1
MemSQL         1
CloudSQL       1
dtype: int64


Notice how there is some duplication due to varied capitalization in this frequency table:

- NoSQL and nosql
- MySQL and mysql

In this exercise, we're going to extract the mentions of different SQL flavors into a new column and clean those duplicates by making them all lowercase. We'll then analyze the results to look at the average number of comments for each flavor.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have created a new dataframe, **hn_sql**, including only rows that mention a SQL flavor.

1. Create a new column called **flavor** in the **hn_sql** dataframe, containing extracted mentions of SQL flavors, defined as:

  - Any time 'SQL' is preceded by one or more word characters.
  - Ignoring all case variation.
2. Use the [Series.str.lower()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.lower.html#pandas.Series.str.lower) method to clean the values in the **flavor** column by converting them to lowercase. Assign the values back to the column in **hn_sql**.
3. Use the [DataFrame.pivot_table()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) method to create a pivot table, **sql_pivot**.
  - The index of the pivot table should be the **flavor** column.
  - The values of the pivot table should be the mean of the **num_comments** column, aggregated by SQL flavor.

In [125]:
hn_sql = hn[hn['title'].str.contains(r"\w+SQL", flags=re.I)].copy()

In [127]:
hn_sql = hn[hn['title'].str.contains(r"\w+SQL", flags=re.I)].copy()
hn_sql["flavor"] = hn_sql["title"].str.extract(r"(\w+SQL)", re.I)
hn_sql["flavor"] = hn_sql["flavor"].str.lower()
sql_pivot = hn_sql.pivot_table(index="flavor",values="num_comments", aggfunc='mean')

In [128]:
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


## 2.3 Using capture groups to extract data

So far we've used capture groups to extract all or most of the text in our regular expression pattern. Capture groups can also be useful to extract specific data from within our expression.

Let's look at a sample of Hacker News titles that mention Python:

```
Developing a computational pipeline using the asyncio module in Python 3
Python 3 on Google App Engine flexible environment now in beta
Python 3.6 proposal, PEP 525: Asynchronous Generators
How async/await works in Python 3.5.0
Ubuntu Drops Python 2.7 from the Default Install in 16.04
Show HN: First Release of Transcrypt Python3.5 to JavaScript Compiler
```

All of these examples have a number after the word "Python," which indicates a version number. Sometimes a space precedes the number, sometimes it doesn't. We can use the following regular expression to match these cases:

<left><img width="800" src="https://drive.google.com/uc?export=view&id=1wAK20uyzSqtAxJmK2jG9qUy9Mf993UDt"></left>

We can use capture groups to extract the version of Python that is mentioned most often in our dataset by wrapping parentheses around the part of our regular expression which captures the version number.

We'll use a capture group to capture the version number after the word "Python," and then build a frequency table of the different versions.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Write a regular expression pattern which will match **Python** or **python**, followed by a space, followed by one or more digit characters or periods.
  - The regular expression should contain a capture group for the digit and period characters (the Python versions)
2. Extract the Python versions from **title** using the regular expression pattern.
3. Use **Series.value_counts()** and the **dict()** function to create a dictionary frequency table of the extracted Python versions. Assign the result to **py_versions_freq**.

In [129]:
# put your code here

In [133]:
pattern = r"[Pp]ython ([\d.]+)"

py_versions = titles.str.extract(pattern)
py_versions_freq = dict(py_versions.value_counts())

In [132]:
py_versions_freq

{('1.5',): 1,
 ('2',): 3,
 ('2.7',): 1,
 ('3',): 10,
 ('3.5',): 3,
 ('3.5.0',): 1,
 ('3.6',): 2,
 ('4',): 1,
 ('8',): 1}

## 2.4 Counting mentions of the 'C' Language

So far, we've created regular expressions to clean and analyze the number of mentions of the Python, SQL, and Java languages. Next up: counting the mentions of the C language.

We can start with a simple regular expression and then iterate as we find and exclude incorrect matches. Let's start with a simple regex that matches the letter "c" with word boundary anchors on either side:

<left><img width="700" src="https://drive.google.com/uc?export=view&id=1b-I2L7LKbsx0Omd1xElPsHgOJq1EWn3N"></left>

We'll re-use the **first_10_matches()** function that we defined in the section 1 to see the results we get from this regular expression:




In [136]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

first_10_matches(r"\b[Cc]\b")

13                 Custom Deleters for C++ Smart Pointers
220                        Lisp, C++: Sadness in my heart
221                  MemSQL (YC W11) Raises $36M Series C
353     VW C.E.O. Personally Apologized to President O...
365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
508     BDE 3.0 (Bloomberg's core C++ library): Open S...
521          Fuchsia: Micro kernel written in C by Google
549     How to Become a C.E.O.? The Quickest Path Is a...
1282    A lightweight C++ signals and slots implementa...
Name: title, dtype: object

Immediately, our results are reasonably relevant. However, we can quickly identify a few match types we want to prevent:

- Mentions of C++, a distinct language from C.
- Cases where the letter C is followed by a period, like in the substring C.E.O.

Let's use a negative set to prevent matches for the + character and the . character.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

We have provided a commented line of code containing the regular expression we used above.

1. Uncomment the line of code. Add a negative set to the end of the regular expression that excludes:
  - The period character .
  - The plus character +.
2. Use the **first_10_matches()** function to return the matches for the regular expression you built, assigning the result to **first_ten**.

In [137]:
# pattern = r"\b[Cc]\b"

In [138]:
pattern = r"\b[Cc]\b[^.+]"
first_ten = first_10_matches(pattern)
first_ten

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

## 2.5 Using lookarounds to control matches based on surrounding text

Let's look at the result of the previous exercise:

```
365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object
```

It looks like we're getting close. In our first 10 matches we have one irrelevant result, which is about "Series C," a term used to represent a particular type of startup fundraising.

Additionally, we've run into the same issue as we did in the previous mission — by using a negative set, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).

Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character. Instead we'll need a new tool: **lookarounds**.

Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1JPl5PtjOBY_LrJZBkTrdhEssP2oUiyB7"></left>


These tips can help you remember the syntax for lookarounds:
  - Inside the parentheses, the first character of a lookaround is always ?.
  - If the lookaround is a lookbehind, the next character will be <, which you can think of as an arrow head pointing behind the match.
  - The next character indicates whether the lookaround is positive (=) or negative (!).
  
Let's create some test data that we'll use to illustrate how lookarounds work:

In [139]:
test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

We'll also create a function that will loop over our test cases and tell us whether our pattern matches. We'll use the **re** module rather than pandas since it tells us the exact text that matches, which will help us understand how the lookaround is working:

In [140]:
def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

In each instance, we'll aim to match the substring Green depending on the characters that precede or follow it. Let's start by using a **positive lookahead** to include instances where the match is followed by the substring **_Blue**. We'll include the underscore character in the lookahead, otherwise we will get zero matches:

In [141]:
run_test_cases(r"Green(?=_Blue)")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<_sre.SRE_Match object; span=(7, 12), match='Green'>
NO MATCH


Notice how the matches themselves are purely the text **Green** and don't include the lookahead. Let's look at a **negative lookahead** to include instances where the match is not followed by the substring **_Red**:

In [142]:
run_test_cases(r"Green(?!_Red)")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<_sre.SRE_Match object; span=(7, 12), match='Green'>
<_sre.SRE_Match object; span=(0, 5), match='Green'>


Next we'll use a **positive lookbehind** to include instances where the match is preceded by the substring **Red_**:

In [143]:
run_test_cases(r"(?<=Red_)Green")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


And finally, using a **negative lookbehind** to include instances where the match isn't preceded by the substring **Yellow_**:

In [144]:
run_test_cases(r"(?<!Yellow_)Green")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
<_sre.SRE_Match object; span=(0, 5), match='Green'>


The contents of a lookaround can include any other regular expression component. For instance, here is an example where we match only cases that are followed by exactly five characters:

In [145]:
run_test_cases(r"Green(?=.{5})")

<_sre.SRE_Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<_sre.SRE_Match object; span=(7, 12), match='Green'>
NO MATCH


The second and third test cases are followed by four characters, not five, and the last test case isn't followed by anything.

Sometimes programming languages won't implement support for all lookarounds (notably, lookbehinds are not in the official JavaScript specification). As an example, to get full support in the [RegExr](https://regexr.com/) tool, you'll need to set it to use the PCRE regex engine.

In this exercise, we're going to use lookarounds to refine the regular expression we build on the last screen to capture mentions of the "C" programming language. As a reminder, here is the last of the regular expressions we attempted to use with this exercise earlier, and the resultant titles that match:

In [146]:
first_10_matches(r"\b[Cc]\b[^.+]")

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

Let's now use lookarounds to exclude the matches we don't want. We want to:

  - Keep excluding matches that are followed by . or +, but still match cases where "C" falls at the end of the string.
  - Exclude matches that have the word 'Series' immediately preceding them.

This exercise is a little harder than those you've seen so far in this course — it's okay if it takes you a few attempts!


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Write a regular expression and assign it to **pattern**. The regular expression should:
  - Match instances of C or c where they are not preceded or followed by another word character.
  - From the match above:
    - Exclude instances where it is followed by a . or + character, without removing instances where the match occurs at the end of the string.
    - Exclude instances where the word 'Series' immediately precedes the match.
2. Count how many stories in **titles** match the regular expression. Assign the result to **c_mentions**.

In [148]:
# put your code here

In [149]:
pattern = r"(?<!Series\s)\b[Cc]\b(?![\+\.])"
c_mentions = titles.str.contains(pattern).sum()

In [150]:
c_mentions

102

## 2.6 Backreferences: using capture groups in a regex pattern

Let's say we wanted to identify strings that had words with double letters, like the "ee" in "feed." Because we don't know ahead of time what letters might be repeated, we need a way to specify a capture group and then to repeat it. We can do this with **backreferences**.

Whenever we have one or more capture groups, we can refer to them using integers left to right as shown in this regex that matches the string **HelloGoodbye**:


<left><img width="600" src="https://drive.google.com/uc?export=view&id=1SbtB41c7S65brVySvkSlAcj-jjqbbGpU"></left>

Within a regular expression, we can use a backslash followed by that integer to refer to the group:


<left><img width="600" src="https://drive.google.com/uc?export=view&id=1BojFjw0jbok1nnN7d_kEaeSHl8xMMp-x"></left>

The regular expression above will match the text **HelloGoodbyeGoodbyeHello**. Let's look at how we could write a regex to capture instances of the same two word characters in a row:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1y0LThLSjZVvUGyWtS3D6lGtJwUIi6L7i"></left>

Let's see this in action using Python:


In [151]:
test_cases = [
              "I'm going to read a book.",
              "Green is my favorite color.",
              "My name is Aaron.",
              "No doubles here.",
              "I have a pet eel."
             ]

for tc in test_cases:
    print(re.search(r"(\w)\1", tc))

<_sre.SRE_Match object; span=(21, 23), match='oo'>
<_sre.SRE_Match object; span=(2, 4), match='ee'>
None
None
<_sre.SRE_Match object; span=(13, 15), match='ee'>


Notice that there was no match for the word **Aaron**, despite it containing a double "a." This is because the uppercase and lowercase "a" are two different characters, so the backreference does not match.

We can easily achieve the same thing using pandas:

In [152]:
test_cases = pd.Series(test_cases)
print(test_cases.str.contains(r"(\w)\1"))

0     True
1     True
2    False
3    False
4     True
dtype: bool


  return func(self, *args, **kwargs)


Let's use this technique to identify story titles that have repeated words.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Write a regular expression to match cases of repeated words:
  - We'll define a word as a series of one or more word characters that are preceded and followed by a boundary anchor.
  - We'll define repeated words as the same word repeated twice, separated by a single whitespace character.
2. Select only the items in **titles** that match the regular expression. Assign the result to **repeated_words**.

In [153]:
# put your code here

In [154]:
pattern = r"\b(\w+)\s\1\b"

repeated_words = titles[titles.str.contains(pattern)]
repeated_words

  return func(self, *args, **kwargs)


3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object

## 2.7 Substituting regular expression matches

When we learned to work with basic string methods, we used the [str.replace()](https://docs.python.org/3/library/stdtypes.html#str.replace) method to replace simple substrings. We can achieve the same with regular expressions using the [re.sub()](https://docs.python.org/3/library/re.html#re.sub) function. The basic syntax for **re.sub()** is:

```python
re.sub(pattern, repl, string, flags=0)
```


The **repl** parameter is the text that you would like to substitute for the match. Let's look at a simple example where we replace all capital letters in a string with dashes:

In [156]:
string = "aBcDEfGHIj"

print(re.sub(r"[A-Z]", "-", string))

a-c--f---j


When working in pandas, we can use the [Series.str.replace()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) method, which uses nearly identical syntax:

```python
Series.str.replace(pat, repl, flags=0)
```

Earlier, we discovered that there were multiple different capitalizations for SQL in our dataset. Let's look at how we could make these uniform with the **Series.str.replace()** method and a regular expression:

In [157]:
sql_variations = pd.Series(["SQL", "Sql", "sql"])

sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)

0    SQL
1    SQL
2    SQL
dtype: object


Let's use the same technique to make all the different variations of "email" in the dataset uniform.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have provided **email_variations**, a pandas Series containing all the variations of "email" in the dataset.

1. Use a regular expression to replace each of the matches in **email_variations** with **"email"** and assign the result to **email_uniform**.
  - You may need to iterate several times when writing your regular expression in order to match every item.
2. Use the same syntax to replace all mentions of email in **titles** with **"email"**. Assign the result to **titles_clean**.


In [158]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

# put your code here

In [160]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])
pattern = r"\be[-\s]?mail"
email_uniform = email_variations.str.replace(pattern, "email", flags=re.I)
titles_clean = titles.str.replace(pattern, "email", flags=re.I)

In [161]:
email_uniform

0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [162]:
titles_clean

0                                Interactive Dynamic Video
1        Florida DJs May Face Felony for April Fools' W...
2             Technology ventures: From Idea to Enterprise
3        Note by Note: The Making of Steinway L1037 (2007)
4        Title II kills investment? Comcast and other I...
                               ...                        
20094    How Purism Avoids Intels Active Management Tec...
20095            YC Application Translated and Broken Down
20096    Microkernels are slow and Elvis didn't do no d...
20097                        How Product Hunt really works
20098    RoboBrowser: Your friendly neighborhood web sc...
Name: title, Length: 20099, dtype: object

## 2.8 Extracting domains from URLs

Over the final three subsections in section 2, we'll extract components of URLs from our dataset. As a reminder, most stories on Hacker News contain a link to an external resource.

The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains. Below is a list of some of the URLs in the dataset, with the domains highlighted in color, so you can see the part of the string we want to capture.

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1E2ZscHaRwn1mzFbw1ijyAeF6DTEko0DV"></left>

The domain of each URL excludes the protocol (e.g. **https://**) and the page path (e.g. **\/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429**).

There are several ways that you could use regular expressions to extract the domain, but we suggest the following technique:

  - Using a series of characters that will match the protocol.
  - Inside a capture group, using a set that will match the character classes used in the domain.
  - Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.

Once you have extracted the domains, you will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we'll look at only the top 20 domains.

We have provided some of the URLs from the dataset which will help you to iterate while you build your regular expression.



**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Write a regular expression to extract the domains from **test_urls** and assign the result to **test_urls_clean**. We suggest the following technique:
  - Using a series of characters that will match the protocol.
  - Inside a capture group, using a set that will match the character classes used in the domain.
  - Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.
2. Use the same regular expression to extract the domains from the **url** column of the **hn** dataframe. Assign the result to **domains**.
3. Use **Series.value_counts()** to build a frequency table of the domains in **domains**, limiting the frequency table to just to the top 5. Assign the result to **top_domains**.



In [163]:
# put your code here

In [164]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly'
])

In [165]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
])
pattern = r"https?://([\w\-\.]+)"

test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.value_counts().head(5)

In [166]:
top_domains

github.com             1008
medium.com              825
www.nytimes.com         525
www.theguardian.com     248
techcrunch.com          245
dtype: int64

## 2.9 Extracting URL parts using multiple capture groups

Having extracted just the domains from the URLs, on this final section we'll extract each of the three component parts of the URLs:

- Protocol
- Domain
- Page path

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1Jn-2ZcplDfKvRQOQBgnQ9pMuHz8syPYw"></left>

In order to do this, we'll create a regular expression with multiple capture groups. Multiple capture groups in regular expressions are defined the same way as single capture groups — using pairs of parentheses.

Let's look at how this works using the first few values from the **created_at** column in our dataset:


In [167]:
created_at = hn['created_at'].head()
print(created_at)

0     8/4/2016 11:52
1    6/23/2016 22:20
2     6/17/2016 0:01
3     9/30/2015 4:12
4    10/31/2015 9:48
Name: created_at, dtype: object


We'll use capture groups to extract these dates and times into two columns:

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1-xaq69_P0idLB_yKbAFhiT8U6nc_NDSW"></left>

In order to do this we can write the following regular expression:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=16y0_UCwd9vVRk3CGIFo9cfF0SFv9GWDd"></left>

Notice how we put a space character between the capture groups, which matches the space character in the original strings.

Let's look at the result of using this regex pattern with **Series.str.extract()**:



In [168]:
pattern = r"(.+)\s(.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

            0      1
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


The result is a dataframe with each of our capture groups defining a column of data.

Now let's write a regular expression that will extract the URL components into individual columns of a dataframe.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Write a regular expression that extracts URL components using three capture groups:
  - The first capture group should include the protocol text, up to but not including ://.
  - The second group should contain the domain, from after :// up to but not including /.
  - The third group should contain the page path, from after / to the end of the string.
2. Use the regular expression pattern to extract the URL components from the **test_urls** series. Assign the results to **test_url_parts**.
3. Use the regular expression pattern to extract the URL components from the url column of the hn dataframe. Assign the results to **url_parts**.

In [None]:
# put your code here

In [169]:
pattern = r"(https?)://([\w\.\-]+)/?(.*)"

test_url_parts = test_urls.str.extract(pattern, flags=re.I)
url_parts = hn['url'].str.extract(pattern, flags=re.I)

In [170]:
url_parts

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
...,...,...,...
20094,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20095,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20096,http,blog.darknedgy.net,technology/2016/01/01/0/
20097,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


## 2.10 Using named capture groups to extract data

In the previous exercise, we created a regular expression which extracted the components from the story URLs into a dataframe with three columns.

Our final task will be to name these columns, which we'll do using named capture groups. Let's look at the example from the previous screen where we used two capture groups to extract the date and time as two separate columns:

In [171]:
created_at = hn['created_at'].head()

pattern = r"(.+) (.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

            0      1
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


In order to name a capture group we use the syntax **?P\<name\>**, where **name** is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group:

<left><img width="600" src="https://drive.google.com/uc?export=view&id=1FqFXJhhM0mqdsBdQxu7IyHaVtUqfxlKQ"></left>

Let's look at the result of this syntax using pandas:



In [172]:
pattern = r"(?P<date>.+) (?P<time>.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

         date   time
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


Each column has a name corresponding to the name of the capture group it represents.

Let's finish this mission by adding names to our capture group from the previous screen to create a dataframe with named columns.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have provided the regex pattern from the previous screen's solution.

1. Uncomment the regular expression pattern. Add names to each capture group:
  - The first capture group should be called **protocol**.
  - The second capture group should be called **domain**.
  - The third capture group should be called **path**.
2. Use the regular expression pattern to extract three named columns of **url** components from the **url** column of the **hn** dataframe. Assign the result to **url_parts**.

In [173]:
# put your code here
# pattern = r"(.+)://([\w\.]+)/?(.*)"

In [174]:
# pattern = r"(https?)://([\w\.\-]+)/?(.*)"
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)"
url_parts = hn['url'].str.extract(pattern, flags=re.I)
url_parts

Unnamed: 0,protocol,domain,path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
...,...,...,...
20094,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20095,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20096,http,blog.darknedgy.net,technology/2016/01/01/0/
20097,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


## 2.11 Next steps

In the section 2, we learned advanced regular expression techniques to help us work with text data, including:

- Using multiple capture groups to extract URL data.
- How to use lookarounds to customize matches based on the surrounding text.
- How to substitute a regular expression match to clean inconsistent data.
- How to use named capture groups to extract dataframes from a text column.

These techniques allow us to clean and analyze text data in an extremely powerful way, and will be one of the most useful tools in your data-cleaning "toolbelt" as you continue on your learning journey.

As we mentioned at the outset, unless you find yourself analyzing and cleaning text data with regular expressions regularly, it's unlikely that you'll remember every detail of regex syntax. The key with regular expressions is to understand the key concepts and what is possible, and know where and how to look up the rest.

With that in mind, don't be bothered if you don't feel like a regex guru right now - that's totally normal, and you'll start to feel better as you use this new data cleaning tool more and more over time.