# Advanced Regular Expressions

In [1]:
import re
import pandas as pd
import numpy as np
hn = pd.read_csv('data/hacker_news.csv')
titles = hn['title']

When we learned to work with basic string methods, we used the str.replace() method to replace simple substrings. We can achieve the same with regular expressions using the re.sub() function. The basic syntax for re.sub() is:

re.sub(pattern, repl, string, flags=0)
The repl parameter is the text that you would like to substitute for the match. Let's look at a simple example where we replace all capital letters in a string with dashes:
<pre>
string = "aBcDEfGHIj"

print(re.sub(r"[A-Z]", "-", string))
string = "aBcDEfGHIj"
​
print(re.sub(r"[A-Z]", "-", string))
a-c--f---j
</pre>
When working in pandas, we can use the Series.str.replace() method, which uses nearly identical syntax:

Series.str.replace(pat, repl, flags=0)
Earlier, we discovered that there were multiple different capitalizations for SQL in our dataset. Let's look at how we could make these uniform with the Series.str.replace() method and a regular expression:
<pre>
sql_variations = pd.Series(["SQL", "Sql", "sql"])
​
sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)
0    SQL
1    SQL
2    SQL
dtype: object
</pre>
We have provided email_variations, a pandas Series containing all the variations of "email" in the dataset.

1. Use a regular expression to replace each of the matches in email_variations with "email" and assign the result to email_uniform.
    - You may need to iterate several times when writing your regular expression in order to match every item.
2. Use the same syntax to replace all mentions of email in titles with "email". Assign the result to titles_clean.

In [2]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])
pattern = r'e.{0,1}mail'
email_uniform = email_variations.str.replace(pattern, "email", flags=re.I)
titles_clean = titles.str.replace(pattern,'email',flags=re.I)

Over the final three screens in this mission, we'll extract components of URLs from our dataset. As a reminder, most stories on Hacker News contain a link to an external resource.

The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains. Below is a list of some of the URLs in the dataset, with the domains highlighted in color, so you can see the part of the string we want to capture.

<img src='images/url_examples_1.svg' />

The domain of each URL excludes the protocol (e.g. https://) and the page path (e.g. /Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429).

There are several ways that you could use regular expressions to extract the domain, but we suggest the following technique:

1. Using a series of characters that will match the protocol.
2. Inside a capture group, using a set that will match the character classes used in the domain.
3. Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.
4. Once you have extracted the domains, you will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we'll look at only the top 20 domains.


1. Write a regular expression to extract the domains from test_urls and assign the result to test_urls_clean. We suggest the following technique:
    - Using a series of characters that will match the protocol.
    - Inside a capture group, using a set that will match the character classes used in the domain.
    - Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in any domains), we don't need to cater for this part of the URL in our regular expression.
2. Use the same regular expression to extract the domains from the url column of the hn dataframe. Assign the result to domains.
3. Use Series.value_counts() to build a frequency table of the domains in domains, limiting the frequency table to just to the top 20. Assign the result to top_domains.

In [3]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
])


In [9]:
pattern = r'https?\:\/\/([\w\.]+)'
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.iloc[:,0].value_counts().head(20)
top_domains

github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
www.youtube.com            213
www.bloomberg.com          193
arstechnica.com            191
www.washingtonpost.com     190
www.wsj.com                138
www.theatlantic.com        137
www.bbc.com                134
www.wired.com              114
www.theverge.com           112
www.bbc.co.uk              108
en.wikipedia.org           100
twitter.com                 93
qz.com                      85
motherboard.vice.com        82
www.newyorker.com           81
Name: 0, dtype: int64

Having extracted just the domains from the URLs, on this final screen we'll extract each of the three component parts of the URLs:

Protocol
Domain
Page path
<img src='images/url_examples_2.svg' />
In order to do this, we'll create a regular expression with multiple capture groups. Multiple capture groups in regular expressions are defined the same way as single capture groups — using pairs of parentheses.

Let's look at how this works using the first few values from the created_at column in our dataset:
<pre>
created_at = hn['created_at'].head()
print(created_at)
0     8/4/2016 11:52
1    1/26/2016 19:30
2    6/23/2016 22:20
3     6/17/2016 0:01
4     9/30/2015 4:12
Name: created_at, dtype: object
</pre>

We'll use capture groups to extract these dates and times into two columns:
<pre>
8/4/2016	11:52
1/26/2016	19:30
6/23/2016	22:20
6/17/2016	0:01
9/30/2015	4:12
</pre>
<img src='images/multiple_capture_groups.svg' />

In order to do this we can write the following regular expression:


Notice how we put a space character between the capture groups, which matches the space character in the original strings.

Let's look at the result of using this regex pattern with Series.str.extract():
<pre>
pattern = r"(.+)\s(.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)
_          0      1
0   8/4/2016  11:52
1  1/26/2016  19:30
2  6/23/2016  22:20
3  6/17/2016   0:01
4  9/30/2015   4:12
</pre>

The result is a dataframe with each of our capture groups defining a column of data.

Now let's write a regular expression that will extract the URL components into individual columns of a dataframe.

In [None]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
])
pattern = r'https?\:\/\/([\w\.]+)'
test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
domains = hn['url'].str.extract(pattern, flags=re.I)
top_domains = domains.value_counts().head(20)

In [None]:
# `test_urls` is available from the previous screen
pattern = r"(.+)://([\w\.]+)/?(.*)"
test_url_parts = test_urls.str.extract(pattern, flags=re.I)
url_parts = hn['url'].str.extract(pattern, flags=re.I)


We have provided the regex pattern from the previous screen's solution.

1. Uncomment the regular expression pattern. Add names to each capture group:
    - The first capture group should be called protocol.
    - The second capture group should be called domain.
    - The third capture group should be called path.
2. Use the regular expression pattern to extract three named columns of url components from the url column of the hn dataframe. Assign the result to url_parts.

In [None]:
We have provided the regex pattern from the previous screen's solution.

Uncomment the regular expression pattern. Add names to each capture group:
The first capture group should be called protocol.
The second capture group should be called domain.
The third capture group should be called path.
Use the regular expression pattern to extract three named columns of url components from the url column of the hn dataframe. Assign the result to url_parts.