## Introduction 

We'll be learning regular expressions while performing analysis on a dataset of submissions to popular technology site Hacker News.
<br>
<br>
The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:
- id: The unique identifier from Hacker News for the story
- title: The title of the story
- url: The URL that the stories links to, if the story has a URL
- num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of 
  downvotes
- num_comments: The number of comments that were made on the story
- author: The username of the person who submitted the story
- created_at: The date and time at which the story was submitted

First is to import the data: 

In [1]:
import pandas as pd 

hn = pd.read_csv('hacker_news.csv')
hn.head(10)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48
5,10557283,Nuts and Bolts Business Advice,,3,4,shomberj,11/13/2015 0:45
6,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
7,11337617,"Shims, Jigs and Other Woodworking Concepts to ...",http://firstround.com/review/shims-jigs-and-ot...,34,7,zt,3/22/2016 16:18
8,10379326,That self-appendectomy,http://www.southpolestation.com/trivia/igy1/ap...,91,10,jimsojim,10/13/2015 9:30
9,11370829,Crate raises $4M seed round for its next-gen S...,http://techcrunch.com/2016/03/15/crate-raises-...,3,1,hitekker,3/27/2016 18:08


## The regular Expression Module 

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.
<br>
<br>
We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The re module. This module contains a number of different functions and classes for working with regular expressions. 
<br>
<br>
One of the most useful functions from the re module is the re.search() function, which takes two required arguments:
- The regex pattern
- The string we want to search that pattern for

In [3]:
import re

m = re.search("and", "hand")
print(m)

<re.Match object; span=(1, 4), match='and'>


The re.search() function will return a Match object if the pattern is found anywhere within the string. If the pattern is not found, re.search() returns None

In [4]:
m = re.search("and", "antidote")
print(m)

None


For now, we can use the fact that the boolean value of a match object is True while None is False to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts

In [5]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.
<br>
<br>
Example: Python/python

In [2]:
import re

titles = hn["title"].tolist()

pattern = "[Pp]ython"
python_mentions = 0 

for ele in titles:
    if re.search(pattern, ele):
        python_mentions += 1 

print(python_mentions)        

160


## Counting Matches with pandas Methods 

Another alternative to perform the same function but avoid using loops in pandas is the vectorized method. We learned that the **Series.str.contains() method** in pandas can be used to test whether a Series of strings match a particular regex pattern.
<br>
<br>
We'll start by creating a pandas object containing our strings:

In [7]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


Next, we'll create our regex pattern, and use Series.str.contains() to compare to each value in our series:

In [8]:
pattern = "[Bb]lue"

pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


One of the neat things about boolean masks is that you can use the Series.sum() method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:

In [9]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


We can apply this concept also in the hacker news dataset

In [10]:
pattern = '[Pp]ython'

titles = hn['title']
python_mentions = titles.str.contains(pattern).sum()
print(python_mentions)

160


On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to view those titles?
<br>
<br>
In that case, we can use the boolean array returned by Series.str.contains() to select just those rows from our series.

In [11]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
py_titles = titles[py_titles_bool]
print(py_titles.head())

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object


We can apply the same case in another example e.g select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

In [3]:
titles = hn['title']

ruby_titles = titles[titles.str.contains("[Rr]uby")]
print(ruby_titles.head())

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object


## Quantifiers 

We could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression below:

In [15]:
[1-2][0-9]{3}

SyntaxError: invalid syntax (<ipython-input-15-16455b5a747c>, line 1)

The name for this type of regular expression syntax is called a quantifier. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths.
<br>
<br>
As an example, we might want to match both e-mail and email. To do this, we would want to specify to match - either zero or one times.
<br>
<br>
To find how many titles in our dataset mention email or e-mail. To do this, we'll need to use ?, the optional quantifier, to specify that the dash character - is optional in our regular expression.

In [4]:
email_bool = titles.str.contains("e-?mail")
email_count = email_bool.sum()
email_titles = titles[email_bool]
print(email_titles.head())

119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object


## Character Classes/Accessing Matching Text with Capture Groups 

Some stories submitted to Hacker News include a topic tag in brackets, like [pdf]. Here are a few examples of story titles with these tags:
- [video] Google Self-Driving SUV Sideswipes Bus
- New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
- Wallace and Gromit  The Great Train Chase (1993) [video]
<br>
<br>
Our first inclination may be to create the regex [pdf]. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters p, d, or f.
<br>
<br>
To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets:

In [18]:
\[pdf\]

SyntaxError: unexpected character after line continuation character (<ipython-input-18-36834e20a8fa>, line 1)

Furthermore, there are some other common character classes which we'll use a lot
- \d -- Any digit character
- \w -- Any digit, uppercase, lowercase, or underscore char
- \s -- Any space, tbal or linebreak character
- . -- any character except newline

In order to match word characters between our brackets, we can combine the word character class (\w) with the 'one or more' quantifier (+), giving us a combined pattern of \w+.
<br>
<br>
This will match sequences like pdf, video, Python, and 2018 but won't match a sequence containing a space or punctuation character like PHP-DEV or XKCD Flowchart.

We'll use these concepts to count the number of titles that contain a tag, which:
- A single open bracket character.
- One or more word characters.
- A single close bracket character.
<br>
<br>
Then select those titles and count how many matching titles

In [14]:
pattern = "\[\w+\]"
tag_titles = titles[titles.str.contains(pattern)]
tag_count = tag_titles.shape[0]
tag_count

444

A backslash followed by certain characters represents an escape sequence — like the \n sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring \b:

In [15]:
print('hello\b world')

hello world


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b"

In [16]:
print('hello\\b world')

hello\b world


In [17]:
print(r'hello\b world')

hello\b world


In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?
<br>
<br>
In order to do this, we'll need to use capture groups. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.
<br>
<br>
We use the Series.str.extract() method to extract the match within our parentheses:

In [18]:
tag_5 = tag_titles.head()
pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

            0
66      [pdf]
100  [German]
159     [pdf]
162     [pdf]
195    [Beta]


We can move our parentheses inside the brackets to get just the text:

In [19]:
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

          0
66      pdf
100  German
159     pdf
162     pdf
195    Beta


Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

In [5]:
# pattern = r"\[\w+\]"
pattern = r"\[(\w+)\]"
tag_freq = titles.str.extract(pattern)
print(tag_freq.head())

     0
0  NaN
1  NaN
2  NaN
3  NaN
4  NaN


## Negative Character Classes 

When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.
<br>
<br>
In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

In [21]:
first_10_matches(r"[Jj]ava")

NameError: name 'first_10_matches' is not defined

We can see that there are a number of matches that contain Java as part of the word JavaScript. We want to exclude these titles from matching so we get an accurate count.
<br>
<br>
One way to do this is by using negative character classes. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:
- Negative Set -- [^fud] -- Any character except f,u,d
- Negative Set -- [^1-3Z\s] -- Any character except 1,2,3,Z, or whitespace characters
- Negative Digit -- \D -- Any character except digit character
- Negative word -- \W -- Any character except word character
- Negative whitespace -- \W -- Any character except whitespace character

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript:

In [6]:
def first_10_matches(pattern):
    
    
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern = r"[Jj]ava[^Ss]"
java_titles = titles[titles.str.contains(pattern)]
print(java_titles.head())

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
Name: title, dtype: object


## Word Boundaries 

A different approach to take in cases like these is to use the word boundary anchor, specified using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. 

In [23]:
pattern = r"\b[Jj]ava\b"
java_titles = titles[titles.str.contains(pattern)]
print(java_titles.head())

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
Name: title, dtype: object


On the previous screen, we learned that the word boundary anchor matches the space between a word character and a non-word character. More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.
<br>
<br>
Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.
- Beginning -- ^abc -- Matches abc only at the start of the string 
- End -- abc$ -- Matches abc only at the end of string 

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function: 

In [7]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases)

0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object


## Matching at the Start and End of Strings 

If we want to match the word Red only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:

In [25]:
test_cases.str.contains(r"^Red")

0     True
1    False
2    False
dtype: bool

If we want to match the word Red only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:

In [26]:
test_cases.str.contains(r"Red$")

0    False
1     True
2    False
dtype: bool

Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset

In [8]:
pattern_beginning = r"^\[\w+\]"
beginning_count = titles.str.contains(pattern_beginning).sum()

pattern_ending =  r"\[\w+\]$"
ending_count = titles.str.contains(pattern_ending).sum()

print(beginning_count)
print(ending_count)

15
417


## Using Flags to Modify Regex Patterns 

Within the titles, there are many different formatting styles used to represent the word "email." e.g email, Email, e Mail, e mail, E-mail etc
<br>
<br>
we can use flags to specify that our regular expression should ignore case
<br>
<br>
Both re.search() and the pandas regular expression methods accept an optional flags argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.
<br>
<br>
A list of all available flags is in the documentation, but by far the most common and the most useful is the re.IGNORECASE flag, which is also available using the alias re.I for convenience.

In [1]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern = r"\be[\-\s]?mails?\b"
email_mentions = titles.str.contains(pattern, flags=re.I).sum()
email_mentions

NameError: name 'pd' is not defined

## Advanced Regular Expressions 

#### Ignore case  

The ignorecase flag is particularly useful when we have many different capitalizations for a word or phrase. In our dataset, the SQL language has three different capitalizations: SQL, sql, and Sql.

In [9]:
import pandas as pd
import re

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

pattern = r"SQL"
sql_counts = titles.str.contains(pattern, flags=re.I).sum()
sql_counts

108

## Capture Groups 

We can extend this analysis by looking at titles that have letters immediately before the "SQL," which is a convention often used to denote different variations or flavors of SQL:

In [10]:
# Extract record with SQL/sql from the dataset
hn_sql = hn[hn['title'].str.contains(r"\w+SQL", flags=re.I)].copy()

# Clean the value in the flavor column by converting them to lowercase. Assign the values back to the column in hn_sql.
hn_sql['flavor'] = hn_sql['title'].str.extract(r"(\w+SQL)", re.I)
hn_sql['flavor'] = hn_sql['flavor'].str.lower()

# Create a pivot table
sql_pivot = hn_sql.pivot_table(index='flavor', values='num_comments', aggfunc='mean')
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


## Using Capture Groups to Extract Data 

In the Hacker News titles that mention Python, it mentioned different versions e.g Python 3.6, Python 2.7 or so. 
<br>
<br>
So if we want to extract the version number, it can be as follows: 

In [11]:
pattern = r"[Pp]ython ([\d\.]+)"
py_versions = titles.str.extract(pattern)
py_versions_freq = dict(py_versions.value_counts())

AttributeError: 'DataFrame' object has no attribute 'value_counts'

So far, we've created regular expressions to clean and analyze the number of mentions of the Python, SQL, and Java languages. Next up: counting the mentions of the C language.
<br>
<br>
We can start with a simple regular expression and then iterate as we find and exclude incorrect matches. Let's start with a simple regex that matches the letter "c" with word boundary anchors on either side. 
<br>
<br>
At the same time, we want to prevent: 
- Mentions of C++, a distinct language from C
- Cases where the letter C is followed by a period, like in the substring C.E.O

In [12]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

first_ten = first_10_matches(r"\b[Cc]\b[^.+]")
print(first_ten.head())

365                 The new C standards are worth it
444      Moz raises $10m Series C from Foundry Group
521     Fuchsia: Micro kernel written in C by Google
1307       Show HN: Yupp, yet another C preprocessor
1326                The C standard formalized in Coq
Name: title, dtype: object


## Using Lookarounds to control Matches Based on Surrounding Text 

It looks like we're getting close. In our first 10 matches we have one irrelevant result, which is about "Series C," a term used to represent a particular type of startup fundraising.
<br>
<br>
Additionally, we've run into the same issue as we did in the previous mission — by using a negative set, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).
<br>
<br>
We need a new tool called lookarounds. Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds
- Positive Lookahead --- zzz(?=abc) --- Matches zzz only when it is followed by abc 
- Negative Lookahead --- zzz(?!abc) --- Matches zzz only when it is not followed by abc 
- Positive Lookbehind --- (?<=abc)zzz --- Matches zzz only when it is preceded by abc
- Negative Lookbehind --- (?<!abc)zzz --- Matches zzz only when it is not preceded by abc 

We'll also create a function that will loop over our test cases and tell us whether our pattern matches. We'll use the re module rather than pandas since it tells us the exact text that matches, which will help us understand how the lookaround is working:

In [36]:
test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

#  Create a function that will loop over our test cases and tell us whether our pattern matches.
def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

Let's start by using a positive lookahead to include instances where the match is followed by the substring _Blue.

In [37]:
run_test_cases(r"Green(?=_Blue)")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
NO MATCH


Let's look at a negative lookahead to include instances where the match is not followed by the substring _Red

In [38]:
run_test_cases(r"(?<=Red_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


Positive lookbehind to include instances where the match is preceded by the substring Red_

In [39]:
run_test_cases(r"(?<=Red_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


Negative lookbehind to include instances where the match isn't preceded by the substring Yellow_

In [41]:
run_test_cases(r"(?<!Yellow_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(0, 5), match='Green'>


The contents of a lookaround can include any other regular expression component. For instance, here is an example where we match only cases that are followed by exactly five characters:

In [42]:
run_test_cases(r"Green(?=.{5})")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
NO MATCH


We can use this pattern to exclude matches that are followed by . or +, but still match cases where "C" falls at the end of the string

In [13]:
pattern = r'(?<!Series\s)\b[Cc]\b(?![\.\+])'
c_mentions = titles.str.contains(pattern).sum()
c_mentions

102

## Using Capture Groups in a RegEx Pattern 

Let's say we wanted to identify strings that had words with double letters, like the "ee" in "feed." Because we don't know ahead of time what letters might be repeated, we need a way to specify a capture group and then to repeat it. We can do this with backreferences.
<br>
<br>
For example: In the regex that matches 'HelloGoodbye':

In [45]:
(Hello)(Goodbye)

NameError: name 'Hello' is not defined

It can be written into group like this: 

In [46]:
(Hello)(Goodbye) \2\1

SyntaxError: unexpected character after line continuation character (<ipython-input-46-941a68bb0920>, line 1)

The regular expression above will match the text HelloGoodbyeGoodbyeHello
<br>
<br>
Let's see this in action using Python:

In [47]:
test_cases = [
              "I'm going to read a book.",
              "Green is my favorite color.",
              "My name is Aaron.",
              "No doubles here.",
              "I have a pet eel."
             ]

for tc in test_cases:
    print(re.search(r"(\w)\1", tc))

<re.Match object; span=(21, 23), match='oo'>
<re.Match object; span=(2, 4), match='ee'>
None
None
<re.Match object; span=(13, 15), match='ee'>


Now we can write a regular expression to match cases of repeated words, and select only the items in titles that match the regular expression

In [48]:
pattern = r'\b(\w+)\s\1\b'
repeated_words = titles[titles.str.contains(pattern)]
print(repeated_words.head())

3102                 Silicon Valley Has a Problem Problem
3176               Wire Wire: A West African Cyber Threat
3178                        Flexbox Cheatsheet Cheatsheet
4797                           The Mindset Mindset (2015)
7276    Valentine's Day Special: Bye Bye Tinder, Flirt...
Name: title, dtype: object


  return func(self, *args, **kwargs)


When we learned to work with basic string methods, we used the str.replace() method to replace simple substrings. We can achieve the same with regular expressions using the re.sub() function. The basic syntax for re.sub() is:

In [49]:
re.sub(pattern, repl, string, flags=0)

NameError: name 'repl' is not defined

When working in pandas, we can use the Series.str.replace() method, which uses nearly identical syntax:

In [14]:
Series.str.replace(pat, repl, flags=0)

NameError: name 'Series' is not defined

## Substituting Regular Expression Matches 

We have provided email_variations, a pandas Series containing all the variations of "email" in the dataset.
<br>
<br>
We can use a regular expression to replace each of the matches in email_variations with "email" and assign the result to email_uniform.

In [51]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

pattern = r'[Ee][\s\-]?mail'
email_uniform = email_variations.str.replace(pattern, 'email', flags=re.I)
titles_clean = titles.str.replace(pattern, 'email', flags=re.I)

In [52]:
print(email_uniform.head())

0    email
1    email
2    email
3    email
4    email
dtype: object


In [15]:
print(titles_clean.head())

NameError: name 'titles_clean' is not defined

## Extracting Domains from URLs 

We'll extract components of URLs from our dataset. As a reminder, most stories on Hacker News contain a link to an external resource.
<br>
<br>
The task we will be performing first is extracting the different components of the URLs in order to analyze them. On this screen, we'll start by extracting just the domains.
<br>
<br>
The domain of each URL excludes the protocol (e.g. https://) and the page path (e.g. /Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429).
<br>
<br>
Write a regular expression to extract the domains from test_urls and assign the result to test_urls_clean. Please note: 
- Using a series of characters that will match the protocol
- Inside a capture group, using a set that will match the character classes used in the domain.
- Because all of the URLs either end with the domain, or continue with page path which starts with / (a character not found in 
  any domains), we don't need to cater for this part of the URL in our regular expression


In [16]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
])

pattern = r"https?://([\w\-\.]+)"

test_urls_clean = test_urls.str.extract(pattern, flags=re.I)
print(test_urls_clean.head())

                                 0
0                   www.amazon.com
1  www.interactivedynamicvideo.com
2                  www.nytimes.com
3                    evonomics.com
4                       github.com


## Extracting URL Parts Using Multiple Capture Groups 

At the same time we can extract other components with: 

In [55]:
pattern = r"(https?)://([\w\.\-]+)/?(.*)"

test_url_parts = test_urls.str.extract(pattern, flags=re.I)
url_parts = hn['url'].str.extract(pattern, flags=re.I) 

In [56]:
test_url_parts.head()

Unnamed: 0,0,1,2
0,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
1,http,www.interactivedynamicvideo.com,
2,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
3,http,evonomics.com,advertising-cannot-maintain-internet-heres-sol...
4,HTTPS,github.com,keppel/pinn


In [17]:
url_parts.head()

NameError: name 'url_parts' is not defined

## Using Named Capture Groups to Extract Data 

Our final task will be to name these columns, which we'll do using named capture groups. Let's look at the example from the previous screen where we used two capture groups to extract the date and time as two separate columns. 

In [58]:
created_at = hn['created_at'].head()

pattern = r"(.+) (.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

            0      1
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


In order to name a capture group we use the syntax ?P, where name is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group: 

In [59]:
pattern = r"(?P<date>.+) (?P<time>.+)"
dates_times = created_at.str.extract(pattern)
print(dates_times)

         date   time
0    8/4/2016  11:52
1   6/23/2016  22:20
2   6/17/2016   0:01
3   9/30/2015   4:12
4  10/31/2015   9:48


We can add name to our capture group from the previous screen to create a dataframe with named columns.

In [61]:
# pattern = r"(https?)://([\w\.\-]+)/?(.*)"
pattern = r"(?P<protocol>https?)://(?P<domain>[\w\.\-]+)/?(?P<path>.*)"
url_parts = hn['url'].str.extract(pattern, flags=re.I)
print(url_parts.head())

  protocol                           domain  \
0     http  www.interactivedynamicvideo.com   
1     http                  www.thewire.com   
2    https                   www.amazon.com   
3     http                  www.nytimes.com   
4     http                  arstechnica.com   

                                                path  
0                                                     
1  entertainment/2013/04/florida-djs-april-fools-...  
2  Technology-Ventures-Enterprise-Thomas-Byers/dp...  
3                2007/11/07/movies/07stein.html?_r=0  
4  business/2015/10/comcast-and-other-isps-boost-...  
