# Regular Expressions Basics
As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

difficult regex
That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.
The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

- id: The unique identifier from Hacker News for the story
- title: The title of the story
- url: The URL that the stories links to, if the story has a URL
- num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the story
- author: The username of the person who submitted the story
- created_at: The date and time at which the story was submitted
- For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.
## Instructions
1. Import the pandas library.
2. Read the hacker_news.csv file into a pandas dataframe. Assign the result to hn.
3. After you have completed the code exercise, use the variable inspector to familiarize yourself with the dataset.

In [1]:
import pandas as pd
import numpy as np
hn = pd.read_csv('data/hacker_news.csv')
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
id              20100 non-null int64
title           20100 non-null object
url             17660 non-null object
num_points      20100 non-null int64
num_comments    20100 non-null int64
author          20100 non-null object
created_at      20100 non-null object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply and:

basic matches
In the third example above, the pattern and does not match Andrew because even though a and A are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The re module. This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the re module is the re.search() function, which takes two required arguments:

The regex pattern
The string we want to search that pattern for
```python
import re
​
m = re.search("and", "hand")
print(m)
< _sre.SRE_Match object; span=(1, 4), match='and' >
The re.search() function will return a Match object if the pattern is found anywhere within the string. If the pattern is not found, re.search() returns None:

m = re.search("and", "antidote")
print(m)
None
```
We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is True while None is False to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:
```python
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]
​
pattern = "Blue"
​
for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")
Match
No Match
No Match
```
So far, we haven't done anything with regular expressions that we couldn't do using the in keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a set. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

basic matches
The regular expression above will match the strings mend, send, and bend.

Let's look at how we can add sets to match more of our example strings from earlier:

basic matches
Let's take another look at the list of strings we used earlier:
```python
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]
 ```              
If you look closely, you'll notice the first string contains the substring Blue with a capital letter, where the third string contains the substring blue in all lowercase. We can use the set [Bb] for the first character so that we can match both variations, and then use that to count how many times Blue or blue occur in the list:
```python
blue_mentions = 0
pattern = "[Bb]lue"
​
for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1
​
print(blue_mentions)
2
```
We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

### Instructions

We have provided code to import the re module and extract a list, titles, containing all the titles from our dataset.

1. Initialize a variable python_mentions with the integer value 0.
2. Create a string — pattern — containing a regular expression pattern that uses a set to match Python or python.
3. Use a loop to iterate over each item in the titles list, and for each item:
4. Use the re.search() function to check whether pattern matches the title.
    - If re.search() returns a match object, increment (add 1 to) the python_mentions variable.

In [2]:
import re

titles = hn["title"].tolist()
python_mentions = 0
pattern = '[Pp]ython'
for item in titles:
    if re.search(pattern, item):
        python_mentions += 1
        

### Instructions

We have provided the regex pattern from the solution to the previous screen.

1. Assign the title column from the hn dataframe to the variable titles.
2. Use Series.str.contains() and Series.sum() with the provided regex pattern to count how many Hacker News titles contain Python or python. Assign the result to python_mentions.

In [3]:
titles = hn.title
python_mentions = titles.str.contains(pattern).sum()

 On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to view those titles?

In that case, we can use the boolean array returned by Series.str.contains() to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.
´´´python
titles = hn['title']
​
py_titles_bool = titles.str.contains("[Pp]ython")
print(py_titles_bool.head())
0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool
Then, we can use that boolean array to select just the matching rows:

py_titles = titles[py_titles_bool]
print(py_titles.head())
´´´
103                          From Python to Lua: Why We Switched

104                    Ubuntu 16.04 LTS to Ship Without Python 2

145      Create a GUI Application Using Qt and Python in Minutes

197     How I Solved GCHQ's Xmas Card with Python and Pycosat...

437  Unikernel Power Comes to Java, Node.js, Go, and Python Apps
Name: title, dtype: object
We can also do it in a streamlined, single line of code:

py_titles = titles[titles.str.contains("[Pp]ython")]

print(py_titles.head())

103                          From Python to Lua: Why We Switched
104                    Ubuntu 16.04 LTS to Ship Without Python 2
145      Create a GUI Application Using Qt and Python in Minutes
197     How I Solved GCHQ's Xmas Card with Python and Pycosat...
437  Unikernel Power Comes to Java, Node.js, Go, and Python Apps
Name: title, dtype: object
</code>
Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.
1. Use Series.str.contains() to create a series of the values from titles that contain Ruby or ruby. Assign the result to ruby_titles.

In [4]:
titles = hn['title']
ruby_titles = titles[titles.str.contains('[Rr]uby')]
ruby_titles.head()

191                    Ruby on Google AppEngine Goes Beta
485          Related: Pure Ruby Relational Algebra Engine
1389    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1950    Rewriting a Ruby C Extension in Rust: How a Na...
2023    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object

In the data cleaning course, we learned that we could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression below:

![](images/quantifier_example.svg)

The name for this type of regular expression syntax is called a quantifier. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both e-mail and email. To do this, we would want to specify to match - either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

![](images/quantifiers_numeric.svg)

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

![](images/quantifiers_other.svg)

On this screen, we're going to find how many titles in our dataset mention email or e-mail. To do this, we'll need to use ?, the optional quantifier, to specify that the dash character - is optional in our regular expression.
1. Use a regular expression and Series.str.contains() to create a boolean mask that matches items from titles containing email or e-mail. Assign the result to email_bool.
2. Use email_bool to count the number of titles that matched the regular expression. Assign the result to email_count.
3. Use email_bool to select only the items from titles that matched the regular expression. Assign the result to email_titles.

In [5]:
email_bool = titles.str.contains('e-?mail')
email_count = email_bool.sum()
email_titles = titles[email_bool]

To match the substring "[pdf]", we can use backslashes to escape both the open and closing brackets: \[pdf\].

<img src="images/escaped_character_syntax_breakdown.svg" />

The other critical part of our task of identifying how many titles have tags is knowing how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be.

To match unknown characters using regular expressions, we use character classes. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

The set notation using brackets to match any of a number of characters.
The range notation, which we used to match ranges of digits (like [0-9]).
Let's look at a summary of syntax for some of the regex character classes:

<img src="images/character_classes_v2_1.svg" />

There are two new things we can observe from this table:

Ranges can be used for letters as well as numbers.
Sets and ranges can be combined.
Just like with quantifiers, there are some other common character classes which we'll use a lot.

<img src="images/character_classes_v2_2.svg" />

The one that we'll be using in order to match characters in tags is \w, which represents any digit uppercase or lowercase letter. Each character class represents a single character, so to match multiple characters (e.g. words like video and pdf), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (\w) with the 'one or more' quantifier (+), giving us a combined pattern of \w+.

This will match sequences like pdf, video, Python, and 2018 but won't match a sequence containing a space or punctuation character like PHP-DEV or XKCD Flowchart. If we wanted to match those tags as well, we could use .+; however, in this case, we're just interested in single-word tags without special characters.

Let's quickly recap the concepts we learned in this screen:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. \ will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. \w will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.
- We'll use these concepts to count the number of titles that contain a tag.

### Instructions

1. Write a regular expression, assigning it as a string to the variable pattern. The regular expression should match, in order:
    - A single open bracket character.
    - One or more word characters.
    - A single close bracket character.
2. Use the regular expression to select only items from titles that match. Assign the result to the variable tag_titles.
3. Count how many matching titles there are. Assign the result to tag_count.

In [10]:
reg = r'\[\w+\]'
tag_titles = titles[titles.str.contains(reg)]
tag_count = tag_titles.count()
tag_count

444

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use capture groups. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

<img src="images/tags_syntax_breakdown_v2.svg" />

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:
```python
tag_5 = tag_titles.head()
print(tag_5)
67      Analysis of 114 propaganda sources from ISIS, Jabhat al-Nusra, al-Qaeda [pdf]
101                                Munich Gunman Got Weapon from the Darknet [German]
160                                      File indexing and searching for Plan 9 [pdf]
163    Attack on Kunduz Trauma Centre, Afghanistan  Initial MSF Internal Review [pdf]
196                                            [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object
```
We use the Series.str.extract() method to extract the match within our parentheses:
```python
pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)
67        [pdf]
101    [German]
160       [pdf]
163       [pdf]
196      [Beta]
Name: title, dtype: object
```
We can move our parentheses inside the brackets to get just the text:
```python
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)
67        pdf
101    German
160       pdf
163       pdf
196      Beta
Name: title, dtype: object
```
If we then use Series.value_counts() we can quickly get a frequency table of the tags:
```python
tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)
pdf       3
Beta      1
German    1
Name: title, dtype: int64
```
Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

Instructions

We have provided a commented line of code with the pattern from the previous exercise.

1. Uncomment the line of code and add parentheses to create a capture group inside the brackets.
2. Use Series.str.extract() and Series.value_counts() with the modified regex pattern to produce a frequency table of all the tags in the titles series. Assign the frequency table to tag_freq.

In [19]:
pattern = r"\[(\w+)\]"
tag_freq = titles.str.extract(pattern,expand=False).value_counts()
tag_freq.head()

pdf      276
video    111
audio      3
2015       3
beta       2
Name: title, dtype: int64

 Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

<img src="images/negative_character_classes.svg" />

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript:
1. Write a regular expression that will match titles containing Java.
    - You might like to use the first_10_matches() function or a site like RegExr to build your regular expression.
    - The regex should match whether or not the first character is capitalized.
    - The regex shouldn't match where 'Java' is followed by the letter 'S' or 's'.
2. Select every row from titles that match the regular expression. Assign the result to java_titles.

In [20]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10
expr = r'[Jj]ava[^Ss]'
java_titles = titles[titles.str.contains(expr)]

437      Unikernel Power Comes to Java, Node.js, Go, an...
812      Ask HN: Are there any projects or compilers wh...
1841                     Adopting RxJava on the Airbnb App
1973           Node.js vs. Java: Which Is Faster for APIs?
2094                     Java EE and Microservices in 2016
2368     Code that is valid in both PHP and Java, and p...
2494     Ask HN: I've been a java dev for a couple of y...
2752                 Eventsourcing for Java 0.4.0 released
2911                 2016 JavaOne Intel Keynote  32mn Talk
3453     What are the Differences Between Java Platform...
4274      Ask HN: Is Bloch's Effective Java Still Current?
4625     Oracle Discloses Critical Java Vulnerability i...
5462                        Lambdas (in Java 8) Screencast
5848     IntelliJ IDEA and the whole IntelliJ platform ...
5948                                        JavaFX is dead
6269             Oracle deprecating Java applets in Java 9
7437     Forget Guava: 5 Google Libraries Java Develope.

Let's look at how using a word boundary changes the match from the string in the example above:
```python
string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"

m1 = re.search(pattern_1, string)
print(m1)
None
```
The regular expression returns None, because there is no substring that contains Java followed by a character that isn't S.

Let's instead use word boundaries in our regular expression:
```python
pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)
_sre.SRE_Match object; span=(41, 45), match='Java'
```
With the word boundary, our pattern matches the Java at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention Java.
1. Write a regular expression that will match titles containing Java.
    - You might like to use the first_10_matches() function or a site like RegExr to build your regular expression.
    - The regex should match whether or not the first character is capitalized.
    - The regex should match only where 'Java' is preceded and followed by a word boundary.
2. Select from titles only the items that match the regular expression. Assign the result to java_titles.

In [25]:
exp = r'\b[Jj]ava\b'
java_titles = titles[titles.str.contains(exp)].head()
java_titles.head()

437     Unikernel Power Comes to Java, Node.js, Go, an...
812     Ask HN: Are there any projects or compilers wh...
1024                         Pippo  Web framework in Java
1973          Node.js vs. Java: Which Is Faster for APIs?
2094                    Java EE and Microservices in 2016
Name: title, dtype: object

On the previous screen, we learned that the word boundary anchor matches the space between a word character and a non-word character. More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string, respectfully.

<img src="images/positional_anchors.svg" />

Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function:
```python
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])
print(test_cases)
0    Red Nose Day is a well-known fundraising event
1                          My favorite color is Red
2          My Red Car was purchased three years ago
dtype: object
```
If we want to match the word Red only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:
```python
test_cases.str.contains(r"^Red")
0     True
1    False
2    False
dtype: bool
```
If we want to match the word Red only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:
```python
test_cases.str.contains(r"Red$")
0    False
1     True
2    False
dtype: bool
```
Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset.


1. Count the number of times that a tag (e.g. [pdf] or [video]) occurs at the start of a title in titles. Assign the result to beginning_count.
2. Count the number of times that a tag (e.g. [pdf] or [video]) occurs at the end of a title in titles. Assign the result to ending_count.

In [26]:
beginning_count = titles.str.contains(r'^\[\w+\]').sum()
ending_count = titles.str.contains(r'\[\w+\]$').sum()

1. Write a regular expression that will match all variations of email included in the starter code. Write your regular expression in a way that will be compatible with the ignorecase flag.
    - As you build your regular expression, you might like to use Series.str.contains() like we did in the examples earlier in this screen.
2. Once your regular expression matches all the test cases, use it to count the number of mentions of email in titles in the dataset. Assign the result to email_mentions.

In [34]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL'])
email_tests.str.contains(r"e[\-\s]?mail",flags=re.I).sum()


9