Regular expressions are a powerful way of building patterns to match text. In the first two missions of this Data Cleaning Advanced course, we're going to extend our knowledge about this extremely powerful tool that every data scientist should be familiar with.

As powerful as regular expressions are, they can be difficult to learn at first and the syntax can look visually intimidating. As a result, a lot of students end up disliking regular expressions and try to avoid using them, instead opting to write more cumbersome code.

![reg_ex.png](attachment:reg_ex.png)

That said, learning (and loving!) regular expressions is something that is a worthwhile investment

- Once you understand how they work, complex operations with string data can be written a lot quicker, which will save you time.
- Regular expressions are often faster to execute than their manual equivalents.
- Regular expressions are supported in almost every modern programming language, as well as other places like command line utilities and databases. Understanding regular expressions gives you a powerful tool that you can use wherever you work with data.


We could probably fill a whole Dataquest course with the intricacies of regular expressions, but instead we're going to give you a two-mission tour of the main components.

One thing to keep in mind before we start: **don't expect to remember all of the regular expression syntax**. The most important thing is to **understand the core principles**, what is possible, and where to look up the details. This will mean you can quickly jog your memory whenever you need regular expressions.

With that in mind, don't be put off if some things in these missions don't stick in your memory. As long as you are able to write and understand regular expressions with the help of documentation and/or other reference guides, you have all the skills you need to excel.

We'll be learning regular expressions while performing analysis on a dataset of submissions to popular technology site Hacker News.

![hacker.png](attachment:hacker.png)

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

- id: The unique identifier from Hacker News for the story
- title: The title of the story
- url: The URL that the stories links to, if the story has a URL
- num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the story
- author: The username of the person who submitted the story
- created_at: The date and time at which the story was submitted

For teaching purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. You can download the modified dataset using the dataset preview tool.

Let's start by reading our Hacker News dataset into a pandas dataframe.

In [1]:
import pandas as pd

hn = pd.read_csv('hacker_news.csv')
hn.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


### The regular expression module

When working with regular expressions, we use the term pattern to describe a regular expression that we've written. If the pattern is found within the string we're searching, we say that it has matched.

As we previously learned, letters and numbers represent themselves in regular expressions. If we wanted to find the string "and" within another string, the regex pattern for that is simply `and`:

![and.png](attachment:and.png)

In the third example above, the pattern and does not match Andrew because even though a and A are the same letter, the two characters are unique.

We previously used regular expressions with pandas, but Python also has a built-in module for regular expressions: The re module. This module contains a number of different functions and classes for working with regular expressions. One of the most useful functions from the re module is the re.search() function, which takes two required arguments:

- The regex pattern
- The string we want to search that pattern for

In [2]:
import re

m = re.search("and", "hand")
print(m)

<re.Match object; span=(1, 4), match='and'>


The re.search() function will return a Match object if the pattern is found anywhere within the string. If the pattern is not found, re.search() returns None:

In [3]:
m = re.search("and", "antidote")
print(m)

None


We'll learn more about match objects later. For now, we can use the fact that the boolean value of a match object is True while None is False to easily check whether our regex matches each string in a list. We'll create a list of three simple strings to use while learning these concepts:

In [4]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

pattern = "Blue"

for s in string_list:
    if re.search(pattern, s):
        print("Match")
    else:
        print("No Match")

Match
No Match
No Match


So far, we haven't done anything with regular expressions that we couldn't do using the in keyword. The power of regular expressions comes when we use one of the special character sequences.

The first of these we'll learn is called a **set**. A set allows us to specify two or more characters that can match in a single character's position.

We define a set by placing the characters we want to match for in square brackets:

![set.png](attachment:set.png)

The regular expression above will match the strings `mend`, `send`, and `bend`.

Let's look at how we can add sets to match more of our example strings from earlier:

![table_example.png](attachment:table_example.png)

Let's take another look at the list of strings we used earlier:

In [5]:
string_list = ["Julie's favorite color is Blue.",
               "Keli's favorite color is Green.",
               "Craig's favorite colors are blue and red."]

If you look closely, you'll notice the first string contains the substring Blue with a capital letter, where the third string contains the substring blue in all lowercase. We can use the set [Bb] for the first character so that we can match both variations, and then use that to count how many times Blue or blue occur in the list:

In [6]:
blue_mentions = 0
pattern = "[Bb]lue"

for s in string_list:
    if re.search(pattern, s):
        blue_mentions += 1

print(blue_mentions)

2


We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

In [8]:
import re

titles = hn["title"].tolist()

python_mentions = 0

pattern = "[Pp]ython"

for t in titles:
    if re.search(pattern, t):
        python_mentions += 1
        
python_mentions

160

### Counting matches with pandas methods

We've learned that we should avoid using loops in pandas, and that vectorized methods are often faster and require less code.

In the data cleaning course, we learned that the Series.str.contains() method can be used to test whether a Series of strings match a particular regex pattern. Let's look at how we can replicate the example from the previous screen using pandas.

We'll start by creating a pandas object containing our strings:

In [9]:
eg_list = ["Julie's favorite color is green.",
           "Keli's favorite color is Blue.",
           "Craig's favorite colors are blue and red."]

eg_series = pd.Series(eg_list)
print(eg_series)

0             Julie's favorite color is green.
1               Keli's favorite color is Blue.
2    Craig's favorite colors are blue and red.
dtype: object


Next, we'll create our regex pattern, and use Series.str.contains() to compare to each value in our series:

In [10]:
pattern = "[Bb]lue"

pattern_contained = eg_series.str.contains(pattern)
print(pattern_contained)

0    False
1     True
2     True
dtype: bool


One of the neat things about boolean masks is that you can use the Series.sum() method to sum all the values in the boolean mask, with each True value counting as 1, and each False as 0. This means that we can easily count the number of values in the original series that matched our pattern:

In [11]:
pattern_count = pattern_contained.sum()
print(pattern_count)

2


If we wanted, we could use method chaining to do the whole operation on one line:

In [12]:
pattern_count = eg_series.str.contains(pattern).sum()
print(pattern_count)

2


Let's use this technique to replicate the analysis we did in the previous screen.

In [14]:
pattern = '[Pp]ython'

python_mentions = hn['title'].str.contains(pattern).sum()
python_mentions

160

### Using regular expressions to select data

On the previous two screens, we used regular expressions to count how many titles contain Python or python. What if we wanted to view those titles?

In that case, we can use the boolean array returned by Series.str.contains() to select just those rows from our series. Let's look at that in action, starting by creating the boolean array.

In [15]:
titles = hn['title']

py_titles_bool = titles.str.contains("[Pp]ython")
py_titles_bool.head()

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool

Then, we can use that boolean array to select just the matching rows:

In [16]:
py_titles = titles[py_titles_bool]
py_titles.head()

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object

We can also do it in a streamlined, single line of code:

In [17]:
py_titles = titles[titles.str.contains("[Pp]ython")]
py_titles.head()

102                  From Python to Lua: Why We Switched
103            Ubuntu 16.04 LTS to Ship Without Python 2
144    Create a GUI Application Using Qt and Python i...
196    How I Solved GCHQ's Xmas Card with Python and ...
436    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object

Let's use this technique to select all titles that mention the programming language Ruby, using a set to account for whether the word is capitalized or not.

In [18]:
titles = hn['title']

ruby_titles = titles[titles.str.contains('[Rr]uby')]

### Quantifiers

In the data cleaning course, we learned that we could use braces ({}) to specify that a character repeats in our regular expression. For instance, if we wanted to write a pattern that matches the numbers in text from 1000 to 2999 we could write the regular expression below:

![quant.png](attachment:quant.png)

The name for this type of regular expression syntax is called a quantifier. Quantifiers specify how many of the previous character our pattern requires, which can help us when we want to match substrings of specific lengths. As an example, we might want to match both e-mail and email. To do this, we would want to specify to match - either zero or one times.

The specific type of quantifier we saw above is called a numeric quantifier. Here are the different types of numeric quantifiers we can use:

![quant_table.png](attachment:quant_table.png)

You might notice that the last two examples above omit the first and last character as wildcards, in the same way that we can omit the first or last indicies when slicing lists.

In addition to numeric quantifiers, there are single characters in regex that specify some common quantifiers that you're likely to use. A summary of them is below.

![quant_table_2.png](attachment:quant_table_2.png)

On this screen, we're going to find how many titles in our dataset mention email or e-mail. To do this, we'll need to use ?, the optional quantifier, to specify that the dash character - is optional in our regular expression.

In [21]:
# The `titles` variable is available from
# the previous screens

email_bool = titles.str.contains("e-?mail")

email_count = email_bool.sum()

email_titles = titles[email_bool]
email_titles

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

### Escape characters

So far, we've learned how to perform simple matches with sets, and how to use quantifiers to specify when a character should repeat a certain number of times. Let's continue by looking at a more complex example.

Some stories submitted to Hacker News include a topic tag in brackets, like [pdf]. Here are a few examples of story titles with these tags:

```
[video] Google Self-Driving SUV Sideswipes Bus
New Directions in Cryptography by Diffie and Hellman (1976) [pdf]
Wallace and Gromit  The Great Train Chase (1993) [video]
In this screen, our task is going to be to find how many titles in our dataset have tags.
```

Our first inclination may be to create the regex [pdf]. Unfortunately, the brackets would be interpreted as a set, so our pattern would match the single characters p, d, or f.

![pdf_set.png](attachment:pdf_set.png)

To match the substring "[pdf]", we can use backslashes to **escape** both the open and closing brackets: \[pdf\].

![pdf_scape.png](attachment:pdf_scape.png)

### Character classes

The other critical part of our task of identifying how many titles have tags is knowing **how to match the characters between the brackets (like pdf and video) without knowing ahead of time what the different topic tags will be**.

To match unknown characters using regular expressions, we use **character classes**. Character classes allow us to match certain groups of characters. We've actually seen two examples of character classes already:

- The set notation using brackets to match any of a number of characters.
- The range notation, which we used to match ranges of digits (like [0-9]).
- Let's look at a summary of syntax for some of the regex character classes:

![classes.png](attachment:classes.png)

There are two new things we can observe from this table:

1. Ranges can be used for letters as well as numbers.
2. Sets and ranges can be combined.

Just like with quantifiers, there are some other common character classes which we'll use a lot.

![class2.png](attachment:class2.png)

The one that we'll be using to match characters in tags is `\w`, which represents any number or letter. Each character class represents a single character, so to match multiple characters (e.g. words like `video` and `pdf`), we'll need to combine them with quantifiers.

In order to match word characters between our brackets, we can combine the word character class (`\w`) with the 'one or more' quantifier (`+`), giving us a combined pattern of `\w+`.

This will match sequences like pdf, video, Python, and 2018 but **won't match a sequence containing a space or punctuation character** like PHP-DEV or XKCD Flowchart. If we wanted to match those tags as well, we could use .+; however, in this case, we're just interested in **single-word tags without special characters**.

Let's quickly recap the concepts we learned in this screen:

- We can use a backslash to escape characters that have special meaning in regular expressions (e.g. `\[` will match an open bracket character).
- Character classes let us match certain groups of characters (e.g. `\w` will match any word character).
- Character classes can be combined with quantifiers when we want to match different numbers of characters.

We'll use these concepts to count the number of titles that contain a tag.

In [22]:
pattern = '\[\w+\]'

tag_titles = titles[titles.str.contains(pattern)]

tag_titles.count()

444

### Accessing the matching text with capture groups

In Python, a backslash followed by certain characters represents an escape sequence — like the `\n` sequence — which we previously learned represents a new line. These escape sequences can result in unintended consequences for our regular expressions. Let's take a look at a string containing the substring `\b`:

In [23]:
print('hello\b world')

hello world


The escape sequence \b represents a backspace, so the final letter from our string is removed. The character sequence \b has a special meaning in regular expressions (which we'll learn about later), so we need a way to write these characters without triggering the escape sequence.

One way is to add an extra backslash before the "b":

In [24]:
print('hello\\b world')

hello\b world


This can make regular expressions even more difficult to read and interpret, so instead we use **raw strings**, which we denote by **prefixing our string with the `r` character**. Let's take a look at the code from above with a raw string:

In [25]:
print(r'hello\b world')

hello\b world


We strongly recommend using raw strings for every regex you write, rather than remember which sequences are escape sequences and using raw strings selectively. That way, you'll never encounter a situation where you forget or overlook something which causes your regex to break.

In the previous screen, we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset?

In order to do this, we'll need to use **capture groups**. Capture groups allow us to specify one or more groups within our match that we can access separately. In this mission, we'll learn how to use one capture group per regular expression, but in the next mission we'll learn some more complex capture group patterns.

We specify capture groups using parentheses. Let's add an open and close parentheses to the pattern we wrote in the previous screen, and break down how each character in our regular expression works:

![capture.png](attachment:capture.png)

We'll learn how to access capture groups in pandas by looking at just the first five matching titles from the previous exercise:

In [26]:
tag_5 = tag_titles.head()
print(tag_5)

66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object


We use the `Series.str.extract()` method to extract the match within our parentheses:

In [27]:
pattern = r"(\[\w+\])"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

            0
66      [pdf]
100  [German]
159     [pdf]
162     [pdf]
195    [Beta]


We can move our parentheses inside the brackets to get just the text:

In [28]:
pattern = r"\[(\w+)\]"
tag_5_matches = tag_5.str.extract(pattern)
print(tag_5_matches)

          0
66      pdf
100  German
159     pdf
162     pdf
195    Beta


If we then use Series.value_counts() we can quickly get a frequency table of the tags:

In [29]:
tag_5_freq = tag_5_matches.value_counts()
print(tag_5_freq)

pdf       3
Beta      1
German    1
dtype: int64


Let's use this technique to extract all of the tags from the Hacker News titles and build a frequency table of those tags.

In [31]:
pattern = r"\[(\w+)\]"
tags = titles.str.extract(pattern)

tags.value_counts()

pdf            276
video          111
2015             3
audio            3
2014             2
beta             2
slides           2
1996             1
map              1
ask              1
blank            1
coffee           1
comic            1
crash            1
detainee         1
gif              1
png              1
much             1
Ubuntu           1
repost           1
satire           1
song             1
survey           1
transcript       1
updated          1
videos           1
Videos           1
USA              1
2008             1
SpaceX           1
5                1
ANNOUNCE         1
Australian       1
Benchmark        1
Beta             1
CSS              1
Challenge        1
Excerpt          1
GOST             1
German           1
HBR              1
Infograph        1
JavaScript       1
Live             1
Map              1
NSFW             1
Petition         1
Python           1
React            1
SPA              1
Skinnywhale      1
viz              1
dtype: int64

### Negative character classes

On the previous screens, we wrote mostly simple regular expressions. In reality, regular expressions are often complex. When creating complex regular expressions, you often need to work iteratively so you can find "bad" instances that match your pattern and then exclude them.

In order to work faster as you build your regular expression, it can be helpful to create a function that returns the first few matching strings:

In [32]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

Another useful approach is to use an online tool like [RegExr](https://regexr.com/) that allows you to build regular expressions and includes syntax highlighting, instant matches, and regex syntax reference. For this screen, we'll use the first_10_matches function we just built to iteratively build a regular expression.

Earlier, we counted the titles that included Python — let's write a simple regular expression to match Java (another popular language), and use our function to look at the matches:

In [33]:
first_10_matches(r"[Jj]ava")

267      Show HN: Hire JavaScript - Top JavaScript Talent
436     Unikernel Power Comes to Java, Node.js, Go, an...
580     Python integration for the Duktape Javascript ...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1046    If you write JavaScript tools or libraries, bu...
1093    Rollup.js: A next-generation JavaScript module...
1162                 V8 JavaScript Engine: V8 Release 5.4
1195                   Proposed JavaScript Standard Style
1314           Show HN: Design by Contract for JavaScript
Name: title, dtype: object

We can see that there are a number of matches that contain Java as part of the word JavaScript. We want to exclude these titles from matching so we get an accurate count.

One way to do this is by using **negative character classes**. Negative character classes are character classes that match every character except a character class. Let's look at a table of the common negative character classes:

![negchar.png](attachment:negchar.png)

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript:

In [35]:
pattern = r"[Jj]ava[^sS]"

java_titles = titles[titles.str.contains(pattern)]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

### Word Boundaries

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where Java occurs at the end of the string, like this title:

`Pippo  Web framework in Java`

This is because the negative set [^Ss] must match one character. Instances at the end of a string aren't followed by any characters, so there is no match.

A different approach to take in cases like these is to use the **word boundary anchor**, specified using the syntax \b. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string. The diagram below shows all the word boundaries in an example string:

![bound.png](attachment:bound.png)

et's look at how using a word boundary changes the match from the string in the example above:

In [36]:
string = "Sometimes people confuse JavaScript with Java"
pattern_1 = r"Java[^S]"

m1 = re.search(pattern_1, string)
print(m1)

None


The regular expression returns None, because there is no substring that contains Java followed by a character that isn't S.

Let's instead use word boundaries in our regular expression

In [37]:
pattern_2 = r"\bJava\b"

m2 = re.search(pattern_2, string)
print(m2)

<re.Match object; span=(41, 45), match='Java'>


With the word boundary, our pattern matches the Java at the end of the string.

Let's use the word boundary anchor as part of our regular expression to select the titles that mention Java.

In [39]:
pattern = r"\b[Jj]ava\b"

java_titles = titles[titles.str.contains(pattern)]
java_titles

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

### Matching at the start and end of strings

So far, we've used regular expressions to match substrings contained anywhere within text. There are often scenarios where we want to specifically match a pattern at the start and end of strings.

On the previous screen, we learned that the **word boundary anchor** matches the space between a word character and a non-word character. More generally in regular expressions, an anchor matches something that isn't a character, as opposed to character classes which match specific characters.

Other than the word boundary anchor, the other two most common anchors are the **beginning anchor** and the **end anchor**, which represent the start and the end of the string.

![anchors.png](attachment:anchors.png)

Note that the ^ character is used both as a beginning anchor and to indicate a negative set, depending on whether the character preceding it is a [ or not.

Let's start with a few test cases that all contain the substring Red at different parts of the string, as well as a test function:

In [40]:
test_cases = pd.Series([
    "Red Nose Day is a well-known fundraising event",
    "My favorite color is Red",
    "My Red Car was purchased three years ago"
])

If we want to match the word Red only if it occurs at the start of the string, we add the beginning anchor to the start of our regular expression:

In [41]:
test_cases.str.contains(r"^Red")

0     True
1    False
2    False
dtype: bool

If we want to match the word Red only if it occurs at the end of the string, we add the end anchor to the end of our regular expression:

In [42]:
test_cases.str.contains(r"Red$")

0    False
1     True
2    False
dtype: bool

Let's use the beginning and end anchors to count how many titles have tags at the start versus the end of the story title in our Hacker News dataset.

In [44]:
pattern_beginning = r"^\[\w+\]"

beginning_count = titles.str.contains(pattern_beginning).sum()
beginning_count

15

In [43]:
pattern_ending = r"\[\w+\]$"

ending_count = titles.str.contains(pattern_ending).sum()
ending_count

417

## Challenge

### Using flags to modify regex patterns

Up until now, we've been using sets like [Pp] to match different capitalizations in our regular expressions. This strategy works well when there is only one character that has capitalization, but becomes cumbersome when we need to cater for multiple instances.

Within the titles, there are many different formatting styles used to represent the word "email." Here is a list of the variations:

```
email
Email
e Mail
e mail
E-mail
e-mail
eMail
E-Mail
EMAIL
emails
Emails
E-Mails
```

To write a regular expression for this, we would need to use a set for all five letters in email, which would make our regular expression very hard to read.

Instead, we can use **flags** to specify that our regular expression should ignore case.

Both re.search() and the pandas regular expression methods accept an optional flags argument. This argument accepts one or more flags, which are special variables in the re module that modify the behavior of the regex interpreter.

A list of all available flags is in the documentation, but by far the most common and the most useful is the `re.IGNORECASE` flag, which is also available using the alias `re.I` for convenience.

When you use this flag, all uppercase letters will match their lowercase equivalents and vice versa. Let's look at an example without using the flag:

In [45]:
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests.str.contains(r"email")

0     True
1    False
2    False
3    False
dtype: bool

Now let's look at what happens when we use the flag:

In [46]:
email_tests.str.contains(r"email",flags=re.I)

0    True
1    True
2    True
3    True
dtype: bool

No matter what the capitalization is, our regular expression matches.

We'll finish this mission by writing a regular expression and count the number of times that email is mentioned in story titles. You'll need to use b**oth ignorecase as well as some of the other regex components you've already learned in this mission.**

In [60]:
email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern = r"\be[\s-]?mail[s]?\b"
email_men = email_tests.str.contains(pattern, flags = re.I)

print(email_tests[email_men])

0       email
1       Email
2      e Mail
3      e mail
4      E-mail
5      e-mail
6       eMail
7      E-Mail
8       EMAIL
9      emails
10     Emails
11    E-Mails
dtype: object


In [61]:
print(email_men.sum())

12


In [62]:
print(len(email_tests) == email_men.sum())

True


In [63]:
email_mentions = titles.str.contains(pattern, flags = re.I).sum()
email_mentions

141