<a href="https://colab.research.google.com/github/nt03/Regex_WWC_DC/blob/master/regex_wwc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#import the python library 're' for regular expressions 

import re

Each character in a regular expression is either:

- a metacharacter, having a special meaning, or 
- a regular character that has a literal meaning.


#Metacharacters

- Python regex has 12 metacharacters
- If you want your regex to match them literally, you need to escape them by placing a backslash `(\)` in front of them.




![alt text](https://www.optimizesmart.com/wp-content/uploads/2010/06/regex-cheatsheet-for-Google-Analytics1.jpg)



#Shorthands

Six regex tokens that consist of a backslash and a letter form shorthand character classes: `‹\d›, ‹\D›, ‹\w›, ‹\W›, ‹\s› and ‹\S›`.

You can use these both inside and outside character classes. Each lowercase shorthand character has an associated uppercase shorthand character with the opposite meaning.

- `‹\d›` match a single digit.
- `‹\w›` matches a single word character.
- `‹\s›` matches any whitespace character. 

A word character is a character that can occur as part of a word. That includes letters, digits, and the underscore.

In [0]:
#caret

re.search("^\d", "number 123")

In [0]:
re.search("^\d", "123")

In [0]:
#dollar 

re.search("\d$", "number 123")

In [0]:
#dot

re.search(".", "number 123")

In [0]:
#asterisk

re.findall(".*", "number 123")

In [0]:
#plus sign

re.findall(".+", "number 123")

In [0]:
#pipe symbol

re.search("cat|dog", "I have a pet dog")

In [0]:
#escape character

s = "The trademark (®) symbol is a symbol to indicate that the preceding mark is a trademark."

re.search("\(.\)", s)

In [0]:
#curly brackets

s = "the year was 1990"

#re.search("\d\d\d\d", s)
re.search("\d{4}", s)


#Character Class

A character class matches a single character out of a list of possible characters. The notation uses square brackets to list the character options. 

eg. `<[CcKk]ate>` is a regex with character class notation to indicate possible ways the name 'Kate' could be spelled.


**By default, regular expressions are case sensitive.

ex. ‹regex› matches regex but not Regex, REGEX, or ReGeX. 

* Inside a character class, only four characters have a special function: ` \, ^, -, and ]`

* All other characters are literals and simply add themselves to the character class. The regular expression ` ‹[$()*+.?{|]› ` matches any one of the nine characters between the square brackets.

* A caret `(^)` negates the character class if you place it immediately after the opening bracket. It makes the character class match any character that is not in the list.

* A hyphen `(-)` creates a range when it is placed between two characters. eg. `[a-zA-Z]` includes all upper and lowercase alphabets, whereas, `[a-h]` includes only lowercase alphabets from a to h.

In [0]:
# character class example

s = "Its easy to misspell Calendar as calender or calander so let's check using regex for all possible spellings and replace"

re.sub("[Cc]al[ae]nd[ae]r", "calendar", s)

In [0]:
#example - hyphen

#find all year strings after 1700s in a text using hyphen 

s = "US Senate Resolution 155 of 10 November 1997 states that the Declaration of Arbroath, the Scottish Declaration of Independence, was signed on 6 April 1320 and the American Declaration of Independence, 1776 was modelled on that inspirational document"

re.findall("[12][7-9][0-9]{2}", s)


In [0]:
#example: all characters except alphabets using caret

re.findall("[^a-zA-Z\s,]", s)

# Grouping and Capturing

Grouping is done with parentheses. They have the highest precedence of all regex operators. 

However, a pair of parentheses isn’t just a group; it’s a capturing group. **Captures** become useful when they cover only part of the regular expression. 


Problem: Create a regular expression that matches any date in yyyy-mm-dd format, and separately captures the year, month, and day.

In [0]:
#@title TRY IT HERE



In [0]:
#@title Solution

d = "2007-03-13"

x = re.search(r"(\d\d\d\d)-(\d\d)-(\d\d)", d)
x.group()

In [0]:
print(x.group(1), x.group(2), x.group(3))

In [0]:
#@title named capture groups

d = "2007-03-13"

x = re.search(r"\b(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)\b", d)

print(x.group('year'))
print(x.group('month'))
print(x.group('day'))

#LAZY QUANTIFIERS


> You can make any quantifier lazy by placing a question mark after it: `‹*?›, ‹+?›, ‹??›, and ‹{7,42}?›` are all lazy quantifiers.

Lazy quantifiers backtrack too, but the other way around. A lazy quantifier repeats as few times as it has to, stores one backtracking position, and allows the regex to continue.


The quantifiers `‹*› and ‹*?›` allow all the same regular expression matches. The only difference is the order in which the possible matches are tried. The greedy quantifier will find the longest possible match. The lazy quantifier will find the shortest possible match.

In [0]:
s = "<p> The very <em>first</em> task is to find the beginning of a paragraph. </p> <p> Then you have to find the end of the paragraph </p>"


re.search("<p>.*</p>", s).group()

In [0]:
re.search("<p>.*?</p>", s).group()

In [0]:
re.findall("great!*", "the music was great!!!")

In [0]:
re.findall("great!*?", "the music was great!!!")

In [0]:
re.findall("great!+", "the music was great!!!")

In [0]:
re.findall("great!+?", "the music was great!!!")

#RE Functions:


1. **`re.search(pattern, string, flags)`** [link text](https://docs.python.org/3/library/re.html#re.search)

> Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.


In [0]:
s = "bill list: HR 1234, S 1023, S 234, HR 4505, S 8098"

re.search("(HR|S) (\d{3,4})", s)

In [0]:
s = "bill list: HR 1234, S 1023, S 234, HR 4505, S 8098"

re.match("(HR|S) (\d{3,4})", s)

In [0]:
s = "bill list: HR 1234, S 1023, S 234, HR 4505, S 8098"

x = re.search("(HR|S) (\d{3,4})", s)

print(x.group(1))
print(x.group(2))

2. **`re.findall(pattern, string, flags)`**[link text](https://docs.python.org/3/library/re.html#re.findall)

> Return all non-overlapping matches of pattern in string, as a list of strings.

In [0]:
s = "bill list: HR 1234, S 1023, S 234, HR 4505, S 8098"

re.findall("(HR|S) (\d{3,4})", s)


3. **`re.sub(pattern, replacement, string)`**[link text](https://docs.python.org/3/library/re.html#re.sub)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

In [0]:
s = "bill list: HR 1234, S 1023, S 234, HR 4505, S 8098"

re.sub('\s', "", s)

In [0]:
re.sub('4', "@", s)

In [0]:
re.sub('\d{4}', "-", s)

#EXERCISES: 

In [0]:
#@title 1. Write a regex to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9)




In [0]:
#@title test block

print(bool(re.search(**insert regex**, "ABCDEFabcdef123450")))

print(bool(re.search(**insert regex**, "*&%@#!}{")))

False
True


In [0]:
#@title solution

'[a-zA-Z0-9]'





In [0]:
#@title 2.Write a regex for a US phone number(10-digit), extract the state code in a named capture group




In [0]:
#@title test block

print(bool(re.search(**insert regex**, "2027893452")))

print(bool(re.search(**insert regex**, "0902352234")))

print(bool(re.search(**insert regex**, "2930352234")))

True


In [0]:
#@title solution

'(?P<statecode>[2-9]\d{2})[2-9][0-9]{6}'

In [0]:
#@title 3. write a regex to replace all () in a text to whitespace




In [0]:
#@title test block

print(re.sub(**insert regex**, "sdkn(huy67)"))

print(re.sub(**insert regex**, "(x+y+z)= (x+y)+z"))



In [0]:
#@title solution

"[\(\)]"
