## 192. Introduction
- Regular Expressions are
    - also known as regex
    - a bunch of characters or a pattern to
        - search for a string within a string
        - validate a string to see if it follows a given pattern
- example
    - validate en email address entered by user
    - validate a password entered by user
- we have ```re``` module in python for regular expressions, and it has many methods


## 193. Sequence Characters
- Regex syntax defines some special characters called sequence characters that we can use to match a single character in a given string
- all the sequence characters start with backward slash ```\```
- example
    - ```\d``` : any digit 0-9
    - ```\D``` : any non-Digit
    - ```\s``` : white space
    - ```\S``` : non-white space
    - ```\w``` : any alpha-numeric characters, a-z, A-Z, & 0-9
    - ```\W``` : any non-alpha-numeric character
    - ```\b``` : space around words
    - ```\A``` : matches only at the start of the string
    - ```\Z``` : matches only at the end of the string

## 194. search()
- Create a regular expression and use it in the ```search()``` method from ```re``` module
- ```re.search(regex, string)```
    - takes a regex and searches in the string
    - returns the first substring within the string that matches the pattern in re.Match type
    - if no match is found, it returns 'None'
    - you can invoke ```result.group()``` to get the matching string from the result

In [1]:
# regularexpressions
# redemo.py
import re
string = "Take up one idea. One idea at a time"
result = re.search(r'o\w\w', string) # returns re.Match type
# print(type(result))
# print(result)
print(result.group())

one


In [2]:
import re
string = "Take up one idea. One idea at a time"
result = re.search(r'o\w', string)
# print(type(result))
# print(result)
print(result.group())

on


In [3]:
import re
string = "Take up One idea. One idea at a time" # Both O in One are uppercase
result = re.search(r'o\w', string) # returns a None when no match is found
# print(type(result))
print(result)
# print(result.group())

None


## 195. findall() and match()
- ```re.findall(regex, string)```
    - takes a regex and searches the string for that pattern from beginning to the end
    - it returns all the substrings that match the pattern as a list
    - if none of the substrings in given string matches the pattern, it returns an empty list
- ```re.match(regex, string)```
    - takes a regex and searches for that pattern at the start of that string
    - if it finds a match, it returns a substring
    - if it does not find a match, it returns 'None'
- ```re.Match.group()```
    - returns subgroup(s) of match

In [4]:
import re
string = "Take up one idea. one idea at a time"

result = re.findall(r'o\w\w', string) # returns a list
# print(type(result))
print(result)

['one', 'one']


In [5]:
import re
string = "Take up one idea. One idea at a time" # O in One is uppercase

result = re.findall(r'o\w\w', string) # returns a list
# print(type(result))
print(result)

['one']


In [6]:
import re
string = "Take up One idea. One idea at a time" # Both O in One are uppercase

print(re.search(r'o\w\w', string)) # Now search() returns None
result = re.findall(r'o\w\w', string) # returns an empty list
# print(type(result))
print(result)

None
[]


In [7]:
import re
string = "Take up One idea. One idea at a time"

result = re.match(r'o\w\w', string) # returns None, pattern not found at start of string
print(result)

None


In [8]:
import re
string = "Take up One idea. One idea at a time"

result = re.match(r'O\w\w', string) # returns None, pattern not found at start of string
print(result)

None


In [9]:
import re
string = "Take up One idea. One idea at a Time"

result = re.match(r'T\w\w', string) # returns re.Match object, which is searched at only start of string
# print(result)
print(result.group())

Tak


## 196. sub()
- ```re.sub(regex, replacementString, string)```
    - replaces the regex pattern in string with the replacementString
    - returns a string which has pattern replaced with replacementString

In [10]:
import re
string = "Take up One idea. One idea at a Time"

result = re.sub(r'One', 'Two', string)
# print(type(result))
print(result)

Take up Two idea. Two idea at a Time


## 197. split()
- ```re.split(regex, string)```
    - splits the string into a list of strings using the regex as a delimiter
    - returns a list of strings separated by regex

In [11]:
import re
string = "Take 1 up One 23 idea. One idea 45 at a Time"

result = re.split(r'\d+', string)
# print(type(result))
print(result)

['Take ', ' up One ', ' idea. One idea ', ' at a Time']


## 198. Quantifiers
- Sequence Characters are used to match single character
- Quantifiers are used to match multiple characters
- ```+``` : specifies one or more repetitions of the preceeding regex
    - ```\d+``` means one or more digits
- ```*``` : specifies zero or more repetitions of the preceeding regex
- ```?``` : specifies zero or one repetitions of the preceeding regex
- ```{m}``` : specifies exactly 'm' repetitions of the preceeding regex
- ```{m,n}``` : specifies minumim 'm' and maximum 'n' repetitions of the preceeding regex
    - default value of 'm' is zero, & default value of 'n' is infinity if 'n' is not specified

## 199. using quantifiers
- use quantifiers for pattern matching

- ```+``` quantifier

In [12]:
import re
string = "Take up One idea. One idea at a Time"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w+', string) # '+' quantifier, 1 or more repetition
print(result)

['One', 'One']


- ```*``` quantifier

In [13]:
import re
string = "Take up One idea. One idea at a Time"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w*', string) # '*' quantifier, 0 or more repetition
print(result)

['One', 'One']


- ```?``` quantifier

In [14]:
import re
string = "Take up One idea. One idea at a Time"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w?', string) # '*' quantifier, 0 or 1 repetition
print(result)

['On', 'On']


- ```{m}``` quantifier

In [15]:
import re
string = "Take up One idea. One idea at a Time"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w{1}', string) # '{m}' quantifier, exactly 1 repetition
print(result)

['On', 'On']


In [16]:
# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w{2}', string) # '{m}' quantifier, exactly 2 repetition
print(result)

['One', 'One']


In [17]:
# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w{3}', string) # '{m}' quantifier, exactly 3 repetition
print(result)

[]


In [18]:
string = "Take up One idea. One idea at a Time Only"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w{3}', string) # '{m}' quantifier, exactly 3 repetition
print(result)

['Only']


- ```{m,n}``` quantifier

In [19]:
import re
string = "Take up One idea. One idea at a Time"

# result = re.findall(r'O\w\w', string)
result = re.findall(r'O\w{1,2}', string) # '{m,n}' quantifier, min 1 max 2 repetition
print(result)

['One', 'One']


## 200. Matching Dates
- Create a regular expression that will return all matching dates


In [20]:
import re
string = "Take 1 up 1-3-2019 One 23 idea. One idea 45 at a Time 12-11-2020"

result = re.findall(r'\d{1,2}-\d{1,2}-\d{4}', string)
print(result)

['1-3-2019', '12-11-2020']


## 201. Special Characters
- Along with sequence characters and quantifiers, we can use certain special characters in our regular expressions as well
- ```\``` : escape character, if you want to use any special character like '\' in regular expressions, you have to escape it using '\\'
- ```.``` : dot operator, matches any character except a new line character
- ```^``` : matches regex at the beginning of the string
- ```$``` : matches regex at the end of the string
- ```[...]``` : matches the characters specified in the range
- ```[^...]``` : matches the characters except those in the specified range
- ```(...)``` : you can provide a regular expression in it and it'll match that with string
- ```(R|S)``` : matches multiple regular expressions

## 202. using special characters
- use special characters in the regular expressions

- ```[^...]``` special character

In [21]:
import re
string = "Take 1 up 1-3-2019 One 23 idea. One idea 45 at a Time 12-11-2020"

result = re.search(r'^O\w', string) # matches at beginning of string
print(result)

None


In [22]:
result = re.search(r'^T\w', string) # matches at beginning of string
# print(result)
print(result.group())

Ta


In [23]:
result = re.search(r'^T\w*', string) # matches at beginning of string
# print(result)
print(result.group())

Take


## 203. Web Scrapping Demo
- Web Scrapping
    - Process of collecting information from web pages
- Retrieve the title of website using regular expressions
- ```urllib.request.urlopen()```
    - opens a url and returns a response containing header, status, etc which is basically the HTML of the webpage
- ```response.read()```
    - converts HttpResponse object into bytes which can be printed as string

In [29]:
# webscrappingdemo.py
import re, urllib.request
# sites = ["google.com", "bharaththippireddy.com"]
sites = ["google.com", "youtube.com"]
for s in sites:
    print("Searching : ", s)
    response = urllib.request.urlopen("https://"+s) # opens a URL
    print(type(response)) # 'http.client.HTTPResponse'
    text = response.read() # reads response
    print(type(text)) # 'bytes'
    title = re.findall("<title>.*</title>", str(text), re.I) # case-insensitive matching
    print(title[0])
    print(title) # debug

Searching :  google.com
<class 'http.client.HTTPResponse'>
<class 'bytes'>
<title>Google</title>
['<title>Google</title>']
Searching :  youtube.com
<class 'http.client.HTTPResponse'>
<class 'bytes'>
<title>YouTube</title>
['<title>YouTube</title>']
