# Python

---

# Regular Expressions

Regular expressions are a powerful tool for various kinds of string manipulation.
They are a domain specific language (DSL) that is present as a library in most modern programming languages, not just Python.
They are useful for two main tasks:
- verifying that strings match a pattern (for instance, that a string has the format of an email address),
- performing substitutions in a string (such as changing all American spellings to British ones).


Regular expressions in Python can be accessed using the re module, which is part of the standard library.
After you've defined a regular expression, the re.match function can be used to determine whether it matches at the beginning of a string.
If it does, match returns an object representing the match, if not, it returns None.
To avoid any confusion while working with regular expressions, we would use raw strings as r"expression".
Raw strings don't escape anything, which makes use of regular expressions easier.

In [2]:
import re
word = input()

pattern = r"gl"

if re.match(pattern, word):
    print("Match")
else:
    print("No match")

Match


Other functions to match patterns are re.search and re.findall.

The function re.search finds a match of a pattern anywhere in the string.

The function re.findall returns a list of all substrings that match a pattern.

In [4]:
import re

pattern = r"spam"

if re.match(pattern, "eggspamsausagespam"):
    print("Match")
else:
    print("No match")

if re.search(pattern, "eggspamsausagespam"):
    print("Match")
else:
    print("No match")

print(re.findall(pattern, "eggspamsausagespam"))

No match
Match
['spam', 'spam']


In [7]:
import re

quote = "Always do your best. Your best is going to change from moment to moment; it will be different when you are healthy as opposed to sick. Under any circumstance, simply do your best, and you will avoid self-judgment, self-abuse and regret"

pattern = input()

len(re.findall(pattern,quote))



3

The regex search returns an object with several methods that give details about it.
These methods include group which returns the string matched, start and end which return the start and ending positions of the first match, and span which returns the start and end positions of the first match as a tuple.

Example:

In [13]:
import re

pattern = r"pam"

match = re.search(pattern, "eggspamsausage")
if match:
    print(match.group())
    print(match.start())
    print(match.end())
    print(match.span())

pam
4
7
(4, 7)


### Search and replace

In [8]:
import re

str = "My name is David. Hi David."
pattern = r"David"
newstr = re.sub(pattern, "Amy", str)
print(newstr)

My name is Amy. Hi Amy.


We need to create a number formatting system for a contacts database.
Create a program that will take the phone number as input, and if the number starts with "00", replace them with "+".
The number should be printed after formatting.

In [12]:
import re

number = input()
pattern = r"00"

if number[0] == "0":
    newstr = re.sub(pattern, "+", number)

else:
    newstr = number

print(newstr)


+161


### Metacharacters

Metacharacters are what make regular expressions more powerful than normal string methods.
They allow you to create regular expressions to represent concepts like "one or more repetitions of a vowel".

The existence of metacharacters poses a problem if you want to create a regular expression (or regex) that matches a literal metacharacter, such as "$". You can do this by escaping the metacharacters by putting a backslash in front of them.
However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping.


In [22]:
# . (Dot)

#The first metacharacter we will look at is . (dot).
#This matches any character, other than a new line.

import re

pattern = r"gr.y"

if re.match(pattern, "grey"):
    print("Match 1")

if re.match(pattern, "gray"):
    print("Match 2")

if re.match(pattern, "blue"):
    print("Match 3")

Match 1
Match 2


In [17]:
#Start and end
import re

pattern = r"^gr.y$"

if re.match(pattern, "grey"):
    print("Match 1")

if re.match(pattern, "gray"):
    print("Match 2")

if re.match(pattern, "stingray"):
    print("Match 3")

Match 1
Match 2


In [19]:
# Character classes []

import re

pattern = r"[aeiou]"

if re.search(pattern, "grey"):
    print("Match 1")

if re.search(pattern, "qwertyuiop"):
    print("Match 2")

if re.search(pattern, "rhythm myths"):
    print("Match 3")

Match 1
Match 2


In [18]:
# ^ Invert a character class 

import re

pattern = r"[^A-Z]"

if re.search(pattern, "this is all quiet"):
    print("Match 1")

if re.search(pattern, "AbCdEfG123"):
    print("Match 2")

if re.search(pattern, "THISISALLSHOUTING"):
    print("Match 3")

Match 1
Match 2


In [21]:
#All the products in online shop have their own ID. Every ID consists of 4 symbols:
#The first symbol: an uppercase character
#The second symbol: an uppercase character
#The third symbol: a digit
#The forth symbol: a digit

import re
Id = input()

if re.match(r"[A-Z][A-Z][0-9][0-9]$",Id):
    print("Searching")
else:
    print("Wrong format")

Searching


In [30]:
# *
# The example above matches strings that start with "egg" and follow with zero or more "spam"s.

import re

pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
    print("Match 1")

if re.match(pattern, "eggspamspamegg"):
    print("Match 2")

if re.match(pattern, "spam"):
    print("Match 3")

Match 1
Match 2


In [26]:
# + 
# Means "one or more repetitions", as opposed to "zero or more repetitions".

import re

pattern = r"g+"

if re.match(pattern, "g"):
    print("Match 1")

if re.match(pattern, "gggggggggggggg"):
    print("Match 2")

if re.match(pattern, "agbc"):
    print("Match 3")

Match 1
Match 2


In [27]:
# ?
# The metacharacter ? means "zero or one repetitions"

import re

pattern = r"ice(-)?cream"

if re.match(pattern, "ice-cream"):
    print("Match 1")

if re.match(pattern, "icecream"):
    print("Match 2")

if re.match(pattern, "sausages"):
    print("Match 3")

if re.match(pattern, "ice--cream"):
    print("Match 4")

Match 1
Match 2


```Curly braces can be used to represent the number of repetitions between two numbers.
The regex {x,y} means "between x and y repetitions of something".
Hence {0,1} is the same thing as ?.
If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity.
```

In [28]:
import re

pattern = r"9{1,3}$"

if re.match(pattern, "9"):
    print("Match 1")

if re.match(pattern, "999"):
    print("Match 2")

if re.match(pattern, "9999"):
    print("Match 3")

Match 1
Match 2


In [31]:
#Let's imagine we are creating our own authentication system.
#Create a program that takes a password as input and returns "Password created" if
#- it has at least one uppercase character
#- it has at least one number

import re
password = input()

if re.search( r"(?=.*[A-Z])(?=.*[0-9])", password):
    print("Password created")
else:
    print("Wrong format")
    


Password created


### Groups

A group can be created by surrounding part of a regular expression with parentheses.
This means that a group can be given as an argument to metacharacters such as * and ?.

Example:

In [1]:
import re

pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
    print("Match 1")

if re.match(pattern, "eggspamspamspamegg"):
    print("Match 2")

if re.match(pattern, "spam"):
    print("Match 3")

Match 1
Match 2


The content of groups in a match can be accessed using the group function.
A call of group(0) or group() returns the whole match.

A call of group(n), where n is greater than 0, returns the nth group from the left.
The method groups() returns all groups up from 1.

Example:

In [2]:
import re

pattern = r"a(bc)(de)(f(g)h)i"

match = re.match(pattern, "abcdefghijklmnop")
if match:
    print(match.group())
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.groups())

abcdefghi
abcdefghi
bc
de
('bc', 'de', 'fgh', 'g')


There are several kinds of special groups.
Two useful ones are named groups and non-capturing groups.
Named groups have the format (?P<name>...), where name is the name of the group, and ... is the content. They behave exactly the same as normal groups, except they can be accessed by group(name) in addition to its number.
Non-capturing groups have the format (?:...). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.

Example:

In [3]:
import re

pattern = r"(?P<first>abc)(?:def)(ghi)"

match = re.match(pattern, "abcdefghi")
if match:
    print(match.group("first"))
    print(match.groups())

abc
('abc', 'ghi')


In [4]:
# | (Or)

import re

pattern = r"gr(a|e)y"

match = re.match(pattern, "gray")
if match:
    print ("Match 1")

match = re.match(pattern, "grey")
if match:
    print ("Match 2")    

match = re.match(pattern, "griy")
if match:
     print ("Match 3")

Match 1
Match 2


### Special Sequences

There are various special sequences you can use in regular expressions. They are written as a backslash followed by another character.
One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number.

Example:

In [7]:
import re

pattern = r"(.+) \1"

match = re.match(pattern, "word word")
if match:
    print ("Match 1")

match = re.match(pattern, "?! ?!")
if match:
    print ("Match 2")    

match = re.match(pattern, "abc cde")
if match:
    print ("Match 3")

Match 1
Match 2


More useful special sequences are \d, \s, and \w.
These match digits, whitespace, and word characters respectively.
In ASCII mode they are equivalent to [0-9], [ \t\n\r\f\v], and [a-zA-Z0-9_].
In Unicode mode they match certain other characters, as well. For instance, \w matches letters with accents.
Versions of these special sequences with upper case letters - \D, \S, and \W - mean the opposite to the lower-case versions. For instance, \D matches anything that isn't a digit.

Example:

In [8]:
import re

pattern = r"(\D+\d)"

match = re.match(pattern, "Hi 999!")
if match:
    print("Match 1")

match = re.match(pattern, "1, 23, 456!")
if match:
    print("Match 2")

match = re.match(pattern, " ! $?")
if match:
    print("Match 3")

Match 1


Additional special sequences are \A, \Z, and \b.
The sequences \A and \Z match the beginning and end of a string, respectively.
The sequence \b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
The sequence \B matches the empty string anywhere else.

Example:

In [10]:
import re

pattern = r"\b(cat)\b"

match = re.search(pattern, "The cat sat!")
if match:
    print ("Match 1")

match = re.search(pattern, "We s>cat<tered?")
if match:
    print ("Match 2")

match = re.search(pattern, "We scattered.")
if match:
    print ("Match 3")

# \b(cat)\b" basically matches the word "cat" surrounded by word boundaries.

Match 1
Match 2


In [14]:
#Sample Input
# No #pressure, no #diamonds

#Sample Output
 #pressure
 #diamonds

import re
text = "No #pressure, no #diamonds"
#your code goes here
#use re.findall() with r"#\w+" as the regex

match = re.findall( r"#\w+", text)

for i in match:
    print(i)


#pressure
#diamonds


### Email Extraction

Email Extraction


To demonstrate a sample usage of regular expressions, lets create a program to extract email addresses from a string.
Suppose we have a text that contains an email address:

str = "Please contact info@sololearn.com for assistance"

Our goal is to extract the substring "info@sololearn.com".
A basic email address consists of a word and may include dots or dashes. This is followed by the @ sign and the domain name (the name, a dot, and the domain name suffix).
This is the basis for building our regular expression.

pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"

[\w\.-]+ matches one or more word character, dot or dash.
The regex above says that the string should contain a word (with dots and dashes allowed), followed by the @ sign, then another similar word, then a dot and another word.

Our regex contains three groups:

1 - first part of the email address.

2 - domain name without the suffix.

3 - the domain suffix.

In [16]:
import re

pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"
str = "Please contact info@sololearn.com for assistance"

match = re.search(pattern, str)
if match:
    print(match.group())

info@sololearn.com


### Excersice

In [23]:
import re

pattern = r"^[819].......$"
number = input()

match = re.search(pattern, number)
if match:
    print("Valid")
else:
    print("Invalid")

Valid
