Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.



The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

Identifiers:

    \d > Matches any decimal digit; this is equivalent to the class [0-9].
    
    \D > Matches any non-digit character; this is equivalent to the class [^0-9].
    
    \s > Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
    
    \S > Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
    
    \w > Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
    
    \W > Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
    
    . > any character, except for a newline
    
    \b > the whitespace around words
    
    \. > a period
    
    

Modifiers:

    {1,3} > we're expecting 1-3
    
    + > match 1 or more
    
    ? > match 0 or 1 
    
    * > match 0 or more
    
    $ > match the end of string/line
    
    ^ > match the beginning of the string/line
    
    | > either or
    
    [] > range or "variance" Ex: [A-Za-z0-9]
    
    {x} > expecting "x" amount
    
    ( > indicates where string extraction is to start
    
    ) > indicates where string extraction is to end

White Space Characters:

    \n > new line
    
    \s > space
    
    \t > tab
    
    \e > escape
    
    \f > form feed
    
    \r > return

DON'T FORGET!

. + * ? [ ] $ ^ ( ) {} | \
    

In [41]:
import re

In [42]:
sampleString = '''Jessica is 15 years old, and Daniel is 27 years old. 
Edward is 97 years old, and his grandfather, Oscar, is 102.'''

In [43]:
ages = re.findall(r'\d{1,3}', sampleString) #findiing digits, 1 to 3 in length

In [12]:
names = re.findall(r'[A-Z][a-z]*', sampleString) #finding the nammes

In [13]:
ages, names

(['15', '27', '97', '102'], ['Jessica', 'Daniel', 'Edward', 'Oscar'])

In [22]:
ageDict = {}

x = 0

for eachName in names:
    ageDict[eachName] = ages[x]
    x += 1

In [23]:
ageDict

{'Daniel': '27', 'Edward': '97', 'Jessica': '15', 'Oscar': '102'}

In [64]:
#extracting the numbers from the string

input_x = 'My 2 favorite numbers are 93 and 100'

op_int = re.findall('[0-9]+',input_x)

op_int

['2', '93', '100']

In [59]:
#finding the words starting with AEIOU
op_word = re.findall('[AEIOU]', input_x) 

op_word

[]

In [67]:
# Greedy Matching

inp = 'From: Using the: character'

greedy_op = re.findall('^F.+:', inp)

greedy_op

['From: Using the:']

Expectation was to get result till fist :, but + and * have the behaviour to push/extand. To stop this behaviour, we can use ?

In [68]:
# Non Greedy Matching

non_greedy_op = re.findall('^F.+?:', inp)

non_greedy_op

['From:']

In [80]:
#extracting email address

email_adr = 'From test@gmail.com Sat Jan 5 10:24:32 2017'

email = re.findall('^From (\S+@\S+)', email_adr)

email

['test@gmail.com']

In [81]:
email = re.findall('\S+@\S+', email_adr)

email

['test@gmail.com']

In [85]:
#extracting the host name

host = re.findall('@([^ ]*)', email_adr) # [^ ] means non-blank characters after @

host

['gmail.com']

In [88]:
#Escape Character

inp = 'We just recieved $10.00 for cookies'

op = re.findall('\$[0-9.]+', inp) #treating $ as a regular character here using \

op

['$10.00']

In [75]:
#pulling data from site and extracting using re
import urllib.request
import urllib.parse
import re

In [30]:
url = 'http://pythonprogramming.net'

In [33]:
values = {'s':'basics',
        'submit':'search'}

In [34]:
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
resp = urllib.request.urlopen(req)

respData = resp.read()

In [35]:
respData

b'<html>\n\t<head>\n\t\t\n\t\t<!-- \n\t\tpalette:\n\t\tdark blue: #003F72\n\t\tyellow: #FFD166\n\t\tsalmon: #EF476F\n\t\toffwhite: #e7d7d7\n\t\tLight Blue: #118AB2\n\t\tLight green: #7DDF64\n\t\t-->\n\n\t\t<meta name="viewport" content = "width=device-width, initial-scale=1.0">\n\t\t<title>Python Programming Tutorials</title>\n\n\t\t<meta name="description" content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free.">\n\n\t\t<link rel="shortcut icon" href="/static/favicon.ico">\n\t\t<link rel="stylesheet" href="/static/css/materialize.min.css">\n        <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">\n        <meta name="google-site-verification" content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" />\n        <link rel="stylesheet" type="text/css" href="/static/css/bootstrap.css">\n\n\t\t\n\t\t  <!-- Compiled and minified CSS -->\n\n\t\t<!-- Compiled and minified JavaScri

In [93]:
paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))

In [94]:
for eachP in paragraphs:
    print (eachP)

Learn how to use Python with Pandas, Matplotlib, and other modules to gather insights from and about your data.
Control hardware with Python programming and the Raspberry Pi.
How to develop websites with either the Flask or Django frameworks for Python.
Create your own games with Python\'s PyGame library, or check out the multi-platform Kivy.
Learn the basic and intermediate Python fundamentals.
Just getting started?
Not a problem, learn the basics of programming with Python 3 here!
Create software with a user interface using Tkinter, PyQt, or Kivy.
\n\t\t\t\t\t\t<a href="#" class="btn btn-flat white modal-close">Cancel</a> &nbsp;\n\t\t\t\t\t\t<a href="#" class="waves-effect waves-blue blue btn btn-flat modal-action modal-close">Login</a>\n\t\t\t\t\t
\n\t\t\t\t\t\t\t\t<a href="#" class="btn btn-flat white modal-close">Cancel</a> &nbsp;\n\t\t\t\t\t\t\t\t<button class="btn" type=submit value=Register>Sign Up</button>\n\t\t\t\t\t\t\t


In [98]:
#splitting words using re

print (re.split(r'(\s*)', 'here are some words'))

['here', ' ', 'are', ' ', 'some', ' ', 'words']


  return _compile(pattern, flags).split(string, maxsplit)
