#### Manipulating Text with Regular Expressions

Regular expressions or Regexes, are written in a condensed formatting language. It is though of as pattern which is given to a regex processor with some source data. The processor then parse that source data using that pattern and returns chunks of text back to the data scientist or programmer for further manipulations. 

There are really main reasons why this is done:
- to check whether a pattern exists within some source data
- to get all instances of a complex pattern from some source data
- to clean your source data using a pattern generally through string splitting.

Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications, and a solid understanding of regex will help one quickly and efficiently manipulate text data for further data science application.

In [1]:
#First, we'll import the re module, which is where python stores regex libraries
import re

There are several main processing functions in re that might be used
1. match() - checks for a match that is at the beginning of the string and returns a boolean.
2. search() - checks for a match anywhere in the string and returns a boolean.

In [2]:
# Let's create some texts for an example
text = 'This is a good day.'

# Now, lets see if it's a good day or not
if re.search('good', text): # the first parameter here is the pattern
    print('Wonderful')
else:
    print('Alas :(')

Wonderful


In [3]:
# The findall() and split() functions will parse the string for us and return chunks. Let's see an example
text = 'Amy works diligently. Amy gets good grades. Our student Amy is successful.'

# Lets's split this on all instances of Amy
re.split('Amy', text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is successful.']

In addition to checking for conditionals, we can segment a string. The work that regex does here is called tokenizing, where the string is seperated into substrings based on patterns.

In [4]:
# If we wanted to count how many times we havw talked about Amy, we could use findall()
re.findall('Amy', text)

['Amy', 'Amy', 'Amy']

We've seen that:
- .search() - looks for some pattern and returns a boolean
- .split() - will use a pattern for creating a list of substrings
- .findall() - will look for a patter and pull out all occurences

In [5]:
#The regex specification standard defines a markup language to describe patterns in text.Let's start with anchors. 

#Anchors specify the start and/or end of the string that you are trying to match. 

#The caret character ^ means start and the dollar sign character $ means end. 
#If you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify.
#For ending, you have to put the $ character after the string, it means that the text Regex retrieves must end with the string you specify.

In [6]:
# Here's an example
text = 'Amy works diligently. Amy gets good grades. Our student Amy is successful.'

# Let's see if this begins with Amy
re.search('^Amy', text)

<re.Match object; span=(0, 3), match='Amy'>

Notice that re.search() actually returned to us a new object, called re-Match object. A re-Match object always has a boolean value of True, as something found, so you can always evaluate it in an if statement as we did earlier.

The rendering of the match object also tells you what pattern was match, in this case the word Amy and the location the match was in, as the span.

#### Patterns and Character Classes

In [7]:
#Let's create a string of a single leaners' grade over a semester in one course accross all of their assignments.
grades = 'ACAAAABCBCBAA'

#If we want to answer the question "How many B's were in the grade list?" we would just use B
re.findall('B', grades)

['B', 'B', 'B']

If we wanted to count the number of 'A's or B's in the list, we can't use 'AB' since thi is used to match all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets.

In [8]:
re.findall('[AB]', grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [9]:
#This is called the set operator. You can also include a range of characters, which are ordered alphanumerically. For instance, if we want to refer to all lower case letters we could use [a-z].
#Let's build a simple regex to parse out all instances where this student receive an A followed by a B or a C 
re.findall('[A][B-C]', grades)

['AC', 'AB']

In [10]:
# Notice how the [AB patter describes a set of possible character which could be either (A OR B), while the [A][B-C] pattern denoted two sets of characters which must have been matched back to back.
# You can write this pattern by using the pipe operator, which means OR
re.findall('AB|AC', grades)

['AC', 'AB']

In [11]:
# We can use the caret with the set operator to negate our results. For instance, if we want to parse out only the grades which were not A's
re.findall('[^A]', grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [12]:
# Note this carefully -the caret was previously matched to the beggining of a string as an anchor point, but inside of the set operator the caret and other special characters we will be talking about, lose their meaning. This can quite confusing. 
# Take a look at this
re.findall('^[^A]', grades)

[]

In [13]:
# It is an empty list because the regex says that we want to match any value at the begining of the string which is not an A.
# our string though starts with an A, so there is no match found.
# Remember, when you are using the set operator, you are doing character-based matching. So you are matching individual characters in an OR method

#### Quantifiers

Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m, n}, where e is the expression or character we are matching, m is the minimum number of times you want it matched, and n is the maximum number of times the item could be matched.

In [14]:
# Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?
re.findall('A{2,10}', grades) # we'll use 2 as our min, but ten as our max.

['AAAA', 'AA']

In [15]:
# So we see that there were two streaks, one where the student had four A's, and one where they had only two A's

#We might try and do this using single values and just repeating the pattern
re.findall('A{1,1}A{1,1}', grades)

['AA', 'AA', 'AA']

The first pattern is looking for any combination of two A's up to ten A's in a row. So it sees four A's as a single streak. The second pattern is looking for two A's back to back, so it sees two A's followed immediately by two more A's. We say that the regex processor begins at the start of the string and consumes variables which match patterns as it does.

It is important to note that the regex quantifier syntax does not allow you to deviate from the (m, n) pattern. 

In [16]:
# In particular, if you have an extra space in between the braces, you'll get an empty result.
re.findall('A{2, 2}', grades)

[]

In [17]:
# As we have already seen, if we dont include a quantifier then the default is {1,1}
re.findall('AA', grades)

['AA', 'AA', 'AA']

In [18]:
# If you just have one number in the braces, it's considered to be both m and n
re.findall('A{2}', grades)

['AA', 'AA', 'AA']

In [19]:
# Using this, we could find a decreasing trend in a student's grades
re.findall('A{1,10}B{1,10}C{1,10}', grades)

['AAAABC']

That's a bit of a hack, because we included a maximum that was just arbitrarily large. There are three other quantifiers that are used as short hand, 
- an asterix * to match 0 or more times, 
- a question mark ? to match one or more times, 
- or a + plus sign to match one or more times. 

In [20]:
# Let's look at a more complex example, and load some data scrapped from wikipedia
with open (r"C:\Users\user\Desktop\ferpa.txt.txt") as file:
            #we'll read that into a variable called wiki
            wiki = file.read()
#Let's print that variable out to the screen
wiki

'Overview[edit]\nFERPA gives parents access to their child\'s education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student\'s consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.\n\nOther regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student\'s personally identifiable information without the student\'s consent.[2]\n\nExamples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student\'s grades o

In [21]:
#Scanning through this document, one of the things we notice is that the headers all have words [edit] behind them, followed by a newline character.
#So, if we wanted to get a list of all of the headers in this article, we could do so using re.findall
re.findall("[a-zA-Z]{1,100}\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [22]:
#That didn't quite work. It got all of the headers, but only the last word of the header, and it really was quite clunky.
#Let's iteratively improve this. First, we can use \w to match any letter, including digits and numbers.
re.findall("[\w]{1,100}\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

This is something new. \w is a metacharacter, and indicates a special pattern of any letter or digit.
There are actually a number of different metacharacters listed in the documentation. For instance,\s matches any whitespace character.

In [23]:
# Next, there are three other quantifiers we can use which shorten up the curly brace syntax. We can use an asterix * to match 0 or more times, so let's try that.
re.findall("[\w]*\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [24]:
# Now that we have shortened the regex, let's improve it a little bit. We can add in a space using the space character
re.findall("[\w ]*\[edit\]", wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

In [25]:
# This gets us the list of section titles in the wikipedia page! You can create a list of titles by iterating through this and applying nother regex
for title in re.findall("[\w ]*\[edit\]", wiki):
    # Now we will take that intermediate result and split on the square bracket [ just taking the first result
    print(re.split("[\[]",title)[0])

Overview
Access to public records
Student medical records


#### Groups

To this point, we have been talking about a regex as a single pattern which is matched. But, you can actually match different patterns, called groups, at the same time, and then refer to the groups you want. To group patterns together, you use parenthesis which is actually pretty natural

In [26]:
# lets rewrite our findall using groups
re.findall("([\w ]*)(\[edit\])", wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

We see that the python re module breaks out the results by group. We can actually refer to groups by number as well with the match objects that are returned. But, how do we get back a list of match objects?

Thus far, we've seen that findall() returns strings, and search() and match() return individual Match objects. But what do we do if we want a list of Match objects?  

In [27]:
# In this case, we use the function finditer()
for item in re.finditer("([\w ]*)(\[edit\])", wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')


We see here that the groups() method returns a tuple of the group.
We can get an individual group using group(number), where group(0) is the whole match, and each other number is the portion of the match we are interested in.

In [28]:
# In this case, we want group(1)
for item in re.finditer("([\w ]*)(\[edit\])", wiki):
    print(item.groups(1))

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')


One more piece to regex groups though hardly used but a good idea is the labelling or naming groups. In the previous example, we see how we can use the position of the group. But giving them a label and looking at the results as a dictionary is prettye useful. for that we use the syntax (?P<name>), where the parenthesis starts the group, the ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.

In [29]:
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])", wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

Overview
Access to public records
Student medical records


In [30]:
# We can print out the whole dictionary for the item too, and see that the [edit] string is still in there. 
# Here's the dictionary kept for the last match
print(item.groupdict())

{'title': 'Student medical records', 'edit_link': '[edit]'}


We have seen how we can match individual character patterns with [], how we can group matches together using(), and how we can use quantifiers such as *, ?, or m{n} to describe patterns. Something we glossed over in the previous example was the \w, which standards for any word character.
There are a number of short hands which are used with regexes for different kinds of characters including:
- a . for any single character which is not a newline
- a \d for any digit
- and \s for any whitespace character, like spaces and tabs.

There are more , and a full list can be found in the python documentation for regexes

#### Look-ahead and Look-behind

One more concept to be familiar with is called "look ahead" and "look behind" matching. In this case, the pattern being given to the regex engine is for text either before or after the text we are trying to isolate. 

For example, in our headers, we want to isolate text which comes before the [edit] rendering, but we actually dont care about the [edit] text itself. Thus far, we have been throwing the [edit] away, but if we want to use them to match but dont want to capture them, we could put them in a group and use look ahead instead with ?= syntax.

In [31]:
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])", wiki):
    # what this regex says is match the groups, the first will be named and called title, will have any amount of whitespace or regular word characters,
    # the second will be the characters [edit] but we dont actually want this edit put in our output match objects
    print(item)

<re.Match object; span=(0, 8), match='Overview'>
<re.Match object; span=(2715, 2739), match='Access to public records'>
<re.Match object; span=(3692, 3715), match='Student medical records'>


#### Example: Wikipedia Data

In [63]:
# Let's look at some more wikipedia data. Here's some data on universities in the US which are buddhist.based
with open (r"C:\Users\user\Desktop\buddhist.txt.txt") as file:
            #we'll read that into a variable called wiki
            wiki = file.read()
#Let's print that variable out to the screen
wiki

'þÿ\x00B\x00u\x00d\x00d\x00h\x00i\x00s\x00t\x00 \x00u\x00n\x00i\x00v\x00e\x00r\x00s\x00i\x00t\x00i\x00e\x00s\x00 \x00a\x00n\x00d\x00 \x00c\x00o\x00l\x00l\x00e\x00g\x00e\x00s\x00 \x00i\x00n\x00 \x00t\x00h\x00e\x00 \x00U\x00n\x00i\x00t\x00e\x00d\x00 \x00S\x00t\x00a\x00t\x00e\x00s\x00\n\x00\n\x00F\x00r\x00o\x00m\x00 \x00W\x00i\x00k\x00i\x00p\x00e\x00d\x00i\x00a\x00,\x00 \x00t\x00h\x00e\x00 \x00f\x00r\x00e\x00e\x00 \x00e\x00n\x00c\x00y\x00c\x00l\x00o\x00p\x00e\x00d\x00i\x00a\x00\n\x00\n\x00J\x00u\x00m\x00p\x00 \x00t\x00o\x00 \x00n\x00a\x00v\x00i\x00g\x00a\x00t\x00i\x00o\x00n\x00J\x00u\x00m\x00p\x00 \x00t\x00o\x00 \x00s\x00e\x00a\x00r\x00c\x00h\x00\n\x00\n\x00\n\x00\n\x00T\x00h\x00i\x00s\x00 \x00a\x00r\x00t\x00i\x00c\x00l\x00e\x00 \x00n\x00e\x00e\x00d\x00s\x00 \x00a\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00a\x00l\x00 \x00c\x00i\x00t\x00a\x00t\x00i\x00o\x00n\x00s\x00 \x00f\x00o\x00r\x00 \x00v\x00e\x00r\x00i\x00f\x00i\x00c\x00a\x00t\x00i\x00o\x00n\x00.\x00 \x00P\x00l\x00e\x00a\x00s\x00e\x00 \x00

We can see that each university follows a fairly similar pattern, with the name followed by an - then the words "located in" followed by the city and the state.

We will use the example to show the verbose mode of python regexes. The verbose mode allows us to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a \ or by using the \s special value. However, this means we can write our regex a bit more like code, and can even include comments with #

In [65]:
pattern = """
(?P<title>.*)      #the university title
(-\ located\ in\ ) #an indicator of the location
(?P<city>\w*)      #city the university is in
(,\ )              #separator for the state
(?P<state>\w*)     #the state the city is located in"""

# Now when we call finditer(), we just pass the re.VERBOSE flag as the last parameter, this makes it such
# easier to understand large regexes!
for item in re.finditer(pattern,wiki,re.VERBOSE):
    # we can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

#### Example: New york Times and Hashtags

In [72]:
# Here's another example from the New York Times which covers health tweets on news items.
# This data came from the UC Irvine machine Learning Repository which is a great source of different kinds of data
with open ("C:\Users\user\Desktop\nytimeshealth.txt", "r") as file:
            #we'll read everything into a variable and take a look at it
            health = file.read()
health

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (1488400861.py, line 3)

We can see that there are tweets with fields separated by pipes |. Let's try and get a list of all of the hashtags that are included in this data. A hashtag begins with a poud sign (or hash mark) and continues until some whitespace is found.

So lets create a pattern. We want to include the hash sign first, then any number of alphanumeric characters. And we end when we see some whitespace.

In [None]:
pattern = '#[\w\d]*(?=\s)'
# Notice that the ending is a look ahead. We're not actually interested in matching whitespace in the return value.
# Also notice that we used an asterix * instead of the plus + for the matching of alphabetical characters or digits,
# because a + would require at least one of each.

# Lets search and display all of the hashtags
re.findall(pattern, health)

In [None]:
# We can see here that there were lots of ebola related tweeks in this particular dataset

#### Python Regular Expressions(RegEx)

##### Why use Regular Expressions?
1. To identify the pattern to get date and time from a log file.
2. To verify real and fake e-mail addresses.
3. To verify phone numbers and find the country to which it belongs.
4. To find a particular string from student data.
5. It is compartible with all programming languages - Python, Java, Ruby, PHP, Swift, C#, Groovy, scala, JavaScript.
6. It is used for web scrapping

##### What is RegEx?

RegEx is a specifice text string for describing a search pattern.

##### Various operations to perform with RegEx.
1. Find a word in a string.
2. Generate an iterator.
3. Match one of any of several letters.
4. Match series of range of characters.
5. Replace string.
6. Match a single character.

In [33]:
import re

In [34]:
# To identify the pattern to get name and age
Nameage = '''
Janice is 22 and Theon is 33
Gabriel is 44 and Joey is 21
'''

ages = re.findall(r'\d{1,3}', Nameage)
names = re.findall(r'[A-Z][a-z]*', Nameage)

ageDict = {}

x = 0

for eachname in names:
    ageDict[eachname] = ages[x]
    x+=1
print(ageDict)

{'Janice': '22', 'Theon': '33', 'Gabriel': '44', 'Joey': '21'}


##### Cursor operations of RegEx

Both the string and RegEx have their own cursor.

#### Various operations to perform with RegEx.

##### 1. Find a word in a string.

In [35]:
if re.search("inform", "we need to inform him with the latest information"):
    print("There is inform")

There is inform


In [36]:
allinform = re.findall("inform", "we need to inform him with the latest information")
for i in allinform:
    print(i)

inform
inform


In [37]:
allinform = re.findall("inform", "we need to inform him with the latest information")
for i in allinform:
    print(i)

inform
inform


In [38]:
allinform = re.findall("inform", "we need to inform him with the latest information")
for i in allinform:
    print(i)

inform
inform


##### 2. Generate an iterator.

In [39]:
# we want the starting and the ending index of the matching object
str = "we need to inform him with the latest information"

for i in re.finditer("inform", str):
    loctup = i.span()
    print(loctup)

(11, 17)
(38, 44)


##### 3. Match one of any of several letters.

In [40]:
# matching words with a particular pattern
str = "Sat, hat, mat, pat"
allstr = re.findall("[shmp]at", str)
for i in allstr:
    print(i)
#note that the will print out words with only the lowercase

hat
mat
pat


In [41]:
# To print out all words including the uppercase
str = "Sat, hat, mat, pat"
allstr = re.findall("[Shmp]at", str)
for i in allstr:
    print(i)

Sat
hat
mat
pat


##### 4. Match series of range of characters.

In [42]:
# To print all letters that fall between the range of h-m
str = "Sat, hat, mat, pat"
allstr = re.findall("[h-m]at", str)
for i in allstr:
    print(i)

hat
mat


In [43]:
# if we want all the strings to be printed
str = "Sat, hat, mat, pat"
allstr = re.findall("[h-z]at", str)
for i in allstr:
    print(i) # note that Sat is not printed cause of the uppercase

hat
mat
pat


In [44]:
# making all words in the strings lowercase 
str = "sat, hat, mat, pat"
allstr = re.findall("[h-z]at", str)
for i in allstr:
    print(i)

sat
hat
mat
pat


In [45]:
# using a caret symbol ^ to include all the words with first letters between h-m
# with the caret symbol, everything apart from the range h-m will be printed
str = "sat, hat, mat, pat"
allstr = re.findall("[^h-m]at", str)
for i in allstr:
    print(i)

sat
pat


##### 5. Replace string.

In [46]:
#To replace rat with the word food 
food = "hat rat mat pat"

# pattern object provides us with additional methods one of which is substitute

regex = re.compile("[r]at")
food = regex.sub("food", food)
print(food)

hat food mat pat


##### Solving the backslash \\ problem

In [47]:
randstr = "here is \\drogba"
print(randstr)

here is \drogba


Note that the output printed has only one backslash \ instead of double.

In [48]:
# Making use of RegEx to solve this.
# re(rawstring) will treat backslash as special

randstr = "here is \\drogba"
print(re.search(r"\\drogba", randstr))

<re.Match object; span=(8, 15), match='\\drogba'>


In [49]:
# To deal with new line spaces
randstr = '''
Keep the blue flag 
flying high
Chelsea
'''
print(randstr)


Keep the blue flag 
flying high
Chelsea



In [50]:
# Removing the new line with space

regex = re.compile("\n")
randstr = regex.sub(" ", randstr)
print(randstr)

 Keep the blue flag  flying high Chelsea 


In [51]:
#White spaces to work with
#\b: backspace
#\f: formfeed
#\r: carriage return
#\t: Tab
#\v: Vertical tab

##### 6. Match a single character.

In [52]:
#To get the 5th digit
randstr = "12345"
print("Matches:", len(re.findall("\d", randstr)))

Matches: 5


In [53]:
# making the d uppercase D will match anything but digits
randstr = "12345"
print("Matches:", len(re.findall("\D", randstr)))

Matches: 0


In [54]:
# To match a specific digit
randstr = "12345"
print("Matches:", len(re.findall("\d{5}", randstr)))

Matches: 1


In [55]:
# To match digits with a certain range. Here we look for strings with 5 and 7
num = "123 1234 12345 123456 1234567"
print("Matches:", len(re.findall("\d{5,7}", num))) 

Matches: 3


#### Applications of Regular Expressions

#### Verifying phone numbers

All phone numbers should have:
1. 3 starting digits and '-' sign
2. 3 middle digits and '-' sign
3. 4 digits in the end.

E.g. 444-122-1234

In [56]:
#\w [a-zA-Z0-9_] - This will match anything inside the bracket
#\W [^a-zA-Z0-9_] - This will match anything but the thing inside the bracket. This is similar to using the ^ symbol

phn = "412-555-1212"
if re.search("\w{3}-\w{3}-\w{4}", phn): #anything with 3 digits, a hyphen and 4 digits respectively.
    print("it is a phone number")
    
# If we changed the phone number 412-5551-1212, the output will return empty as it does not meet the stated specificatio

it is a phone number


In [57]:
#If we replace the 'w' with a 'd' for the specified string it returns output as above. But if not with specification, it will be emply
phn = "412-555-1212"
if re.search("\d{3}-\d{3}-\d{4}", phn):
    print("it is a phone number")

it is a phone number


In [58]:
#\s [\f\n\r\t\v] - It will print anything in the bracket
#\S [^\f\n\r\t\v] - It will print anything but the letters in the bracket

# To see if a full name is valid or not
if re.search("\w{2, 20}\S\w{2, 20}", "Saurabh Kulshrestha"):
    print("fullname is valid")

##### To verify E-mail address

E-mail address should include:
1. 1 to 20 lowercase and uppercase letters, numbers plus ._%+-
2. An @ symbol
3. 2 to 20 lowercase and uppercase letters, number
4. A period
5. 2 to 3 lowercase and uppercase letters

In [59]:
email = "sk@aol.com md@.com @seo.com dc@.com"
print("EmailMatches:", len(re.findall("[\w._%+-]{1, 20}@[\w.-]{2, 20}.[A-Za-z]{2, 3}", email)))

EmailMatches: 0


In [60]:
email = "sk@aol.com md@.com @seo.com dc@.com sk@aol.com"
print("EmailMatches:", len(re.findall("[\w._%+-]{1, 20}@[\w.-]{2, 20}.[A-Za-z]{2, 3}", email)))

EmailMatches: 0


#### Web Scrapping

In [61]:
# Scrapping phone numbers from a webpage using RegEx
import urllib.request
from re import findall

In [62]:
url = "http://www.summet.com/dmsi/html/codesamples/addresses.html"

response = urllib.request.urlopen(url)

html = response.read()

htmlStr = html.decode()

pdata = findall("\(\d{3}\) \d{3}-\d{4}", htmlStr) #3digits in () space 3 digits-4 digits

for item in pdata:
    print(item)

(257) 563-7401
(372) 587-2335
(786) 713-8616
(793) 151-6230
(492) 709-6392
(654) 393-5734
(404) 960-3807
(314) 244-6306
(947) 278-5929
(684) 579-1879
(389) 737-2852
(660) 663-4518
(608) 265-2215
(959) 119-8364
(468) 353-2641
(248) 675-4007
(939) 353-1107
(570) 873-7090
(302) 259-2375
(717) 450-4729
(453) 391-4650
(559) 104-5475
(387) 142-9434
(516) 745-4496
(326) 677-3419
(746) 679-2470
(455) 430-0989
(490) 936-4694
(985) 834-8285
(662) 661-1446
(802) 668-8240
(477) 768-9247
(791) 239-9057
(832) 109-0213
(837) 196-3274
(268) 442-2428
(850) 676-5117
(861) 546-5032
(176) 805-4108
(715) 912-6931
(993) 554-0563
(357) 616-5411
(121) 347-0086
(304) 506-6314
(425) 288-2332
(145) 987-4962
(187) 582-9707
(750) 558-3965
(492) 467-3131
(774) 914-2510
(888) 106-8550
(539) 567-3573
(693) 337-2849
(545) 604-9386
(221) 156-5026
(414) 876-0865
(932) 726-8645
(726) 710-9826
(622) 594-1662
(948) 600-8503
(605) 900-7508
(716) 977-5775
(368) 239-8275
(725) 342-0650
(711) 993-5187
(882) 399-5084
(287) 755-

#### Python Regular Expressions (Regex) Tutorials

A regular expression is a set of characters that helps one identify strings a specific pattern.

##### Symbols for writing regular expressions
1. Asterix * The preceding character is repeated zero or more times
2. Plus + The preceding character is repeated at least once
3. {} The preceding character is repeated as many times as mentioned in the braces.
4. Period . Represents a single occurence of any character except newline.
5. ? The preceding character is optional
6. ^ Specifies that the match must start at the beginning of the string.
7. $ Specifies that the match must occur at the end of the string
8. [] Matches one out of all characters within the brackets
9. [^..] Matches any one character except those not in the brackets
10. \d Matches a digit
11. \w Matches an alphanumeric character.
12. \s Matches a whitespace character

In [73]:
import re

In [74]:
str = "Abcd 4 computer 765 Python 687"

pattern = 'computer' #extracting a specific word
match = re.findall(pattern, str)

print(match)

['computer']


In [78]:
#the code above didn't tell us much. We will now use some of those special characters
str = "Abcd 4 computer 765 Python 687"

pattern = r'[a-zA-Z]+'
match = re.findall(pattern, str)

print(match)

['Abcd', 'computer', 'Python']


In [79]:
#Let's remove the space in between Abcd and 4
str = "Abcd4 computer 765 Python 687"

pattern = r'[a-zA-Z]+'
match = re.findall(pattern, str)

print(match)

['Abcd', 'computer', 'Python']


In [80]:
#Let's remove all the spaces in between Abcd, 4 and computer
str = "Abcd4computer 765 Python 687"

pattern = r'[a-zA-Z]+' # the plus brings back letters next to each other
match = re.findall(pattern, str)

print(match)

['Abcd', 'computer', 'Python']


In [82]:
#Let's do numbers
str = "Abcd 4 computer 765 Python 687"

pattern = r'[0-9]+'
match = re.findall(pattern, str)

print(match)

['4', '765', '687']


In [83]:
#Let's do both letters and numbers
str = "Abcd4 computer 765 Python 687"

pattern = r'[a-zA-Z0-9]+'
match = re.findall(pattern, str)

print(match)

['Abcd4', 'computer', '765', 'Python', '687']


In [84]:
#Let's try another format that will give same answer
str = "Abcd 4 computer 765 Python 687"

pattern = r'.[^ ]+'
match = re.findall(pattern, str)

print(match)

['Abcd', ' 4', ' computer', ' 765', ' Python', ' 687']


In [85]:
# using the three single brackets 
str = '''
apple
banana
orange
peach
avocado
cherries
'''
pattern = r'.*s' #Looking for any character that is not a caret return 
match = re.findall(pattern, str)
for m in match:
    print(m)

cherries


In [86]:
# using word boundaries 
str = '''
apple
banana
orange
peach
avocado
cherries
'''
pattern = r'\b[aeiou].+\b' #looking for boubdaries. Any pattern that has aeiou
match = re.findall(pattern, str)
for m in match:
    print(m)

apple
orange
avocado


In [87]:
# Let's run the code without the boundaries 
str = '''
apple
banana
orange
peach
avocado
cherries
'''
pattern = r'[aeiou].+' 
match = re.findall(pattern, str)
for m in match:
    print(m)

apple
anana
orange
each
avocado
erries


In [88]:
# Let's run the code with the boundaries on the right hand side
str = '''
apple
banana
orange
peach
avocado
cherries
'''
pattern = r'[aeiou].+\b' 
match = re.findall(pattern, str)
for m in match:
    print(m)

apple
anana
orange
each
avocado
erries


In [89]:
#Let's look at another example using emails (scrapping a webpage) 
str = '''
dfshj@gmail.com
3ytgdg\.56
tigacharm56h@hotmail.com
hfg123h@aol
'''
pattern = r'[a-z]+[0-9]*[a-z]*@[a-z]+\.com' 
match = re.findall(pattern, str)

print(match)

['dfshj@gmail.com', 'tigacharm56h@hotmail.com']


In [92]:
#Let's find markers on the webpage 
str = '''
dfshj@gmail.com
3ytgdg\.56
tigacharm56h@hotmail.com
hfg123h@aol
'''
pattern = r'[a-z]+[0-9]*[a-z]*@[a-z]+\.com' 
match = re.finditer(pattern,str)

for m in match:
    print(m)

<re.Match object; span=(1, 16), match='dfshj@gmail.com'>
<re.Match object; span=(28, 52), match='tigacharm56h@hotmail.com'>


In [93]:
#Let's check just the span 
str = '''
dfshj@gmail.com
3ytgdg\.56
tigacharm56h@hotmail.com
hfg123h@aol
'''
pattern = r'[a-z]+[0-9]*[a-z]*@[a-z]+\.com' 
match = re.finditer(pattern,str)

for m in match:
    print(m.span())

(1, 16)
(28, 52)


In [95]:
#Let's check new set of data
str = '''
Sam
car
2453
Alexa
John
90
'''

pattern = r'\b[A-Z][a-z]+\b' 
nstr = re.sub(pattern,  "",str) #for substituting anything not in the pattern

print(nstr)



car
2453


90

