#### Performing Queries with Regex in Python

The ‘re’ package provides several methods to actually perform queries on an input string. The methods that we will be discussing are:
	re.match()
	re.search()
	re.findall()

Lets first take a look at the match() method. 

The way the match() method works is that it will only find matches if they occur at the start of the string being searched.

So for example, calling match() on the string ‘dog cat dog’, looking for the pattern ‘dog’ will match:


In [171]:
sampelText = 'dog cat dog'
re.match(r'dog', sampelText)


<_sre.SRE_Match at 0x5fdb8b8>

In [230]:
sampelText = 'dog cat dog'

match = re.match(r'd.....', 'dog cat dog')
match.group(0)

'dog ca'

In [231]:
sampelText = 'dog cat dog'

match = re.match(r'dog....', 'dog cat dog')
match.group(0)

'dog cat'

In [232]:
sampelText = 'dog cat dog'

match = re.match(r'd.*', 'dog cat dog')
match.group(0)

'dog cat dog'

In [235]:
sampelText = 'dog cat dog'
match = re.match(r'cat', 'dog cat dog')
#match.group(0)

The search() method is similar to match(), but search() doesn’t restrict us to only finding matches at the beginning of the string, so searching for ‘cat’ in our example string finds a match:

In [254]:
sampelText = 'dog cat dog'

match = re.search(r'cat.*', 'dog cat dog')

match.group(0)

'cat dog'

When we call findall(), we simply get a list of all matching patterns. 
This is just simpler. Calling findall() on our example string we get:

In [237]:
re.findall(r'dog', 'dog cat dog')

['dog', 'dog']

In [238]:
re.findall(r'cat', 'dog cat dog')

['cat']

#### Lets read data from homicide.txt as list

In [239]:
data = [line.strip() for line in open("D:/insofe/CSE7306c22/20170128_Batch22_CSE7306c_Lab01_TamingText/homicides.txt", 'r')]
type(data)

list

In [240]:
len(data)

1250

Display first line in the list for homicides.txt 

In [255]:
data[0:5]

['39.311024, -76.674227, iconHomicideShooting, \'p2\', \'<dl><dt>Leon Nelson</dt><dd class="address">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>\'',
 '39.312641, -76.698948, iconHomicideShooting, \'p3\', \'<dl><dt>Eddie Golf</dt><dd class="address">4900 Challedon Road<br />Baltimore, MD 21207</dd><dd>black male, 26 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting</dd></dl>\'',
 '39.309781, -76.649882, iconHomicideBluntForce, \'p4\', \'<dl><dt>Nelsene Burnette</dt><dd class="address">2000 West North Ave<br />Baltimore, MD 21217</dd><dd>black female, 44 years old</dd><dd>Found on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: blunt force</dd></dl>\'',
 '39.363925, -76.598772, iconHomicideAsphyxiation, \'p5\', \'<dl><dt>Thomas MacKenney</dt><dd class="address">5900 Northwood Drive<br />Baltimore, 

In [242]:
r = re.compile("iconHomicideShooting")
newlist = filter(r.search, data)
len(newlist)

228

In [243]:
r = re.compile("iconHomicideShooting|icon_homicide_shooting")
newlist = filter(r.search, data)
len(newlist)

1003

In [244]:
r = re.compile("Cause: shooting")
newlist = filter(r.search, data)
len(newlist)

228

In [245]:
r = re.compile("Cause: [Ss]hooting")
newlist = filter(r.search, data)
len(newlist)

1003

In [246]:
r = re.compile("[Ss]hooting")
newlist = filter(r.search, data)
len(newlist)

1005

In [247]:
exp1 = re.compile("Cause: [Ss]hooting")
exp2 = re.compile("[Ss]hooting")

list1 = filter(exp1.search, data)
list2 = filter(exp2.search, data)

In [248]:
set(list1) - set(list2)

set()

In [250]:
set(list2) - set(list1)

{'39.28322500000, -76.63946800000, icon_homicide_bluntforce, \'p472\', \'<dl><dt><a href="http://essentials.baltimoresun.com/micro_sun/homicides/victim/472/lyle-dimeler">Lyle Dimeler</a></dt><dd class="address">400 S. Calhoun St.<br />Baltimore, MD 21223</dd><dd>Race: Unknown<br />Gender: male<br />Age: 53 years old</dd><dd>Found on November 18, 2008</dd><dd>Victim died at Maryland Shock Trauma Center</dd><dd>Cause: Blunt Force</dd><dd><a href="http://www.baltimoresun.com/news/local/bal-shootingdeath1118,0,3571249.story">Read the article</a></dd></dl>\'',
 '39.33743900000, -76.66316500000, icon_homicide_bluntforce, \'p914\', \'<dl><dt><a href="http://essentials.baltimoresun.com/micro_sun/homicides/victim/914/steven-harris">Steven Harris</a></dt><dd class="address">4200 Pimlico Road<br />Baltimore, MD 21215</dd><dd>Race: Black<br />Gender: male<br />Age: 38 years old</dd><dd>Found on July 29, 2010</dd><dd>Victim died at Scene</dd><dd>Cause: Blunt Force</dd><dd class="popup-note"><p>Harr

#### Now Let's use pattern to extract the desirable string for business need
<dd>[F|f]ound(.*)</dd>

In [251]:
for line in data:
    regex = '<dd>[F|f]ound(.*)</dd>'
    matches = re.search(regex, line)
    line = matches.group(1) + '\n'
    print(line)
    

 on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting

 on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting

 on January 2, 2007</dd><dd>Victim died at scene</dd><dd>Cause: blunt force

 on January 3, 2007</dd><dd>Victim died at scene</dd><dd>Cause: asphyxiation

 on January 5, 2007</dd><dd>Victim died at scene</dd><dd>Cause: blunt force

 on January 5, 2007</dd><dd>Victim died at JHH</dd><dd>Cause: shooting

 on January 5, 2007</dd><dd>Victim died at UMMC</dd><dd>Cause: shooting

 on January 7, 2007</dd><dd>Victim died at JHH</dd><dd>Cause: shooting

 on January 8, 2007</dd><dd>Victim died at Bayview</dd><dd>Cause: shooting

 on January 8, 2007</dd><dd>Victim died at JHH</dd><dd>Cause: shooting

 on January 9, 2007</dd><dd>Victim died at Sinai</dd><dd>Cause: shooting

 on January 9, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting

 on January 9, 2007</dd><dd>Victim died at scene</dd><dd>Cause: shooting

 on January 9, 2007</dd><d

AttributeError: 'NoneType' object has no attribute 'group'

The previous pattern was too greedy and matched too much of the string. 
We need to use the ? metacharacter to make the regex \lazy".


In [252]:
incidentDate = []
for line in data:
    regex = '<dd>[F|f]ound on (.*?)</dd>'
    matches = re.search(regex, line)
    line = matches.group(1) + '\n'
    incidentDate.append(matches.group(1))
    print(line)

January 1, 2007

January 2, 2007

January 2, 2007

January 3, 2007

January 5, 2007

January 5, 2007

January 5, 2007

January 7, 2007

January 8, 2007

January 8, 2007

January 9, 2007

January 9, 2007

January 9, 2007

January 9, 2007

January 9, 2007

January 13, 2007

January 15, 2007

January 18, 2007

January 20, 2007

January 20, 2007

January 22, 2007

January 23, 2007

January 23, 2007

January 24, 2007

January 27, 2007

January 27, 2007

January 29, 2007

January 31, 2007

February 1, 2007

February 2, 2007

February 8, 2007

February 10, 2007

February 11, 2007

February 11, 2007

February 17, 2007

February 18, 2007

February 19, 2007

February 19, 2007

February 19, 2007

February 21, 2007

February 21, 2007

February 23, 2007

February 24, 2007

February 26, 2007

February 27, 2007

March 3, 2007

March 4, 2007

March 5, 2007

March 6, 2007

March 9, 2007

March 11, 2007

March 11, 2007

March 12, 2007

March 13, 2007

March 13, 2007

March 13, 2007

March 13, 2007

Marc

AttributeError: 'NoneType' object has no attribute 'group'

In [253]:
incidentDate

['January 1, 2007',
 'January 2, 2007',
 'January 2, 2007',
 'January 3, 2007',
 'January 5, 2007',
 'January 5, 2007',
 'January 5, 2007',
 'January 7, 2007',
 'January 8, 2007',
 'January 8, 2007',
 'January 9, 2007',
 'January 9, 2007',
 'January 9, 2007',
 'January 9, 2007',
 'January 9, 2007',
 'January 13, 2007',
 'January 15, 2007',
 'January 18, 2007',
 'January 20, 2007',
 'January 20, 2007',
 'January 22, 2007',
 'January 23, 2007',
 'January 23, 2007',
 'January 24, 2007',
 'January 27, 2007',
 'January 27, 2007',
 'January 29, 2007',
 'January 31, 2007',
 'February 1, 2007',
 'February 2, 2007',
 'February 8, 2007',
 'February 10, 2007',
 'February 11, 2007',
 'February 11, 2007',
 'February 17, 2007',
 'February 18, 2007',
 'February 19, 2007',
 'February 19, 2007',
 'February 19, 2007',
 'February 21, 2007',
 'February 21, 2007',
 'February 23, 2007',
 'February 24, 2007',
 'February 26, 2007',
 'February 27, 2007',
 'March 3, 2007',
 'March 4, 2007',
 'March 5, 2007',
 '