# Using Regular Expressions to Match Patterns

Matching patterns using regular expressions is useful when you want to look for strings in a file with a lot of data. In this exercise we will look at 3 scenarios and use regular expressions to match patterns. 
    
In order to use regular expressions in Python you first need to add the regular expression package(re). Then, for each scenario, you will need to simulate some data, build the necessary regular expression(s) 

## Scenario 1: Times

This pattern will match times after noon, but before midnight when reported in 24-hour or "military" format(e.g. 15:30). The following regular expression(s) work because it makes it so the only digits you capture are those that start with a 1 (i.e. 12:xx) or 2 (i.e. 20:xx). The print then combines the 2 filters so that you can see all values after noon/before midnight.

In [1]:
#Import packages needed
import re
#Simulate Data
times=['00:30','o1:30','02:30','03:30','04:30','05:30','06:30','07:30','08:30','09:30','10:30','11:30','12:30','13:30','14:30','15:30','16:30','17:30','18:30','19:30','20:30','21:30','22:30','23:30']
#Build expressions
regex1=re.compile('[1][2-9]:\d{2}')
regex2=re.compile('[2][0-9]:\d{2}')
#Filter
print(filter(regex1.match,times)+filter(regex2.match,times))

['12:30', '13:30', '14:30', '15:30', '16:30', '17:30', '18:30', '19:30', '20:30', '21:30', '22:30', '23:30']


## Scenario 2: Genus Species Names

This pattern will match genus species names that are expressed in the format G. species(e.g. H. sapiens). This code will look for a capital letter(A-Z), followed by a period, then a space and finally 2-25 lowercase letters. As we use the .match function, it will look for instances where the string starts with the specified parameters.

In [4]:
#Simulate Data
names=['M. avium','Bubbles','T. cruzi','J. F. Kennedy','B. megaterium','Kei-ichi Uchiya','mumbo.jumbo','T. rex', 'S. pyogenes','h. sapiens']
#Build Expression
regex=re.compile('[A-Z]\.\s[a-z]{2,25}')
#Filter
print(filter(regex.match,names))

['M. avium', 'T. cruzi', 'B. megaterium', 'T. rex', 'S. pyogenes']


As you can see, this expression works because it ignored items like 'Bubbles', 'J. F. Kennedy, etc. This is because they did not fit the required format. In the instance of 'Bubbles' it was ignored because it didn't have a period or space following the first letter.In the 'J. F. Kennedy' example it was ignored because it failed the second part of the expression which required a lowercase letter/word following the first period and space.

## Scenario 3: Social Security Numbers

This pattern will match social security numbers in the proper format(e.g. 389-05-4771). This code will look for 3 digits followed by a dash, then 2 digits followed by a dash and finally 4 digits.

In [5]:
#Simulate Data
data=['389-05-4771','123-45-6789','McDougal Littell','876-54-3210','111-22-3333','Goofy','888-77-6666','22-333-44']
#Build Expression
regex=re.compile('\d{3}\-\d{2}\-\d{4}')
#Filter
print(filter(regex.match,data))

['389-05-4771', '123-45-6789', '876-54-3210', '111-22-3333', '888-77-6666']


This regular expression worked because it ignored data like 'Goofy' and '22-333-44'. Naturally, Goofy was gnored because it didn't start with a digit and 22-333-44 was ignored because it didn't start with 3 digits.