## Regular Expressions Worksheet 2

### Part 1 - Checking for Matches

Now that we've practiced writing regular expressions, we can begin to test our expressions and check for matches using functions from the `re` module.

In [1]:
import re

For this first example, we'll use the regular expression we created last class for representing US dollars and cents:

In [2]:
regex = " \$([0-9]|[1-9][0-9]+)\.\d\d "

**STEP 1.** First, we need to 'compile' this into a regular expression pattern...right now, it's just an odd-looking python string. The way we do this is with the `compile()` function from `re`:

In [3]:
compiled_regex = re.compile(regex)

In [4]:
type (compiled_regex)

re.Pattern

The variable, `compiled_regex`, now contains a regular expression pattern.

**STEP 2.** Now, let's use `.search()`, which is a method that works specifically on compiled regular expressions. `.search()` will look for the first match and return a 'match object' we can use later:

In [5]:
#The following code will use our compiled regex to see if there's a match in the string '$0.25':

match = compiled_regex.search('The parking meter costs $0.25 per 15 minutes.')
print('The match object looks like this:', match)

The match object looks like this: <re.Match object; span=(23, 30), match=' $0.25 '>


In [6]:
type (match)

re.Match

By the way, look at what happens to the match object when there isn't a match:

In [7]:
match = compiled_regex.search('The parking meter costs $0.254 per 15 minutes.')
print('The match object looks like this:', match)

The match object looks like this: None


When there's no match, a value of `None` is returned.

**STEP 3.** If the match object returns a match, it evaluates to `True` and if it's `None`, it evaluates to `False`. Therefore, you can use an `if` statement to see if there's a match. 

In [8]:
if match:
    print('There\'s a match!')
else:
    print('There isn\'t a match.') #Remember, the previous code block found no match

There isn't a match.


**STEP 4.** Usually, you'll want not just to check if there's a match, but do something with the match. For that, you use the `.group()` method and you group whatever information you need with parentheses () in your initial regular expression string.

Here, I've re-written the regex to include 2 groups, 1 for the dollars and 1 for the cents:

In [11]:
regex2 = " \$([0-9]|[1-9][0-9]+)\.(\d\d) "
compiled_regex2 = re.compile(regex2)
match2 = compiled_regex2.search('Your order total is $101.14 today.')

Now, we can use `.group(1)` to access the dollars and `.group(2)` to access the cents:

In [12]:
if match2:
    print(f'The dollars are {match2.group(1)} and the cents are {match2.group(2)}')
else:
    print('There was no match')

The dollars are 101 and the cents are 14


### Part 2 - Looping to Find a Match

Now that you've got the basics, try writing a program that will loop through the following text to print out the hours and the minutes for each time listed. To help get you started, I've split the multi-line text into a list of strings separated at the line breaks.

In [13]:
text = '''
The time is 0:15 right now    # quarter past midnight
The time is 0:07 right now   # 7 past midnight
The time is 4:59 right now
And here are a few more...
The time is 12:07 right now   # 7 past noon
The time is 23:00 right now  # 11pm
The time is 10:13 right now
The time is 6:00 right now
The time is NOT 123:456 right now
'''
text_list = text.split('\n')

In [14]:
text_list

['',
 'The time is 0:15 right now    # quarter past midnight',
 'The time is 0:07 right now   # 7 past midnight',
 'The time is 4:59 right now',
 'And here are a few more...',
 'The time is 12:07 right now   # 7 past noon',
 'The time is 23:00 right now  # 11pm',
 'The time is 10:13 right now',
 'The time is 6:00 right now',
 'The time is NOT 123:456 right now',
 '']

To help get you started, the following code block contains the regular expression for the 24-hour time format that we created last class. You may need to modify it to match all of the times and to capture the hours and minutes successfully.

In [16]:
time_exp = ' (\d|1\d|2[0-3]):[0-5]\d '
compiled_time_exp = re.compile(time_exp)
match = compiled_time_exp.search()

In [None]:
# Need answer

### Exercise 1

Here's another one for you to try. Create a regular expression and write a program that will list the dates and times for successful logins ONLY from the following logfile data.

Your program should print out something that looks like this (notice the different date format):
```
Login successful on 09/05/2020 at 11:24.
Login successful on 09/06/2020 at 09:17.
```

*Tip: Remember that you can use `\d` instead of `[0-9]`.*

In [17]:
logfile_data = [
    '2020-09-05T11:24 Login successful',
    '2020-09-05T12:15 Error: file not found',
    '2020-09-05T13:07 Logout successful',
    '2020-09-06T09:16 Login error: bad username',
    '2020-09-06T09:17 Login successful'
]

In [26]:
log_exp = "(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d:\d\d) Login successful"
comp_log = re.compile(log_exp)

for transaction in logfile_data:
    match = comp_log.search(transaction)
    if match:
        print(f"Successful login on {match.group(2)}/{match.group(3)}/{match.group(1)} at {match.group(4)}")

Successful login on 09/05/2020 at 11:24
Successful login on 09/06/2020 at 09:17


### Exercise 2

Revise your code from the above, this time to save the data in a dictionary called `logdata`. The primary keys should be the date and, for each date, there should be a dictionary with the times and login outcomes. Your final dictionary should look like this:

```
{'2020-09-05': {'11:24': 'Login successful',
  '12:15': 'Error: file not found',
  '13:07': 'Logout successful'},
 '2020-09-06': {'09:16': 'Login error: bad username',
  '09:17': 'Login successful'}}
```

In [32]:
log2_exp = "(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d:\d\d) (.*)$"
comp_log2 = re.compile(log2_exp)

logdata = {}

for transaction in logfile_data:
    match = comp_log2.search(transaction)
    if match:
        date = match.group(1)
        time = match.group(2)
        outcome = match.group(3)
        
    if date not in logdata:
        logdata[date] = {time:outcome}
    else:
        logdata[date][time] = outcome
        
logdata

{'2020': {'09': '06'}}

### Exercise 3

You are given a text from a financial news article.  Write and compile a regular expression to extract the ticker symbol for the company highlighted in each article.  (Here *ticker* is used to mean a symbol representing a company in the stock market.)  

Write just one regex that should work to extract the company from either of the three strings.

inspiration from:  https://holypython.com/advanced-python-exercises/project-regular-expressions-regex/

In [None]:
text1 = """This morning's inflation numbers tell a story: 
Prices are going up faster than they have in 40 years, cutting into profits. 
But Tyson Foods (TSN: 88.82) saw its profits soar by double digits last quarter.  """

text2 = """Joe Rogan's podcast will remain on Spotify (SPOT: 157.31) following 
controversies over COVID misinformation"""

text3 = """Alcoa Corporation (AA: 90.07) plans to announce its first quarter 2022 
financial results on Wednesday, April 20, 2022 after the close of trading on the 
New York Stock Exchange. """


In [None]:
#Write your code here

### Exercise 4

Now redo Exercise 3 so that you do not include anything besides the ticker symbol for each company.  If you have time then also return the prices--it's OK to return a list of tuples where the first element is the ticker symbol and the second is the price.

source:  https://holypython.com/advanced-python-exercises/project-regular-expressions-regex/

In [None]:
text1 = """This morning's inflation numbers tell a story: 
Prices are going up faster than they have in 40 years, cutting into profits. 
But Tyson Foods (TSN: 88.82) saw its profits soar by double digits last quarter.  """

text2 = """Joe Rogan's podcast will remain on Spotify (SPOT: 157.31) following 
controversies over COVID misinformation"""

text3 = """Alcoa Corporation (AA: 90.07) plans to announce its first quarter 2022 
financial results on Wednesday, April 20, 2022 after the close of trading on the 
New York Stock Exchange. """


In [None]:
#Write your code here

### `.findall()` method
The `.findall()` method will return all instances of a match in a list. For example, let's say you have the following line of text:

In [None]:
s = 'Future meetings of the Order of the Phoenix will be held on the following dates: 10/24/21, 11/10/21, and 12/5/21.'

If you want to pull out the meeting dates, you can create a regular expression to match the date format and then use `.findall()`:

In [None]:
date_ex = '\d\d?/\d\d?/\d\d'
compiled_date = re.compile(date_ex)
date_list = compiled_date.findall(s)
print(f'The meeting dates are:', date_list)

### Exercise 5
Create a list of all html tags in the string below.  An html tag can be found inside <> characters; the first word is the tag, and remaining words are parameters.  (A tag that starts with </ is a closing tag, and there is one to match each tag.  For this exercise it is OK to find only those that are not closing tags, and it's also OK to have duplicate tags.)

In [None]:
text4 = """<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <title>this is some text</title>
  </head>
  <body>
    
  </body>
</html>"""

In [None]:
#Write your code here

### `.findall()` with multiple capture groups

If you use `.findall()` with capture groups you will get a list of tuples, where the first element in the tuple is group 1, the second group 2, and so on.

In [None]:
text5 = """This morning's inflation numbers tell a story:  Prices are going up faster than they have in 40 years, cutting into profits. 
But Tyson Foods (TSN: 88.82) saw its profits soar by double digits last quarter.  In other news, Joe Rogan's podcast will remain on Spotify 
(SPOT: 157.31) following controversies over COVID misinformation.  And finally, Alcoa Corporation (AA: 90.07) plans to announce its first 
quarter 2022 financial results on Wednesday, April 20, 2022 after the close of trading on the New York Stock Exchange."""


In [None]:
grouped_regex ='([A-Z]+): (\d+.\d\d)' # Step 1
grouped_regex_c = re.compile(grouped_regex) #Step 2

match_all_list2 = grouped_regex_c.findall(text5)  #Steps 3 and 4 combined

match_all_list2

### Exercise 6

Write code to extract all the filenames and extensions from the text below, where the name of the file and the file extension (the part that comes after the .) are separated.  (Use 2 capture groups.)

In [None]:
text6 = """Download Class 28 Worksheet RegularExpressions2.ipynb and the file logins.txt.
  Move both to your course folder and open the worksheet in Jupyter Lab.  A reference that may help:  
  it's a regex cheatsheet where the topics that we'll cover in this course are highlighted with red boxes:  regex_cheatsheetClass27.pdf."""

In [None]:
#Write your code here