### Mistery pattern

What does the following regular expression match? Try to find out!
```
r"([A-Za-z0-9\.]+)@[A-Za-z0-9\.]*qmul\.ac\.uk"
```

When you have got it, try using the ```.search()``` function to return a matching object and  print ```.group(1)``` of the matching object. This corresponds to whatever matches the content of the first (and only) pair of brackets () in the pattern. What is it?

In [2]:
import re

pattern=r"([A-Za-z0-9\.]+)@[A-Za-z0-9\.]*qmul\.ac\.uk"

# The mystery pattern corresponds to a QM email account:
text="Anonymous Student <ec241009@se24.qmul.ac.uk>"
mo=re.search(pattern, text)
# brackets () capture the username as a group
mo.group(1)

'ec241009'

### Call me back

Write a function ```parseNumber``` that takes a string with a phone number as an argument, and returns a dictionary with the country code, area code (starting with zero) and number as separate strings.
So for instance the call
```
parseNumber("+44 (207) 882 5555")
```
should return the dictionary
```
{
    "country": "44",
    "area": "0207",
    "number": "8825555" # remove the spaces
}

```
Use grouping as in the example above; search for "Grouping" in the official [tutorial](https://docs.python.org/3/howto/regex.html) for more information. You can stick with this format, or try to cater for a few variants - for instance ```"(0207) 8825555"``` might return ```None``` for the country, and the rest as above. 
Simply return ```None``` if the number cannot be parsed. You may need a combination of regular expressions, string methods and ```if``` statements to get the job done.

In [6]:
import re

def parseNumber(num):
    num = re.sub(r'\s', '', num)
    # num = num.replace(' ', '')  # either this or the line above
    # there are 5 groups in this regexp
    mo = re.match(r"(\+(\d+))?(\((\d+)\))?(\d+)", num)  # Add r before the string
    if mo is None:
        return None
    # Uncomment the two lines below to print all groups
    # for i in range(6):
    #     print(f"Group {i}: {mo.group(i)}")
    data = {
        'country': mo.group(2),
        'area': mo.group(4),
        'number': mo.group(5)
    }
    # Add 0 to area code if needed
    if data['area'] is not None and data['area'][0] != '0':
        data['area'] = '0' + data['area']
    return data

print(parseNumber("+44 (207) 882 5555"))
# print(parseNumber("(0207) 882 5555"))
# print(parseNumber("882 5555"))


{'country': '44', 'area': '0207', 'number': '8825555'}


### Books to scrape

The few lines of code below download the html source code of the home page of a simulated [bookstore](http://books.toscrape.com/) meant for web scraping practice. Write code that uses regular expressions to extract the title and price of the books listed on the page, and prints them out as comma-separated values:

```
A Light in the Attic,£51.77
Tipping the Velvet,£53.74
...
```
Hint: have a look at the web page first. Then print the ```html``` variable, and use the search function of your browser (generally CTRL-F) on the notebook to find occurrences of title words or prices in the source code of the page. Is there any regularity you can exploit? 

In [7]:
# This exercise uses the requests package - when running locally, if it is not
# available, try to install it by typing "pip install requests" in the terminal. 
# Anaconda users try "conda install requests" instead
from requests import get

url='http://books.toscrape.com/'
response=get(url)
# html code as a string
html=response.content.decode('utf-8')

In [9]:
# titles follow 'title=' and are between quotes. Note that the quotes are escaped
titles=re.findall(r"title=\"(.+)\"", html) # notice the group contents are listed
# prices are the decimal numbers following a £ sign
prices=re.findall(r"£\d+\.\d+", html) # no group here, entire match listed

# apart for the encoding of the apostrophe, this seems to work well
for title, price in zip(titles, prices):
    print (title+", "+price)

A Light in the Attic, £51.77
Tipping the Velvet, £53.74
Soumission, £50.10
Sharp Objects, £47.82
Sapiens: A Brief History of Humankind, £54.23
The Requiem Red, £22.65
The Dirty Little Secrets of Getting Your Dream Job, £33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, £17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, £22.60
The Black Maria, £52.15
Starving Hearts (Triangular Trade Trilogy, #1), £13.99
Shakespeare&#39;s Sonnets, £20.66
Set Me Free, £17.46
Scott Pilgrim&#39;s Precious Little Life (Scott Pilgrim #1), £52.29
Rip it Up and Start Again, £35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, £57.25
Olio, £23.88
Mesaerion: The Best Science Fiction Stories 1800-1849, £37.59
Libertarianism for Beginners, £51.33
It&#39;s Only the Himalayas, £45.17


### An interactive tutorial

There are a number of interactive tutorials and puzzles on regular expressions online. Try this one:
    
http://regexone.com/

**(C) 2014,2020 Fabrizio Smeraldi** ([f.smeraldi@qmul.ac.uk](mailto:f.smeraldi@qmul.ac.uk) - [web](http://www.eecs.qmul.ac.uk/~fabri/)), all rights reserved. In: "Computer Programming", School of Electronic Engineering and Computer Science, Queen Mary University of London.