## Regular Expressions

**Note**: This assignment is adapted from Chris Manning and Dan Jurafsky's Coursera NLP 2012 course.

In this assignment your task will be to help a poor Masters student at Stanford (yes, basically the definition of oxymoron) who is trying to find someone who would write a [recommendation letter](https://www.uwb.edu/careers/faculty-and-staff/referenceletters/reference-letter-template) to be attached to their graduate school application. This Masters student of ours decided to adopt a cold calling/maining approach: their plan is to reach out to whoever's email/phone they can get their hands on.

Armed with knowledge of regular expressions, your task is to help them find as many email addresses and phone numbers as you can.

### Task definition

That is, given a string

```
jurafsky@stanford.edu
```

your solution should return

```
jurafsky@stanford.edu
```


It will not always be this easy though, as your solution should also deal with more complex attempts at obfuscation of email addresses, such as

```
jurafsky(at)cs.stanford.edu

jurafsky at csli dot stanford dot edu
```

which should return

```
jurafsky@cs.stanford.edu

jurafsky@csli.stanford.edu
```

And you should even take care of 

```
<script type="text/javascript">obfuscate('stanford.edu','jurafsky')</script>
```

which should once again return

```
jurafsky@stanford.edu
```

(No, executing JavaScript is not required in this assignment).

In case of phone numbers the same idea applies -- all of the following examples

```
TEL +1-650-723-0293
Phone: (650) 723-0293
Tel (+1): 650-723-0293
<a href="contact.html">TEL</a> +1&thinsp;650&thinsp;723&thinsp;0293
```

should return the same canonical form:

```
650-723-0293
```

(You can assume all mentioned numbers as well as the Master student will be located in North America, so the international prefix is not required).

*Note: you do not have to deal with any "next-level" obfuscation which would require outside information, such as images or strings like the following:*

```
"first name"@cs.stanford.edu
```


### Data

In order for you to test your regular expressions and related code we prepared a dataset (its proper name would probably be the *development test set*) which you can find in the `data/dev/` directory. It contains various homepages of Stanford faculty, out of which the emails and phone numbers should be extracted. The file `data/devGOLD` 


### Process

We suggest you deal with the task of this assignment in the following way:

1. Edit the `process_file` function in the cell below and execute it (hit Ctrl+Enter or Cmd+Enter).
2. Your solution will be automatically evaluated on the development test set. You'll see three sections:

    **True Positives**: list of emails and phone numbers that were correctly matched
    
    **False Positives**: list of emails and phone numbers that your solution *matched* but which *should not be matched*
    
    **False Negatives**: list of emails and phone numbers that your solution *did not match* but which *should be matched*
    
   As the starter code will inevitably produce some **False Positives** and **False Negatives**, we suggest you then navigate to the `data/dev/` directory, find the appropriate file (its name is the first part of the tuple) and try to debug your regular expression there. This can be easily done in JupyterLab by pressing `Ctrl+F` and typing in your regular expression starting and ending it with forward slashes, such as `/[a-z]@[a-z].edu/`. Pressing Enter afterwards will highlight the parts of the file that matched your regular expression.

3. Repeat 1. and 2. until the number of **False Positives** and **False Negatives** is as small as possible.

### Notes

As this assignment uses the Python regular expressions library (`re`), it would be very wise to consult its [documentation](https://docs.python.org/3/library/re.html).

In [1]:
import re
from utils import evaluate

In [3]:
MAIL_REGEX = r'(\w+)@(\w+).edu'

def process_file(name, file):
    """
    Process a file with name `name`, whose "handle" is provided as the `file` parameter.
    Returns a `list` of touples in the following format:
    
    In case of an email:
    
        (name, 'e', 'someone@something')

    And in case of a phone number:
    
        (name, 'p', '###-###-#####')
    """
    
    res = []
    
    for line in file:
        matches = re.findall(MAIL_REGEX, line)
        
        # if we found a match, let's add it
        for match in matches:
            if len(match) == 2:
                email = '{}@{}.edu'.format(match[0], match[1])
                res.append((name, 'e', email))
    return res

results = evaluate('./data/dev', './data/devGOLD', process_file)
results

True Positives (4): 
{('balaji', 'e', 'balaji@stanford.edu'),
 ('nass', 'e', 'nass@stanford.edu'),
 ('shoham', 'e', 'shoham@stanford.edu'),
 ('thm', 'e', 'pkrokel@stanford.edu')}
False Positives (1): 
{('psyoung', 'e', 'young@stanford.edu')}
False Negatives (113): 
{('ashishg', 'e', 'ashishg@stanford.edu'),
 ('ashishg', 'e', 'rozm@stanford.edu'),
 ('ashishg', 'p', '650-723-1614'),
 ('ashishg', 'p', '650-723-4173'),
 ('ashishg', 'p', '650-814-1478'),
 ('bgirod', 'p', '650-723-4539'),
 ('bgirod', 'p', '650-724-3648'),
 ('bgirod', 'p', '650-724-6354'),
 ('cheriton', 'e', 'cheriton@cs.stanford.edu'),
 ('cheriton', 'e', 'uma@cs.stanford.edu'),
 ('cheriton', 'p', '650-723-1131'),
 ('cheriton', 'p', '650-725-3726'),
 ('dabo', 'e', 'dabo@cs.stanford.edu'),
 ('dabo', 'p', '650-725-3897'),
 ('dabo', 'p', '650-725-4671'),
 ('dlwh', 'e', 'dlwh@stanford.edu'),
 ('engler', 'e', 'engler@lcs.mit.edu'),
 ('engler', 'e', 'engler@stanford.edu'),
 ('eroberts', 'e', 'eroberts@cs.stanford.edu'),
 ('eroberts