# Programming Assignment 1: Spamlord

This assignment is your chance to become a Dark Lord of spam email! Yes, you too
can build regexes to spread evil throughout the galaxy.

Our goal in this assignment is to use regular expressions
to extract phone numbers and email addresses from documents found on
the web. If you really were a malicious actor, you could then use these
extracted addresses to bombard unsuspecting victims with spam!

Of course, we would never do anything nefarious like that in 124. Instead,
our goal will be to gain some experience with regular expressions (regexes).

## What we're trying to do

To make things a little more concrete, here are some examples to illustrate
exactly what your implementation should be able to do if it's working correctly.

### Extracting Phone Numbers

Your program should be able to process text that looks like the examples shown
on the left, and extract the phone number on the right (note that we want
the result in a standardized form).

```
# Various formats
Phone:  (650) 723-0293 =>  650-723-0293
Tel (+1): 650-723-0293 =>  650-723-0293

# HTML Markup
<a href="contact.html">TEL</a> +1&thinsp;650&thinsp;723&thinsp;0293 => 650-723-0293
```
__NOTE:__ You can assume we are only working with North American phone numbers,
so all numbers will be of the form: [3 digit area code]-[3 digits]-[4 digits].

__NOTE:__ Your solution may also extract fax numbers, which look exactly like
phone numbers. This is fine, you're not expected to distinguish between
the two.

### Extracting Email Addresses

Similarly to the phone numbers, we are also interested in processing text
containing (possibly obfuscated) email addresses and returning the corresponding
email addresses in a standard form.

```
# Ordinary email addresses
manning@cs.stanford.edu => manning@cs.stanford.edu

# Obfuscated email addresses
manning(at)cs.stanford.edu => manning@cs.stanford.edu
manning at csli dot stanford dot edu => manning@csli.stanford.edu

# Email addresses hidden using HTML/JavaScript (this is a very tricky one)
<script type="text/javascript">obfuscate('stanford.edu','manning')</script> => manning@stanford.edu
```

### Cases you won't have to worry about

Although you should aim to make your regexes as powerful and general-purpose as
you possibly can, there are some cases that are difficult or impossible to
handle with regexes and which we don't expect you to be able to deal with.

These include:

* Anything involving images or other non-text ways of displaying emails or phone
numbers
* Examples that require parsing names into parts, like:

```
"first name"@cs.stanford.edu
```

* Particularly clever/difficult examples that don't contain much or any
part of the actual email address. For example,

```
To send me email, try the simplest address that makes sense.
```

## Testing and Evaluation

Before you get started, here's some info about how your
implementation will be tested and evaluated.

### Some General Comments on Evaluation

In order to make your life easier on this (and future) homeworks, we will be
giving you some data to test your code on, which we call a
_development set_. Using a development set to test and evaluate your methods
is an extremely common approach in natural language processing and machine
learning. More generally, coming up with a robust set of test cases to
evaluate your work against is also an extremely important part of writing
good code.

In this case, our development set is a bunch of HTML documents (the personal
homepages of some Stanford CS professors) that we've scraped from the web and
downloaded for you. If you're not familiar with the details of HTML or its
syntax, that's fine. For the purposes of this assignment, all you need to know
is that the inputs are text files (with some formatting) that contain the
(possibly obfuscated) emails and phone numbers that we want to extract. You can
find all of these HTML documents in the ```data/dev``` folder.

In addition to the HTML documents, the ```data``` directory also contains
another file, ```data/devGOLD```. You can think of this file as the answer key
corresponding to the documents in ```data/dev```. It contains all the
correctly extracted phone numbers and emails from all the documents in
```data/dev```, in a particular format so your scripts (and our grading scripts)
can read them easily. The format is as follows:

### Format of Matches

Each line in the ```data/devGOLD``` file represents one extracted email address
or phone number in the form of a 3-tuple. Each tuple is represented as 3 strings
separated by vertical bars ("|"):

```
jurafsky|e|jurafsky@stanford.edu
```

The first string is the name of the file that
the match came from (with the ".html" extension removed).

The second string is an 'e' if the match was an email address, or a
'p' if it was a phone number.

The third string is the actual extracted email address or phone number itself,
in the following standard form:

```
  user@example.com
  650-555-1234
```

To sum up, the answers in the ```data/devGOLD``` file and the outputs
generated by your implementation should take the form of Python tuples
that look like:

```
  (filename, match type, match value)
```

__NOTE:__ You don't have to worry about case sensitivity in your values
(email addresses or phone numbers), since they will be normalized to lower case
before being compared with the answers.

### Scoring

Your grade will consist of three parts, totalling 25 points.

The first, worth 16 points, scores how well your implementation does on the
development set. For these examples you're given the correct answers,
so you should aim get 100% of them correct!

The second part of your grade, worth 8 points, will be based on how well your
regular expressions find emails and phone numbers in a different set of
examples, the test set. This test set is hidden and only the teaching staff
knows what is in it! Because you don't know exactly what trickery goes on in
this test set, you should be creative in thinking of different ways of writing
(and hiding) emails and phone numbers.

The third part is a brief section worth 1 point designed to get you thinking about ethical issues surrounding spam emails.

You're not expected to perform perfectly on the test set as you don't know
what is in it, or have the correct answers (just like in real life). As long as
you manage to achieve some reasonable performance (compared to a benchmark that
we provide), you'll get full points! The benchmark is set at 50 test errors or
fewer.

Normally, we would hide your test set performance so you can't tune your methods
to maximize test set performance (this is good experimental procedure).
However, in the interests of transparency and making your life easier, we'll
show you your test score and number of test errors on Gradescope so you can get
an idea of how close you are to the benchmark and full points.

You're free to submit as many times as you'd like on Gradescope until you hit
the benchmark. If you beat the benchmark, you can also submit your
best-performing solution to the Gradescope leaderboard to see how you stack up
against your classmates (although you won't get any extra points above 24).

Here are the equations we use to calculate the scores for the two parts, where
`e` is the total number of errors (false negatives and false positives) for
each part:

__Dev:__

```
  if e < 10 then score(e) = 16 - e
  else if e >= 10 then score(e) = 6
```

__Test:__

```
  if e <= 50      then score(e) = 8
  else if 50 < e  then score(e) = 8 - (e - 50) * 0.2
```

__Note:__ This sort of two-stage evaluation (a known development set and a
hidden test set) is a very commonly used approach in machine learning!
Evaluating on a development set where we have the "right" answers lets us
measure our performance precisely and improve our approach, while a test set
that is hidden from us until later allows us to see how we perform
"out in the wild", on examples that we might not have been able to tailor
our methods to.

### Checking your Dev Set Performance

We provide a function, named `score()` for you to conveniently check your
performance on the development set as you work. It takes in a list of your
predicted matches (the output of the function you'll write) and a list of
correct/gold matches, read from the `data/devGOLD` file.

It compares the two and calculates how they overlap, printing out a
bunch of information in the following form:

```
  True Positives (4):
  set([('balaji', 'e', 'balaji@stanford.edu'),
       ('nass', 'e', 'nass@stanford.edu'),
       ('shoham', 'e', 'shoham@stanford.edu'),
       ('thm', 'e', 'pkrokel@stanford.edu')])
  False Positives (1):
  set([('psyoung', 'e', 'young@stanford.edu')])
  False Negatives (113):
  set([('ashishg', 'e', 'ashishg@stanford.edu'),
       ('ashishg', 'e', 'rozm@stanford.edu'),
       ('ashishg', 'p', '650-723-1614'),
       ('ashishg', 'p', '650-723-4173'),
       ('ashishg', 'p', '650-814-1478'),
  ...
```

The true positive section displays e-mails and phone numbers which are in
both your list of matches and the gold matches list. These are examples
that your regexes correctly found.

The false positive section displays matches which
your regular expressions extracted but which are not in the gold matches list.
These are incorrect and show where your method may have been too
broad/aggressive.

The false negative section displays e-mails and phone numbers which your code
did not match, but which do exist in the html files. These are the matches your
code missed.

Your goal, then, is to reduce the number of false positives and false negatives
to zero. At the bottom of the output you can see the total counts of true
positives, false positives, and false negatives.

## Submitting your Solution

__IMPORTANT:__ Before you submit, make sure your code works in an environment
set up using Python 3.8 and the environment file and setup we provided
(and no other dependencies), as this is the
environment our autograder will be run on.

Submit your assignment
via [Gradescope (www.gradescope.com)](http://www.gradescope.com).
You can create a submission with the files below, and it will run the
remaining grading with the starter code.
We expect the following files in your final submission:

```
pa1.ipynb (do not alter the filename)
```

If your solution depends on other files, please put those files in a folder
named `deps/` (this folder should be on the same level as `pa1.ipynb`)
and upload a zip file (any name is fine) containing this folder and
`pa1.ipynb` to submit on Gradescope.
Gradescope will then automatically unzip the folder so that your
submission contains:

```
deps/
pa1.ipynb
```

## Frequently Asked Questions


__Q:__ Where should I be writing my Python code?

__A:__ The only place you should need to write code in this assignment is
the function `process_file(filename, data_directory)`.

```
  def process_file(filename, data_directory):
```

It's been marked for you with a "TODO". This function takes
in a filename of a text file as a string and the path to the directory
 where the file is located, also as a string. It returns a list of tuples
representing e-mails or phone numbers found in that file. See the section above
about "Some General Comments on Evaluation" for a detailed description of the
output tuple format.

__Q:__ What version of Python should I use? How should I set up my dependencies?

__A:__ In this class, we will be using Python 3.8.5. Please see the README file
in the Github repository (and available on Canvas) for instructions on
how to set up your environment.

__Q:__ What format should the phone numbers and e-mails have?
__A:__ The canonical forms we expect are:

```
  user@example.com
  650-555-1234
```

The case of the e-mails you find should not matter because the starter code
and the grading code will lowercase your matched e-mails before comparing
them against the gold set.

## Some Tips before you start:

* Although the phone number and email portions of the assignment are totally
independent and can be done in any order you'd like, we recommend starting with
phone numbers, as they are simpler and a good warm-up for the email regexes!
* If you need a refresher on Python or Jupyter notebooks (or you're new to
either of them), we strongly encourage taking a look at PA0 first.
Referring back to it as you work might help if you run into any issues with
Python syntax or idioms.

### Environment Check

Before we do anything else, let's quickly check that you're running the correct
version of Python and are in the right environment!

In [None]:
import os
assert os.environ['CONDA_DEFAULT_ENV'] == "cs124"

import sys
assert sys.version_info.major == 3 and sys.version_info.minor == 8

If the above cell complains, it means that you're using the wrong environment
or Python version!

If so, please exit this notebook, kill the notebook server with CTRL-C, and
try running

$ conda activate cs124

then restarting your notebook server with

$ jupyter notebook

If that doesn't work, you should go back and follow the installation
instructions in PA0.

## Getting Started!

Alright, let's extract some emails and phone numbers!

First we'll download some extra files that we need, if we don't already have them:

In [1]:
%%bash

if [[ ! -d "./data" ]]
then
    echo "Missing extra files (this probably means you're running on Google Colab). Downloading..."
    git clone https://github.com/cs124/pa1-spamlord.git
    cp -r ./pa1-spamlord/{data,deps,util.py} .
fi

Next, we'll need to import some modules that we'll need later.

__NOTE: Do NOT import and use any other packages outside of the Python standard
library. Although we provide NumPy, Scikit-Learn, and other packages in
the conda environment we set up for you, you will not be using them in PA1, only
in later assignments. Importing them in your solution will cause it to fail the
autograder.__

In [2]:
# For opening files and manipulating file paths
from io import open
import os

# For regular expressions
import re

# Helper functions we'll use later
from util import process_dir, get_gold, score


Now, let's take a look at what our data actually looks like. This should always
be one of the first things you do whenever you're solving a problem that
requires working with data.

All of the data in our development set is found in files in the data/dev
directory. Each one is an HTML file scraped from a CS faculty webpage.

To visualize them, you can take any of the files and open them in a browser of
your choice! For example, by right-clicking on `data/dev/dabo.html` and
clicking `open with -> Google Chrome`.

Feel free to change the filename to some of the other files in data/dev
(i.e. dabo to aiken, balaji, etc.) to take a look at some of the other
faculty pages. This is also one way to explore some of the different formats
that are used for email and phone numbers and start thinking of some ideas for
how to write your regular expressions.

You should find that, as expected, it looks exactly like a
faculty webpage (possibly minus some images which we didn't download along
with the HTML, but that's fine, as we're mostly interested in the text).

As is common for faculty pages, these pages have contact information like email
addresses and phone numbers. Our goal is to write regular expressions
that we can use to automatically match and extract these from the webpages.

Now we've seen what the files look like as webpages. However, for our
 purposes we're interested in the text contents, as that's what we'll
be matching with our regular expressions.

Let's try reading in the same file as a single giant text string:

In [3]:
# Open and read a file as a gigantic string
with open("data/dev/dabo.html", 'r', encoding='ISO-8859-1') as file:
    all_txt = file.read()
    print(all_txt)

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<base href="http://crypto.stanford.edu/~dabo/index.html">
<title>Dan Boneh</title>
</head>

<body background="background.gif" text="#000000" link="#00006F" vlink="#003F00"
alink="#7F0000">

<div align="left">
 


<table border="0" cellspacing="0" cellpadding="15">

<tr> 
   <td></td>



<td>
<a href="pubs.html" 
	onMouseOver="pubs.src='icons/pubsON.gif';"
	onMouseOut="pubs.src='icons/pubs.gif';">
	<img border="0" src="icons/pubs.gif" alt="Publications" name="pubs"></a> <BR>
<a href="#courses"
	onMouseOver="courses.src='icons/coursesON.gif';"
	onMouseOut="courses.src='icons/courses.gif';">
	<img border="0" src="icons/courses.gif" alt="Courses" name="courses"></a> <BR>
<a href="http://crypto.stanford.edu" 
	onMouseOver="students.src='icons/studentsON.gif';"
	onMouseOut="students.src='icons/students.gif';">
	<img border="0" src="icons/students.g

Okay, it's a bit long and hard to parse, but it seems reasonable!
There's a bunch of somewhat cluttered HTML markup,
but it's all in text form and if we search through it, it looks like all of the
text from the page (including the emails and phone numbers!) is somewhere in
there.

Now the interesting part: For each line,
instead of just reading and printing it,
let's try to find matches for a regular expression
that we provide, and extract those matches as emails.

In this case we'll use a super-simple pattern that just looks for 1 or more
alphanumeric characters or periods followed by an '@' followed by 1 or more
alphanumeric characters or periods, followed by '.edu'. This is just the usual
format of an email address.

__WARNING:__ Note that we strip the ".html" extensions from our
filenames when outputting our matches! This is important, as if you
don't do this your output format won't match the required one. You can
see the format for yourself in the data/devGOLD file.

Let's wrap that all together in a function and give it a whirl:

In [4]:
def example_process_file(filename: str, data_directory: str):
    """
    An example function we wrote to scan through a single file and return
    all the strings matching a simple hardcoded pattern.

    f will be an object satisfying the StringIO interface.
    You don't need to know the details of exactly what that entails, except
    that for your purposes f will always be a file object.
    """

    # This is the pattern we're trying to match
    # Note the use of two capture clauses, one before the '@' and one after the
    # '@'.
    simple_pattern = '([\w\.]+)@([\w\.]+.edu)'

    # The format of our matches requires stripping the ".html" extension
    # from our filenames
    filename_no_ext = filename.split('.')[0]

    with open(os.path.join(data_directory, filename),
              'r', encoding='ISO-8859-1') as file:

        all_txt = file.read()

        res = []
        # This function takes a regex pattern and a text string and returns
        # all matches in the string as a list
        # Each match in the list is a tuple of the capture groups in the
        # expression. So in this case, each element in matches will be a tuple
        # of form (stuff before '@', stuff after '@').
        matches = re.findall(simple_pattern, all_txt)
        for m in matches:
            # Put the email back together
            email = '%s@%s' % m
            res.append((filename_no_ext, 'e', email))

    return res

In [5]:
result = example_process_file('dabo.html', 'data/dev')
print(result)

[('dabo', 'e', 'dabo@cs.stanford.edu')]


Success! It looks like we got our first match! The output of our function in
this case is a list of matches, where each match is a tuple of
(file, 'e' or 'p' indicating email or phone number, extracted email or
phone number). The reason we want our output in this format is so that it works
with our automated evaluation later.

In this case we can see that we have just a single match from the
file data/dev/dabo (note that we only searched in this one file!) which is
an email and is the address 'dabo@cs.stanford.edu'.

Okay, so we've seen how we can process a single file. However, our dev
set consists of many such files. We need a function to loop over all of them,
process them, and return all the extracted addresses.

This function is provided for you in util.py! You will not need to modify it,
but we encourage you to take a look at how it is implemented.

Here's what happens when we run it:

In [6]:
all_results = process_dir('data/dev', example_process_file)

print(all_results)

[('nass', 'e', 'nass@stanford.edu'), ('nass', 'e', 'nass@stanford.edu'), ('nass', 'e', 'nass@stanford.edu'), ('nass', 'e', 'nass@stanford.edu'), ('engler', 'e', 'engler@lcs.mit.edu'), ('dabo', 'e', 'dabo@cs.stanford.edu'), ('kosecka', 'e', 'kosecka@cs.gmu.edu'), ('kosecka', 'e', 'kosecka@cs.gmu.edu'), ('zm', 'e', 'manna@cs.stanford.edu'), ('zm', 'e', 'manna@cs.stanford.edu'), ('thm', 'e', 'pkrokel@Stanford.edu'), ('thm', 'e', 'pkrokel@Stanford.edu'), ('zelenski', 'e', 'zelenski@cs.stanford.edu'), ('zelenski', 'e', 'zelenski@cs.stanford.edu'), ('hanrahan', 'e', 'hanrahan@cs.stanford.edu'), ('hanrahan', 'e', 'hanrahan@cs.stanford.edu'), ('kunle', 'e', 'kunle@ogun.stanford.edu'), ('kunle', 'e', 'kunle@ogun.stanford.edu'), ('kunle', 'e', 'darlene@csl.stanford.edu'), ('kunle', 'e', 'darlene@csl.stanford.edu'), ('psyoung', 'e', 'patrick.young@stanford.edu'), ('psyoung', 'e', 'patrick.young@stanford.edu'), ('balaji', 'e', 'balaji@stanford.edu'), ('widom', 'e', 'siroker@cs.stanford.edu'), ('wi

Looks like we got quite a few more matches, even with our very simple pattern.

You may have also noticed that our results have quite a few duplicates. If you
examine the corresponding files, you can see that this is happening because
the same email address (or phone number) appears more than once in the file.

Don't worry about this for now, we'll be careful to strip out duplicates later
when we're doing our evaluation.

Speaking of evaluation, before you go on to implement your own method let's
finish up by showing you the evaluation setup.

The evaluation process is straightforward: all that needs to be done is to load
the correct answers for the dev set from the provided file (data/devGOLD) and
compare them to the matches that were generated by your function.

We provide this helper function in util.py (you don't need to modify it, but
again we strongly encourage taking a peek at it to make sure that you understand
what it's doing).

Let's use it to read the gold (correct) matches from the provided file:

In [7]:
all_gold_matches = get_gold('data/devGOLD')
print(all_gold_matches)

[('ashishg', 'e', 'ashishg@stanford.edu'), ('ashishg', 'e', 'rozm@stanford.edu'), ('ashishg', 'p', '650-723-1614'), ('ashishg', 'p', '650-723-4173'), ('ashishg', 'p', '650-814-1478'), ('balaji', 'e', 'balaji@stanford.edu'), ('bgirod', 'p', '650-723-4539'), ('bgirod', 'p', '650-724-3648'), ('bgirod', 'p', '650-724-6354'), ('cheriton', 'e', 'cheriton@cs.stanford.edu'), ('cheriton', 'e', 'uma@cs.stanford.edu'), ('cheriton', 'p', '650-723-1131'), ('cheriton', 'p', '650-725-3726'), ('dabo', 'e', 'dabo@cs.stanford.edu'), ('dabo', 'p', '650-725-3897'), ('dabo', 'p', '650-725-4671'), ('dlwh', 'e', 'dlwh@stanford.edu'), ('engler', 'e', 'engler@lcs.mit.edu'), ('engler', 'e', 'engler@stanford.edu'), ('eroberts', 'e', 'eroberts@cs.stanford.edu'), ('eroberts', 'p', '650-723-3642'), ('eroberts', 'p', '650-723-6092'), ('fedkiw', 'e', 'fedkiw@cs.stanford.edu'), ('hager', 'e', 'hager@cs.jhu.edu'), ('hager', 'p', '410-516-5521'), ('hager', 'p', '410-516-5553'), ('hager', 'p', '410-516-8000'), ('hanrahan

As expected, these exactly match the output format that we showed earlier for
the file processing function. This will make it easy to compare our answers
to the gold answers.

We provide a helper function to do just that in util.py. Given that this
function shows you how your method will be scored, it's definitely worth
taking a look through.

Let's try evaluating our existing super-basic method using the above code:

In [8]:
guess_list = process_dir('data/dev', example_process_file)
gold_list = get_gold('data/devGOLD')
score(guess_list, gold_list)

Guesses (25): 
{('balaji', 'e', 'balaji@stanford.edu'),
 ('cheriton', 'e', 'cheriton@cs.stanford.edu'),
 ('dabo', 'e', 'dabo@cs.stanford.edu'),
 ('engler', 'e', 'engler@lcs.mit.edu'),
 ('eroberts', 'e', 'eroberts@cs.stanford.edu'),
 ('fedkiw', 'e', 'fedkiw@cs.stanford.edu'),
 ('hanrahan', 'e', 'hanrahan@cs.stanford.edu'),
 ('kosecka', 'e', 'kosecka@cs.gmu.edu'),
 ('kunle', 'e', 'darlene@csl.stanford.edu'),
 ('kunle', 'e', 'kunle@ogun.stanford.edu'),
 ('latombe', 'e', 'asandra@cs.stanford.edu'),
 ('latombe', 'e', 'latombe@cs.stanford.edu'),
 ('latombe', 'e', 'liliana@cs.stanford.edu'),
 ('manning', 'e', 'dbarros@cs.stanford.edu'),
 ('manning', 'e', 'manning@cs.stanford.edu'),
 ('nass', 'e', 'nass@stanford.edu'),
 ('nick', 'e', 'nick.parlante@cs.stanford.edu'),
 ('psyoung', 'e', 'patrick.young@stanford.edu'),
 ('rinard', 'e', 'rinard@lcs.mit.edu'),
 ('shoham', 'e', 'shoham@stanford.edu'),
 ('thm', 'e', 'pkrokel@stanford.edu'),
 ('widom', 'e', 'siroker@cs.stanford.edu'),
 ('widom', 'e', '

Looks reasonable! It appears that our basic method produced 25 matches,
while the gold set contains 117 matches.

There were 25 true positives (matches that we found that were in the gold set),
no false positives (matches that we found that were NOT in the gold set), and
92 false negatives (matches in the gold set that we did NOT find).

This seems like a pretty good start, but there are still 92 addresses that our
approach didn't manage to catch. Figuring out how to extract these addresses
without accidentally matching any non-address text is up to you!

You should write your implementation in the function below. Be sure to read the
documentation so that you know what sort of inputs/outputs will be expected.

In [121]:
e_name = r'([\w\.]{3,})'
e_at = r'(@|at|&#x40;|where)'
e_loc = r'([\w\.(?:dot);\-]{3,})'
#e_loc = r'([\w\.-;]+)' # this one yields different results
e_dot = r'(\.|;|do?[tm]|\s)'
e_suf = r'([\-]*e[\-]*d[-\]*u[\-]*)'
e_pat1 = r'{}\s*{}\s*{}\s*{}\s*{}'.format(e_name, e_at, e_loc, e_dot, e_suf)
e_pat1
e_pat2 = r'{}[\s.]*{}\s*{}\s*{}\s*{}'.format(e_name, e_at, e_loc, e_dot, e_suf)
e_pat2 = r'{}\s*\(.*{}\s*{}\s*{}\s*{}'.format(e_name, e_at, e_loc, e_dot, e_suf)
e_pat2

'([\\w\\.]{3,})\\s*\\(.*(@|at|&#x40;|where)\\s*([\\w\\.(?:dot);\\-]{3,})\\s*(\\.|;|do?[tm]|\\s)\\s*([\\-]*e[\\-]*d[-\\]*u[\\-]*)'

In [115]:
p_pat1 = r'(\d{3})[^\da-zA-Z\(]{1,5}(\d{3})[^\da-zA-Z\(,]{1,5}(\d{4})'

In [113]:

re.findall(p_pat, '684 985 4587', re.IGNORECASE)

[('684', '985', '4587')]

In [84]:
process_file('ullman.html', 'data/dev')

[('ullman', 'e', 'support@gradiance.edu'),
 ('ullman', 'e', 'ullman@cs.stanford.edu')]

In [102]:
# TODO: Implement your approach here!
def process_file(filename: str, data_directory: str):
    """
    This function takes in a filename, opens the corresponding file, and
    scans its contents against regex patterns. It returns a list of
    (filename, type, value) tuples where type is either an 'e' or a 'p'
    for e-mail or phone, and value is the formatted phone number or e-mail.
    The canonical formats are:
         (filename, 'p', '###-###-#####')
         (filename, 'e', 'someone@something')
    If the numbers you submit are formatted differently they will not
    match the gold answers.

    NOTE:

    NOTE: DO NOT CHANGE THIS INTERFACE, as it will be called directly by
    the submit script
    """
    # TODO: Replace/modify this with your implementation
    filename_no_ext = filename.split('.')[0]
    
    res = []
    for l in open( # for each line in the file
            os.path.join(data_directory, filename),
            'r', encoding='ISO-8859-1'):
        e_mat1 = re.findall(e_pat1, l, re.IGNORECASE)
        for m in e_mat1:
            #print(m)
            m0, m2, m4 = m[0], m[2], m[4]
            m2 = re.sub(r'-', r'', m2)
            m2 = re.sub(r';', r'.', m2)
            m4 = re.sub(r'[^a-zA-Z]', r'', m4)
            email = '%s@%s.%s' % (m0, m2, m4)
            res.append((filename_no_ext, 'e', email))
        if not e_mat1:
            e_mat2 = re.findall(e_pat2, l, re.IGNORECASE)
            for m in e_mat2:
                m0, m2 = m[0], m[2]
                email = '%s@%s.edu' % (m0, m2)
                res.append((filename_no_ext, 'e', email))
        
        p_mat1 = re.findall(p_pat1, l, re.IGNORECASE)
        for m in p_mat1:
            #print(m)
            m0, m1, m2 = m[0], m[1], m[2]
            phone = '%s-%s-%s' % (m0, m1, m2)
            res.append((filename_no_ext, 'p', phone))
            
    return res

__REMINDER:__ You should strip the ".html" extension from your filenames
when returning your formatted matches.

As you work on your methods, you can use the code below to evaluate your
performance:

In [122]:
guess_list = process_dir('data/dev', process_file)
gold_list = get_gold('data/devGOLD')
score(guess_list, gold_list)

Guesses (116): 
{('ashishg', 'e', 'ashishg@stanford.edu'),
 ('ashishg', 'e', 'rozm@stanford.edu'),
 ('ashishg', 'p', '650-723-1614'),
 ('ashishg', 'p', '650-723-4173'),
 ('ashishg', 'p', '650-814-1478'),
 ('balaji', 'e', 'balaji@stanford.edu'),
 ('bgirod', 'p', '650-723-4539'),
 ('bgirod', 'p', '650-724-3648'),
 ('bgirod', 'p', '650-724-6354'),
 ('cheriton', 'e', 'cheriton@cs.stanford.edu'),
 ('cheriton', 'e', 'uma@cs.stanford.edu'),
 ('cheriton', 'p', '650-723-1131'),
 ('cheriton', 'p', '650-725-3726'),
 ('dabo', 'e', 'dabo@cs.stanford.edu'),
 ('dabo', 'p', '650-725-3897'),
 ('dabo', 'p', '650-725-4671'),
 ('dlwh', 'e', 'dlwh@stanford.edu'),
 ('engler', 'e', 'engler@lcs.mit.edu'),
 ('engler', 'e', 'engler@stanford.edu'),
 ('eroberts', 'e', 'eroberts@cs.stanford.edu'),
 ('eroberts', 'p', '650-723-3642'),
 ('eroberts', 'p', '650-723-6092'),
 ('fedkiw', 'e', 'fedkiw@cs.stanford.edu'),
 ('fedkiw', 'e', 'www.m@h.ucla.edu'),
 ('hager', 'p', '410-516-5521'),
 ('hager', 'p', '410-516-5553'),


#### The Spamlord's Environmental Impact

Wait! Don’t send that spam email! Like most other tasks involving electricity, sending that little spam email actually releases carbon dioxide into the atmosphere. Each email sent requires not only the electricity you use on your personal device, but also the energy to store and transmit the message through data centers. According to the carbon footprint expert Mike Berners-Lee’s 2010 book <a href="http://www.goodreads.com/book/show/7230015-how-bad-are-bananas"> “How Bad are Bananas: The Carbon Footprint of Everything.”</a>, the average spam email has a carbon footprint equivalent to 0.3 grams of carbon dioxide. Furthermore a normal email has a footprint  of 4 g of carbon dioxide and an email with long attachments can have a carbon footprint of 50g carbon dioxide. 

It is estimated that globally around <a href="https://talosintelligence.com/reputation_center/email_rep?cid=27273&industry=agency&offset=390"> 120 billion spam emails </a> are sent every day. Calculate the global carbon emission for a day’s worth of spam emails. Provide the answer in tons.

In [123]:
def carbon_dioxide_emissions():
    # TODO: Replace/modify this so the method returns your solution
    carbon_emissions = 120*1e12*0.3/1e6
    return carbon_emissions

What kind of policies could the government enact to mitigate the quantity of carbon dioxide emitted as a result of spam emails? Who should be responsible for instigating these policies? Please provide a 3-6 sentence response. There is not one correct answer, so please feel free to be creative and share your ideas.

In [124]:
def government_response():
    # TODO: Place your response into the response string below
    response = "Hire more NLP experts to filter out the spams more effectively."
    return response

Once you're ready to submit, you can run the cell below to prepare and zip
up your solution:

In [125]:
%%bash

if [[ ! -f "./pa1.ipynb" ]]
then
    echo "WARNING: Did not find notebook in Jupyter working directory. This probably means you're running on Google Colab. You'll need to go to File->Download .ipynb to download your notebok and other files, then zip them locally. See the README for more information."
else
    echo "Found notebook file, creating submission zip..."
    zip -r submission.zip pa1.ipynb deps/
fi


Found notebook file, creating submission zip...
  adding: pa1.ipynb (deflated 81%)
  adding: deps/ (stored 0%)
  adding: deps/example_dep.txt (stored 0%)


If you're running on Google Colab, see the README for instructions on
how to submit.

__Best of luck!__

__Some reminders for submission:__
* If you have any extra files required for your implementation to work, make
 sure they are in a `deps/` folder on the same level as `pa1.ipynb` and
 include that folder in your submission zip file.
 * Make sure you didn't accidentally change the name of your notebook file,
 (it should be `pa1.ipynb`) as that is required for the autograder to work.
* Go to Gradescope (gradescope.com), find the PA1 SpamLord assignment and
upload your zip file (`submission.zip`) as your solution.
* Wait for the autograder to run (it should only take a minute or so) and check
that your submission was graded successfully! If the autograder fails, or you
get an unexpected score it may be a sign that your zip file was incorrect.