## Tips
- To avoid unpleasant surprises, I suggest you _run all cells in their order of appearance_ (__Cell__ $\rightarrow$ __Run All__).


- If the changes you've made to your solution don't seem to be showing up, try running __Kernel__ $\rightarrow$ __Restart & Run All__ from the menu.


- Before submitting your assignment, make sure everything runs as expected. First, restart the kernel (from the menu, select __Kernel__ $\rightarrow$ __Restart__) and then **run all cells** (from the menu, select __Cell__ $\rightarrow$ __Run All__).

## Reminder

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name, UA email, and collaborators below:



Several of the cells in this notebook are **read only** to ensure instructions aren't unintentionally altered.  

If you can't edit the cell, it is probably intentional.

In [1]:
NAME = "Kathleen Costa"
# University of Arizona email address
EMAIL = "kathleencosta@arizona.edu"
# Names of any collaborators.  Write N/A if none.
COLLABORATORS = "N/A"

## Scratchpad

You are welcome to create new cells (see the __Cell__ menu) to experiment and debug your solution.

In [2]:
%load_ext autoreload
%autoreload 2

# Mini Python tutorial

This course uses Python 3.11.

Below is a very basic (and incomplete) overview of the Python language... 

For those completely new to Python, [this section of the official documentation may be useful](https://docs.python.org/3.11/library/stdtypes.html#common-sequence-operations).

In [3]:
# This is a comment.  
# Any line starting with # will be interpreted as a comment

# this is a string assigned to a variable
greeting = "hello"

# If enclosed in triple quotes, strings can also be multiline:

"""
I'm a multiline
string.
"""

# let's use a for loop to print it letter by letter
for letter in greeting:
    print(letter)
    
# Did you notice the indentation there?  Whitespace matters in Python!

# here's a list of integers

numbers = [1, 2, 3, 4]

# let's add one to each number using a list comprehension
# and assign the result to a variable called res
# list comprehensions are used widely in Python (they're very Pythonic!)

res = [num + 1 for num in numbers]

# let's confirm that it worked
print(res)

# now let's try spicing things up using a conditional to filter out all values greater than or equal to 3...
print([num for num in res if not num >= 3])

# Python 3.7 introduced "f-strings" as a convenient way of formatting strings using templates
# For example ...
name = "Josuke"

print(f"{greeting}, {name}!")

# f-strings are f-ing convenient!


# let's look at defining functions in Python..

def greet(name):
    print(f"Howdy, {name}!")

# here's how we call it...

greet("partner")

# let's add a description of the function...

def greet(name):
    """
    Prints a greeting given some name.
    
    :param name: the name to be addressed in the greeting
    :type name: str
    
    """
    print(f"Howdy, {name}!")
    
# I encourage you to use docstrings!

# Python introduced support for optional type hints in v3.5.
# You can read more aobut this feature here: https://docs.python.org/3.7/library/typing.html
# let's give it a try...
def add_six(num: int) -> int:
    return num + 6

# this should print 13
print(add_six(7))

# Python also has "anonymous functions" (also known as "lambda" functions)
# take a look at the following code:

greet_alt = lambda name: print(f"Hi, {name}!")

greet_alt("Fred")

# lambda functions are often passed to other functions
# For example, they can be used to specify how a sequence should be sorted
# let's sort a list of pairs by their second element
pairs = [("bounce", 32), ("bighorn", 12), ("radical", 4), ("analysis", 7)]
# -1 is last thing in some sequence, -2 is the second to last thing in some seq, etc.
print(sorted(pairs, key=lambda pair: pair[-1]))

# we can sort it by the first element instead
# NOTE: python indexing is zero-based
print(sorted(pairs, key=lambda pair: pair[0]))

# You can learn more about other core data types and their methods here: 
# https://docs.python.org/3.7/library/stdtypes.html

# Because of its extensive standard library, Python is often described as coming with "batteries included".  
# Take a look at these "batteries": https://docs.python.org/3.7/library/

# You now know enough to complete this homework assignment (or at least where to look)

h
e
l
l
o
[2, 3, 4, 5]
[2]
hello, Josuke!
Howdy, partner!
13
Hi, Fred!
[('radical', 4), ('analysis', 7), ('bighorn', 12), ('bounce', 32)]
[('analysis', 7), ('bighorn', 12), ('bounce', 32), ('radical', 4)]


# Getting started

In this assignment, you'll be implementing ...

- **feature functions** to generate word and character _n_-grams
- a `Vocabulary` class for managing features

... and use them to generate **feature vectors** for words and documents.

Representing words and documents as vectors is a fundamental step in statistical approaches to prediction/classification and clustering.


In [4]:
from typing import List, Union, Iterable, Text, Tuple, Dict, Any
from collections import Counter
import re

# `Vocabulary` class

The `Vocabulary` class keeps track of our observed features and assigns each distinct feature that is tracked a unique id.  A feature's ID corresponds to its index or column in feature vectors.  The first dimension (where the index is 0) corresponds to the `<UNK>` feature which is used to represent unknown or unseen features.

If you're new to defining classes in Python, please see the following video:

- https://youtu.be/9yDOI2UvBtU

When you're ready, implement the following methods:

- `id_for(self, feature: Feature) -> int`
- `feature_for(self, feature_id: int) -> Union[Feature, None]`
- `create_f2i(features: Iterable[Feature]) -> Dict[Feature, int]`
- `add_feature(self, feature: Feature) -> None`

As always, use each method's docstring and accompanying tests to guide your solution.
  
  
**NOTE**: functions prefixed with [`@staticmethod` do **not** need `self` as a parameter.  Treat them like regular functions.](https://docs.python.org/3.7/library/functions.html#staticmethod)

In [5]:
Feature = Union[Text, Tuple[Any, ...]]

class Vocabulary:
    """
    Stateful vocabulary.
    Provides a mapping from feature to ID and a reverse mapping of ID to feature.
    """
    # symbol for unknown terms    
    UNKNOWN = "<UNK>"
    
    def __init__(self, features: Iterable[Feature]=[]):
        """
        :param features: a sequence of features
        :type features: a sequence of strings
        
        :Example:
        v = Vocabulary(["has_four_legs", "is_furry", "has_tail"])
        """
        self.f2i: Dict[Feature, int] = Vocabulary.create_f2i(features)
        self.i2f: Dict[int, Feature] = Vocabulary.create_i2f(self.f2i)
        
    def id_for(self, feature: Feature) -> int:
        """
        Looks up ID for feature using self.f2i.  
        If the feature is unknown, returns -1.
        
        :Example:
        
        v = Vocabulary(["b_feature", "a_feature"])
        assert v.id_for('<UNK>') == 0
        assert v.id_for('a_feature') == 1
        assert v.id_for('b_feature') == 2
        assert v.id_for('z_feature') == -1
        """
        # YOUR CODE HERE
        return self.f2i.get(feature, -1)
        
    def feature_for(self, feature_id: int) -> Union[Feature, None]:
        """
        Looks up term corresponding to feature_id.  
        If feature_id is unknown, returns None.
        
        :Example:
        
        v = Vocabulary(["b_feature", "a_feature"])
        assert v.feature_for(0) == v.UNKNOWN
        assert v.feature_for(1) == 'a_feature'
        assert v.feature_for(2) == 'b_feature'
        assert v.feature_for(19) == None
        """
        # YOUR CODE HERE
        return self.i2f.get(feature_id, None)
    
    @property
    def features(self) -> List[Feature]:
        """
        @property decorator allows attribute-like use of a method:
        
        :Example:
        
        v = Vocabulary(["has_four_legs", "is_furry", "has_tail"])
        assert v.features == ['<UNK>', 'has_four_legs', 'has_tail', 'is_furry']
        """
        return [self.i2f[i] for i in range(len(self.i2f))]
        
    @staticmethod
    def create_f2i(features: Iterable[Feature]) -> Dict[Feature, int]:
        """
        Takes a flat iterable of terms and returns a dictionary of term -> int.
        Assumes terms have already been normalized.

        Requirements:
        - First term in vocabulary (ID 0) is reserved for Vocabulary.UNKNOWN.
        - All keys following Vocabulary.UNKNOWN should be alphabetized (a -> z).
        """
        # YOUR CODE HERE
        features = sorted(set(features)) 
        return {Vocabulary.UNKNOWN: 0, **{feature: idx + 1 for idx, feature in enumerate(features)}}
    
    @staticmethod
    def create_i2f(f2i: Dict[Feature, int]) -> Dict[int, Feature]:
        """
        Takes a dict of string -> integer and returns a reverse mapping of integer -> string.
        
        :Example:
        
        assert Vocabulary.create_i2f({"a_feature": 1, "b_feature": 2}) == {1: "a_feature", 2: "b_feature"}
        """
        return {i:f for (f, i) in f2i.items()}

    def add_feature(self, feature: Feature) -> None:
        """
        Takes a term and updates self.f2i 
        and self.i2f if the term is not already in the vocabulary.

        NOTE: add_feature only appends features.  It does not ensure the features are alphabetized after a new feature is added.
        
        :Example:
        
        v = Vocabulary(["first_feature"])
        assert v.features == ['<UNK>', 'first_feature']
        v.add_feature("second_feature")
        assert v.features == ['<UNK>', 'first_feature', 'second_feature']
        assert v.id_for("second_feature") == 2
        """
        # YOUR CODE HERE
        if feature not in self.f2i:
            next_id = len(self.f2i)
            self.f2i[feature] = next_id
            self.i2f[next_id] = feature
        
    def __plus__(self, other):
        """
        Defines what should happen when two instances of Vocabulary are summed.
                
        :Example:
        
        v1 = Vocabulary(["first_feature"])
        v2 = Vocabulary(["second_feature"])
        v3 = v1 + v2
        assert len(v3) == 3
        """
        if isinstance(other, Text):
            features = self.f2i.keys() + [other]
            return Vocabulary(features)
        elif isinstance(other, Vocabulary):
            features = self.f2i.keys() + other.f2i.keys()
            return Vocabulary(features)
        return self
    
    def __len__(self):
        """
        Defines what should happen when `len` is called on an instance of this class.
        
        :Example:
        
        v = Vocabulary(["first_feature"])
        assert len(v) == 2
        """
        return len(self.f2i)
    
    def __contains__(self, other):
        """
        Defines what should happen when `in` is used with an instance of this class.
        
        :Example:
        
        v = Vocabulary(["feature_1", "feature_2"])
        assert "feature_1" in v
        """
        return True if other in self.f2i else False


In [6]:
# requires you to implement `create_f2i`
v = Vocabulary()
assert len(v) == 1

In [7]:
# requires you to implement `create_f2i`
v = Vocabulary()
assert Vocabulary.UNKNOWN in v

In [8]:
# requires you to implement `create_f2i`
v = Vocabulary(["ends_with_ly"])
assert len(v) == 2

In [9]:
# requires you to implement `create_f2i`
v = Vocabulary(["ends_with_ly", "ends_with_ly"])
assert len(v) == 2

In [10]:
# requires you to implement `create_f2i`
v = Vocabulary()
assert v.features == [Vocabulary.UNKNOWN]

In [11]:
# requires you to implement `create_f2i`
v = Vocabulary()
assert v.i2f[0] == Vocabulary.UNKNOWN

In [12]:
v = Vocabulary()
assert v.id_for(Vocabulary.UNKNOWN) == 0

In [13]:
v = Vocabulary()
# this feature shouldn't exist
assert v.id_for("Xxx") == -1

In [14]:
v = Vocabulary()
assert v.feature_for(0) == Vocabulary.UNKNOWN

In [15]:
v = Vocabulary()
new_features = ["a$", "^h", "ho", "ol", "la", "la"]
ids = set([Vocabulary.UNKNOWN])

for feat in new_features:
    v.add_feature(feat)
    ids.add(v.id_for(feat))

assert len(ids) == len(v)
assert len(ids) == len(set(new_features)) + 1

# Feature functions

Next, we'll implement a couple of feature functions to generate token and character _n_-grams.

## `ngrams`

Generate _n_-grams for the provided tokens.

**HINTS**:
- You are expected to return a list of **tuples**.  You can turn a list into a tuple using `tuple(somelist)`.
- Return an empty list if no _n_-grams can be generated (ex. trigrams for `["hello"]` when `use_start_end=False`)
- See https://parsertongue.org/tutorials/n-grams/

In [16]:
def ngrams(
    # the size of the n-gram
    n: int, 
    tokens: List[Text], 
    use_start_end: bool = True,
    start_symbol: Text = "<S>",
    end_symbol: Text = "</S>"
) -> List[Tuple[Text]]:
    """
    Generates a list of n-gram tuples for the provided sequence of tokens.
    """
    # YOUR CODE HERE
    if n <= 0:
        return []
    
    if use_start_end:
        tokens = [start_symbol] * (n - 1) + tokens + [end_symbol] * (n - 1)

    ngrams_list = []
    
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        ngrams_list.append(ngram)

    return ngrams_list

In [17]:
assert ngrams(n=1, tokens=["Good"], use_start_end=False) == [('Good',)]

In [18]:
assert ngrams(n=0, tokens=["Good"], use_start_end=False) == []

In [19]:
assert ngrams(n=2, tokens=["Good"], use_start_end=False) == []
assert ngrams(n=3, tokens=["Good"], use_start_end=False) == []

In [20]:
assert ngrams(n=2, tokens=["Good", "news"], use_start_end=False) == [('Good', 'news')]

In [21]:
assert ngrams(n=2, tokens=["Good"], use_start_end=True) == [('<S>', 'Good'), ('Good', '</S>')]

In [22]:
assert ngrams(n=3, tokens=["Good"], use_start_end=True) == [
    ('<S>', '<S>', 'Good'), 
    ('<S>', 'Good', '</S>'), 
    ('Good', '</S>', '</S>')
]

In [23]:
assert ngrams(n=3, tokens=["Good"], use_start_end=True, start_symbol="#", end_symbol="#") == [
    ('#', '#', 'Good'), 
    ('#', 'Good', '#'), 
    ('Good', '#', '#')
]

In [24]:
assert ngrams(n=3, tokens=["Good", "news", "everyone", "!"], use_start_end=True) == [
    ('<S>', '<S>', 'Good'),
     ('<S>', 'Good', 'news'),
     ('Good', 'news', 'everyone'),
     ('news', 'everyone', '!'),
     ('everyone', '!', '</S>'),
     ('!', '</S>', '</S>')
]

## `char_ngrams`

Generate character _n_-grams for the provided text.

**HINTS**:
- Think about how you can make use of the `ngrams` function you implemented

In [25]:
def char_ngrams(
    n: int, 
    text: Text,
    use_start_end: bool = True,
    start_symbol: Text   = "^",
    end_symbol: Text     = "$"
) -> List[Text]:
    """
    Generates a list of n-gram tuples for the provided text.
    """
    # YOUR CODE HERE
    if n <= 0:
        return []  
    char_tokens = list(text)

    if use_start_end:
        char_tokens = [start_symbol] * (n - 1) + char_tokens + [end_symbol] * (n - 1)

    ngrams_list = []
    
    for i in range(len(char_tokens) - n + 1):
        ngram = tuple(char_tokens[i:i + n])
        ngrams_list.append(ngram)

    return ngrams_list

In [26]:
assert char_ngrams(n=-1, text="test") == []

In [27]:
assert char_ngrams(n=2, text="test") == [('^', 't'), ('t', 'e'), ('e', 's'), ('s', 't'), ('t', '$')]

In [28]:
assert char_ngrams(n=2, text="test", use_start_end=False) == [('t', 'e'), ('e', 's'), ('s', 't')]

In [29]:
assert char_ngrams(n=3, text="test") == [
    ('^', '^', 't'),
    ('^', 't', 'e'),
    ('t', 'e', 's'),
    ('e', 's', 't'),
    ('s', 't', '$'),
    ('t', '$', '$')
]

In [30]:
assert char_ngrams(n=3, text="test", use_start_end=False) == [('t', 'e', 's'), ('e', 's', 't')]

# Creating feature vectors

Now that we've created a `Vocabulary` and some feature functions, we're ready to generate feature vectors.

## `make_count_vector()`

Takes a sequence of features and a `Vocabulary` and produces a feature vector of counts.

**HINTS**:
- Count occurrences of each feature.  
- If feature is not in the vocabulary, add its occurrences to the count for the `Vocabulary.UNKNOWN` feature

In [31]:
def make_count_vector(datum_features: Iterable[Feature], vocab: Vocabulary) -> List[int]:
    """
    Converts a sequence of features to a count vector.
    
    Takes a datum (seq of features) and a Vocabulary instance.
    Returns a new vector where each feature value is mapped to the count of that feature in the data
    """
    # YOUR CODE HERE
    count_vector = [0] * len(vocab)  
    
    feature_counts = Counter(datum_features)
    
    for feature, count in feature_counts.items():
        if feature in vocab:
            index = vocab.id_for(feature) 
            count_vector[index] += count  
        else:
            unknown_index = vocab.id_for(Vocabulary.UNKNOWN)
            count_vector[unknown_index] += count  
    
    return count_vector

In [32]:
v         = Vocabulary(["ends_with_ly", "starts_with_c", "starts_with_g"])
doc_feats = ["ends_with_ly", "starts_with_c", "starts_with_c"]
vector    = make_count_vector(doc_feats, v)

assert len(vector) == len(v)

In [33]:
# ensure the doc has no unknown features

v         = Vocabulary(["ends_with_ly", "starts_with_c", "starts_with_g"])
doc_feats = ["ends_with_ly", "starts_with_c", "starts_with_c"]
vector    = make_count_vector(doc_feats, v)

f_idx     = v.id_for(Vocabulary.UNKNOWN)
assert vector[f_idx] == 0

In [34]:
# ensure occurrences of features are counted

v         = Vocabulary(["ends_with_ly", "starts_with_c", "starts_with_g"])
doc_feats = ["ends_with_ly", "starts_with_c", "starts_with_c"]
vector    = make_count_vector(doc_feats, v)

for idx, feat_value in enumerate(vector):
    print(f"{v.feature_for(idx)}\t{feat_value}")

f_idx     = v.id_for("starts_with_c")
assert vector   == [0, 1, 2, 0]

<UNK>	0
ends_with_ly	1
starts_with_c	2
starts_with_g	0


In [35]:
# ensure occurrences of features are counted

test      = char_ngrams(n=2, text="last")
vest      = char_ngrams(n=2, text="vest")

v         = Vocabulary(test + vest)
doc_feats = char_ngrams(n=2, text="avast")
vector    = make_count_vector(doc_feats, v)

for idx, feat_value in enumerate(vector):
    print(f"{v.feature_for(idx)}\t{feat_value}")

assert vector == [3, 0, 0, 1, 0, 0, 1, 1, 0]

<UNK>	3
('^', 'l')	0
('^', 'v')	0
('a', 's')	1
('e', 's')	0
('l', 'a')	0
('s', 't')	1
('t', '$')	1
('v', 'e')	0


## `binarize_vector()`

Takes a feature vector and converts it to a binary representation (i.e., each feature's value becomes either 0 or 1).

**HINTS**:
- Use 1 to represent any feature with a value $> 0$

In [36]:
def binarize_vector(vector: List[int]) -> List[int]:
    """
    Takes a count vector and 
    returns a new vector where each feature value is mapped to either 0 or 1
    """
    # YOUR CODE HERE
    return [1 if count > 0 else 0 for count in vector]

In [37]:
assert binarize_vector([3, 0, 10]) == [1, 0, 1]

In [38]:
assert binarize_vector([1, 0, 0]) == [1, 0, 0]

# Bonus: task-specific feature functions and feature vectors

Representations for words and documents are often task-specific.  Implement 3 or more feature functions that will aid in carrying out a specific task.

## Option A: SPAM vs $\neg$SPAM
`bonus_docs` represents a toy SPAM classification dataset.  Write feature functions and use them to generate representations for each document in `bonus_docs`. 

## Option B: Pick your own docs and task
Create a set of documents related to a specific task and write feature functions and use them to generate representations for each document in this set.


### Requirements

- Create at least 3 new feature functions.
- Generate feature vectors for 3 or more documents (for example, `bonus_docs`).  
- Describe your features.  How are they suited to the task you're modeling (ex. distinguishing between SPAM and HAM (not SPAM)?

In [39]:
from dataclasses import dataclass

@dataclass
class Datum:
    doc: Text
    label: Text
        
dataset: List[Datum] = [
    Datum(
        # SPAM
        doc="""
        FROM : "MR.LAMIDO SANUSI" <elvislives478@aol.com>
        SUBJECT: Your kind Attention: Beneficiary, Call me at +2348080754902 for more information.
        BODY:
        My Name Is Mr. Lamido Sanusi. I Am The Governor Central Bank Of Nigeria.  This Is To Notify You That Your Over Due Inheritance Funds Has Been Gazzeted To Be Released To You Via The Foreign Remmitance Department Of Our Bank.

        Meanwhile, A Woman Came To My Office Few Days Ago With A Letter, Claiming To Be Your Representative And Sent By You.  If she is not your reprsentative or sent by you, kindly respond immediately reconfirming to me the following details to avoid any mistake.
        + Full name
        + Full residential contact address
        + Direct telephone number number
        + Age and current occupation
        + Copy of your identification if available.

        However, We Shall Proceed To Issue All Payments Details To The Said Mrs. Barbara Kleihans If We Do Not Hear From You Within The Next Three Working Days From Today. Await for your prompt response

        You.Regards,

        Mr. Lamido Sanusi
        """,
        label="SPAM"
    ),
    Datum(
        # SPAM
        doc="""
        FROM: saxquatch4life@aol.com
        SUBJECT: You're a Winner!
        BODY: 
        This President Zump. You've been pre-selected for early retirement. 
        Please send your social security number ASAP to claim prize.
        """,
        label="SPAM"
    ),
    Datum(
        # NOT SPAM
        doc="""
        FROM: ***REDACTED***@arizona.edu
        SUBJECT: [ling_dept_faculty] Response needed
        BODY: 
        Please send your syllabus to ***REDACTED*** by 4PM on Friday.
        """,
        label="NOT_SPAM"
    ),
    Datum(
        # NOT SPAM
        doc="""
        FROM: ***REDACTED***@arizona.edu
        SUBJECT: Deadline extension?
        BODY: 
        Dr. Hahn-Powell,

        I hope you are well.  Is there any way I can get an extension on the homework?  

        Respectfully,

            ***REDACTED***
        """,
        label="NOT_SPAM"
    ),
    Datum(
        # NOT SPAM
        doc="""
        FROM: drive-shares-noreply@google.com
        SUBJECT: Internship report - Invitation to edit
        BODY:     
            ***REDACTED*** has invited you to edit the following document:

            Internship report

            Open in Docs


        Google Docs: Create and edit documents online.
        Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
        You have received this email because medeiros@email.arizona.edu shared a document with you from Google Docs.
        """,
        label="NOT_SPAM"
    )
]
    
for datapoint in dataset:
    # we can access attributes of a dataclass just like any other class
    print(datapoint.label)

SPAM
SPAM
NOT_SPAM
NOT_SPAM
NOT_SPAM


If Option B, describe the task here.

In [40]:
# If Option B, define your docs and labels here using the Datum class.

In [41]:
#N/A

In [42]:
#N/A

YOUR ANSWER HERE

In [43]:
def contains_exclamation(doc: str) -> int:
    return 1 if '!' in doc else 0

def contains_phone_number(doc: str) -> int:
    pattern = r'\+?\d[\d -]{7,}\d'
    return 1 if re.search(pattern, doc) else 0

def contains_spammy_keywords(doc: str) -> int:
    spam_keywords = ['winner', 'claim', 'urgent', 'free', 'call', 'respond']
    return 1 if any(keyword in doc.lower() for keyword in spam_keywords) else 0

def extract_features(datum: Datum) -> List[int]:
    return [
        contains_exclamation(datum.doc),
        contains_phone_number(datum.doc),
        contains_spammy_keywords(datum.doc),
    ]

feature_vectors = [extract_features(datapoint) for datapoint in dataset]

for datum, vector in zip(dataset, feature_vectors):
    print(f"Type of Mail: {datum.label}, Feature Vector: {vector}")

Type of Mail: SPAM, Feature Vector: [0, 1, 1]
Type of Mail: SPAM, Feature Vector: [1, 0, 1]
Type of Mail: NOT_SPAM, Feature Vector: [0, 0, 0]
Type of Mail: NOT_SPAM, Feature Vector: [0, 0, 0]
Type of Mail: NOT_SPAM, Feature Vector: [0, 0, 0]
