# Assignment

I often need "fake data" to show people how to do data manipulation tasks with regular expressions or pandas. The
problem is that sometimes the data I generate on the web is too messy, and I get bogged down showing students how to
clean all of the data when some of it isn't representative of what I want. In this assignment you get a chance to help
me generate realistic fake data, all with llama 2!

For each question I will describe the data in natural language and you must write a function which queries llama 2 to
generate data in that format and adhere to the description I've written.


In [1]:
import os
import re
from llama_cpp import Llama
from llama_cpp.llama_types import *
from llama_cpp.llama_grammar import *

## Question 1

Generate for me a list of ten fictitious names, where the first name is a single word, and the last (family) name may be
(but doesn't have to be!) up to two words separated by a hyphen. Don't include titles, honorifics, or middle names. The
autograder will expect that you return a list[str] where each value in the list is a full name.


In [2]:
def generate_names() -> list[str]:
    results: list[str] = []
    name: list[str] = []

    # YOUR CODE HERE
    prompt = '''A is a list of twenty names:
    James Conner
    Laurence Burden
    Thomas Frank-Wilson
    Rosanna Godfrey
    Aurora Burden
    Micheal Clark-Dill
    '''

    # Create Llama model
    model: Llama = Llama(model_path="/Users/laurenceburden/.cache/lm-studio/models/TheBloke/Chronomaid-Storytelling-13B-GGUF/chronomaid-storytelling-13b.Q2_K.gguf", verbose=False, n_ctx=2048)

    # Create a grammar of one to three word names
    grammar= r'''
    root ::= (firstName " " lastName)+
    firstName ::= [A-Z] [A-Za-z]*
    hyphenName ::= firstName "-"
    lastName ::=  (firstName hyphenName?)"\n"
    '''

    def stop_at_ten(input_ids, logits) -> bool:
        if (len(results) >= 10):
            return True

    for result in model.create_completion(
        prompt,
        grammar=LlamaGrammar.from_string(grammar=grammar),
        stream=True, 
        stopping_criteria=stop_at_ten, 
        max_tokens=1024
    ):
        char = result["choices"][0]["text"] 

        if char == "\n":
            fullName = "".join(name)
            results.append(fullName)
            name.clear()
        else:
            name.append(char)

    print('results: ', results)
    return results


In [3]:
# Invoke student code
from contextlib import redirect_stderr
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results = generate_names()

# Verify the length
assert (
    len(results) == 10
), f"You did not return ten and only ten results, instead we got {len(results)}."


results:  ['Marie Curie', 'Anna Franklin', 'Isabella Newton', 'Nathaniel Hawthorne', 'Olivia Dunn', 'Gabriel Jackson', 'Rose Adams', 'Carter Mitchell', 'Sophie Williams', 'Molly Parker']


## Question 2

Generate for me a list of 5 things to do in your hometown (or mine if you prefer, Ann Arbor Michigan!). The key is that
these should all (a) start with a number and (b) be no more than three sentences long. So the following would be a good
item:

- 1\. Go to the Henry Ford Museum. The Henry Ford Museum has all sorts of wonderful exhibits for all ages. One
  particular highlight includes giants trains!

While the following would **not be good items** (the first item does not start a numbered list, the second item is not a
sentence as it doesn't end in punctuation, and the third item just goes on and on and on):

- A\. Go to the University of Michigan. The University of Michigan is a school with more than 50,000 students in Ann
  Arbor, MI. The University of Michigan is a public School.
- 2\. Visit the Detroit Eastern Market
- 3\. Visit Sleeping Bear Dunes. The dunes are located along the northwest coast of the Lower Peninsula of Michigan in
  Leelanau and Benzie counties near Traverse City. It covers a 35-mile-long stretch of Lake Michigan's eastern
  coastline, as well as North and South Manitou islands. This national park is known for its massive dunes, some of
  which are over 400 feet high. The area gets its name from the Native American legend of the Sleeping Bear. According
  to the story, a mother bear and her two cubs were trying to cross Lake Michigan from Wisconsin to escape a forest
  fire.


In [14]:
def generate_trip_recommendations() -> list[str]:
    results: list[str] = []
    activity_sentence: str = ""

    # YOUR CODE HERE
    prompt = "A list of five things to do in San Antonio Texas: "

    # Create Llama model
    model: Llama = Llama(model_path="/Users/laurenceburden/.cache/lm-studio/models/TheBloke/Chronomaid-Storytelling-13B-GGUF/chronomaid-storytelling-13b.Q2_K.gguf", 
                         verbose=False, n_ctx=2048, n_gpu_layers=-1)
    # Grammar for a number followed by 1 to 3 sentences
    grammar= r'''
    root ::= (number activity)+
    number ::= [0-9] ". "
    # A sentence is just alphanumerica latin values, plus punctuation and whitespace
    # No parentheticals are allowed in a sentence, but a comma and hyphen are
    sentence ::= [A-Z] [A-Za-z0-9 ,-]* ("." | "!" | "?")
    # A question should be a sentence or two, no more.
    activity ::= sentence "\n" | sentence " " sentence "\n" | sentence " " sentence " " sentence "\n"
    '''

    def stop_at_five(input_ids, logits) -> bool:
        if (len(results) >= 5):
            return True

    for result in model.create_completion(
        prompt,
        grammar=LlamaGrammar.from_string(grammar=grammar),
        stream=True, 
        stopping_criteria=stop_at_five, 
        max_tokens=1024
    ):
        char = result["choices"][0]["text"] 

        if char == "\n":
            results.append(activity_sentence)
            activity_sentence = ""
        else:
            activity_sentence = activity_sentence + char

    return results

In [15]:
# Invoke student code
from contextlib import redirect_stderr
import re
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results=generate_trip_recommendations()
    print(results)

# Verify length
assert (
    len(results) == 5
), f"You did not return five and only five results, instead we got {len(results)}."


['1. Visit The Alamo - a historic site and museum that preserves the memory of the 1836 battle for Texan independence from Mexico. There are guided tours available, exhibits, and educational programs. Address is 300 Alamo Plaza, San Antonio, Texas 78205.', '2. Explore River Walk - a picturesque waterfront promenade that winds through the heart of downtown San Antonio. It offers shopping, dining, and entertainment options as well as boat tours on the river. The walk runs from Downtown to the Mission Reach area.', '3. See the San Antonio Missions National Historical Park - a World Heritage Site that preserves four Spanish colonial mission churches built in the 18th century. These missions include Mission Concepcion, Mission San Juan Capistrano, Mission San Jose, and The Alamo. The park offers guided tours and self-guided options for exploring these historical sites.', '4. Experience Fiesta San Antonio - an annual festival held in April that celebrates the citys rich history and culture w

## Question 3

Generate for me US-based addresses which have a person's name which usually appears on the first line, an optional
company name which often goes on the second line, a street address which has a number followed by some text description,
a city and state where the state is a two letter identifier and comes after the city name, and zip code which is a five
digit number (but as a string, since it could start with 0) followed by an optional hyphen and four more digits.

To make it easy for you to conform to this set of requirements, I created a simple class from the following example --
my mailing address!

> Dr. Christopher Brooks
>
> > School of Information, University of Michigan
> >
> > 105 S. State St.
> >
> > Ann Arbor, MI
> >
> > 48109-1285

Your function should return exactly 5 of these entries!

And, if you've gotten this far in the course, why not send me a postcard and introduce yourself? Everyone loves getting
mail!

(Don't forget to add **United States** if sending mail internationally, even though field is missing from the assignment
`MailingAddress` class.)


In [16]:
from dataclasses import dataclass

@dataclass
class MailingAddress:
    name: str  # Full name, e.g. Dr. Christopher Brooks
    business_name: (
        str 
    )  # Optional business name, e.g. School of Information, University of Michigan
    street_number: int  # Numeric address value, e.g. 105
    street_text: str  # Street information other than numeric address, e.g. S. State St.
    city: str  # City name, e.g. Ann Arbor
    state: str  # State name, only two letters, e.g. MI for Michigan
    zip_code_short: str  # The first five digits of the zip code, e.g. 48109, as a string value, since it could start with 0
    zip_code_long: (
        str 
    )  # The extended zip code (optional) which is the full zip code, e.g. 48109-1285

# YOUR CODE HERE
def generate_addresses() -> list[MailingAddress]:
    results: list[MailingAddress] = []
    tempAddress: MailingAddress
    item: str = ""
    items: list[str] = []
    char: str = ""
    count = 0

    prompt = '''Generate a list of physical street addresses:
    
    Frank Zappa
    Acme Inc.
    105 S. State St.
    San Antonio, TX
    78245-1455

    John Burden
    203 Ocean Way
    Detroit, MI
    48615-8776

    James Godfrey
    Gizmo Corp
    589 South Blvd.
    South Bend, IN
    39872-8322

    Lisa Smith
    390 Main St.
    Dallas, TX
    78231-0912
    '''

    grammar = r'''
    root ::= (name lastname busName? strNum strName cityName stateAbbr zip) "\n\n"
    name ::= [A-Z] [a-z]* " "
    cityName ::= [A-Z] [a-z]* ", "
    lastname ::= [A-Z] [A-Za-z]* "\n"
    busName ::= name "\n"
    strNameLimit ::= [A-Z] [a-z.]* " "
    num ::= [0-9]
    strNum ::= (num num num num?) " "
    strName ::= (strNameLimit strNameLimit?) "\n"
    stateAbbr ::= ("MI" | "TX" | "IN" | "WA" | "AL" | "MN") "\n"
    zip ::= (num num num num num "-" num num num num) "\n"
    '''

    def stop_at_six(input_ids, logits) -> bool:
        if (len(items) >= 6):
            return True

    for i in range(0, 5):
        # print('i: ', i)
        # print('Items Length: ', len(items))
        # print('char: ', char)
        items.clear()
        char=""
        
        
        # Create Llama model
        model: Llama = Llama(model_path="/Users/laurenceburden/.cache/lm-studio/models/TheBloke/Chronomaid-Storytelling-13B-GGUF/chronomaid-storytelling-13b.Q2_K.gguf", 
                            verbose=False, n_ctx=2048, n_gpu_layers=-1)

        for result in  model.create_completion(
            prompt,
            grammar=LlamaGrammar.from_string(grammar=grammar),
            stream=True,
            stopping_criteria=stop_at_six,
            max_tokens=1024
        ):
            char = result["choices"][0]["text"] 

            if char == "\n":
                items.append(item)
                item = ""
                count = count + 1
                
                if count == 2:
                    #print('items: ', items)
                    count = 0
                    # TODO: create address
                    if len(items) >= 6:
                        # handle address with a business line
                        tempAddress = MailingAddress(
                            str(items[0]), str(items[1]),
                            int(items[2].split()[0]), str(items[2].split()[1]), 
                            str(items[3].split(', ')[0]), str(items[3].split(', ')[1]), 
                            str(items[4].split('-')[0]), str(items[4])
                        )
                    else:
                        # Adresses without a business line
                        tempAddress = MailingAddress(
                            str(items[0]), None,
                            int(items[1].split()[0]), str(items[1].split()[1]), 
                            str(items[2].split(', ')[0]), str(items[2].split(', ')[1]), 
                            str(items[3].split('-')[0]), str(items[3])
                        )

                    #print('tempAddress: ', tempAddress)
                    results.append(tempAddress)
                    items.clear()
                    char=""
            else:
                item = item + char
                count = 0
        result = None
    #print('Results: ', results)
    return results

In [18]:
# Invoke student code
from contextlib import redirect_stderr
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results=generate_addresses()
    for r in results:
        print(r)

# Verify length
assert (
    len(results) == 5
), f"You did not return five and only five results, instead we got {len(results)}."


MailingAddress(name='Thisstreetaddressgeneratoronlygeneratestreetaddressesanddoesnotprovideanyadditionalservicesorinformationaboutthemapressourceofthisdataorthemacroenvironmentsurroundingtheaddresseslistedhereinaccordancewiththespecifiedcitiesandstatesasperthelistgivenaboveandsuchotherrestrictionsasmayberegulatedbyspeciallawsofstatetherestrictionsandexceptionsmayvarybyjurisdictionandtimethereforeusershouldrefertoallrelevantlawsandregulationsinconnectionwiththeiruseofthisdatamaterialsorinformationhereincontainedandshouldalsoconsultanyapplicableprofessionalsuchaslegalcounselorbusinessadvisorsforfurtherassistanceandspecificguidanceintheuseofthematerialsorinformationhereincontainedinasmuchastheyrelatetotheirparticularcircumstancesandneedsandsoshouldbeawarethatthisdocumentdoesnotconstitutelawyeradvicenorlegaladviceoffanykindnorshoulditbereferredtoassuchusershouldseekthelawsoftheirstateorbusinessjurisdictionforanymattersinconnectionwiththematerialsorinformationhereincontainedthatmayrequirele

AssertionError: You did not return five and only five results, instead we got 2.