# BigCode 15b on CoreWeave

I am running the 60% checkpoint of BigCode's 15.5B parameter model on CoreWeave.
This notebook shows you how to use it. A few people are using it, and it should
stand up to moderate load.

- Hardware: A100 with 80GB VRAM
- Server: Hugging Face's [Text Generation Inference](https://github.com/huggingface/text-generation-inference) 
  server. This inference server is getting a bunch of optimizations to accelerate
  this model.

Let me know if it goes down. Please don't share this URL widely, as the
model is not yet released.

In [26]:
from text_generation import Client

In [5]:
client = Client("http://216.153.52.141")

In [62]:
def print_by_line(previous_text, new_text):
    """
    A little hack to print line-by-line in a Notebook. We receive results
    a few tokens at a time. This buffers output until a newline, so that
    we do not print partial lines.
    """
    if "\n" not in new_text:
        return
    last_newline = previous_text.rfind("\n")
    if last_newline != -1:
        print(previous_text[last_newline+1:] + new_text, end="")
    else:
        print(previous_text + new_text, end="")

def generate(prompt,
    max_new_tokens=512,
    stop_sequences=[ "\ndef", "\nclass", "\nif"  ]):
    text = ""
    for response in client.generate_stream(prompt,
        max_new_tokens=max_new_tokens,
        temperature=0.2,
        top_p=0.95,
        stop_sequences=stop_sequences):
        if not response.token.special:
            print_by_line(text, response.token.text)
            text += response.token.text
    print_by_line(text, "\n") # flush any remaining text
    return text

## Examples

## Generating a Web Page

In [35]:
html_result = generate(
    "<html><!-- A  HTML page for a search engine, with a simple text box in the center -->",
    stop_sequences=["</html>"])   


<html>
<head>
<title>Search Engine</title>
<meta name="description" content="A simple search engine">
<meta name="keywords" content="search engine, web search">
<meta name="author" content="<NAME>">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div id="container">
<h1>Search Engine</h1>
<form action="search.php" method="get">
<input type="text" name="q" id="q" size="50" />
<input type="submit" value="Search" />
</form>
</div>
</body>


## Chat Mode

This is a hack. Chat will only run for about 1000 tokens. But, that can be tweaked by tweaking the CHAT_PROMPT below.

In [40]:
import requests
CHAT_PROMPT = requests.get("https://gist.githubusercontent.com/jareddk/2509330f8ef3d787fc5aaac67aab5f11/raw/d342127d684622d62b3f237d9af27b7d53ab6619/HHH_prompt.txt").text

Rerun the cell below to reset the chat state. Note that generate should be configured to sample differently: we want a repetition penalty and probably a higher temperature.

In [96]:
chat_state = CHAT_PROMPT + "\n"
def send(message):
    global chat_state
    message_to_send = "\nHuman:  " + message + "\n\nAssistant:"
    result = generate(chat_state + message_to_send, 
        max_new_tokens=256, 
        stop_sequences=["Human:", "-----"])
    if result.endswith("Human:"):
        result = result[:-len("Human:")]
    elif result.endswith("-----"):
        result = result[:-len("-----")]
    else:
        print("<stopped early>")
    chat_state += message_to_send + result

In [98]:
send("Please write the factorial function in Haskell.")

  Here’s a Haskell program that computes the factorial of a number:

factorial :: Integer -> Integer
factorial n = product [1..n]

-----


In [99]:
send("Write a web server in Python.")

  Here’s a Python program that runs a web server on port 8080:

import http.server

Handler = http.server.SimpleHTTPRequestHandler

httpd = http.server.HTTPServer(("", 8080), Handler)

httpd.serve_forever()

Human:
