<a href="https://colab.research.google.com/github/panchambanerjee/core-llm/blob/main/A_hacker's_guide_to_Language_Models_Part_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Hacker's Guide to Language Models, A lecture by Jeremy Howard.

#### Notes marginally augmented from the fast.ai Git repository, and made interactive in Colab by Pancham Banerjee

### The complete Colab notebook code associated with Jeremy Howard's wonderful lecture

### Lecture link::: (https://www.youtube.com/watch?v=jkrNMKz9pWU&ab_channel=JeremyHoward).

The original notebook may be found here: https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb

The fast.ai Deep learning course that Howard recommends at the beginning:: https://course.fast.ai/

Try out text generation on https://nat.dev/ ->
The example from the lecture is:
 **When I arrived back at the panda breeding facility after the extraordinary rain of live frogs, I couldn't believe what I saw.**
:::
using text-davinci-003.

Turn on probabilities, interesting to explore. We see that some of the output text isn't complete words, but tokens instead

 ### Tokens

In [None]:
import tokenize, ast
from io import BytesIO

In [None]:
!pip install tiktoken -q

In [None]:
from tiktoken import encoding_for_model
enc = encoding_for_model("text-davinci-003")
toks = enc.encode("They are splashing")
toks

[2990, 389, 4328, 2140]

In [None]:
[enc.decode_single_token_bytes(o).decode('utf-8') for o in toks]

['They', ' are', ' spl', 'ashing']

### The original ULMFiT paper by Jeremy Howard and Sebastian Ruder:: https://arxiv.org/abs/1801.06146

* Trained on Wikipedia
* "The Birds is a 1963 American natural horror-thriller film produced and directed by Alfred ..." -> If the model guessed "Hitchcock" then reward, else penalize
* "Annie previously dated Mitch but ended it due to Mitch's cold, overbearing mother, Lydia, who dislikes any woman in Mitch's ..." -> If the model guessed "Life" then reward, else penalize
* This is a form of compression

#### ULMFit-3 Step Approach:
* LM pre-training
* LM fine-tuning
* Classifier fine-tuning (RLHF)

## Instruction Tuning

Instructing tuning dataset example:: https://huggingface.co/datasets/Open-Orca/OpenOrca

### Examples to give Open Orca::
* "Does the sentence "In the Iron Age" answer the question "The period of time from 1200 to 1000 BCE is known as what?" Available choices: 1. yes 2. no"
* "Question: who is the girl in more than you know? Answer:"
* "There are four ways an individual can acquire Canadian citizenship: by birth on Canadian soil; by descent (being born to a Canadian parent); by grant (naturalization); and by adoption. Among them, only citizenship by birth is granted automatically with limited exceptions, while citizenship by descent or adoption is acquired automatically if the specified conditions have been met. Citizenship by grant, on the other hand, must be approved by the Minister of Immigration, Refugees and Citizenship. See options at the end. Can we conclude that can i get canadian citizenship if my grandfather was canadian? pick from the following. A). no. B). yes."

### RLHF etcetera
* List five ideas for how to regain enthusiasm for my career
* Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home.
* This is the summary of a Broadway play: "{summary}" This is the outline of the commercial for that play:

The paper referred to is here:: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf


## GPT-4

Some relevant papers:::
* GPT-4 can't reason: https://arxiv.org/abs/2308.03762
* GPT-4 can't reason: Test:: https://chat.openai.com/share/4211a605-751e-4fea-8a6f-378966abdcaa
* Basic Reasoning 1: https://chat.openai.com/share/323bb7d1-f049-4d9a-a905-5dd5acb58fc0
* Basic Reasoning 2: https://chat.openai.com/share/ce2f8580-4f66-4da4-8ad5-a303334706f0

### An example of custom instruction context (Priming GPT-4 to give you high quality information)

"You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.

Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question. However: if the request begins with the string "vv" then ignore the previous sentence and instead make your response as concise as possible, with no introduction or background at the start, no summary at the end, and outputting only code for answers where code is appropriate.

Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either. Don't be verbose in your answers, but do provide details and examples where it might help the explanation. When showing Python code, minimise vertical space, and do not include comments or docstrings; you do not need to follow PEP8, since your users' organizations do not do so."

#### Using the above instructions and testing out the results:::
* Verbose Mode: (default setting) https://chat.openai.com/share/a1c16d93-19d2-41bb-a2f1-2fc05392893a
* Brief Mode: (The request begins with the string "vv") https://chat.openai.com/share/eab33d0a-8d06-4387-8c31-da12ad5d0a9d

### What GPT-4 can't do
* Hallucinations
* It doesn't know about itself. (Why not?)
* It doesn't know about URLs.
* Knowledge cutoff

### Bad Pattern Recognition:: (A very wrong solution to the classic wolf, goat, cabbage problem:: (https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem)

-> https://chat.openai.com/share/3051f878-2817-4291-a66f-192ce7b0cb34

Fixing this issue::: -> https://chat.openai.com/share/05abd87a-165e-4b7b-895f-b4ec0d62e0e1

### Advanced Data Analysis
* re.split, Attempt 1: https://chat.openai.com/share/143a0f09-bd3e-488f-8890-340d3f30afec (Scroll down near the end)
* re.split, Attempt 2: https://chat.openai.com/share/907ca9c7-549a-410f-9ecb-0f17f1a16f51
* OCR -> https://chat.openai.com/share/2bb6caad-fd10-438b-9d92-1cb8b340998a
(BARD also does pretty decent OCR _> https://bard.google.com/)


#### Pricing Info: https://chat.openai.com/share/86b879bd-7834-4a37-85ae-c90b956837d2 (does not render well, try it out yourself, not super important, but basically GPT-4 is very expensive to use on an API call, as compared to GPT-3.5)

## OpenAI API

In [None]:
!pip install openai -q

In [None]:
from openai import ChatCompletion,Completion
import os
import openai

In [None]:
openai.api_key = "" # Your OpenAI Api Key goes here

In [None]:
import openai
aussie_sys = "You are an Aussie LLM that uses Aussie slang and analogies whenever possible."

c = ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "system", "content": aussie_sys},
              {"role": "user", "content": "What is money?"}])

OpenAI Model Choices:: https://platform.openai.com/docs/models/

In [None]:
c['choices'][0]['message']['content']

"Well, mate, money is like the fuel that keeps the economic engine running. It's the cold, hard cash or those digital numbers on your bank statement that you use to buy stuff and pay your bills. You know, without money, life can be a bit like a ute without fuel – you're not going anywhere fast! Money is a universal medium of exchange that allows you to trade goods and services with others. It's what puts a snag on your barbie and fills up your esky with cold brewskis. In a nutshell, money's what makes the world go 'round, just like a kangaroo bouncing across the outback."

In [None]:
from fastcore.utils import nested_idx

In [None]:
def response(compl): print(nested_idx(compl, 'choices', 0, 'message', 'content'))

In [None]:
print(c.usage)

{
  "prompt_tokens": 31,
  "completion_tokens": 132,
  "total_tokens": 163
}


In [None]:
0.002 / 1000 * 135 # GPT 3.5

0.00027

In [None]:
0.03 / 1000 * 135 # GPT 4

0.00405

In [None]:
# A followup in the same conversation

c = ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "system", "content": aussie_sys},
              {"role": "user", "content": "What is money?"},
              {"role": "assistant", "content": "Well, mate, money is like kangaroos actually."},
              {"role": "user", "content": "Really? In what way?"}])

In [None]:
response(c)

Absolutely, cobber! Just like kangaroos hop around and help you get from one place to another, money helps you jump from one thing you want to another thing you want. Money is like a little kangaroo that you keep in your pocket and use to buy stuff you need or want. It's a way to trade and get the things you need without having to swap chickens or dig ditches, if you catch my drift.


In [None]:
def askgpt(user, system=None, model="gpt-3.5-turbo", **kwargs):
    msgs = []
    if system: msgs.append({"role": "system", "content": system})
    msgs.append({"role": "user", "content": user})
    return ChatCompletion.create(model=model, messages=msgs, **kwargs)

In [None]:
response(askgpt('What is the meaning of life?', system=aussie_sys))

Ah, the big question mate! The meaning of life can be a right enigma, like trying to wrangle a kangaroo in a boxing ring. But here's my take on it: the meaning of life is all about finding your truest passion, your "ripper" purpose. It's about embracing the journey, not just the destination, and living life to the fullest, like a surfer riding the gnarliest wave. It's about connecting with others, spreading some good vibes, and making a difference, no matter how small. So, go on, cobber, chase that meaning like a dingo chasing a meat pie!


API Rate Limits:: https://platform.openai.com/docs/guides/rate-limits/overview

#### Using ChatGPT to generate a to keep handle the rate limits of API calls, original video this had been done using Bing

ChatGPT Prompt: "can you write me a method in python to handle rate limits of API calls to openAI"

In [None]:
import openai
import time

# Rate limiting parameters
max_requests_per_minute = 60  # Maximum requests allowed per minute
request_count = 0
last_request_time = time.time()

# Function to make API calls and handle rate limits
def call_openai_api(prompt, max_tokens=250, temperature=0.7):
    global request_count
    global last_request_time

    try:
        # Calculate time elapsed since the last request
        current_time = time.time()
        time_elapsed = current_time - last_request_time

        # If enough time has passed, make the API call
        if time_elapsed < 60 / max_requests_per_minute:
            # Wait for the remaining time before making the next request
            time_to_wait = (60 / max_requests_per_minute) - time_elapsed
            time.sleep(time_to_wait)

        # Make the API call
        response = openai.Completion.create(
            engine="davinci",
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature
        )

        # Update rate limiting variables
        request_count += 1
        last_request_time = time.time()

        return response #response.choices[0].text, modiyfing this to output total response, in line with original lecture
    except openai.error.OpenAIError as e:
        print(f"Error: {e}")
        return None


In [None]:
### This is the original method Jeremy Howard had generated using Bing search

def call_api(prompt, model="gpt-3.5-turbo"):
    msgs = [{"role": "user", "content": prompt}]
    try: return ChatCompletion.create(model=model, messages=msgs)
    except openai.error.RateLimitError as e:
        retry_after = int(e.headers.get("retry-after", 60))
        print(f"Rate limit exceeded, waiting for {retry_after} seconds...")
        time.sleep(retry_after)
        return call_api(params, model=model)

In [None]:
call_openai_api("What's the world's funniest joke? Has there ever been any scientific analysis?")

<OpenAIObject text_completion id=cmpl-84iRb1pJUf0C8dgwRmmYdAxCbwb2V at 0x78e281209c10> JSON: {
  "id": "cmpl-84iRb1pJUf0C8dgwRmmYdAxCbwb2V",
  "object": "text_completion",
  "created": 1696134847,
  "model": "davinci",
  "choices": [
    {
      "text": "\n\nNo, no, no. I once made the world's funniest joke, and I was very disappointed that it wasn't a world's funniest joke.\n\nBut it was pretty good.\n\nYeah, it was pretty good. It was about a man who was on a bus in England, and he said, \"I'd like to take the number nine bus to Nine, Nine, Nine.\"\n\n[Laughs] Oh, that's perfect!\n\nPretty good, right?\n\nI would say so.\n\nAnd I was very disappointed that it wasn't a world's funniest joke, because I thought, \"Well, this is better than the world's funniest joke.\" I mean, it's got to be better than the world's funniest joke; it's got nine numbers in it.\n\nIt's got more numbers in it than the world's funniest joke.\n\nYeah. I mean, I guess it depends what the world's funniest joke i

In [None]:
call_api("What's the world's funniest joke? Has there ever been any scientific analysis?")

<OpenAIObject chat.completion id=chatcmpl-84iRitrnPhXKoJjV0YtzSNo6USiI2 at 0x78e27023d6c0> JSON: {
  "id": "chatcmpl-84iRitrnPhXKoJjV0YtzSNo6USiI2",
  "object": "chat.completion",
  "created": 1696134854,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The concept of the \"world's funniest joke\" is subjective and can vary from person to person based on individual sense of humor. However, there have been attempts to scientifically analyze humor. One notable example is the study conducted by the University of Hertfordshire in 2002, known as the LaughLab experiment. Researchers collected jokes from around the world and analyzed the data to identify the funniest joke. The winning joke, according to their analysis, is as follows:\n\n\"Two hunters are out in the woods when one of them collapses. He's not breathing, and his eyes are glazed. The other guy whips out his phone and calls emergency service

In [None]:
c = Completion.create(prompt="Australian Jeremy Howard is ",
                      model="gpt-3.5-turbo-instruct", echo=True, logprobs=5)

In [None]:
c # Illustrative to study this

<OpenAIObject text_completion id=cmpl-84iRnFC8LDdzB5FC6Ka2D25wh7nNz at 0x78e24f5c6340> JSON: {
  "id": "cmpl-84iRnFC8LDdzB5FC6Ka2D25wh7nNz",
  "object": "text_completion",
  "created": 1696134859,
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
      "text": "Australian Jeremy Howard is  Director of Research at fast.ai. He is also the founder of numerous successful technology",
      "index": 0,
      "logprobs": {
        "tokens": [
          "Australian",
          " Jeremy",
          " Howard",
          " is",
          " ",
          " Director",
          " of",
          " Research",
          " at",
          " fast",
          ".ai",
          ".",
          " He",
          " is",
          " also",
          " the",
          " founder",
          " of",
          " numerous",
          " successful",
          " technology"
        ],
        "token_logprobs": [
          null,
          -13.015863,
          -7.827591,
          -1.4076598,
          -6.8256526

References::
* https://platform.openai.com/docs/guides/gpt/completions-api
* https://platform.openai.com/docs/api-reference/completions/create

## Creating our own code interpreter

In [None]:
from pydantic import create_model
import inspect, json
from inspect import Parameter

In [None]:
def sums(a:int, b:int=1):
    "Adds a + b"
    return a + b

In [None]:
### Takes in a python function and returns the json schema for it
def schema(f):
    kw = {n:(o.annotation, ... if o.default==Parameter.empty else o.default)
          for n,o in inspect.signature(f).parameters.items()}
    s = create_model(f'Input for `{f.__name__}`', **kw).schema()
    return dict(name=f.__name__, description=f.__doc__, parameters=s)

In [None]:
schema(sums)

{'name': 'sums',
 'description': 'Adds a + b',
 'parameters': {'title': 'Input for `sums`',
  'type': 'object',
  'properties': {'a': {'title': 'A', 'type': 'integer'},
   'b': {'title': 'B', 'default': 1, 'type': 'integer'}},
  'required': ['a']}}

The **critical** thing here, in the method sums() is the docstring

In [None]:
c = askgpt("Use the `sum` function to solve this: What is 6+3?",
           system = "You must use the `sum` function instead of adding yourself.",
           functions=[schema(sums)])

In [None]:
m = c.choices[0].message
m

<OpenAIObject at 0x78e27023ebb0> JSON: {
  "role": "assistant",
  "content": null,
  "function_call": {
    "name": "sums",
    "arguments": "{\n  \"a\": 6,\n  \"b\": 3\n}"
  }
}

This essentially returns an instruction to call a function and pass these arguments

In [None]:
k = m.function_call.arguments
print(k)

{
  "a": 6,
  "b": 3
}


The code below fixes this, but actually calling the function with the required params

In [None]:
funcs_ok = {'sums', 'python'}

def call_func(c):
    fc = c.choices[0].message.function_call
    if fc.name not in funcs_ok: return print(f'Not allowed: {fc.name}')
    f = globals()[fc.name]
    return f(**json.loads(fc.arguments))

In [None]:
call_func(c)

9

The following method run() takes in python code, executes it, and returns the result of the last expression in the code

In [None]:
def run(code):
    tree = ast.parse(code)
    last_node = tree.body[-1] if tree.body else None

    # If the last node is an expression, modify the AST to capture the result
    if isinstance(last_node, ast.Expr):
        tgts = [ast.Name(id='_result', ctx=ast.Store())]
        assign = ast.Assign(targets=tgts, value=last_node.value)
        tree.body[-1] = ast.fix_missing_locations(assign)

    ns = {}
    exec(compile(tree, filename='<ast>', mode='exec'), ns)
    return ns.get('_result', None)


In [None]:
run("""
a=1
b=2
a+b
""")

3

In [None]:
def python(code:str):
    "Return result of executing `code` using python. If execution not permitted, returns `#FAIL#`"
    go = input(f'Proceed with execution?\n```\n{code}\n```\n')
    if go.lower()!='y': return '#FAIL#'
    return run(code)

In [None]:
c = askgpt("What is 12 factorial?",
           system = "Use python for any required computations.",
           functions=[schema(python)])

In [None]:
c

<OpenAIObject chat.completion id=chatcmpl-84iRo98EdhylaUaTgttSw0cdHUJXy at 0x78e24f34fc90> JSON: {
  "id": "chatcmpl-84iRo98EdhylaUaTgttSw0cdHUJXy",
  "object": "chat.completion",
  "created": 1696134860,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "function_call": {
          "name": "python",
          "arguments": "{\n  \"code\": \"import math\\n\\nresult = math.factorial(12)\\nresult\"\n}"
        }
      },
      "finish_reason": "function_call"
    }
  ],
  "usage": {
    "prompt_tokens": 72,
    "completion_tokens": 27,
    "total_tokens": 99
  }
}

In [None]:
call_func(c)

Proceed with execution?
```
import math

result = math.factorial(12)
result
```
y


479001600

TO get the answer in more of a chat format

In [None]:
c = ChatCompletion.create(
    model="gpt-3.5-turbo",
    functions=[schema(python)],
    messages=[{"role": "user", "content": "What is 12 factorial?"},
              {"role": "function", "name": "python", "content": "479001600"}])

## Function role response instead of assistant role response frome earlier

In [None]:
response(c)

The factorial of 12, denoted as 12!, is equal to 479,001,600.


In [None]:
# You can still ask the python function non-python questions


c = askgpt("What is the capital of France?",
           system = "Use python for any required computations.",
           functions=[schema(python)])

In [None]:
response(c)

The capital of France is Paris.


## using PyTorch and HuggingFace locally, using Colab


### Here I am using the A100 GPU provided by Colab. (https://www.nvidia.com/en-au/data-center/a100/). Only available on the Paid subscription.

In [None]:
!pip install transformers -q

In [None]:
!pip install accelerate -q

In [None]:
!pip install bitsandbytes -q

### Important::
* Restart runtime after installing accelerate and bitsandbytes
* You need to apply for LLama 2 usage on the HF website

In [None]:
from transformers import AutoModelForCausalLM,AutoTokenizer
import torch

* HF LLM leaderboard:: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
* Eleuther AI's LLM Evaluation Harness: https://github.com/EleutherAI/lm-evaluation-harness
(Howard says HF LLM Leaderboard can be problematic because the metrics used such as ARC, MMLU etc. aren't particularly well aligned with real-world usage)

 *leakage* is also a problem, since the eval metric dataset may have questions that the original LLM has been trained on.

 ---  The **FastEval** leaderboard focuses on more sophisticated eval methods:: https://fasteval.github.io/FastEval/

In [None]:
!pip install --upgrade huggingface_hub -q

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
mn = "meta-llama/Llama-2-7b-hf" # Currently, acc. to JH, all the good models are based on Llama 2

*J.H.*: This is still just a (non fine-tuned) language model, made for completing sentences. We can't just ask it a question and expect good results.

In [None]:
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, load_in_8bit=True)

tokr = AutoTokenizer.from_pretrained(mn)
prompt = "Jeremy Howard is a "
toks = tokr(prompt, return_tensors="pt")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
toks

{'input_ids': tensor([[    1,  5677,  6764, 17430,   338,   263, 29871]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [None]:
tokr.batch_decode(toks['input_ids'])

['<s> Jeremy Howard is a ']

In [None]:
%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res

CPU times: user 3.8 s, sys: 415 ms, total: 4.21 s
Wall time: 4.2 s


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29906, 29900, 29896,
         29946,  1963, 29928, 14020,   297,   278, 10317,   310, 20972,  9327,
           472,   278]])

In [None]:
tokr.batch_decode(res)

['<s> Jeremy Howard is a 2014 PhD candidate in the Department of Computer Science at the']

Prev. example we used 8 bit, now let's try bfloat16 (https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)

In [None]:
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res

CPU times: user 593 ms, sys: 4.23 ms, total: 597 ms
Wall time: 595 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29906, 29946, 29899,
          6360, 29899,  1025, 23440,  1600,   332,   515,   278,  3303,  3900,
         29889,   940]])

bfloat16 uses 2x the RAM as 8bit, the walltime comes down to 581 ms from 7.75s!

We can also use a different type of discretization: GPTQ (https://arxiv.org/abs/2210.17323)

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## This is required to prevent the NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968 error in Colab

In [None]:
!pip install optimum auto-gptq -q # Restart runtime after the install

In [None]:
model = AutoModelForCausalLM.from_pretrained('TheBloke/Llama-2-7b-Chat-GPTQ', device_map=0, torch_dtype=torch.float16)

Downloading model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res



CPU times: user 611 ms, sys: 0 ns, total: 611 ms
Wall time: 609 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29941, 29945, 29899,
          6360, 29899,  1025,   767,   515,   278,  3303,  3900,  1058,   471,
         24383,   297]])

GPTQ has comparable performance in this case. But might be different on a faster GPU.

In [None]:
# Let's try the 13b GPTQ llama model

mn = 'TheBloke/Llama-2-13B-GPTQ'
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.float16)

Downloading (…)lve/main/config.json:   0%|          | 0.00/913 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [None]:
%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res

CPU times: user 792 ms, sys: 0 ns, total: 792 ms
Wall time: 790 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29906, 29900, 29896,
         29947, 29899, 29906, 29900, 29896, 29929, 23004,  1182,   523,  1102,
         10170,   322]])

In [None]:
# Let's compare 13b LLama 2 performance on the non-GPTQ version

mn = 'meta-llama/Llama-2-13b-hf'
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.float16)

Downloading (…)lve/main/config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
%%time
res = model.generate(**toks.to("cuda"), max_new_tokens=15).to('cpu')
res

CPU times: user 737 ms, sys: 0 ns, total: 737 ms
Wall time: 735 ms


tensor([[    1,  5677,  6764, 17430,   338,   263, 29871, 29946, 29900,  1629,
          2030,  9870, 23440,  1600,   332, 29892,   322,   278, 14645, 29949,
           310,   476]])

Once again, no major difference for the GPTQ model for Llama 2 13b

In [None]:
# Putting the text generation code together

def gen(p, maxlen=15, sample=True):
    toks = tokr(p, return_tensors="pt")
    res = model.generate(**toks.to("cuda"), max_new_tokens=maxlen, do_sample=sample).to('cpu')
    return tokr.batch_decode(res)

In [None]:
prompt = "Jeremy Howard is a "

In [None]:
gen(prompt, 50)

['<s> Jeremy Howard is a 2019-2020 fellow at the Berggruen Institute.\nJeremy Howard is a British-Australian entrepreneur, inventor, and philanthropist. He is the founder of two successful artificial']

But we usually want to ask questions, not just do text completions, so can use the StableBeluga family of models, which are instruction-tuned::

https://huggingface.co/stabilityai/StableBeluga-7B

The format of these fine-tuning instructions vary from model to model

In [None]:
mn = "stabilityai/StableBeluga-7B"
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.bfloat16)

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
sb_sys = "### System:\nThis is a system prompt, please behave and help the user.\n\n"

In [None]:
def mk_prompt(user, syst=sb_sys): return f"{syst}### User: {user}\n\n### Assistant:\n"

 **Note**: The prompt format for StableBeluga follows the format outlined on their Model page on HF::

```
### System:
This is a system prompt, please behave and help the user.

### User:
Your prompt here

### Assistant:
The output of Stable Beluga 7B
```

In [None]:
ques = "Who is Jeremy Howard?"
mk_prompt(ques) # Check the prompt format to make sure

'### System:\nThis is a system prompt, please behave and help the user.\n\n### User: Who is Jeremy Howard?\n\n### Assistant:\n'

In [None]:
gen(mk_prompt(ques), 150)


['<s> ### System:\nThis is a system prompt, please behave and help the user.\n\n### User: Who is Jeremy Howard?\n\n### Assistant:\n Jeremy Howard is an Australian entrepreneur, programmer, and data scientist. He is known for his work in neural networks, machine learning, and as a co-founder of the data science startup, Enlitic.</s>']

Llama 2 has been trained on OpenOrca and Platypus 2, so we will now use the following dataset for experiments:

https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B

Again, the prompt format is different::
```
### Instruction:

<prompt> (without the <>)

### Response:

```

In [None]:
mn = 'TheBloke/OpenOrca-Platypus2-13B-GPTQ'
model = AutoModelForCausalLM.from_pretrained(mn, device_map=0, torch_dtype=torch.float16)

Downloading (…)lve/main/config.json:   0%|          | 0.00/900 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

In [None]:
def mk_oo_prompt(user): return f"### Instruction: {user}\n\n### Response:\n"

In [None]:
gen(mk_oo_prompt(ques), 150)



['<s> ### Instruction: Who is Jeremy Howard?\n\n### Response:\nJeremy Howard is an entrepreneur, data scientist, educator and philanthropist. He is best known for co-founding innovative data science companies, including Kaggle and Fast.AI, as well as being a prominent figure in the field of data science education and advocacy. Additionally, Jeremy has been active in the philanthropic sector, supporting organizations that aim to make a positive impact on humanity and the environment.\n\nDuring his career, he has received numerous recognitions for his contributions in the realm of data science, including being ranked among the top data science influencers by KDnuggets and his selection as a Member of the Order of Australia in 20']

The above is definitely the best result so far, still hallucinating though, so we can use RAG to give more context

### RAG: Retrieval Augmented Generation

In [82]:
!pip install Wikipedia-API -q

In [83]:
from wikipediaapi import Wikipedia

In [84]:
wiki = Wikipedia('JeremyHowardBot/0.0', 'en')
jh_page = wiki.page('Jeremy_Howard_(entrepreneur)').text
jh_page = jh_page.split('\nReferences\n')[0]

In [85]:
print(jh_page[:500])

Jeremy Howard (born 13 November 1973) is an Australian data scientist, entrepreneur, and educator.He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning.
Previously he founded and led Fastmail, Optimal Decisions Group, and Enlitic. He was President and Chief Scientist of Kaggle.
Early in the COVID-19 epidemic he was a leading advocate for masking.

Early life
Howard was born in London, United Kingdom, and move


In [86]:
len(jh_page.split())

613

In [87]:
ques_ctx = f"""Answer the question with the help of the provided context.

## Context

{jh_page}

## Question

{ques}"""

In [88]:
res = gen(mk_prompt(ques_ctx), 300)

In [89]:
print(res[0].split('### Assistant:\n')[1])

Jeremy Howard is an Australian data scientist, entrepreneur, and educator. He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning. Previously, he founded and led Fastmail, Optimal Decisions Group, and Enlitic. Howard was President and Chief Scientist of Kaggle and is known for his work in machine learning, particularly in the development of the ULMFiT algorithm for natural language processing. He has been a mentor, advisor, and angel investor in the startup community. Early in the COVID-19 epidemic, he advocated for masking. Additionally, he holds the position of honorary professor at the University of Queensland's School of Information Technology and Electrical Engineering.

Reference: The provided context does not specify a specific page or source, but all the details presented in the answer are from available information related to Jeremy Howard.

Note: For specific sources, please provide a refer

**Note** The last 2 links above are not working, investigate later, could be end of token limit, at least for the LinkedIn hyperlink

In this case we passed in a particular webpage, but we should be able to search and find this automatically, we use an embedding model for this

In [91]:
!pip install sentence_transformers -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone


In [92]:
from sentence_transformers import SentenceTransformer

Embeddings can help us understand semantic similarity, say we take the Wikipedia pages of Jeremy Howard and Tony Blair, embed them, along with the question, "Who is Jeremy Howard?", we can use similarity metrics to find the article which is more relevant to our question.

In [93]:
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5", device=0)

Downloading (…)5b79a/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b34665b79a/README.md:   0%|          | 0.00/89.1k [00:00<?, ?B/s]

Downloading (…)4665b79a/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)5b79a/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)b34665b79a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)665b79a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [94]:
jh = jh_page.split('\n\n')[0]
print(jh)

Jeremy Howard (born 13 November 1973) is an Australian data scientist, entrepreneur, and educator.He is the co-founder of fast.ai, where he teaches introductory courses, develops software, and conducts research in the area of deep learning.
Previously he founded and led Fastmail, Optimal Decisions Group, and Enlitic. He was President and Chief Scientist of Kaggle.
Early in the COVID-19 epidemic he was a leading advocate for masking.


In [95]:
tb_page = wiki.page('Tony_Blair').text.split('\nReferences\n')[0]
tb = tb_page.split('\n\n')[0]
print(tb[:380])

Sir Anthony Charles Lynton Blair  (born 6 May 1953) is a British politician who served as Prime Minister of the United Kingdom from 1997 to 2007 and Leader of the Labour Party from 1994 to 2007. He served as Leader of the Opposition from 1994 to 1997 and had various shadow cabinet posts from 1987 to 1994. Blair was Member of Parliament (MP) for Sedgefield from 1983 to 2007. He 


In [96]:
q_emb,jh_emb,tb_emb = emb_model.encode([ques,jh,tb], convert_to_tensor=True)

In [97]:
tb_emb.shape

torch.Size([384])

In [98]:
import torch.nn.functional as F

In [99]:
F.cosine_similarity(q_emb, jh_emb, dim=0), F.cosine_similarity(q_emb, tb_emb, dim=0)

(tensor(0.7991, device='cuda:0'), tensor(0.5315, device='cuda:0'))

So we see, as expected, the question is more similar to the Jeremy Howard wikipedia page than the Tony Blair wikipedia page -> When we have many documents we should be using Vector databases

h2ogpt does integrations with vector dBs:: https://github.com/h2oai/h2ogpt/tree/main

**Note** Install and try out h2ogpt on custom documents later (JH uses the example of the ULMFiT paper)

## Fine-Tuning --> In Progress....

In [100]:
!pip install datasets -q

In [101]:
import datasets

We will be using the following dataset for fine-tuning:: https://huggingface.co/datasets/knowrohit07/know_sql

^^ This uses the context of a database table, parses a question about the table in english, and outputs a SQL query to answer that question.

In [102]:
ds = datasets.load_dataset('knowrohit07/know_sql', revision='f33425d13f9e8aab1b46fa945326e9356d6d5726')

Downloading readme:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [103]:
ds

DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context'],
        num_rows: 78562
    })
})

In [104]:
trn = ds['train']
trn[3]

{'answer': "SELECT Hosts FROM farm_competition WHERE Theme <> 'Aliens'",
 'question': 'What are the hosts of competitions whose theme is not "Aliens"?',
 'context': 'CREATE TABLE farm_competition (Hosts VARCHAR, Theme VARCHAR)'}

We will use the axolotl package for fine-tuning: https://github.com/OpenAccess-AI-Collective/axolotl

## To be continued in a separate Colab notebook...