# Labeling using pretrained models

## OpenAI GPT-4

### Introduction

OpenAI's provides pretrained models that can be used for a variety of tasks. We can use these models to label the data without providing any examples (zero-shot classification).

#### Why GPT-4?

Someone may asks, "Why not use GPT-3?". The reason is that GPT-4 has a bigger context window and we absolutely need that to be able to feed entire webpages to the model. We may use GPT-3 for smaller text inputs like meta title or meta description. We might use GPT-4o-mini, it's cheaper while still having a big context window (128k context length — same as GPT-4o). You can learn more about the different models [here](https://platform.openai.com/docs/models).

#### Challenges

The main challenge is to get the model to output the labels in the desired format. We will leverage [function calling](https://platform.openai.com/docs/guides/function-calling) to ensure [structured output](https://platform.openai.com/docs/guides/structured-outputs) from the model.

#### Downsides

- Latency. At the end of the day, we are making API calls to OpenAI's servers and this can be slow (relatively speaking).
- Cost. OpenAI's API is not free and the cost can add up quickly. But we can optimize the usage by caching the results, stripping the text to the relevant parts only, etc.

### Install libraries

In [None]:
%pip install openai python-dotenv requests html_sanitizer

Note: you may need to restart the kernel to use updated packages.


### Import libraries

In [None]:
import os
import requests
import json
from openai import OpenAI
from dotenv import load_dotenv
from html_sanitizer import Sanitizer

### Configuring client

In [None]:
# Load the API key from the .env file
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_ORG_ID = os.getenv("OPENAI_ORG_ID")

# Create a client
client = OpenAI(api_key=OPENAI_API_KEY, organization=OPENAI_ORG_ID)

### Functions

In [None]:
def Bank(guess: bool):
    return guess

def Pay(guess: bool):
    return guess

def Crypto(guess: bool):
    return guess

# "function definitions" for the OpenAI API. This helps the model understand what the functions are supposed to do.
tools = [
    {
        "type": "function",
        "function": {
            "name": "Bank",
            "description": "Determines if the website is a bank.",
            "parameters": {
                "type": "object",
                "properties": {
                    "guess": {
                        "type": "boolean",
                        "description": "The guess of whether the website is a bank.",
                    },
                },
                "required": ["guess"],
                "additionalProperties": False,
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "Pay",
            "description": "Determines if the page is payment page or ask for payment information.",
            "parameters": {
                "type": "object",
                "properties": {
                    "guess": {
                        "type": "boolean",
                        "description": "The guess of whether the website is a payment processor.",
                    },
                },
                "required": ["guess"],
                "additionalProperties": False,
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "Crypto",
            "description": "Determines if the page is part of a cryptocurrency website.",
            "parameters": {
                "type": "object",
                "properties": {
                    "guess": {
                        "type": "boolean",
                        "description": "The guess of whether the website is a cryptocurrency website.",
                    },
                },
                "required": ["guess"],
                "additionalProperties": False,
            },
        },
    },
]

### Labeling

In [None]:
def label(url: str):
    labels = {
        "Bank": None,
        "Pay": None,
        "Crypto": None,
    }

    # Make HTTP request to the URL and get the HTML content
    http_response = requests.get(url)
    html = http_response.text

    # Sanitize the HTML to ensure it is safe to pass to the model. It also helps to reduce the size of the content.
    sanitizer = Sanitizer()
    sanitized_html = sanitizer.sanitize(html)

    # Prompt the model with the HTML content.
    messages = [
        {
            "role": "system",
            "content": "You are a website security tool. You have been asked to determine if the following websites are banks, payment processors, or cryptocurrency websites.",
        },
        {
            "role": "user",
            "content": f"This is the HTML of the website: {sanitized_html}",
        },
    ]

    # Make a request to the OpenAI API
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        stream=False,  # Return all messages at once
        parallel_tool_calls=True,  # Functions can be called in parallel - no need to be executed sequentially
        tool_choice="required",  # Forces the assistant to use the tools
        temperature=0.0,  # No randomness/creativity in the responses
    )

    for choice in response.choices:
        if choice.message.tool_calls:
            for tool_call in choice.message.tool_calls:
                # Arguments are passed as a JSON string
                arguments = json.loads(tool_call.function.arguments)
                # Assign the guess to the corresponding label. We name the labels as the function names. Convert the guess (boolean) to an integer.
                labels[tool_call.function.name] = int(arguments["guess"])

    return labels

In [None]:
url = "https://www.google.com" # Not a bank, not payment, not crypto

label(url)

{'Bank': 0, 'Pay': 0, 'Crypto': 0}

In [None]:
url = "https://societegenerale.ci/fr/" # Bank, not payment, not crypto

label(url)

{'Bank': 1, 'Pay': 0, 'Crypto': 0}

In [None]:
url = "https://nuxt.lemonsqueezy.com/checkout"  # Not a bank, payment, not crypto

label(url)

{'Bank': 0, 'Pay': 1, 'Crypto': 0}

In [None]:
url = "https://www.binance.com/fr"  # Not a bank, not payment, crypto

label(url)

{'Bank': 0, 'Pay': 0, 'Crypto': 1}