<a href="https://colab.research.google.com/github/miaomiaozhang20/ec970_spring2024/blob/main/exploring_llm_pricing_cost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experimentation setup

In [None]:
def calculate_cost_per_token(cost_per_million_tokens):
    cost_per_token = cost_per_million_tokens / 1_000_000
    return cost_per_token

In [None]:
message = """
In your own words, explain strengths and weaknesses of Nvidia's business model, given the following context:

```
NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. NVIDIA is now a full-stack computing infrastructure
company with data-center-scale offerings that are reshaping industry.
Our full-stack includes the foundational CUDA programming model that runs on all NVIDIA GPUs, as well as hundreds of domain-specific software libraries,
software development kits, or SDKs, and Application Programming Interfaces, or APIs. This deep and broad software stack accelerates the performance and
eases the deployment of NVIDIA accelerated computing for computationally intensive workloads such as artificial intelligence, or AI, model training and
inference, data analytics, scientific computing, and 3D graphics, with vertical-specific optimizations to address industries ranging from healthcare and telecom to
automotive and manufacturing.
Our data-center-scale offerings are comprised of compute and networking solutions that can scale to tens of thousands of GPU-accelerated servers
interconnected to function as a single giant computer; this type of data center architecture and scale is needed for the development and deployment of modern
AI applications.
The GPU was initially used to simulate human imagination, enabling the virtual worlds of video games and films. Today, it also simulates human intelligence,
enabling a deeper understanding of the physical world. Its parallel processing capabilities, supported by thousands of computing cores, are essential for deep
learning algorithms. This form of AI, in which software writes itself by learning from large amounts of data, can serve as the brain of computers, robots and self
driving cars that can perceive and understand the world. GPU-powered AI solutions are being developed by thousands of enterprises to deliver services and
products that would have been immensely difficult or even impossible with traditional coding. Examples include generative AI, which can create new content
such as text, code, images, audio, video, and molecule structures, and recommendation systems, which can recommend highly relevant content such as
products, services, media or ads using deep neural networks trained on vast datasets that capture the user preferences.
NVIDIA has a platform strategy, bringing together hardware, systems, software, algorithms, libraries, and services to create unique value for the markets we
serve. While the computing requirements of these end markets are diverse, we address them with a unified underlying architecture leveraging our GPUs and
networking and software stacks. The programmable nature of our architecture allows us to support several multi-billion-dollar end markets with the same
underlying technology by using a variety of software stacks developed either internally or by third-party developers and partners. The large and growing number
of developers and installed base across our platforms strengthens our ecosystem and increases the value of our platform to our customers.
Innovation is at our core. We have invested over $45.3 billion in research and development since our inception, yielding inventions that are essential to modern
computing. Our invention of the GPU in 1999 sparked the growth of the PC gaming market and redefined computer graphics. With our introduction of the CUDA
programming model in 2006, we opened the parallel processing capabilities of our GPU to a broad range of compute-intensive applications, paving the way for
the emergence of modern AI. In 2012, the AlexNet neural network, trained on NVIDIA GPUs, won the ImageNet computer image recognition competition,
marking the “Big Bang” moment of AI. We introduced our first Tensor Core GPU in 2017, built from the ground-up for the new era of AI, and our first autonomous
driving system-on-chips, or SoC, in 2018. Our acquisition of Mellanox in 2020 expanded our innovation canvas to include networking and led to the introduction
of a new processor class – the data processing unit, or DPU. Over the past 5 years, we have built full software stacks that run on top of our GPUs and CUDA to
bring AI to the world’s largest industries, including NVIDIA DRIVE stack for autonomous driving, Clara for healthcare, and Omniverse for industrial digitalization;
and introduced the NVIDIA AI Enterprise software – essentially an operating system for enterprise AI applications. In 2023, we introduced our first data center
CPU, Grace, built for giant-scale AI and high-performance computing. With a strong engineering culture, we drive fast, yet harmonized, product and technology
innovations in all dimensions of computing including silicon, systems, networking, software and algorithms. More than half of our engineers work on software.
The world’s leading cloud service providers, or CSPs, and consumer internet companies use our data center-scale accelerated computing platforms to enable,
accelerate or enrich the services they deliver to billions of end users, including AI solutions and assistants, search, recommendations, social networking, online
shopping, live video, and translation.
Enterprises and startups across a broad range of industries use our accelerated computing platforms to build new generative AI-enabled products and services,
or to dramatically accelerate and reduce the costs of their workloads and workflows. The enterprise software industry uses them for new AI assistants and
chatbots; the transportation industry for autonomous driving; the healthcare industry for accelerated and computer-aided drug discovery; and the financial
services industry for customer support and fraud detection.

Researchers and developers use our computing solutions to accelerate a wide range of important applications, from simulating molecular dynamics to climate
forecasting. With support for more than 3,500 applications, NVIDIA computing enables some of the most promising areas of discovery, from climate prediction to
materials science and from wind tunnel simulation to genomics. Including GPUs and networking, NVIDIA powers over 75% of the supercomputers on the global
TOP500 list, including 24 of the top 30 systems on the Green500 list.
Gamers choose NVIDIA GPUs to enjoy immersive, increasingly cinematic virtual worlds. In addition to serving the growing number of gamers, the market for PC
GPUs is expanding because of the burgeoning population of live streamers, broadcasters, artists, and creators. With the advent of generative AI, we expect a
broader set of PC users to choose NVIDIA GPUs for running generative AI applications locally on their PC, which is critical for privacy, latency, and cost-sensitive
AI applications.
Professional artists, architects and designers use NVIDIA partner products accelerated with our GPUs and software platform for a range of creative and design
use cases, such as creating visual effects in movies or designing buildings and products. In addition, generative AI is expanding the market for our workstation
class GPUs, as more enterprise customers develop and deploy AI applications with their data on-premises.
Headquartered in Santa Clara, California, NVIDIA was incorporated in California in April 1993 and reincorporated in Delaware in April 1998.
```
"""

# Groq

In [None]:
!pip install -U -q groq

In [None]:
import getpass
import os

# Set your Groq API key
os.environ["GROQ_API_KEY"] = getpass.getpass()

In [None]:
import os
import time

from groq import Groq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

start_time = time.time()

def call_groq(query: str, model_name: str):
  return client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": message,
          }
      ],
      max_tokens=1000,
      temperature=0.0,
      model=model_name,
  )

In [None]:
# Define cost map
model_costs = {
    "mixtral-8x7b-32768": {
        "input_tokens": 0.27,
        "output_tokens": 0.27,
    },
}

In [None]:
for model_name, costs in model_costs.items():
    num_iterations = 10
    total_time = 0
    total_output_tokens = 0
    total_cost = 0

    for i in range(num_iterations):
        # Call the model
        start = time.time()
        response = call_groq(message, model_name=model_name)
        end = time.time()

        # Calculate costs
        input_cost = calculate_cost_per_token(costs["input_tokens"])
        output_cost = calculate_cost_per_token(costs["output_tokens"])
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        iteration_cost = (input_tokens * input_cost) + (output_tokens * output_cost)

        # Calculate time
        iteration_time = end - start

        # Calculate totals
        total_time += iteration_time
        total_output_tokens += output_tokens
        total_cost += iteration_cost

        # Prevent rate limiting
        time.sleep(1)

    avg_tokens_per_second = total_output_tokens / total_time
    avg_cost = total_cost / num_iterations

    print(f"Model: {model_name}")
    print(f"Average tokens per second: {avg_tokens_per_second:.2f}")
    print(f"Average total cost: ${avg_cost:.5f}")
    print()

# Anthropic

In [None]:
!pip install -U -q anthropic

In [None]:
os.environ["ANTHROPIC_API_KEY"] = getpass.getpass()

## Define code to call model

In [None]:
from anthropic import Anthropic

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def call_anthropic(query: str, model_name: str) -> str:
  return client.messages.create(
      temperature=0.0,
      max_tokens=1000,
      messages=[
          {
              "role": "user",
              "content": message,
          },
      ],
      model=model_name,
  )

## Calculate cost and speed

In [None]:
# Define cost map
model_costs = {
    "claude-3-haiku-20240307": {
        "input_tokens": 0.25,
        "output_tokens": 1.25,
    },
    "claude-3-sonnet-20240229": {
        "input_tokens": 3.00,
        "output_tokens": 15.00,
    },
    "claude-3-opus-20240229": {
        "input_tokens": 15.00,
        "output_tokens": 75.00,
    },
}

In [None]:
for model_name, costs in model_costs.items():
    num_iterations = 10
    total_time = 0
    total_output_tokens = 0
    total_cost = 0

    for i in range(num_iterations):
        # Call the model
        start = time.time()
        response = call_anthropic(message, model_name=model_name)
        end = time.time()

        # Calculate costs
        input_cost = calculate_cost_per_token(costs["input_tokens"])
        output_cost = calculate_cost_per_token(costs["output_tokens"])
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens
        iteration_cost = (input_tokens * input_cost) + (output_tokens * output_cost)

        # Calculate time
        iteration_time = end - start

        # Calculate totals
        total_time += iteration_time
        total_output_tokens += output_tokens
        total_cost += iteration_cost

        # Prevent rate limiting
        time.sleep(1)

    avg_tokens_per_second = total_output_tokens / total_time
    avg_cost = total_cost / num_iterations

    print(f"Model: {model_name}")
    print(f"Average tokens per second: {avg_tokens_per_second:.2f}")
    print(f"Average total cost: ${avg_cost:.5f}")
    print()

# Cohere

In [None]:
!pip install -U -q cohere

In [None]:
# Set your Cohere API key
os.environ["COHERE_API_KEY"] = getpass.getpass()

In [None]:
import cohere

# Get your cohere API key on: www.cohere.com
co = cohere.Client(os.environ["COHERE_API_KEY"])

def call_cohere(query: str, model_name: str) -> str:
  return co.chat(
      message=query,
      max_tokens=1000,
      temperature=0.0,
  )

In [None]:
# Define cost map
model_costs = {
    "command-light": {
        "input_tokens": 0.30,
        "output_tokens": 0.60,
    },
    "command-r": {
        "input_tokens": 0.50,
        "output_tokens": 1.50,
    },
}

In [None]:
for model_name, costs in model_costs.items():
    num_iterations = 10
    total_time = 0
    total_output_tokens = 0
    total_cost = 0

    for i in range(num_iterations):
        # Call the model
        start = time.time()
        response = call_cohere(message, model_name=model_name)
        end = time.time()

        # Calculate costs
        input_cost = calculate_cost_per_token(costs["input_tokens"])
        output_cost = calculate_cost_per_token(costs["output_tokens"])
        input_tokens = response.meta["billed_units"]["input_tokens"]
        output_tokens = response.meta["billed_units"]["output_tokens"]
        iteration_cost = (input_tokens * input_cost) + (output_tokens * output_cost)

        # Calculate time
        iteration_time = end - start

        # Calculate totals
        total_time += iteration_time
        total_output_tokens += output_tokens
        total_cost += iteration_cost

        # Prevent rate limiting
        time.sleep(1)

    avg_tokens_per_second = total_output_tokens / total_time
    avg_cost = total_cost / num_iterations

    print(f"Model: {model_name}")
    print(f"Average tokens per second: {avg_tokens_per_second:.2f}")
    print(f"Average total cost: ${avg_cost:.5f}")
    print()

# Mistral

In [None]:
!pip install -U -q mistralai

In [None]:
# Set your Mistral API key
os.environ["MISTRAL_API_KEY"] = getpass.getpass()

In [None]:
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient(api_key=os.environ["MISTRAL_API_KEY"])

def call_mistral(query: str, model_name: str) -> str:
  return client.chat(
      model=model_name,
      max_tokens=1000,
      temperature=0.0,
      messages=[
        ChatMessage(role="user", content=query)
      ]
  )

In [None]:
# Define cost map
model_costs = {
    "mistral-small-2312": {
        "input_tokens": 0.70,
        "output_tokens": 0.70,
    },
    "mistral-large-2402": {
        "input_tokens": 8.00,
        "output_tokens": 24.00,
    },
}

In [None]:
for model_name, costs in model_costs.items():
    num_iterations = 10
    total_time = 0
    total_tokens_per_second = 0
    total_cost = 0

    for i in range(num_iterations):
        # Call the model
        start = time.time()
        response = call_mistral(message, model_name=model_name)
        end = time.time()

        # Calculate costs
        input_cost = calculate_cost_per_token(costs["input_tokens"])
        output_cost = calculate_cost_per_token(costs["output_tokens"])
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        iteration_cost = (input_tokens * input_cost) + (output_tokens * output_cost)

        # Calculate time
        iteration_time = end - start

        # Calculate totals
        total_time += iteration_time
        total_output_tokens += output_tokens
        total_cost += iteration_cost

        # Prevent rate limiting
        time.sleep(1)

    avg_tokens_per_second = total_output_tokens / total_time
    avg_cost = total_cost / num_iterations

    print(f"Model: {model_name}")
    print(f"Average tokens per second: {avg_tokens_per_second:.2f}")
    print(f"Average total cost: ${avg_cost:.5f}")
    print()

# OpenAI

In [None]:
!pip install -U -q openai

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def call_openai(query: str, model_name: str) -> str:
  return client.chat.completions.create(
      model=model_name,
      temperature=0,
      max_tokens=1000,
      messages=[
          {"role": "user", "content": query},
      ]
  )

In [None]:
# Define cost map
model_costs = {
    "gpt-3.5-turbo-0125": {
        "input_tokens": 0.50,
        "output_tokens": 1.50,
    },
    "gpt-4-0125-preview": {
        "input_tokens": 10.00,
        "output_tokens": 30.00,
    },
}

In [None]:
for model_name, costs in model_costs.items():
    num_iterations = 10
    total_time = 0
    total_output_tokens = 0
    total_cost = 0

    for i in range(num_iterations):
        # Call the model
        start = time.time()
        response = call_openai(message, model_name=model_name)
        end = time.time()

        # Calculate costs
        input_cost = calculate_cost_per_token(costs["input_tokens"])
        output_cost = calculate_cost_per_token(costs["output_tokens"])
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        iteration_cost = (input_tokens * input_cost) + (output_tokens * output_cost)

        # Calculate time
        iteration_time = end - start

        # Calculate totals
        total_time += iteration_time
        total_output_tokens += output_tokens
        total_cost += iteration_cost

        # Prevent rate limiting
        time.sleep(1)

    avg_tokens_per_second = total_output_tokens / total_time
    avg_cost = total_cost / num_iterations

    print(f"Model: {model_name}")
    print(f"Average tokens per second: {avg_tokens_per_second:.2f}")
    print(f"Average total cost: ${avg_cost:.5f}")
    print()