Python to C++ Code Converter
============================

Converts Python code to C++ for performance:
- Solution with a Frontier Model
- Solution with an Open-Source model

Download [results speadsheet here](../content/Python-to-CPP-Results.csv).

Performance of C++ solution with identical results:
- Claude 3.5 Sonnet is the winner, followed by GPT-4o, followed by CodeQwen
- Note: Qwen has 7B parameters; its closed-source cousins have more than 1T
- For everyday problems, Qwen is more than capable of converting Python to Optimized C++ code

Simple coding problem result:
- Both Frontier models and Qwen were able to write optimized C++ code for a simple program to estimate pi.
- GPT-4o needed additional instructions to include libraries so C++ code would run
- Qwen 1.5 still provided explanation text despite explicitly being told not to in the prompt
- GPT-4o reduced time by 98.5% or 67x faster (186ms).
- Claude 3.5 reduce time by 98.4% or 62x faster (204ms).
- Qwen 1.5 reduced time by 98.5% or 67x faster (188ms).

Harder coding problem result:
- Both Frontier models were able to write optimized C++ code for a harder program to calculate the maximum subarray sum.
- All models figured out the intent of the function and rewrote the code using [Kadane's Algorithm](https://en.wikipedia.org/wiki/Maximum_subarray_problem) to solve it which means it can be done in just 1 loop instead of a nested loop.
- GPT-4o ran with a warning related to an overflow issue and produced the wrong result
- Qwen 1.5 explanation code needed to be removed to ran and produced the wrong result due to changing the random number generation function 
- GPT-4o reduced time by 99.137% or 116x faster (415.427ms).
- Claude 3.5 reduce time by 99.9984% or 64,579x faster (0.745ms).
- Qwen 1.5 reduced time by 99.9987% or 77,069x faster (0.624ms).

To do:
- Add o1 (mini & preview), 03 mini, Gemini (2.0 Flash Exp & Thinking, Pro) to the Closed Source mix
- Add CodeGemma, CodeLlama (Phind 34b v1, v2 & Python v1, 70b Python & Instruct), StarCoder 2 15b & WizardCoder 15b v1

# Approach

The approach used is to test models that rank well on the code related leaderboards given a prompt similer to below with the results compared using 2 coding problems: python code for a simple problem (loop that gradually approachs pii) and a harder problem ([calculate maximum subarray sum](https://en.wikipedia.org/wiki/Maximum_subarray_problem)).

## Prompt
> Please reimplement this Python code in C++ with the fastest possible implementation for `specify architecture or type of CPU and process`. Only respond with the C++ code. Do not explain your implementation. The only requirement is that the C++ code prints the same result and runs fast.

## Leaderboards

Best in Agentic Coding ([SWE Bench](https://arxiv.org/html/2410.06992v2)) on [Vellum LLM Leaderboard](https://www.vellum.ai/llm-leaderboard):
1. 73.3% - Claude 3.7 Sonnet (R)
2. 63.8% - Gemini 2.5 Pro
3. 62.3% - Claude 3.7 Sonnet
4. 61.0% - Open AI 03-mini
5. 55.0% - GPT-4.1
6. 51.8% - Gemini 2.0 Flash
7. 49.2% - DeepSeek-R1
8. 49.0% - Claude 3.5 Sonnet
9. 48.9% - OpenAI o1
10. 40.6% - Claude 3.5 Haiku

...

15. 18.8% - Qwen2.5-VL-32B
16. 10.2% - Gemma 3 27b

[Coding Leaderboard on Scale](https://scale.com/leaderboard/coding) (Deprecated as of March 2025):
1. 1237 - o1-mini
2. 1137 - o3-mini
3. 1132 - GPT-4o (November 2024)
4. 1123 - o1-preview
5. 1111 - Gemini 2.0 Flash Experimental (December 2024)
6. 1109 - Gemini 2.0 Pro (December 2024)
7. 1108 - Gemini 2.0 Flash Thinking (January 2025)
8. 1100 - DeepSeek R1
9. 1083 - o1 (December 2024)
10. 1079 - Claude 3.5 Sonnet (June 2024)
11. 1045 - GPT-4o (August 2024)
12. 1036 - GPT-4o (May 2024)
13. 1034 - GPT-4 Turbo Preview
13. 959 - Claude 3 Opus
14. 1029 - Mistral Large 2
15. 1022 - Llama 3.1 405B Instruct
16. 1007 Gemini 1.5 Pro (August 27, 2024)
17. 994 - Gemini 1.5 Pro (May 2024)
18. 992 - GPT-4 (November 2024)
19. 985 - Deepseek V3
20. 984 - Llama 3.2 90B Vision Instruct
22. 943 - Gemini 1.5 Flash
23. 891 - Gemini 1.5 Pro (April 2024)
24. 879 - Claude 3 Sonnet
25. 871 - Llama 3 70B Instruct
26. 811 - Mistral Large
27. 685 - Gemini 1.0 Pro
28. 598 - CodeLlama 34B Instruct

[BigCode Models Leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard) to compare the performance of open-source models in code generation tasks.

Include all, not just base models, that were not externally tested and have good python and cpp scores as well as good win rate sorted by best cpp score:
1. 67.85 cpp / 87.20 python - CodeQwen1.5-7B-Chat
2. 59.59 cpp / 71.95 python - Phind-CodeLlama-34B-v2
3. 57.81 cpp / 65.85 python - Phind-CodeLlama-34B-v1
4. 55.34 cpp / 70.22 python - Phind-CodeLlama-34B-Python-v1
5. 49.69 cpp / 55.49 python - CodeLlama-70b-Python
6. 49.69 cpp / 52.44 python - CodeLlama-70b
7. 48.45 cpp / 75.60 python - CodeLlama-70b-Instruct
8. 48.35 cpp / 50.79 python - CodeQwen1.5-7B
9. 47.20 cpp / 70.73 python - WizardCoder-Python-34B-V1.0
10. 42.86 cpp / 62.19 python - WizardCoder-Python-13B-V1.0
11. 42.60 cpp / 52.74 python - CodeGemma-7B-it
12. 41.53 cpp / 50.79 python - CodeLlama-34b-Instruct
13. 41.44 cpp / 44.15 python - StarCoder2-15B
14. 41.42 cpp / 45.11 python - CodeLlama-34b
15. 40.34 cpp / 40.13 python - CodeGemma-7B
16. 39.09 cpp / 53.29 python - CodeLlama-34b-Python
17. 38.95 cpp / 58.12 python - WizardCoder-15B-V1.0
18. 36.36 cpp / 50.60 python - CodeLlama-13b-Instruct
19. 36.21 cpp / 42.89 python - CodeLlama-13b-Python
20. 35.81 cpp / 35.07 python - CodeLlama-13b
21. 33.63 cpp / 34.09 python - StarCoder2-7B
22. 29.03 cpp / 45.65 python - CodeLlama-7b-Instruct

# Dependencies

In [None]:
# imports

import os
import io
import sys
import json
import requests
from dotenv import load_dotenv
from openai import OpenAI
import google.generativeai
import anthropic
from IPython.display import Markdown, display, update_display
import gradio as gr
import subprocess

# Setup

In [None]:
# environment

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')

if os.environ['OPENAI_API_KEY']:
    print(f"OpenAI API Key exists and begins {os.environ['OPENAI_API_KEY'][:8]}")
else:
    print("OpenAI API Key not set")
    
if os.environ['ANTHROPIC_API_KEY']:
    print(f"Anthropic API Key exists and begins {os.environ['ANTHROPIC_API_KEY'][:7]}")
else:
    print("Anthropic API Key not set")

if os.environ['HF_TOKEN']:
    print(f"HuggingFace Token Key exists and begins {os.environ['HF_TOKEN'][:3]}")
else:
    print("HuggingFace Token not set")

In [None]:
# initialize

openai = OpenAI()
claude = anthropic.Anthropic()
OPENAI_MODEL = "gpt-4o"
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"

# Functions and Prompts

## System prompt

Replace with details on architecture code will run on such as AMD64 or specific CPU and GPU.

In [None]:
system_message = "You are an assistant that reimplements Python code in high performance C++ for an AMD Ryzen 7 5800X 8-Core CPU with 3.80 GHz and Radeon RX 6800 XT GPU with 16GB GDDR6. "
system_message += "Respond only with C++ code; use comments sparingly and do not provide any explanation other than occasional comments. "
system_message += "The C++ response needs to produce an identical output in the fastest possible time. Keep implementations of random number generators identical so that results match exactly."

## User prompt

Note hints provided so GPT-4o included packages so code would run but still had an overflow error on the harder coding problem.

In [None]:
def user_prompt_for(python):
    user_prompt = "Rewrite this Python code in C++ with the fastest possible implementation that produces identical output in the least time. "
    user_prompt += "Respond only with C++ code; do not explain your work other than a few comments. "
    user_prompt += "Pay attention to number types to ensure no int overflows. Remember to #include all necessary C++ packages such as iomanip.\n\n"
    user_prompt += python
    return user_prompt

In [None]:
def messages_for(python):
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt_for(python)}
    ]

## Functions to generate code and output it

In [None]:
# write to a file called optimized.cpp

def write_output(cpp):
    code = cpp.replace("```cpp","").replace("```","")
    with open("optimized.cpp", "w") as f:
        f.write(code)

In [None]:
def optimize_gpt(python):    
    stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)
    reply = ""
    for chunk in stream:
        fragment = chunk.choices[0].delta.content or ""
        reply += fragment
        print(fragment, end='', flush=True)
    write_output(reply)

In [None]:
def optimize_claude(python):
    result = claude.messages.stream(
        model=CLAUDE_MODEL,
        max_tokens=2000,
        system=system_message,
        messages=[{"role": "user", "content": user_prompt_for(python)}],
    )
    reply = ""
    with result as stream:
        for text in stream.text_stream:
            reply += text
            print(text, end="", flush=True)
    write_output(reply)

# Coding Problems

## Calculate Pii (easy) Python Code

Loop that gradually approachs pii.

In [None]:
pi = """
import time

def calculate(iterations, param1, param2):
    result = 1.0
    for i in range(1, iterations+1):
        j = i * param1 - param2
        result -= (1/j)
        j = i * param1 + param2
        result += (1/j)
    return result

start_time = time.time()
result = calculate(100_000_000, 4, 1) * 4
end_time = time.time()

print(f"Result: {result:.12f}")
print(f"Execution Time: {(end_time - start_time):.6f} seconds")
"""

### Execute Python code to calculate pii

Note down execution time of python script for baseline comparison.

**Important: don't share notebook else can be used to execute code on your local machine. Always validate the code being executed isn't harmful before running.**

In [None]:
exec(pi)

### Optimize code using GPT-4o

In [None]:
optimize_gpt(pi)

### Run C++ code generated by GPT-4o

Requires `clang++` installed on machine. Change to this to run on Linux or Mac:

    !clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp
    !./optimized

Change -march to match machine architecture. Check valid options with to target specific system:

In [None]:
# !clang++ -mcpu=help

In [None]:
!clang++ -O3 -std=c++17 -march=x86-64 -o optimized.exe optimized.cpp
!optimized.exe

### Optimize code using Claude 3.5 Sonnet

In [None]:
optimize_claude(pi)

### Run C++ code generated by Claude 3.5 Sonnet

In [None]:
!clang++ -O3 -std=c++17 -march=x86-64 -o optimized.exe optimized.cpp
!optimized.exe

## Calculate Maximum Subarray Sum (hard)

[Calculate maximum subarray sum problem](https://en.wikipedia.org/wiki/Maximum_subarray_problem): Given an array of a large number of random +ve and -ve numbers: if you were to take any subarray of consecutive numbers and add them up find the largest sum of any possible subarray.

Uses custom function to create a large number of pseudo random numbers based on the [Linear Congruential Generator (LCG)](https://en.wikipedia.org/wiki/Linear_congruential_generator) algorithm instead of the coding langauge's built-in random number library.

Note telling model to be careful to support large number sizes

In [None]:
python_hard = """# Be careful to support large number sizes

def lcg(seed, a=1664525, c=1013904223, m=2**32):
    value = seed
    while True:
        value = (a * value + c) % m
        yield value
        
def max_subarray_sum(n, seed, min_val, max_val):
    lcg_gen = lcg(seed)
    random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]
    max_sum = float('-inf')
    for i in range(n):
        current_sum = 0
        for j in range(i, n):
            current_sum += random_numbers[j]
            if current_sum > max_sum:
                max_sum = current_sum
    return max_sum

def total_max_subarray_sum(n, initial_seed, min_val, max_val):
    total_sum = 0
    lcg_gen = lcg(initial_seed)
    for _ in range(20):
        seed = next(lcg_gen)
        total_sum += max_subarray_sum(n, seed, min_val, max_val)
    return total_sum

# Parameters
n = 10000         # Number of random numbers
initial_seed = 42 # Initial seed for the LCG
min_val = -10     # Minimum value of random numbers
max_val = 10      # Maximum value of random numbers

# Timing the function
import time
start_time = time.time()
result = total_max_subarray_sum(n, initial_seed, min_val, max_val)
end_time = time.time()

print("Total Maximum Subarray Sum (20 runs):", result)
print("Execution Time: {:.6f} seconds".format(end_time - start_time))
"""

### Execute Python code to calculate max subarray sum

Note down execution time of python script for baseline comparison.

In [None]:
exec(python_hard)

### Optimize code using GPT-4o

In [None]:
optimize_gpt(python_hard)

### Run C++ code generated by GPT-4o

In [None]:
!clang++ -O3 -std=c++17 -march=x86-64 -o optimized.exe optimized.cpp
!optimized.exe

### Optimize code using Claude 3.5 Sonnet

In [None]:
optimize_claude(python_hard)

### Run C++ code generated by Claude 3.5 Sonnet

In [None]:
!clang++ -O3 -std=c++17 -march=x86-64 -o optimized.exe optimized.cpp
!optimized.exe

# UI for Code Conversion

## Functions to Stream Code Generate for UI

In [None]:
def stream_gpt(python):    
    stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)
    reply = ""
    for chunk in stream:
        fragment = chunk.choices[0].delta.content or ""
        reply += fragment
        yield reply.replace('```cpp\n','').replace('```','')

In [None]:
def stream_claude(python):
    result = claude.messages.stream(
        model=CLAUDE_MODEL,
        max_tokens=2000,
        system=system_message,
        messages=[{"role": "user", "content": user_prompt_for(python)}],
    )
    reply = ""
    with result as stream:
        for text in stream.text_stream:
            reply += text
            yield reply.replace('```cpp\n','').replace('```','')

In [None]:
def optimize(python, model):
    if model=="GPT":
        result = stream_gpt(python)
    elif model=="Claude":
        result = stream_claude(python)
    else:
        raise ValueError("Unknown model")
    for stream_so_far in result:
        yield stream_so_far        

In [None]:
with gr.Blocks() as ui:
    with gr.Row():
        python = gr.Textbox(label="Python code:", lines=10, value=python_hard)
        cpp = gr.Textbox(label="C++ code:", lines=10)
    with gr.Row():
        model = gr.Dropdown(["GPT", "Claude"], label="Select model", value="GPT")
        convert = gr.Button("Convert code")

    convert.click(optimize, inputs=[python, model], outputs=[cpp])

ui.launch(inbrowser=True)

In [None]:
def execute_python(code):
    try:
        output = io.StringIO()
        sys.stdout = output
        exec(code)
    finally:
        sys.stdout = sys.__stdout__
    return output.getvalue()

In [None]:
def execute_cpp(code):
    write_output(code)
    compiler_cmd = ["clang++", "-O3", "-std=c++17", "-march=x86-64", "-o", "optimized.exe", "optimized.cpp"]
    try:
        compile_result = subprocess.run(compiler_cmd, check=True, text=True, capture_output=True)
        run_cmd = ["optimized.exe"]
        run_result = subprocess.run(run_cmd, check=True, text=True, capture_output=True)
        return run_result.stdout
    except subprocess.CalledProcessError as e:
        return f"An error occurred:\n{e.stderr}"

In [None]:
def select_sample_code(sample_code):
    if sample_code=="pi":
        return pi
    elif sample_code=="python_hard":
        return python_hard
    else:
        return "Type your Python program here"

## UI for Code Converter

Note includes CodeQwen so run code below so functions are available for that and ensure model is deployed to AWS, Azure or GCP and running before selecting that option.

In [None]:
css = """
.python {background-color: #306998;}
.cpp {background-color: #050;}
"""

In [None]:
with gr.Blocks(css=css) as ui:
    gr.Markdown("## Convert code from Python to C++")
    with gr.Row():
        python = gr.Textbox(label="Python code:", value=pi, lines=10)
        cpp = gr.Textbox(label="C++ code:", lines=10)
    with gr.Row():
        code = gr.Dropdown(["pi", "python_hard"], label="Select code", value="pi")
        sample = gr.Button("Populate sample code")
        model = gr.Dropdown(["GPT", "Claude", "CodeQwen"], label="Select model", value="GPT")
        convert = gr.Button("Convert code")
    with gr.Row():
        python_run = gr.Button("Run Python")
        cpp_run = gr.Button("Run C++")
    with gr.Row():
        python_out = gr.TextArea(label="Python result:", elem_classes=["python"])
        cpp_out = gr.TextArea(label="C++ result:", elem_classes=["cpp"])

    sample.click(select_sample_code, inputs=[code], outputs=[python])
    convert.click(optimize, inputs=[python, model], outputs=[cpp])
    python_run.click(execute_python, inputs=[python], outputs=[python_out])
    cpp_run.click(execute_cpp, inputs=[cpp], outputs=[cpp_out])

ui.launch(inbrowser=True)

# Open-Source Models Running in the Cloud

Use [Inference Endpoints](https://endpoints.huggingface.co/) to have HuggingFace run models for you so you have an endpoint that you can use to call the model remotely from your code. **Remember to stop it!** Good alternative to Google Colab when you want to run the code locally but machine is not powerful enough.

Steps:
1. Select model to run from an endpoint. e.g. [CodeQwen1.5-7B-Chat](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat)
2. From the left toobar select Deploy->HF Inference Endpoint
3. Select Cloud provider: Amazon Web Services, Miceosoft Azure or Google Cloud Platform
4. Select runtime & region, e.g. for CodeQwen1.5-7B-Chat need at least a GPU box with 24GB of RAM such as `Nvidia L4`
5. Create endpoint (requires credit card on file with HugglingFace)

Note: can take 5-10 minutes to initalize endpoint when creating or unpausing.

**Remember to stop it! Almost $1 per hour**

## Dependencies

In [None]:
from huggingface_hub import login, InferenceClient
from transformers import AutoTokenizer

## Setup

Try running `# login(hf_token, add_to_git_credential=True)` and if issue picking up token run `login()` and paste token in textbox displayed.

In [None]:
hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)
# login()

Replace Endpoint URL below

In [None]:
code_qwen = "Qwen/CodeQwen1.5-7B-Chat"
# code_gemma = "google/codegemma-7b-it"
CODE_QWEN_URL = "https://w1zm0p9xt7jg0c0n.us-east4.gcp.endpoints.huggingface.cloud"
# CODE_GEMMA_URL = "https://my-endpoint.region.cloud-provider.endpoints.huggingface.cloud"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(code_qwen)
messages = messages_for(pi)
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [None]:
print(text)

## Optimize code using QwenCode 1.5 Chat

In [None]:
client = InferenceClient(CODE_QWEN_URL, token=hf_token)
stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)
for r in stream:
    print(r.token.text, end = "")

### Function to Stream Qwen code generated

In [None]:
def stream_code_qwen(python):
    tokenizer = AutoTokenizer.from_pretrained(code_qwen)
    messages = messages_for(python)
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    client = InferenceClient(CODE_QWEN_URL, token=hf_token)
    stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)
    result = ""
    for r in stream:
        result += r.token.text
        yield result    

In [None]:
stream_code_qwen(python_hard)

In [None]:
stream_code_qwen(pi)

## Update function UI calls to support Qwen

In [None]:
def optimize(python, model):
    if model=="GPT":
        result = stream_gpt(python)
    elif model=="Claude":
        result = stream_claude(python)
    elif model=="CodeQwen":
        result = stream_code_qwen(python)
    else:
        raise ValueError("Unknown model")
    for stream_so_far in result:
        yield stream_so_far    