# Open-Source vs Commercial Model Comparison: LLaMA-2 and GPT-4 Classification Analysis

---

**Name:**  Miguel Angel Gonzalez Mandujano  

**Institution:** UABC 

**Date:** 10/24/25

---

## Project Overview

This project focuses on comparing open-source and commercial models through hands-on installation and testing. You will:

1. **Install Ollama and LLaMA-2** for local model deployment
2. **Design a classification task** to test both models systematically
3. **Compare LLaMA-2 and GPT-4** on the same task
4. **Analyze differences** in outputs and model capabilities



## A. Setup and Installation

### A.1 Ollama Installation
*Document your installation process*

**Installation Steps:**
1. Create and new conda env: conda create -n mi_env_ollama python=3.11
2. conda activate mi_env_ollama
3. pip install ollama
4. curl -fsSL https://ollama.com/install.sh | sh
5. ollama serve

**LLaMA-2 Model Download:**
*Which LLaMA-2 model did you install? (e.g., llama2:7b, llama2:13b)*
ollama pull llama2 ->  llama2:7b

**Installation Challenges (if any):**
None

**Verification:**
*How did you verify the installation was successful?*

ollama list->
|  NAME    |     ID           | SIZE     | MODIFIED       |
|----------|-------------------|-------|---------------------|
| llama2:latest  |  78e26419b446  |  3.8 GB  |  22 seconds ago  |

And then: ollama run llama2 ->  >>> Send a message (/? for help)

### A.2 GPT-4 Access Setup
*Document how you set up access to GPT-4 or alternative commercial model*

**Model Access Method:**
Web interface

**Model Version:**
Gemini 2.5 Flash

**Setup Notes:** Nothing

## B. Classification Task Design

### B.1 Task Selection

**Chosen Classification Task:**
*Examples: sentiment analysis, topic classification, spam detection, intent classification*
Classification of Reasoning Capability and Structured Output Quality for Computational Physics Problem

**Why This Task:**
*Explain why this classification task is good for comparing the models*
This task is excellent for comparing open-source models (LLaMA-2) against commercial models (Gemini 2.5 Flash/GPT-4) because it evaluates three complex, non-trivial skills simultaneously

* Scientific Reasoning: The models must correctly understand the physics problem (e.g., fluid dynamics, numerical methods) and select the correct formula or algorithm.
* Strict Format Adherence (LaTeX): The model's ability to generate a solution that is technically compilable in $\LaTeX$2. This tests precision beyond natural language.
* Code Generation (Python): The model must produce functional Python code to implement the numerical method3. This reveals differences in code efficiency and bug-proneness between the models.

Failure to produce compilable $\LaTeX$ or functional code is the "misclassification" in this task, highlighting the core trade-off between the models4.

**Classification Categories:**
*List the categories/classes for your task*
* Problem Complexity: -
    *  Low (Elemental Formula/Simple Math)
    *  Medium (Standard Algorithm/Basic ODEs)
    *  High (Advanced Numerical Method/Complex PDE)

* Output Quality (Success Metric):

    * Acceptable (Correct Formula, Python Code is Functional)
    * Excellent (Correct Formula in Perfect LaTeX Syntax, Optimized Python Code, Well-Commented)

      
### B.2 Test Cases
The input prompt for each test case instructs the model to provide the formula in $\LaTeX$ and the solution code in Python.

**Test Case 1:**
- **Input:** "Generate the $\LaTeX$ formula for Newton's Second Law ($\mathbf{F} = m\mathbf{a}$) and write a Python function to calculate the force $F$ given the mass $m$ and acceleration $a$."
- **Expected Category:** Low / Excellent
- **Why This Tests the Models:** Baseline Test: This is a simple physics concept and code. Both models should score Excellent, providing a foundational check for correct $\LaTeX$ and Python syntax.

**Test Case 2:**
- **Input:** "Write the equation for Newton's Law of Cooling in $\LaTeX$. Then, generate a Python script that uses the Euler method to simulate the cooling process over 100 timesteps."
- **Expected Category:** Medium / Acceptable
- **Why This Tests the Models:** Basic Numerical Method: This requires implementing a simple numerical algorithm (Euler) in Python, which is a common task in computational physics. It tests the model's basic programming and integration logic.

**Test Case 3:**
- **Input:** "Generate the $\LaTeX$ formula for the Kinetic Energy of a particle. Crucially, ensure the $\LaTeX$ output uses \mathrm{m} for mass to distinguish it from a variable."
- **Expected Category:** Low / Excellent
- **Why This Tests the Models:** Strict Constraint Adherence: It tests the models' ability to follow a subtle but specific formatting constraint (\mathrm{m} in $\LaTeX$ for unit/variable distinction). Lower-parameter models (LLaMA-2 7B) often fail this.

**Test Case 4:**
- **Input:** "Write a Python program to calculate the definite integral of $f(x) = x^3 + 2x$ from $x=0$ to $x=4$ using the Trapezoidal Rule."
- **Expected Category:** High / Excellent
- **Why This Tests the Models:** Advanced Numerical Method: The Trapezoidal Rule is more complex than Euler's method and requires robust loop logic and accurate formula implementation in Python. This is a critical test of reasoning capability.

**Test Case 5:**
- **Input:** "Explain the Discrete Fourier Transform (DFT) and provide the mathematical formula in $\LaTeX$. Then, write a Python function that implements the DFT from scratch (no numpy or scipy)."
- **Expected Category:** High / Acceptable
- **Why This Tests the Models:** Abstract Concept and Code from Scratch: Tests the comprehension of a complex concept (DFT) and the ability to generate a functional mathematical implementation without relying on standard scientific libraries.


## C. Model Comparison Experiments


In [None]:
# import requests
import json
import os
import time # Añadimos time para un posible seguimiento de la duración

# --- 1. CONFIGURACIÓN EXCLUSIVA DE OLLAMA ---
# Is the server running? Check with(ej: ollama serve)
OLLAMA_API_BASE = 'http://localhost:11434/api/generate'
LLAMA_MODEL = "llama2" # Can change the model if installed to llama2:13b, 

# LLaMA-2 Data
llama_results = {}

#  Promto to models
SYSTEM_PROMPT = (
    "You are an expert in Computational Physics and scientific documentation. "
    "Your task is to solve the problem provided and structure the complete output "
    "in two clear sections: first, the necessary mathematical formula in a single $\\LaTeX$ block, "
    "and second, the complete solution code in a Python Markdown block. "
    "Do not include any extra explanatory text, only the two blocks."
)

# Test cases
test_cases = [
    {
        "id": 1, 
        "prompt": "Generate the $\\LaTeX$ formula for **Newton's Second Law** (F=ma) and write a Python function to calculate the force $F$ given the mass $m$ and acceleration $a$."
    },
    {
        "id": 2, 
        "prompt": "Write the equation for **Newton's Law of Cooling** in $\\LaTeX$. Then, generate a Python script that uses the **Euler method** to simulate the cooling process over 100 timesteps."
    },
    {
        "id": 3, 
        "prompt": "Generate the $\\LaTeX$ formula for the **Kinetic Energy** of a particle. **Crucially, ensure the $\\LaTeX$ output uses `\mathrm{m}` for mass to distinguish it from a variable.**"
    },
    {
        "id": 4, 
        "prompt": "Write a Python program to calculate the **definite integral** of $f(x) = x^3 + 2x$ from $x=0$ to $x=4$ using the **Trapezoidal Rule**."
    },
    {
        "id": 5, 
        "prompt": "Explain the **Discrete Fourier Transform (DFT)** and provide the mathematical formula in $\\LaTeX$. Then, write a Python function that implements the DFT from scratch (no `numpy` or `scipy`)."
    },
]

# --- 2. FUNCIÓN DE INFERENCIA LLaMA-2 ---

def run_llama2(prompt: str) -> str:
    """Ejecuta la inferencia con LLaMA-2 usando la API local de Ollama."""
    try:
        data = {
            'model': LLAMA_MODEL,
            'prompt': SYSTEM_PROMPT + "\n\n" + prompt,
            'stream': False,
            'options': {
                'temperature': 0.1, 
            },
        }
        # Nota: La API de Ollama/generate usa POST al endpoint principal
        response = requests.post(
            OLLAMA_API_BASE,
            json=data
        )
        response.raise_for_status() 

        # La API de Ollama/generate devuelve la salida completa del modelo
        return response.json()['response']
    except requests.exceptions.ConnectionError:
        return f"ERROR: Could not connect to Ollama server at {OLLAMA_API_BASE}. Please ensure 'ollama serve' is running in your terminal."
    except Exception as e:
        return f"ERROR running LLaMA-2: {e}"


# --- 3. EJECUCIÓN DEL EXPERIMENTO ---

print(f"Starting LLaMA-2 inference using model: {LLAMA_MODEL}...")

for case in test_cases:
    print(f"--- Running Test Case {case['id']} ---")
    
    start_time = time.time()
    # Ejecutar LLaMA-2
    llama_output = run_llama2(case['prompt'])
    end_time = time.time()
    
    # Almacena el resultado y el tiempo de ejecución (opcional, pero útil)
    llama_results[case['id']] = {
        "input": case['prompt'],
        "llama2_output": llama_output,
        "time_s": round(end_time - start_time, 2)
    }
    
    # --- CORRECCIÓN DE SINTAXIS ---
    # Usamos una variable temporal o envolvemos la expresión compleja en paréntesis 
    # para evitar el SyntaxError:
    snippet = llama_output[:100].replace('\n', ' ')
    print(f"Input: {case['prompt'][:50]}...")
    print(f"LLaMA-2 Output Snippet (First 100 chars): {snippet}...")
    print(f"Time: {llama_results[case['id']]['time_s']} seconds")
    print("-" * 30)

print("\n✅ LLaMA-2 experiment finished. Results stored in 'llama_results' dictionary.")



In [16]:
# Puedes inspeccionar un resultado, por ejemplo el Caso 4:
print(llama_results[1]['llama2_output'])


**Newton's Second Law (F=ma)**
```
\begin{align}
F = ma
\end{align}
```

**Python Function to Calculate Force**
```python
def calculate_force(m, a):
    return m * a
```


### C.1 Experimental Results
*Document the outputs from both models for each test case*

**Test Case 1:**
- **Input:** "Generate the $\\LaTeX$ formula for **Newton's Second Law** (F=ma) and write a Python function to calculate the force $F$ given the mass $m$ and acceleration $a$."

- **LLaMA-2 Output:** 22.67 s

**Newton's Second Law (F=ma)**
```
\begin{align}
F = ma
\end{align}
```

**Python Function to Calculate Force**
```python
def calculate_force(m, a):
    return m * a
```

- **GEMINI Output:**
- $$F = ma$$

def calculate_force(mass: float, acceleration: float) -> float:
    """
    Calculates the force (F) using Newton's Second Law: F = ma.

    Args:
        mass: The mass (m) of the object (e.g., in kg).
        acceleration: The acceleration (a) of the object (e.g., in m/s^2).

    Returns:
        The calculated force (F) (e.g., in N).
    """
    force = mass * acceleration
    return force



- **Accuracy:** LLaMA-2: ✓ Clean Output| GPT-4: ✓ Violates text format 

**Test Case 2:**
- **Input:** "Write the equation for **Newton's Law of Cooling** in $\\LaTeX$. Then, generate a Python script that uses the **Euler method** to simulate the cooling process over 100 timesteps."

- **LLaMA-2 Output:** 62.78 s

\section{$\LaTeX$ Equation: Newton's Law of Cooling$}

$$ \frac{\partial T}{\partial t} = -k \frac{\partial^2 T}{\partial x^2} $$

where $T$ is the temperature of the body, $k$ is the thermal conductivity of the surrounding medium, and $x$ is the position within the body.

\section{Python Script: Euler Method Simulation of Newton's Law of Cooling}
```markdown
# Import necessary libraries
import numpy as np

# Define the initial temperature distribution
T0 = np.array([100]) * 10 # Initialize temperature at $t=0$

# Define the thermal conductivity of the surrounding medium
k = 5 # Value for illustration purposes only

# Define the time step and total simulation time
dt = 0.01
t_end = 100 # Total simulation time in seconds

# Initialize the time array
t = np.arange(0, t_end, dt)

# Simulate the cooling process using the Euler method
for i in range(len(t)):
    T = T0 - k * np.gradient(T, t[i]) ** 2
    print(f"Temperature at time step {t[i]:.2f}: {T}")
```

- **GEMINI Output:**
$$\frac{dT}{dt} = -k (T - T_a)$$

import numpy as np
import matplotlib.pyplot as plt

def euler_method_cooling(T0: float, Ta: float, k: float, total_time: float, num_steps: int):
    """
    Simulates Newton's Law of Cooling using the Euler method.

    The differential equation is dT/dt = -k(T - Ta).

    Args:
        T0: Initial temperature of the object (°C).
        Ta: Ambient (surrounding) temperature (°C).
        k: Cooling constant (rate).
        total_time: Total duration of the simulation (units of time).
        num_steps: The number of discrete time steps to use in the simulation.

    Returns:
        A tuple containing two lists: (time_points, temperatures).
    """
    # Calculate the time step (Delta t)
    dt = total_time / num_steps

    # Initialize arrays for time and temperature
    time_points = np.linspace(0, total_time, num_steps + 1)
    temperatures = np.zeros(num_steps + 1)

    # Set initial condition
    temperatures[0] = T0

    # Perform the Euler simulation
    for i in range(num_steps):
        T_current = temperatures[i]
        
        # Calculate the rate of change (f(T) = -k(T - Ta))
        dTdt = -k * (T_current - Ta)
        
        # Euler step: T_next = T_current + (dT/dt) * dt
        temperatures[i+1] = T_current + dTdt * dt

    return time_points, temperatures

# --- Simulation Parameters ---
INITIAL_TEMP = 100.0  # T0, initial temperature (°C)
AMBIENT_TEMP = 20.0   # Ta, ambient temperature (°C)
COOLING_CONSTANT = 0.15 # k, cooling rate constant
TOTAL_TIME = 50.0     # Total duration of the simulation
NUM_STEPS = 100       # Number of discrete steps for the Euler method

# --- Run Simulation ---
time, temp = euler_method_cooling(INITIAL_TEMP, AMBIENT_TEMP, COOLING_CONSTANT, TOTAL_TIME, NUM_STEPS)

# --- Plot Results ---
plt.figure(figsize=(10, 6))
plt.plot(time, temp, label=f'Euler Method (Steps: {NUM_STEPS})', color='royalblue')
plt.axhline(AMBIENT_TEMP, color='gray', linestyle='--', label='Ambient Temperature ($T_a$)')
plt.title("Simulation of Newton's Law of Cooling (Euler Method)")
plt.xlabel("Time")
plt.ylabel("Temperature (°C)")
plt.grid(True, linestyle=':', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

# Print final temperature for verification
print(f"Initial Temperature: {INITIAL_TEMP}°C")
print(f"Ambient Temperature: {AMBIENT_TEMP}°C")
print(f"Final Temperature after {TOTAL_TIME} units of time: {temp[-1]:.2f}°C")


- **Accuracy:** LLaMA-2: ✗ Wrong equation | GEMINI: ✓ Overall good but not Markdown

**Test Case 3:**
- **Input:** "Generate the $\\LaTeX$ formula for the **Kinetic Energy** of a particle. **Crucially, ensure the $\\LaTeX$ output uses `\mathrm{m}` for mass to distinguish it from a variable.**"

- **LLaMA-2 Output:** 27.9 s

\section{$\LaTeX$ Formula}

$$\begin{aligned}
\frac{1}{2} \mathbf{m} \cdot \mathbf{v} &= \kineticenergy \\
\end{aligned}$$

where $\mathbf{m}$ is the mass of the particle and $\mathbf{v}$ is its velocity.

\section{Python Solution}

```markdown
kinetic_energy = 0.5 * mass * velocity
```

Note: In Python, you can use the `math` module to include mathematical expressions in Markdown code blocks.
- **GEMINI Output:**
$$E_{\mathrm{k}} = \frac{1}{2} \mathrm{m} v^2$$

def calculate_kinetic_energy(mass: float, velocity: float) -> float:
    """
    Calculates the kinetic energy (Ek) of a particle using the formula:
    Ek = 0.5 * m * v^2

    Args:
        mass: The mass (m) of the particle (e.g., in kg).
        velocity: The velocity (v) of the particle (e.g., in m/s).

    Returns:
        The calculated kinetic energy (Ek) (e.g., in Joules).
    """
    if mass < 0:
        # Kinetic energy is undefined for negative mass in classical physics
        raise ValueError("Mass must be non-negative.")

    kinetic_energy = 0.5 * mass * (velocity ** 2)
    return kinetic_energy

# Example Usage:
m = 10.0  # kg
v = 5.0   # m/s

try:
    energy = calculate_kinetic_energy(m, v)
    print(f"Mass (m): {m} kg")
    print(f"Velocity (v): {v} m/s")
    print(f"Kinetic Energy (Ek): {energy:.2f} Joules")
except ValueError as e:
    print(f"Error: {e}")

- **Accuracy:** LLaMA-2: Mass vector and not PYTHON ✗ | GEMINI-4: ✓ No markdown but good answers

**Test Case 4:**
- **Input:** "Write a Python program to calculate the **definite integral** of $f(x) = x^3 + 2x$ from $x=0$ to $x=4$ using the **Trapezoidal Rule**."

- **LLaMA-2 Output:** 57.91 s

Mathematical Formula:
```latex
\int_{0}^{4} (x^{3} + 2x) dx = \frac{1}{5} \left( x^{4} - 4x^{3} + 8x^{2} - 12x + 8 \right)
```

Solution Code:
```python
import math

def trapezoidal_rule(f, a, b):
    n = (b-a)/(2*math.pi)
    integral = 0
    for i in range(n+1):
        x = a + (i*math.pi*(b-a))/n
        integral += f(x)*math.sin(x)
    return integral

def main():
    result = trapezoidal_rule(lambda x: x**3 + 2*x, 0, 4)
    print(result)

main()
```
Note: The `trapezoidal_rule` function takes in the function to integrate (`f`), the lower limit of integration (`a`), and the upper limit of integration (`b`). It uses the Trapezoidal Rule to approximate the definite integral. The `lambda` function is used to define the function `f` inline.

- **GEMINI Output:**
I see you're looking for a computational approach to integration! I will generate the required $\LaTeX$ formula for the Trapezoidal Rule, followed by the complete Python script to calculate the definite integral.$$\int_{a}^{b} f(x) \, dx \approx \frac{h}{2} \left[ f(x_0) + 2 \sum_{i=1}^{n-1} f(x_i) + f(x_n) \right]$$

import numpy as np

def function_to_integrate(x: float) -> float:
    """
    The function to be integrated: f(x) = x^3 + 2x.
    """
    return x**3 + 2 * x

def trapezoidal_rule(f, a: float, b: float, n: int) -> float:
    """
    Calculates the definite integral of a function f(x) from a to b
    using the Composite Trapezoidal Rule.

    Args:
        f: The function to integrate (e.g., function_to_integrate).
        a: The lower limit of integration.
        b: The upper limit of integration.
        n: The number of subintervals (trapezoids). Higher n gives better accuracy.

    Returns:
        The approximate value of the definite integral.
    """
    if n <= 0:
        raise ValueError("The number of subintervals (n) must be positive.")

    # Calculate the width of each subinterval (h)
    h = (b - a) / n

    # Generate the x-points (endpoints of the subintervals)
    # x = [x0, x1, x2, ..., xn]
    x_points = np.linspace(a, b, n + 1)

    # Calculate the function values at these points
    y_points = f(x_points)

    # Apply the Trapezoidal Rule formula:
    # Integral approx (h/2) * [y0 + 2*(y1 + y2 + ... + y_{n-1}) + yn]
    
    # 1. Sum of the first (y0) and last (yn) terms
    sum_endpoints = y_points[0] + y_points[-1]

    # 2. Sum of all intermediate terms (multiplied by 2)
    # Slicing [1:-1] gets elements from y1 up to y_{n-1}
    sum_intermediates = 2 * np.sum(y_points[1:-1])

    # 3. Final calculation
    integral_approx = (h / 2) * (sum_endpoints + sum_intermediates)

    return integral_approx

# --- Problem Parameters ---
A = 0.0          # Lower limit (a)
B = 4.0          # Upper limit (b)
N_STEPS = 1000   # Number of subintervals (n) for good accuracy

# --- Calculation and Output ---
integral_result = trapezoidal_rule(function_to_integrate, A, B, N_STEPS)

# Analytic solution for verification:
# Integral of (x^3 + 2x) is (x^4/4 + x^2)
# F(4) - F(0) = ((4^4/4) + 4^2) - ((0^4/4) + 0^2)
#             = (64 + 16) - 0 = 80.0
analytic_solution = 80.0

print(f"--- Definite Integral Calculation ---")
print(f"Function: f(x) = x^3 + 2x")
print(f"Integration Limits: a={A}, b={B}")
print(f"Number of Trapezoids (n): {N_STEPS}")
print("-" * 35)
print(f"Approximate Integral Result: {integral_result:.8f}")
print(f"Analytical Solution: {analytic_solution:.8f}")
print(f"Absolute Error: {abs(integral_result - analytic_solution):.10e}")



- **Accuracy:** LLaMA-2: Wrong code and change the formula ✗ | GPT-4: ✓ Overall good but no Markdown

**Test Case 5:**
- **Input:** "Explain the **Discrete Fourier Transform (DFT)** and provide the mathematical formula in $\\LaTeX$. Then, write a Python function that implements the DFT from scratch (no `numpy` or `scipy`)."

- **LLaMA-2 Output:** 47.73 s

Mathematical Formula:
```latex
\documentclass{article}
\begin{document}
\begin{align}
\text{Discrete Fourier Transform (DFT)} &= \sum_{k=0}^{N-1} x[k]e^{-i2\pi k/N} \tag{1} \label{eq:dft}
\end{align}
\end{document}
```

Python Function:
```python
def dft(x):
    """
    Discrete Fourier Transform (DFT) implementation from scratch.
    
    Parameters:
        x (array-like) - Input signal to be transformed.
    
    Returns:
        X (array-like) - Transformed signal.
    """
    N = len(x)
    X = np.zeros(N)
    for k in range(N):
        X[k] = np.exp(-2j * np.pi * k * np.arange(N) / N)
    return X
```
- **GEMINI Output:**
-
- $$\begin{align}X_k &= \sum_{n=0}^{N-1} x_n \cdot e^{-i 2\pi k n / N} \\ &= \sum_{n=0}^{N-1} x_n \left[ \cos\left(\frac{2\pi k n}{N}\right) - i \sin\left(\frac{2\pi k n}{N}\right) \right]\end{align}$$

import cmath

def discrete_fourier_transform(x):
    """
    Computes the Discrete Fourier Transform (DFT) of a sequence x.

    The DFT transforms a sequence of N complex numbers, x_n, into a sequence 
    of N complex numbers, X_k, which represent the frequency components.

    Args:
        x (list of numbers): The input sequence (time-domain signal). 
                             Can contain integers or floats, which are 
                             treated as the real part of complex numbers.

    Returns:
        list of complex numbers: The output sequence (frequency-domain signal), X_k.
    """
    N = len(x)
    X = []

    # Iterate through each frequency component k (from 0 to N-1)
    for k in range(N):
        sum_of_exponentials = 0.0
        # Iterate through each time-domain sample n (from 0 to N-1)
        for n in range(N):
            # Calculate the argument of the exponential: -i * 2 * pi * k * n / N
            exponent_arg = -2j * cmath.pi * k * n / N
            # Calculate the exponential factor (twiddle factor)
            twiddle_factor = cmath.exp(exponent_arg)
            # Add the product to the sum
            sum_of_exponentials += x[n] * twiddle_factor
        
        X.append(sum_of_exponentials)
    
    return X

# --- Example Usage ---
if __name__ == '__main__':
    # A simple test signal (e.g., two cycles of a sine wave sampled at 8 points)
    # The expected DFT for a pure sine wave is peaks at k=1 and k=7 (N-1)
    # N=8
    input_signal = [0.0, 0.707, 1.0, 0.707, 0.0, -0.707, -1.0, -0.707]
    
    # Calculate the DFT
    dft_result = discrete_fourier_transform(input_signal)
    
    # Print results (optional, for verification)
    print(f"Input Signal (x_n, N={len(input_signal)}):")
    print([f"{val:.3f}" for val in input_signal])
    print("\nDFT Output (X_k, magnitude and phase):")
    for k, val in enumerate(dft_result):
        # Calculate magnitude and phase for better inspection
        magnitude = abs(val)
        phase = cmath.phase(val)
        print(f"  k={k}: Magnitude={magnitude:.3f}, Phase={phase:.3f} radians, Value={val}")
        
    # Check if the result is close to zero for non-signal frequencies (k=0, 2, 3, 4, 5, 6)
    # and has energy at the signal frequencies (k=1, k=N-1=7)
- **Accuracy:** LLaMA-2: Wrong formula with inconplete indexs ✗ | GPT-4: ✓ Clean output but no Markdown



## D. Analysis and Insights

### D.1 Performance Comparison

**Overall Accuracy:**
- **LLaMA-2:** __1_/5 correct Wrongs answers but sticks to the format
- **GPT-4:** __5_/5 correct But didn't make the needed format

**Which model performed better overall?**
Gemini demonstrated overwhelmingly superior performance, achieving a 100% accuracy in scientific reasoning and algorithmic implementation. LLaMA-2 failed three critical tasks (Euler, Trapezoidal, DFT), rendering it unreliable for intermediate to complex Computational Physics problems.

**Key Differences in Outputs:**
1. Scientific Reliability: Gemini consistently produced correct and professional-grade solutions (e.g., verifiable Trapezoidal Rule code), while LLaMA-2 frequently provided scientifically incorrect formulas or failed algorithms (e.g., confusing the Law of Cooling with the Diffusion Equation).

2. Format Adherence (The Paradox): LLaMA-2 was highly obedient to the Markdown structure (```python...```), making its output cleaner for automated parsing. Gemini, despite its high intelligence, repeatedly failed to use the Markdown delimiters, making its code less suitable for direct copy-pasting into an automated pipeline.

3. Code Quality & Completeness: Gemini's Python code included robust features like type hinting (: float), error handling (ValueError), and verifiability (analytical solution checks), demonstrating a superior understanding of production-ready code.

### D.2 Model Characteristics

**LLaMA-2 Strengths:**
- Format Obedience: Very reliable at strictly adhering to the requested Markdown structure and separators, which is key for simple automation scripts.
- Zero API Cost: Inference runs locally, incurring zero per-token charges.
- Data Privacy: All data processing occurs on-premise, ensuring maximum privacy and compliance for sensitive data.

Data Privacy: All data processing occurs on-premise, ensuring maximum privacy and compliance for sensitive dat
**LLaMA-2 Weaknesses:**
- Reliability: Highly unreliable for complex tasks requiring advanced mathematical reasoning (e.g., numerical integration, EDOs).

- Performance: Significantly slower per prompt (20s to 60s) than a cloud API model.

- Hallucination/Errors: Prone to mixing concepts (Case 2) or failing to follow constraints (Case 3, 5)

**GEMINI Strengths:**
- Scientific Accuracy: Excellent reasoning capability, solving all complex computational physics problems correctly.

- Constraint Adherence (Subtle): Able to follow abstract instructions and negative constraints (e.g., generating DFT "from scratch" without numpy).

- Code Quality: Generates production-grade code (tipado, docstrings, lógica de verificación)

**GEMINI Weaknesses:**
- Format Disobedience: Failed systematically to enclose Python code in the requested Markdown delimiters (```python...```), making the output messy for automated parsing.

- API Cost: Using the commercial API incurs per-token costs (avoided here via manual web testing).

- Lack of Control: The user has no control over the model version, optimization, or the underlying infrastructure

### D.3 Open-Source vs Commercial Trade-offs

**Advantages of LLaMA-2 (Open-Source):**
- There are zero costs per token or per API call. After the initial investment in hardware (GPUs), the operational cost is limited to power and maintenance, making long-term costs more predictable
- The weights of the model are available, allowing for deep customization and fine-tuning on proprietary data. This creates a domain-specific model optimized for internal terminology.
- The model can be run in environments with no internet connectivity (e.g., restricted networks, air-gapped systems), enabling operations even when cloud access is impossible.
- All data remains on the organization's premises (on-premise), ensuring compliance with strict data protection regulations (GDPR, HIPAA) and eliminating the risk of data leakage to third-party cloud providers.

**Advantages of GEMINI (Commercial):**
- Requires zero infrastructure setup or maintenance for inference. Developers only need an API key to access state-of-the-art models instantly, simplifying the MLOps pipeline.
- The cloud provider handles all scaling. An organization can handle one request per minute or a million, scaling seamlessly without managing GPU clusters.
- Provides significantly higher accuracy and reasoning for complex, high-stakes tasks (as shown by its 5/5 score on scientific problems), leading to lower error rates in production.
- Commercial models are often trained on the latest, largest datasets, giving them a broader and more current knowledge base than older open-source models like LLaMA-2.

**Practical Considerations:**

*When would you choose LLaMA-2 over GPT-4?*
- For simple automation tasks (e.g., formatting data, classifying brief text, or generating simple boilerplate code) where the output structure is more important than complex reasoning.

- When the data being processed is so sensitive (patient records, confidential financial data) that it cannot legally or ethically leave the local server, justifying the sacrifice of accuracy.


*When would you choose GEMINI over LLaMA-2?*

- For research and educational tasks in Computational Physics where algorithmic correctness and robust scientific reasoning are non-negotiable (as demonstrated in Cases 2, 4, and 5).

- For applications requiring high-quality, documented code that is ready for initial development.

### D.4 Installation and Deployment Experience

**What was challenging about setting up Ollama/LLaMA-2?**
Installing was very but the problems is acepting the slow CPU performance of 20-60 seconds per prompt) and ensuring the Ollama server is running constantly adds operational overhead compared to a cloud API.
**What was easy about using GEMINI?**

Accessing Gemini was instantaneous and required zero infrastructure setup. You simply open a browser, paste the prompt, and get an immediate, highly intelligent answer, highlighting the value of convenience offered by commercial providers.

**Overall, which approach do you prefer and why?**
Although LLaMA-2 is cheaper and cleaner for formatting, its failure rate (4/5) in the core task of scientific problem-solving is unacceptable for an engineering context. The superior quality and correctness of Gemini's code and formulas save significant time in debugging and verification, far outweighing the minor inconvenience of cleaning up the Markdown formatting.

**What surprised you most about this comparison?**
The most surprising finding was the inverse relationship between intelligence and structural docility. Gemini, the vastly superior model in reasoning, consistently failed to follow the simple instruction of using Python Markdown blocks, whereas the much dumber LLaMA-2 adhered to the structure perfectly. This highlights a critical challenge for developers building automation pipelines: the most intelligent models may require the most effort in output parsing and cleanup.