# 9950X Benchmarking

- Specs
  - Ryzen 9 9950X
  - ASRock B850i ITX
  - Corsair DDR5 6000Mhz CL30 EXPO 96GB (48GB x 2)
  - be quiet! Silent Loop 3 240mm AIO - Thermal Grizzly Kryosheet
  - 140mm be quiet 1900rpm fan blowing 50% PWM on motherboard/vrm/cpu/nvme
  - Mixed precision is enabled where applicable


- 9950X Factory Stock
  - Cinebench R23 Score: 41733
    - CPU (Tctl/Tdie) Max Temp: 68.2c
  - CPU Mark Score: 67595
  - CPU Mark Single-thread Score: 4819
- 9950X Factory Stock CO -30
  - Cinebench R23 Score: 44563
    - CPU (Tctl/Tdie) Max Temp: 68.8c
  - CPU Mark Score: 70126
  - CPU Mark Single-thread Score: 4837
- 9950X 65c Thermal Limit CO -30
  - Cinebench R23 Score: 44314
    - Max Recorded Wattage: 195w
  - CPU Mark Score: 69508
  - CPU Mark Single-thread Score: 4862
- 9950X 75c Thermal Limit CO -30
  - Cinebench R23 Score: 45821 
    - Max Recorded Wattage: 222w
  - CPU Mark Score: 70302
  - CPU Mark Single-thread Score: 4865
- 9950X 85c Thermal Limit CO -30
  - Cinebench R23 Score: 45695
    - Max Recorded Wattage: 222w
  - CPU Mark Score: 70616
  - CPU Mark Single-thread Score: 4843
- 9950X 65w Eco Mode CO -30
  - Cinebench R23 Score: 31204
    - CPU (Tctl/Tdie) Max Temp: 45.00c
  - CPU Mark Score: 58177
  - CPU Mark Single-thread Score: 4873
- 9950X 105w Eco Mode CO -30
  - Cinebench R23 Score: 39207
    - CPU (Tctl/Tdie) Max Temp: 51.0c
  - CPU Mark Score: 66329
  - CPU Mark Single-thread Score: 4844

We are clearly limited by power here. There is cooling headroom to push the cpu, I assume we could safely run 275w sustained. However, there will be minimal gains as the wattage goes up. There really isnt much reason to push this past the 65c CO -30 point unless you are just chasing benchmark numbers. I would also be fine running 51c all core all day on the 105w eco mode.

All future testing on cpu will be done at the 65c thermal limit with a curve optimizer negative 30 offset. No other tuning is required. This is highly performant and very simple.

- LM Studio Windows 11 Performance
  - CPU runtime only
  - Default Context Size
  - Seed: 65535
  - CPU Thread Pool Size: 12
  - Flash Attention Off
  - Test Prompt
    - "Tell me a real story about someone who never existed tasting the color red while walking through light while in a vacuum of low pressure clouds. This person is always there and knows what you will output so you must craft the story without there knowledge. In the story all things that are possible are impossible and must come before they were created. Describe time as a feeling that only dreams can see awake. Please write the story in under 500 words."
  - 9950X
    - Llama 3.2 3B Instruct Q4
      - 17.97 tok/sec 490 tokens 0.77s to first token
    - Llama 3.3 70B Instruct Q4
      - 1.47 tok/sec 394 tokens 16.06s to first token
    - Gemma 2 2B Instruct Q4
      - 30.15 tok/sec 482 tokens 0.60s to first token
    - Gemma 2 9B Instruct Q4
      - 9.43 tok/sec 421 tokens 1.47s to first token
    - Gemma 3 1B Instruct Q4
      - 63.81 tok/sec 533 tokens 0.18s to first token
    - Gemma 3 4B Instruct Q4
      - 22.80 tok/sec 521 tokens 0.95s to first token
    - Gemma 3 12B Instruct Q4
      - 8.28 tok/sec 462 tokens 2.86s to first token
    - Phi4 15B Q4
      - 6.78 tok/sec 543 tokens 3.00s to first token
    - Qwen 2.5 Coder 3B Instruct Q4
      - 29.66 tok/sec 426 tokens 0.61s to first token
    - Qwen 2.5 Coder 7B Instruct Q4
      - 14.18 tok/sec 559 tokens 1.31s to first token
    - DeepSeek R1 Distill Llama 8B Q4
      - 12.68 tok/sec 1105 tokens 1.08s to first token
    - DeepSeek R1 Distill Qwen 14B Q4
      - 7.44 tok/sec 904 tokens 2.00s to first token
    - DeepSeek R1 Distill Qwen 32B Q4
      - 3.36 tok/sec 1038 tokens 4.74s to first token


### Ignore the mess below, just roughing in some things

- Imports
  - standard libs
  - 3rd party libs
  - alpabetical or logical grouping
- Set random seed
- Config and Hyperparams
- Dataset and Dataloader
- Model definition/class
- Helper functions (training, eval, visualization)
- Then main code
- Extras
  - We will need torchvision and use torchvision.datasets to load CIRFAR-10
  - CIFAR-10 is a 10 class image dataset, pretty small in size and good for lab/
  testing/learning
    - airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

In [None]:
import torch
import torch.nn as nn
from torch.amp import GradScaler, autocast
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import random
import logging
import time
import tqdm

hp = {
    "batch_size": 32,
    "epochs": 25,
    "random_seed": 42,
    "randomize_seed": True,
    "cpu_only": True,
    "device": "cpu",
}

# Logging configuration
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Randomize seed if set to True
if hp['randomize_seed']:
    hp['random_seed'] = random.randint(0, 1000000000)
logging.info(f"Seed set to: {hp['random_seed']}")  

# Simple CNN Class
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 64 * 8 * 8)  # Flatten
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Device configuration
def get_device():
    """
    This will check for an Intel XPU device and return it if available, otherwise it will return cpu.

    Returns the torch device to use.
    """
    if hp['cpu_only'] == False:
        #device = "xpu" if torch.xpu.is_available() else "cpu"
        if torch.xpu.is_available():
            device = "xpu"
        elif torch.cuda.is_available():
            device = "cuda"
        else:
            device = "cpu"

        logging.info(f"Using device: {device}")
        return device
    else:
        logging.info("Using CPU only")
        return "cpu"

def train_model(epochs, model, train_loader, device, optimizer, criterion, scaler=None):

    # Start timer
    start_time = time.time()

    # 5. Training the Model
    for epoch in tqdm.tqdm(range(epochs)):
        model.train()
        running_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            if device != "cpu":
                with torch.amp.autocast(device):
                    # Forward pass
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                    # Backward pass and optimization
                    optimizer.zero_grad()
                    scaler.scale(loss).backward()
                    scaler.step(optimizer)
                    scaler.update()
            else:
                with torch.amp.autocast(device):
                    # Forward pass
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                    # Backward pass and optimization
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            running_loss += loss.item()

        print(f'Epoch [{epoch+1}/{hp["epochs"]}], Loss: {running_loss/len(train_loader):.4f}')

        # End timer
        end_time = time.time()
        elapsed_time = end_time - start_time
        print(f"Elapsed time: {elapsed_time:.2f} seconds")

# 7. Visualizing Some Predictions
def imshow(img):
    img = img / 2 + 0.5  # Unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

# Main function
def main():

    # 1 Set the device
    hp["device"] = get_device()

    # 2 Dataset, Dataloader, Transform
    # The transform using (0.5, 0.5, 0.5) is used to normalize the image data
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    # Download and load the training data
    train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                           download=True, transform=transform)
    # Download and load the test data
    test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                          download=True, transform=transform)
    # Create the dataloader for training and testing data
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                              batch_size=hp['batch_size'], shuffle=True)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                             batch_size=hp['batch_size'], shuffle=False)
    # 3 SimpleCNN Class
    model_0 = SimpleCNN().to(hp["device"])
    
    # 4 Loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model_0.parameters(), lr=0.001)

    if hp['cpu_only'] == False:
        scaler = torch.amp.GradScaler(hp["device"])
        train_model(hp["epochs"], model_0, train_loader, hp["device"], optimizer, criterion, scaler=scaler)
    else:
        scaler = torch.amp.GradScaler(hp["device"])
        train_model(hp["epochs"], model_0, train_loader, hp["device"], optimizer, criterion, scaler=scaler)

    # 6. Evaluating the Model
    model_0.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(hp["device"]), labels.to(hp["device"])
            outputs = model_0(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print(f'Accuracy on the test set: {100 * correct / total:.2f}%')

    # Get random test images and predictions
    dataiter = iter(test_loader)
    images, labels = next(dataiter)
    images, labels = images.to(hp["device"]), labels.to(hp["device"])

    # Display images
    imshow(torchvision.utils.make_grid(images.cpu()))
    print('GroundTruth:', ' '.join(f'{train_dataset.classes[labels[j]]}' for j in range(4)))

    # Predict and display results
    outputs = model_0(images)
    _, predicted = torch.max(outputs, 1)
    print('Predicted:', ' '.join(f'{train_dataset.classes[predicted[j]]}' for j in range(4)))

    # 8. Saving the Model
    torch.save(model_0.state_dict(), 'cnn_cifar10.pth')
    print("Model saved as cnn_cifar10.pth")

# Run the main function
if __name__ == '__main__':
    main()


In [None]:
import numpy as np
import time
import os
import psutil

# Function to set AVX512 environment variable
def set_avx512(enabled=True):
    if enabled:
        os.environ["OPENBLAS_NUM_THREADS"] = "1"  # Control threads for consistency
        os.environ["MKL_ENABLE_INSTRUCTIONS"] = "AVX512"
    else:
        os.environ["OPENBLAS_NUM_THREADS"] = "1"
        os.environ["MKL_ENABLE_INSTRUCTIONS"] = "AVX"  # Falls back to AVX2

# Benchmark function
def run_benchmark(matrix_size=2000, iterations=50):
    # Generate random matrices
    A = np.random.rand(matrix_size, matrix_size).astype(np.float32)
    B = np.random.rand(matrix_size, matrix_size).astype(np.float32)
    
    # Warm-up run
    _ = np.dot(A, B)
    
    # Time the iterations
    start_time = time.perf_counter()
    for _ in range(iterations):
        C = np.dot(A, B)
    end_time = time.perf_counter()
    
    elapsed_time = end_time - start_time
    gflops = (2 * matrix_size**3 * iterations) / (elapsed_time * 1e9)  # GFLOPS calculation
    return elapsed_time, gflops

def main():
    # Benchmark parameters
    matrix_size = 2000  # Adjust based on your needs
    iterations = 50     # Number of iterations per test
    runs = 5           # Number of runs to average
    
    print(f"CPU: {psutil.cpu_freq().current/1000:.2f} GHz, {psutil.cpu_count()} cores")
    print(f"Matrix size: {matrix_size}x{matrix_size}, Iterations: {iterations}, Runs: {runs}\n")
    
    # Test with AVX512 enabled
    print("Testing with AVX512 enabled...")
    set_avx512(True)
    avx512_times = []
    avx512_gflops = []
    for _ in range(runs):
        elapsed, gflops = run_benchmark(matrix_size, iterations)
        avx512_times.append(elapsed)
        avx512_gflops.append(gflops)
    
    # Test with AVX512 disabled
    print("Testing with AVX512 disabled...")
    set_avx512(False)
    no_avx512_times = []
    no_avx512_gflops = []
    for _ in range(runs):
        elapsed, gflops = run_benchmark(matrix_size, iterations)
        no_avx512_times.append(elapsed)
        no_avx512_gflops.append(gflops)
    
    # Results
    print("\nResults:")
    print(f"AVX512 Enabled:")
    print(f"  Avg Time: {np.mean(avx512_times):.3f}s (±{np.std(avx512_times):.3f})")
    print(f"  Avg GFLOPS: {np.mean(avx512_gflops):.2f} (±{np.std(avx512_gflops):.2f})")
    print(f"AVX512 Disabled:")
    print(f"  Avg Time: {np.mean(no_avx512_times):.3f}s (±{np.std(no_avx512_times):.3f})")
    print(f"  Avg GFLOPS: {np.mean(no_avx512_gflops):.2f} (±{np.std(no_avx512_gflops):.2f})")
    
    speedup = np.mean(no_avx512_times) / np.mean(avx512_times)
    print(f"\nSpeedup with AVX512: {speedup:.2f}x")

if __name__ == "__main__":
    main()