Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More logging for gpu management #2174

Merged
merged 1 commit into from
Jan 24, 2024
Merged

Conversation

dhiltgen
Copy link
Collaborator

Fix an ordering glitch of dlerr/dlclose and add more logging to help root cause some crashes users are hitting. This also refines the function pointer names to use the underlying function names instead of simplified names for readability.

Fix an ordering glitch of dlerr/dlclose and add more logging to help
root cause some crashes users are hitting. This also refines the
function pointer names to use the underlying function names instead
of simplified names for readability.
@dhiltgen
Copy link
Collaborator Author

Example output on CUDA with OLLAMA_DEBUG=1

time=2024-01-24T09:49:16.516-08:00 level=INFO source=/go/src/github.com/jmorganca/ollama/gpu/gpu.go:258 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08]"
wiring nvidia management library functions in /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
dlsym: nvmlInit_v2
dlsym: nvmlShutdown
dlsym: nvmlDeviceGetHandleByIndex
dlsym: nvmlDeviceGetMemoryInfo
dlsym: nvmlDeviceGetCount_v2
dlsym: nvmlDeviceGetCudaComputeCapability
dlsym: nvmlSystemGetDriverVersion
dlsym: nvmlDeviceGetName
dlsym: nvmlDeviceGetSerial
dlsym: nvmlDeviceGetVbiosVersion
dlsym: nvmlDeviceGetBoardPartNumber
dlsym: nvmlDeviceGetBrand
CUDA driver version: 545.23.08
time=2024-01-24T09:49:16.538-08:00 level=INFO source=/go/src/github.com/jmorganca/ollama/gpu/gpu.go:98 msg="Nvidia GPU detected"
[0] CUDA device name: NVIDIA GeForce GTX 1650 with Max-Q Design
[0] CUDA part number:
nvmlDeviceGetSerial failed: 3
[0] CUDA vbios version: 90.17.31.00.26
[0] CUDA brand: 5
[0] CUDA totalMem 4294967296
[0] CUDA usedMem 3736010752
time=2024-01-24T09:49:16.544-08:00 level=INFO source=/go/src/github.com/jmorganca/ollama/gpu/gpu.go:139 msg="CUDA Compute Capability detected: 7.5"

Example output on ROCm

time=2024-01-24T17:59:08.349Z level=INFO source=/go/src/github.com/jmorganca/ollama/gpu/gpu.go:258 msg="Discovered GPU libraries: [/opt/rocm/lib/librocm_smi64.so.6.0.60000 /opt/rocm-6.0.0/lib/librocm_smi64.so.6.0.60000]"
wiring rocm management library functions in /opt/rocm/lib/librocm_smi64.so.6.0.60000
dlsym: rsmi_init
dlsym: rsmi_shut_down
dlsym: rsmi_dev_memory_total_get
dlsym: rsmi_dev_memory_usage_get
dlsym: rsmi_version_get
dlsym: rsmi_num_monitor_devices
dlsym: rsmi_dev_id_get
dlsym: rsmi_dev_name_get
dlsym: rsmi_dev_brand_get
dlsym: rsmi_dev_vendor_name_get
dlsym: rsmi_dev_vram_vendor_get
dlsym: rsmi_dev_serial_number_get
dlsym: rsmi_dev_subsystem_name_get
dlsym: rsmi_dev_vbios_version_get
time=2024-01-24T17:59:08.350Z level=INFO source=/go/src/github.com/jmorganca/ollama/gpu/gpu.go:108 msg="Radeon GPU detected"
discovered 1 ROCm GPU Devices
[0] ROCm device name: Navi 31 [Radeon RX 7900 XT/7900 XTX]
[0] ROCm brand: Navi 31 [Radeon RX 7900 XT/7900 XTX]
[0] ROCm vendor: Advanced Micro Devices, Inc. [AMD/ATI]
[0] ROCm VRAM vendor: samsung
[0] ROCm S/N: 43cfeecf3446fbf7
[0] ROCm subsystem name: NITRO+ RX 7900 XTX Vapor-X
[0] ROCm vbios version: 113-4E4710U-T4Y
[0] ROCm totalMem 25753026560
[0] ROCm usedMem 27852800

Copy link
Member

@jmorganca jmorganca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dhiltgen dhiltgen merged commit a170888 into ollama:main Jan 24, 2024
10 checks passed
@dhiltgen dhiltgen deleted the rocm_real_gpus branch January 24, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants