Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

Closed
trholding opened this issue Jul 23, 2023 · 4 comments
Closed

Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] #7

trholding opened this issue Jul 23, 2023 · 4 comments

Comments

@trholding
Copy link
Contributor

@karpathy I was thinking if you'd consider using a BLAS lib to speed up compute so that larger models may work.
If that is the case, please also have a option to compile with CLBlast (compatible drop in blas) so that compute can get offloaded to GPU via OpenCL.

https://www.netlib.org/blas/#_reference_blas_version_3_11_0
https://www.openblas.net/
https://github.com/CNugteren/CLBlast

@trholding trholding changed the title Use a BLAS lib and & CLBlast for CPU and GPU speedups. [Enhancement] Use a BLAS lib & CLBlast for CPU & GPU speedups. [Enhancement] Jul 23, 2023
@karpathy
Copy link
Owner

Yeah... I definitely want to keep the simplicity of the repo. I'll take a look.

@hchenphd
Copy link

If CPU(BLAS)&GPU(CLBlast) speedups were applied, I am very interested in benchmarking the program with different quantization model by different edge devices, such as RK3588,Nvidia Jetson Orin Nano, even Android mobile(like Qualcomm Snapdragon 7/8).
As karpathy said , " in this repo we focus on more narrow applications ", I believe deploying the program on edge devices with CPU&GPU speedup can have significant value for many interesting and specific scenarios.

@MostWrong
Copy link

MostWrong commented Jul 24, 2023

So I tried using cblas

#include <cblas.h>

void accum(float *a, float *b, int size) {
    cblas_saxpy(size, 1.0f, b, 1.0f, a, 1);
}


void rmsnorm(float* o, float* x, float* weight, int size) {
    float ss = cblas_sdot(size, x, 1.0f, x, 1.0f);
    ss /= size;
    ss += 1e-5f;
    ss = 1.0f / sqrt(ss);

    for (int j = 0; j < size; j++) {
        o[j] = weight[j] * (ss * x[j]);
    }
}

void matmul(float* xout, float* x, float* w, int n, int d) {
    cblas_sgemv(CblasRowMajor, CblasNoTrans, d, n, 1.0f, w, n, x, 1, 0.0f, xout, 1);
}

Gives a decent speedup

trholding referenced this issue in trholding/llama2.c Jul 31, 2023
Added BLAS support:

+ Openblas
+ CLBlast (GPU)

CLBlast is considerable slower. Needs investigation.

Added APE binary prompt support

Usage:

Ape run:
$   run.com

Baremetal Boot:
$  qemu-system-x86_64 -serial stdio -hda run.com
(input is broken on baremetal)

Updated Makefile

Usage:
make runopenblas
make runclblast
@trholding
Copy link
Contributor Author

Available in a separate fork. Thanks for the discussion. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants