llama-2-70b-hf-inference

How to do llama-70b HuggingFace inference, parallelized across multiple GPUs

Need to set accelerate config. Then, run inference.py to download a model into a given directory and run load_checkpoint_and_dispatch in the accelerate library.

Want to expand to have a codebase for:

OpenAI inference class
Llama 70b hf inference (accelerate, or just normal)
Llama inference
Mistral inference
All other inference with chat template that works
Generic evaluation, maybe.
Getting folder file structure
SSH commands

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
download_models.py		download_models.py
generation.py		generation.py
generation.txt		generation.txt
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

download_models.py

download_models.py

generation.py

generation.py

generation.txt

generation.txt

inference.py

inference.py

Repository files navigation

llama-2-70b-hf-inference

About

Releases

Packages

Languages

notrichardren/code-repository

Folders and files

Latest commit

History

Repository files navigation

llama-2-70b-hf-inference

About

Resources

Stars

Watchers

Forks

Languages