Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud-edge collaborative inference for LLM based on KubeEdge-Ianvs #96

Open
hsj576 opened this issue May 8, 2024 · 3 comments
Open

Comments

@hsj576
Copy link
Member

hsj576 commented May 8, 2024

What would you like to be added/modified:
This issue aims to build a cloud-edge collaborative inference framework for LLM on KubeEdge-Ianvs. Namely, it aims to help all cloud-edge LLM developers improve inference accuracy with strong privacy and fast inference speed. This issue includes:

  1. Implement a benchmark of LLM tasks (e.g. basic LLM tasks such as user question answering, code generation, or text translation) in KubeEdge-Ianvs.
  2. An example of LLM cloud-edge collaborative inference implemented in KubeEdge-Ianvs.
  3. (advance) Implement cloud-edge collaborative algorithms for LLM, such as Speculative decoding, etc. .

Why is this needed:
At present, LLM models with the scale of 10 billion and 100 billion parameters, led by Llama2-70b and Qwen-72b, can only be deployed in the cloud with sufficient computing power to provide inference services. However, for users of edge terminals, on the one hand, cloud LLM services face the problem of slow inference speed and long response delay; on the other hand, uploading edge private data to the cloud for processing may face the risk of privacy disclosure. At the same time, the inference accuracy of LLM models that can be deployed in edge environments (such as TinyLlama-1.1b) is much lower than that of cloud LLM. Therefore, using cloud LLM or edge LLM alone cannot simultaneously take into account privacy protection, real-time inference and inference accuracy. Therefore, we need to combine the advantages of high inference accuracy of cloud LLM with strong privacy and fast inference of edge LLM through the strategy of cloud edge collaboration, so as to better meet the needs of edge users.

Recommended Skills:
KubeEdge-Ianvs, Python, Pytorch, LLMs

Useful links:
Introduction to Ianvs
Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

@MooreZheng
Copy link
Contributor

If anyone has questions regarding this issue, please feel free to leave a message here. We would also appreciate it if new members could introduce themselves to the community.

@IcyFeather233
Copy link

Hi! To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging?

@hsj576
Copy link
Member Author

hsj576 commented May 24, 2024

Hi! To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging?

Yes, student of this OSPP project needs to have access to at least one consumer-grade GPU (2080,3090, etc.). However, since this project mainly focuses on LLM inference, it does not require so much computing resources. For edge LLM models, if your available computing resources is limited, you can choose small-scale LLM models such as TinyLlama-1.1b, Qwen1.5-0.5B, etc. These models can be deployed even on a personal laptop for inference. For cloud LLM models, if your computing resources are not sufficient enough to support the deployment of LLM at a scale of 10 billion or 100 billion, you can use GPT-4, Claude3, Kimi, GLM4 and other commercial LLMs with open API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants