Skip to content

Commit

Permalink
updated Quantized LLM Models
Browse files Browse the repository at this point in the history
  • Loading branch information
prabha-git committed Apr 20, 2024
1 parent 960f2fb commit 90fd848
Show file tree
Hide file tree
Showing 15 changed files with 121 additions and 30 deletions.
20 changes: 20 additions & 0 deletions docs/writing/posts/Karpathy's - let's build GPT from scratch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
draft: true
date: 2024-03-19
slug: lets-build-gpt-from-scratch
tags:
- llm
authors:
- Prabha
---
!!!note "Self Note"
This note is for me to understand the concepts


!!!note "Learning Resource"
Karpathy's tutorial on Youtube [Lets build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2794s)


ChatGPT is probabilistic system

Transformer Neural Net is used for LLMs
Binary file removed docs/writing/posts/Quantized LLM Models.png
Binary file not shown.
12 changes: 12 additions & 0 deletions docs/writing/posts/SentencePiece Model Files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
draft: true
date: 2024-03-12
slug:
tags:
- llm
authors:
- Prabha
---

- .spm is associated with SentencePiece Model files.
- SentencePiece is a library and tool for unsupervised text tokenization and detokenization
17 changes: 17 additions & 0 deletions docs/writing/posts/Tracking Looker's Download.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
draft: true
date: 2024-03-06
slug: looker-download-tracking
tags:
- "#looker"
authors:
- Prabha
---
# Question: Is there a way to track the user downloads in Looker?

Response from Looker Support Below:
So, there is no direct way to track download activities. This is currently a Feature Request - [https://portal.feedback.us.pendo.io/app/#/case/190687](https://portal.feedback.us.pendo.io/app/#/case/190687)

However, we have a workaround using System Activity Event Attribute explore, please check the link - [https://madhive.cloud.looker.com/explore/system__activity/event_attribute?fields=event.created_time,user.name,event_attribute.event_id,event_attribute.name,event_attribute.value&f[event.category]=query&f[event.created_date]=30+days&f[event.name]=export%5E_query&sorts=event.created_time&limit=5000&query_timezone=America%2FNew_York&vis=%7B%7D&filter_config=%7B%22event.category%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22query%22%7D%2C%7B%7D%5D%2C%22id%22%3A0%2C%22error%22%3Afalse%7D%5D%2C%22event.created_date%22%3A%5B%7B%22type%22%3A%22past%22%2C%22values%22%3A%5B%7B%22constant%22%3A%2230%22%2C%22unit%22%3A%22day%22%7D%2C%7B%7D%5D%2C%22id%22%3A2%2C%22error%22%3Afalse%7D%5D%2C%22event.name%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22export_query%22%7D%2C%7B%7D%5D%2C%22id%22%3A4%2C%22error%22%3Afalse%7D%5D%7D&dynamic_fields=%5B%5D&origin=share-expanded](https://madhive.cloud.looker.com/explore/system__activity/event_attribute?fields=event.created_time,user.name,event_attribute.event_id,event_attribute.name,event_attribute.value&f[event.category]=query&f[event.created_date]=30+days&f[event.name]=export%5E_query&sorts=event.created_time&limit=5000&query_timezone=America%2FNew_York&vis=%7B%7D&filter_config=%7B%22event.category%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22query%22%7D%2C%7B%7D%5D%2C%22id%22%3A0%2C%22error%22%3Afalse%7D%5D%2C%22event.created_date%22%3A%5B%7B%22type%22%3A%22past%22%2C%22values%22%3A%5B%7B%22constant%22%3A%2230%22%2C%22unit%22%3A%22day%22%7D%2C%7B%7D%5D%2C%22id%22%3A2%2C%22error%22%3Afalse%7D%5D%2C%22event.name%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22export_query%22%7D%2C%7B%7D%5D%2C%22id%22%3A4%2C%22error%22%3Afalse%7D%5D%7D&dynamic_fields=%5B%5D&origin=share-expanded)

You can also refer to- [https://www.googlecloudcommunity.com/gc/Technical-Tips-Tricks/Track-Downloads-in-System-Activity-workaround/ta-p/586869](https://www.googlecloudcommunity.com/gc/Technical-Tips-Tricks/Track-Downloads-in-System-Activity-workaround/ta-p/586869)
9 changes: 9 additions & 0 deletions docs/writing/posts/gpus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@




| GPU | Cost | Speed (General Performance) | Memory | Energy Consumption | Training Suitability | Inference Suitability | Other Factors |
| --------------- | ---- | -------------------------------------- | ------------- | ------------------ | -------------------- | --------------------- | -------------------------------------------------------------------------------- |
| **NVIDIA T4** | $$ | Good for inference | 16 GB GDDR6 | Low (70W) | Limited | High | Efficient for edge computing and power-sensitive environments |
| **NVIDIA V100** | $$$ | Excellent for training & inference | 16-32 GB HBM2 | Moderate (250W+) | High | High | Well-suited for AI model training, HPC |
| **NVIDIA A100** | $$$$ | Superior for both training & inference | 40 GB HBM2e | High (400W) | Very High | Very High | Supports Multi-Instance GPU (MIG) for versatile workloads, Sparsity acceleration |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions docs/writing/posts/my-q&a.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
draft: true
date: 2024-03-06
slug: my-q_and_a
tags:
- llm
authors:
- Prabha
---
!!!note "Self Note"
This note is for me to understand the concepts
# MY Q&A

## Why is the fine-tuned LLM Model is faster?




### Why DSPy compiled program score less than the uncompiled program?

[Colab Notebook](https://colab.research.google.com/drive/1wwyCGgKizNZo48IzfKa9m4uMp2SNRBSF#scrollTo=IyjklZsKCxF-)

- Found that chain of thought doesn't give the expected output (gives out blank output) so metric calculation is not meaningful.

- So make sure DSPy lm model is producing expected output.
13 changes: 13 additions & 0 deletions docs/writing/posts/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

- LLM model is getting bigger, quantization helps reduce model size with little to no quality degradation
- ![[Pasted image 20240415112817.png]]

- Some state of art methods to reduce models are
- Pruning
- Removing layers which do not contribute to model decisions
- Knowledge Distillation
- Train a student model using the teacher model,
- Challenge is you still need to fit original large model in your machine
- Quantization
- in nn you can quantize weights, activations
- Idea is to represent model weights with lower precision (achieved by converting to different dtype )
55 changes: 25 additions & 30 deletions docs/writing/posts/quantized-llm-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,53 +8,48 @@ authors:
- Prabha
---

# Quantized LLM Model
# Quantized LLM Models

LLM models are large with billions of parameter,
Large Language Models (LLMs) are known for their vast number of parameters, often reaching billions. For example, open-source models like Llama2 come in sizes of 7B, 13B, and 70B parameters, while Google's Gemma has 2B parameters. Although OpenAI's GPT-4 architecture is not publicly shared, it is speculated to have more than a trillion parameters, with 8 models working together in a mixture of experts approach.

for example, Open source Models like llama2 comes in the sizes 7B, 13B and 70B
## Understanding Parameters

Google's Gemma is 2B
A parameter is a model weight learned during the training phase. The number of parameters can be a rough indicator of a model's capability and complexity. These parameters are used in huge matrix multiplications across each layer until an output is produced.

Even though OpenAI's GPT4 is not opensource or it's architecture is publically shared. It is speculatted to have more than Trillion parameters with 8 models working together (mixture of experts)
## The Problem with LLMs

## What is a parameter?
As LLMs have billions of parameters, loading all the parameters into memory and performing massive matrix multiplications becomes a challenge. Let's consider the math behind this:

Parameter is a model weight it learned during the training phase , number of parameter can be a rough indicator of model capability and complexity.
For a 70B parameter model (like the Llama2-70B model), the default size in which these parameters are stored is 32 bits (4 bytes). To load this model, you would need:

these parameters will be used in a huge matrix multiplication on each layers until it produces an output.
70B parameters * 4 bytes = 260 GB of memory

This highlights the significant memory requirements for running LLMs.

# What is the problem LLMs
## Quantization as a Solution

As the model is huge with billions of parameter, you need to load all the paramters in it's memory and perform huge matrix multiplication.
Quantization is a technique used to reduce the size of the model by decreasing the precision of parameters and storing them in less memory. For example, representing 32-bit floating-point (FP32) parameters in a 16-bit floating-point (FP16) datatype.

Lets do the math
In practice, this loss of precision does not significantly degrade the output quality of LLMs but offers substantial performance improvements in terms of efficiency. By quantizing the model, the memory footprint can be reduced, making it more feasible to run LLMs on resource-constrained systems.

lets say 70B parameter model (like llama2-70b model). Default size in which these parameters are stores is 32bit (4 bytes)
Quantization allows for a trade-off between model size and performance, enabling the deployment of LLMs in a wider range of applications and devices. It is an essential technique for making LLMs more accessible and efficient while maintaining their impressive capabilities.

to load this model , you need 70B * 4 bytes = 260.77GB of memory

Now you see the problem, correct.

| | Gemma FP 32 bit precision | Gemma FP16 bit precision |
| --------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
| | | Planck |
| # of Parameters | >> model.num_parameters()<br>2,506,172,416<br> | >> model.num_parameters()<br>2,506,172,416 |
| Memory Size based on # Parameters | >> model.num_parameters() * 4 / (1024**3)<br>9.336219787597656 GB | >> model.num_parameters() * 4 / (1024**3)<br>4.668109893798828 GB |
| Memory Footprint | >>model.get_memory_footprint() / (1024**3)<br>9.398719787597656 GB | >>model.get_memory_footprint() / (1024**3)<br>4.730609893798828 |
| Average Inference time | 10.36 seconds | 7.48 seconds |
| Distribution of | ![[Pasted image 20240420104243.png]] | ![[Pasted image 20240420104320.png]] |
| | | |

What is Quantization?

Quantization is a method to reduce the size of the model, by reducing the precision of parameters and store them in less memory , example representing fp32 in fp16 datatype.

Practically speaking this loss of precision wouldn't significantly reduce the LLM output quality but will have significant performance in terms of efficiency.


| | Gemma FP 32 bit precision | Gemma FP16 bit precision |
| --------------------------------- | ------------------------------------------------------------------- | ------------------------------------------- |
| | | Planck |
| # of Parameters | >> model.num_parameters()<br>2,506,172,416<br> | >> model.num_parameters()<br>2,506,172,416 |
| Memory Size based on # Parameters | >> model.num_parameters() * 4 / (1024**3)<br>9.336219787597656 GB | >> model.num_parameters() * 4 / (1024**3) |
| Memory Footprint | >>model.get_memory_footprint() / (1024**3)<br>9.398719787597656 GB | >>model.get_memory_footprint() / (1024**3) |
| | | |
| | | |
| | | |
it is 28% faster with ~50% less memory.

How about accuracy?

I ran the output of both the models and computed similarity score using OpenAIs `text-embedding-3-large` and got the similarity score.


0 comments on commit 90fd848

Please sign in to comment.