updated Quantized LLM Models

prabha-git · Apr 20, 2024 · 90fd848 · 90fd848
1 parent 960f2fb
commit 90fd848
Show file tree

Hide file tree

Showing 15 changed files with 121 additions and 30 deletions.
diff --git a/docs/writing/posts/Karpathy's - let's build GPT from scratch.md b/docs/writing/posts/Karpathy's - let's build GPT from scratch.md
@@ -0,0 +1,20 @@
+---
+draft: true
+date: 2024-03-19
+slug: lets-build-gpt-from-scratch
+tags:
+  - llm
+authors:
+  - Prabha
+---
+!!!note "Self Note"
+	This note is for me to understand the concepts
+
+
+!!!note "Learning Resource"
+	Karpathy's tutorial on Youtube [Lets build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2794s)
+
+
+ChatGPT is probabilistic system
+
+Transformer Neural Net is used for LLMs
diff --git a/docs/writing/posts/Quantized LLM Models.png b/docs/writing/posts/Quantized LLM Models.png
diff --git a/docs/writing/posts/SentencePiece Model Files.md b/docs/writing/posts/SentencePiece Model Files.md
@@ -0,0 +1,12 @@
+---
+draft: true
+date: 2024-03-12
+slug: 
+tags:
+  - llm
+authors:
+  - Prabha
+---
+
+- .spm is associated with SentencePiece Model files.
+- SentencePiece is a library and tool for unsupervised text tokenization and detokenization 
diff --git a/docs/writing/posts/Tracking Looker's Download.md b/docs/writing/posts/Tracking Looker's Download.md
@@ -0,0 +1,17 @@
+---
+draft: true
+date: 2024-03-06
+slug: looker-download-tracking
+tags:
+  - "#looker"
+authors:
+  - Prabha
+---
+# Question: Is there a way to track the user downloads in Looker?
+
+Response from Looker Support Below:
+So, there is no direct way to track download activities. This is currently a Feature Request - [https://portal.feedback.us.pendo.io/app/#/case/190687](https://portal.feedback.us.pendo.io/app/#/case/190687)
+
+However, we have a workaround using System Activity Event Attribute explore, please check the link - [https://madhive.cloud.looker.com/explore/system__activity/event_attribute?fields=event.created_time,user.name,event_attribute.event_id,event_attribute.name,event_attribute.value&f[event.category]=query&f[event.created_date]=30+days&f[event.name]=export%5E_query&sorts=event.created_time&limit=5000&query_timezone=America%2FNew_York&vis=%7B%7D&filter_config=%7B%22event.category%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22query%22%7D%2C%7B%7D%5D%2C%22id%22%3A0%2C%22error%22%3Afalse%7D%5D%2C%22event.created_date%22%3A%5B%7B%22type%22%3A%22past%22%2C%22values%22%3A%5B%7B%22constant%22%3A%2230%22%2C%22unit%22%3A%22day%22%7D%2C%7B%7D%5D%2C%22id%22%3A2%2C%22error%22%3Afalse%7D%5D%2C%22event.name%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22export_query%22%7D%2C%7B%7D%5D%2C%22id%22%3A4%2C%22error%22%3Afalse%7D%5D%7D&dynamic_fields=%5B%5D&origin=share-expanded](https://madhive.cloud.looker.com/explore/system__activity/event_attribute?fields=event.created_time,user.name,event_attribute.event_id,event_attribute.name,event_attribute.value&f[event.category]=query&f[event.created_date]=30+days&f[event.name]=export%5E_query&sorts=event.created_time&limit=5000&query_timezone=America%2FNew_York&vis=%7B%7D&filter_config=%7B%22event.category%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22query%22%7D%2C%7B%7D%5D%2C%22id%22%3A0%2C%22error%22%3Afalse%7D%5D%2C%22event.created_date%22%3A%5B%7B%22type%22%3A%22past%22%2C%22values%22%3A%5B%7B%22constant%22%3A%2230%22%2C%22unit%22%3A%22day%22%7D%2C%7B%7D%5D%2C%22id%22%3A2%2C%22error%22%3Afalse%7D%5D%2C%22event.name%22%3A%5B%7B%22type%22%3A%22%3D%22%2C%22values%22%3A%5B%7B%22constant%22%3A%22export_query%22%7D%2C%7B%7D%5D%2C%22id%22%3A4%2C%22error%22%3Afalse%7D%5D%7D&dynamic_fields=%5B%5D&origin=share-expanded)
+
+You can also refer to- [https://www.googlecloudcommunity.com/gc/Technical-Tips-Tricks/Track-Downloads-in-System-Activity-workaround/ta-p/586869](https://www.googlecloudcommunity.com/gc/Technical-Tips-Tricks/Track-Downloads-in-System-Activity-workaround/ta-p/586869)
diff --git a/docs/writing/posts/gpus.md b/docs/writing/posts/gpus.md
@@ -0,0 +1,9 @@
+
+
+
+
+| GPU             | Cost | Speed (General Performance)            | Memory        | Energy Consumption | Training Suitability | Inference Suitability | Other Factors                                                                    |
+| --------------- | ---- | -------------------------------------- | ------------- | ------------------ | -------------------- | --------------------- | -------------------------------------------------------------------------------- |
+| **NVIDIA T4**   | $$   | Good for inference                     | 16 GB GDDR6   | Low (70W)          | Limited              | High                  | Efficient for edge computing and power-sensitive environments                    |
+| **NVIDIA V100** | $$$  | Excellent for training & inference     | 16-32 GB HBM2 | Moderate (250W+)   | High                 | High                  | Well-suited for AI model training, HPC                                           |
+| **NVIDIA A100** | $$$$ | Superior for both training & inference | 40 GB HBM2e   | High (400W)        | Very High            | Very High             | Supports Multi-Instance GPU (MIG) for versatile workloads, Sparsity acceleration |
diff --git a/docs/writing/posts/img/Pasted image 20240316224221.png b/docs/writing/posts/img/Pasted image 20240316224221.png
diff --git a/docs/writing/posts/img/Pasted image 20240406070408.png b/docs/writing/posts/img/Pasted image 20240406070408.png
diff --git a/docs/writing/posts/img/Pasted image 20240415112607.png b/docs/writing/posts/img/Pasted image 20240415112607.png
diff --git a/docs/writing/posts/img/Pasted image 20240415112817.png b/docs/writing/posts/img/Pasted image 20240415112817.png
diff --git a/docs/writing/posts/img/Pasted image 20240420103817.png b/docs/writing/posts/img/Pasted image 20240420103817.png
diff --git a/docs/writing/posts/img/Pasted image 20240420104243.png b/docs/writing/posts/img/Pasted image 20240420104243.png
diff --git a/docs/writing/posts/img/Pasted image 20240420104320.png b/docs/writing/posts/img/Pasted image 20240420104320.png
diff --git a/docs/writing/posts/my-q&a.md b/docs/writing/posts/my-q&a.md
@@ -0,0 +1,25 @@
+---
+draft: true
+date: 2024-03-06
+slug: my-q_and_a
+tags:
+  - llm
+authors:
+  - Prabha
+---
+!!!note "Self Note"
+	This note is for me to understand the concepts
+# MY Q&A
+
+## Why is the fine-tuned LLM Model is faster?
+
+
+
+
+### Why DSPy compiled program score less than the uncompiled program?
+
+[Colab Notebook](https://colab.research.google.com/drive/1wwyCGgKizNZo48IzfKa9m4uMp2SNRBSF#scrollTo=IyjklZsKCxF-)
+
+- Found that chain of thought doesn't give the expected output (gives out blank output) so metric calculation is not meaningful.
+
+- So make sure DSPy lm model is producing expected output. 
diff --git a/docs/writing/posts/quantization.md b/docs/writing/posts/quantization.md
@@ -0,0 +1,13 @@
+
+- LLM model is getting bigger, quantization helps reduce model size with little to no quality degradation
+- ![[Pasted image 20240415112817.png]]
+
+- Some state of art methods to reduce models are 
+	- Pruning
+		- Removing layers which do not contribute to model decisions
+	- Knowledge Distillation
+		- Train a student model using the teacher model, 
+		- Challenge is you still need to fit original large model in your machine
+	- Quantization
+		- in nn you can quantize weights, activations
+		- Idea is to represent model weights with lower precision (achieved by converting to different dtype )
diff --git a/docs/writing/posts/quantized-llm-models.md b/docs/writing/posts/quantized-llm-models.md
@@ -8,53 +8,48 @@ authors:
   - Prabha
 ---
 
-# Quantized LLM Model
+# Quantized LLM Models
 
-LLM models are large with billions of parameter, 
+Large Language Models (LLMs) are known for their vast number of parameters, often reaching billions. For example, open-source models like Llama2 come in sizes of 7B, 13B, and 70B parameters, while Google's Gemma has 2B parameters. Although OpenAI's GPT-4 architecture is not publicly shared, it is speculated to have more than a trillion parameters, with 8 models working together in a mixture of experts approach.
 
-for example, Open source Models like llama2 comes in the sizes 7B, 13B and 70B
+## Understanding Parameters
 
-Google's Gemma is 2B
+A parameter is a model weight learned during the training phase. The number of parameters can be a rough indicator of a model's capability and complexity. These parameters are used in huge matrix multiplications across each layer until an output is produced.
 
-Even though OpenAI's GPT4 is not opensource or it's architecture is publically shared. It is speculatted to have more than Trillion parameters with 8 models working together (mixture of experts)
+## The Problem with LLMs
 
-## What is a parameter?
+As LLMs have billions of parameters, loading all the parameters into memory and performing massive matrix multiplications becomes a challenge. Let's consider the math behind this:
 
-Parameter is a model weight it learned during the training phase , number of parameter can be a rough indicator of model capability and complexity. 
+For a 70B parameter model (like the Llama2-70B model), the default size in which these parameters are stored is 32 bits (4 bytes). To load this model, you would need:
 
-these parameters will be used in a huge matrix multiplication on each layers until it produces an output.
+70B parameters * 4 bytes = 260 GB of memory
 
+This highlights the significant memory requirements for running LLMs.
 
-# What is the problem LLMs
+## Quantization as a Solution
 
-As the model is huge with billions of parameter, you need to load all the paramters in it's memory and perform huge matrix multiplication. 
+Quantization is a technique used to reduce the size of the model by decreasing the precision of parameters and storing them in less memory. For example, representing 32-bit floating-point (FP32) parameters in a 16-bit floating-point (FP16) datatype.
 
-Lets do the math 
+In practice, this loss of precision does not significantly degrade the output quality of LLMs but offers substantial performance improvements in terms of efficiency. By quantizing the model, the memory footprint can be reduced, making it more feasible to run LLMs on resource-constrained systems.
 
-lets say 70B parameter model (like llama2-70b model). Default size in which these parameters are stores is 32bit (4 bytes)
+Quantization allows for a trade-off between model size and performance, enabling the deployment of LLMs in a wider range of applications and devices. It is an essential technique for making LLMs more accessible and efficient while maintaining their impressive capabilities.
 
-to load this model , you need 70B * 4 bytes = 260.77GB of memory
 
-Now you see the problem, correct.
 
+|                                   | Gemma FP 32 bit precision                                           | Gemma FP16 bit precision                                          |
+| --------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
+|                                   |                                                                     | Planck                                                            |
+| # of Parameters                   | >> model.num_parameters()<br>2,506,172,416<br>                      | >> model.num_parameters()<br>2,506,172,416                        |
+| Memory Size based on # Parameters | >> model.num_parameters() * 4 / (1024**3)<br>9.336219787597656 GB   | >> model.num_parameters() * 4 / (1024**3)<br>4.668109893798828 GB |
+| Memory Footprint                  | >>model.get_memory_footprint()  / (1024**3)<br>9.398719787597656 GB | >>model.get_memory_footprint()  / (1024**3)<br>4.730609893798828  |
+| Average Inference time            | 10.36 seconds                                                       | 7.48 seconds                                                      |
+| Distribution of                   | ![[Pasted image 20240420104243.png]]                                | ![[Pasted image 20240420104320.png]]                              |
+|                                   |                                                                     |                                                                   |
 
-What is Quantization?
-
-Quantization is a method to reduce the size of the model, by reducing the precision of parameters and store them in less memory , example representing fp32 in fp16 datatype.
-
-Practically speaking this loss of precision wouldn't significantly reduce the LLM output quality but will have significant performance in terms of efficiency.
-
-
-|                                   | Gemma FP 32 bit precision                                           | Gemma FP16 bit precision                    |
-| --------------------------------- | ------------------------------------------------------------------- | ------------------------------------------- |
-|                                   |                                                                     | Planck                                      |
-| # of Parameters                   | >> model.num_parameters()<br>2,506,172,416<br>                      | >> model.num_parameters()<br>2,506,172,416  |
-| Memory Size based on # Parameters | >> model.num_parameters() * 4 / (1024**3)<br>9.336219787597656 GB   | >> model.num_parameters() * 4 / (1024**3)   |
-| Memory Footprint                  | >>model.get_memory_footprint()  / (1024**3)<br>9.398719787597656 GB | >>model.get_memory_footprint()  / (1024**3) |
-|                                   |                                                                     |                                             |
-|                                   |                                                                     |                                             |
-|                                   |                                                                     |                                             |
+it is 28% faster with ~50% less memory.
 
+How about accuracy? 
 
+I ran the output of both the models and computed similarity score using OpenAIs `text-embedding-3-large` and got the similarity score.