# Using open-source 'GPT4All' model locally 

* [1. Backgorund](#background)
* [2. Downloading and Converting the Model](#convert)
* [3. Using the Local Model](#using)

<hr>
<a class="anchor" id="background">
    
## 1. Background
    
</a>

There are several limitations that can restrict the ongoing research on Large Language Models (LLMs). 

First, the access to the weights and architecture of the trained models from GPT-family is usually restricted, and even if one does have access, it requires significant resources to perform any task. For example, though Facebook has released its LLaMA model weights under a non-commercial license, running this model on a local PC is practically impossible due to the large number of parameters (7 billion).

Second, the available APIs to the pre-trained LLMs are usually not free to build on top of. 

The alternative open-source models (like `GPT4All` which is trained on top of Facebook’s LLaMA model) aim to overcome these obstacles and make the LLMs more accessible to everyone. They can be loaded to a local PC and used to ask questions though prompts using the local computer's CPU. The authors of GPT4All incorporated several tricks to do efficient fine-tuning and inference. It is true that we are sacrificing quality by a small margin when using this approach. However, it is a trade-off between no access at all and accessing a slightly underpowered model.

<hr>
<a class="anchor" id="convert">
    
## 2. Downloading  and Converting the Model
    
</a>

In [1]:
import requests
from pathlib import Path
from tqdm import tqdm

In [3]:
local_path = './models/gpt4all-lora-quantized-ggml.bin'
Path(local_path).parent.mkdir(parents=True, exist_ok=True)

In [4]:
# Download the model from URL - 
# this process might take a while since the file size is 4GB

url = 'https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized-ggml.bin'

# Send a GET request to the URL to download the file
response = requests.get(url, stream=True)

# Open the file in binary mode and write the contents of the response in chunks
with open(local_path, 'wb') as f:
    for chunk in tqdm(response.iter_content(chunk_size=8192)):
        if chunk:
            f.write(chunk)

514266it [12:01, 713.10it/s] 


#### Transform the downloaded file to the latest format

- Start by downloading the codes in the LLaMAcpp repository or simply fork it using the following command 
- Pass the downloaded file to the `convert.py` script and run it with a Python interpreter

```shell
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && git checkout 2b26469
python3 llama.cpp/convert.py ./models/gpt4all-lora-quantized-ggml.bin
```

Running the script "convert.py" will create a new file in the same directory as the original model with the following name: ggml-model-q4_0.bin. It basically is a converted version of the pre-trained model weights to **4-bit precision using the GGML format**. So, it uses fewer bits to represent the numbers and hence, reduces memory usage and allows faster inference. 

<hr>
<a class="anchor" id="using">
    
## 3. Using the Local Model
    
</a>

In [None]:
# The LangChain library uses PyLLaMAcpp module to load the converted GPT4All weights
# !pip install pyllamacpp
# !pip install gpt4all

In [4]:
from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain
#from langchain.callbacks.base import CallbackManager
from langchain.callbacks.manager import AsyncCallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [5]:
# Defining the prompt
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

In [6]:
callback_manager = AsyncCallbackManager([StreamingStdOutCallbackHandler()])
llm = GPT4All(model="./models/ggml-model-q4_0.bin", 
              callback_manager=callback_manager, 
              verbose=True)
llm_chain = LLMChain(prompt=prompt, llm=llm)

Found model file at  ./models/ggml-model-q4_0.bin


objc[72777]: Class GGMLMetalClass is implemented in both /Users/iryna/Documents/projects/langchain_snippets/langenv/lib/python3.10/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x169300208) and /Users/iryna/Documents/projects/langchain_snippets/langenv/lib/python3.10/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x16954c208). One of the two will be used. Which one is undefined.
llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_loa

In [11]:
question = "What is the reason for the earthquakes?"
llm_chain.run(question)

 Earthquake occurs due to movement of tectonic plates, which are huge slabs that make up the outer layer of our planet. These plates move slowly over time and can cause sudden movements when they collide or slide past each other. The reason for this is not fully understood but it could be related to changes in temperature or pressure deep within the earth's crust, which causes stress on the tectonic plates.

" Earthquake occurs due to movement of tectonic plates, which are huge slabs that make up the outer layer of our planet. These plates move slowly over time and can cause sudden movements when they collide or slide past each other. The reason for this is not fully understood but it could be related to changes in temperature or pressure deep within the earth's crust, which causes stress on the tectonic plates."