# LlamaIndex - Private Setup

Using GPT4ALL and HuggingFace embeddings, we will leverage the [MongoDB Loader from LlamaHub](https://llamahub.ai/l/mongo), to load documents from a MongoDB database. Using llama-index, this MongoDB documents are injested and vectorized, and questions can be answered about these documents.

Inspired by the recent popularity of [PrivateGPT](https://github.com/imartinez/privateGPT), this notebook will walk you through a llama-index setup that uses entirely local models. In this notebook, we use GPT4ALL and huggingface embeddings, which should run decently well on CPU alone. If you had more resources, we also provide some links further down for setting up any LLM from huggingface and running on GPU.

This notebook is inspired by the [LlamaIndex - Local Model Demo.ipynb ](https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing) 

## Dependencies Setup

### Download gpt4all model

In [3]:
!wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin

--2023-05-26 13:45:41--  https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin
Risoluzione di gpt4all.io (gpt4all.io)... 104.26.0.159, 104.26.1.159, 172.67.71.169, ...
Connessione a gpt4all.io (gpt4all.io)|104.26.0.159|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 3785248281 (3,5G)
Salvataggio in: «ggml-gpt4all-j-v1.3-groovy.bin.2»


2023-05-26 13:46:48 (54,6 MB/s) - «ggml-gpt4all-j-v1.3-groovy.bin.2» salvato [3785248281/3785248281]



### Download extra packages

In [4]:
%pip install  pygpt4all llama-index sentence_transformers accelerate pymongo langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Documents setup

For this demo, we are going to query the sample_mflix.movies database as part of the sample dataset available in MongoDB Atlas. We created a new collection "movies_short" with just 500 documents for performance reasons. We then cloned an existing document creating a new document called "The Paolo Picello movie" inserting some fictitious information. We will then try to query this document to understand if the system is actually retrieving knowledge from the MongoDB documents.

In [5]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms import GPT4All
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index import (
    GPTVectorStoreIndex, 
    LangchainEmbedding, 
    LLMPredictor, 
    ServiceContext, 
    StorageContext, 
    download_loader,
    PromptHelper
)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from llama_index import GPTListIndex, SimpleMongoReader

host = "mongodb+srv://username:password@cluster0.4m8aa.mongodb.net"
port = 27017
db_name = "sample_mflix"
collection_name = "movies_short"
field_names = ["title","plot","fullplot"]
# query_dict is passed into db.collection.find()
query_dict = {}
reader = SimpleMongoReader(host, port)
documents = reader.load_data(db_name, collection_name,field_names, query_dict=query_dict)

In [7]:
# print a document to test. Each document is a single page from the pdf, with appropriate metadata
documents[10]

Document(text="The CheatA venal, spoiled stockbroker's wife impulsively embezzles $10,000 from the charity she chairs and desperately turns to a Burmese ivory trader to replace the stolen money.Edith Hardy uses charity funds for Wall Street investments in hopes of buying some new gowns. She loses all the money and borrows from wealthy oriental Tori. When her husband gives her the amount she borrowed, Tori won't take it back, branding her shoulder with a Japanese sign of his ownership. She shoots him. Her husband takes the blame. In court Edith reveals all to an angry mob.", doc_id='cc800d65-0a1b-4beb-9d52-e53ee0d5f93e', embedding=None, doc_hash='1f2c28537625838ed7f8d0c6ddf2de3af7e5428452994586962390dcb230b981', extra_info=None)

## CPU Llama Index
The GPT4ALL setup follows the instructions from [langchain](https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html).

Then, the model is wrapped in the LLMPredictor class from llama-index. 

Keep in mind this current setup will run on CPU. If you have access to a GPU, you could also run any LLM from huggingface for improved speed and performance. More details available on huggingface LLMs and example notebooks [here](https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-huggingface-llm).

Lastly, the embeddings are downloaded and run locally using huggingface. These will automatically run on GPU if you have CUDA installed, otherwise they will run on CPU.

In [8]:
%pip install gpt4all


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
local_llm_path = './ggml-gpt4all-j-v1.3-groovy.bin'
llm = GPT4All(model=local_llm_path, backend='gptj', streaming=True, n_ctx=512)
llm_predictor = LLMPredictor(llm=llm)

Found model file.
gptj_model_load: loading model from './ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285


In [11]:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

In [12]:
prompt_helper = PromptHelper(max_input_size=512, num_output=256, max_chunk_overlap=-1000)
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    embed_model=embed_model,
    prompt_helper=prompt_helper,
    node_parser=SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=300, chunk_overlap=20))
)

### Create the Index

This step will break each document into nodes, and create an embedding vector for each node using our `embed_model`. This may take a several minutes if running on CPU!

In [13]:
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

In [16]:
index.storage_context.persist(persist_dir="./storage")

#### (Optional) Load the Index if already saved

In [14]:
from llama_index import load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context, service_context=service_context)

### Try Asking a question

Due to processing constraints, setting `similarity_top_k=1` is an ideal setting. Otherwise, responses will be quite slow due to the speed of CPU inference.

In [15]:
query_engine = index.as_query_engine(streaming=True, similarity_top_k=1, service_context=service_context)

In [21]:
response_stream = query_engine.query("What is the name of the movie that talks about a computer engineer trying to build a demo of how you can leverage AI tools to answer questions around data stored in MongoDB?" )
response_stream.print_response_stream()

The name of the movie is "PaoLo Picello".



Interestingly, the system was able to get my name out of its corpus. This is not the exact name we specified in the MongoDB document ("The Paolo Picello movie") but it's still a quite impressive result. 