# Main Jupyter Notebook for the CS205 Final Project

## In this project we will explore a usecase of Large Language Models
### We will start with how to use an LLM through HuggingFace, and explain some of the basic concepts behind an LLM. Once we have a good understading of how to use an LLM for generating text, we will explore Retrieval Augmented Generation (RAG). 

#### For this project we have used Llama 2 7 Billion paramter model with OpenAI's text-embed-002 embedding model. Llama 2 7B was served locally by Ollama. We have used Llama Index and LangChain to interact with the LLM

#### The data store is at the root of the project directory with the name 'data'. Create a data repository before running the indexing and query cells

### Let's understand how to use an LLM using HuggingFace

In [1]:
import torch

#Import HuggingFace Transformer
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
#Fetch Meta's OPT LLM with 1.3 billion parameters. This is quite a small model compared to the SOTA like GPT4V, etc.

model = AutoModelForCausalLM.from_pretrained('facebook/opt-1.3b')
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')

##### Open Pre-trained Transformer (OPT) is a collection of decoder-only transformer developed by Meta. 

In [10]:
input_text = 'I like CS205 Artificial Intelligence course, because'
tok_input = tokenizer(input_text, return_tensors='pt', add_special_tokens=True, truncation=True) #Create tokens from the given input and return PyTorch tensors

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [23]:
torch.manual_seed(123)

generated_output = model.generate(**tok_input, 
                                  max_new_tokens=200, 
                                  return_dict_in_generate=True, 
                                  do_sample=True) #Generation is deterministic. 
                                                  #To use top-k sampling, set do_sample=True to get different responses in each generation
                                                  #Set do_sample=False to have a deterministic generation each time


In [24]:
decoded_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)[0]
print(decoded_output)

I like CS205 Artificial Intelligence course, because the textbook's not that bad (well maybe if you already know AI).  It's not too hard.  Also CS205 is kind of a hard class because of its topics
Thanks for the reply.  The one thing I am having trouble with is the topic of the assignment. Some other students seem to be talking about a similar topic.  It seems like they just need to put two different things together for the same result.   Do u think its too wordy for me? Or is it just not my thing?
I just looked at the word count for an example, and it looks reasonable.
Do you think you learned anything new in the text?
It really helps to understand how other programs work, how to think about programming, or how to do programming in different languages.  It will really set you up for success.  When you are trying to make a game, you would have to have some kind of software-programming background, and even if


#### We saw text generation in the previous subsection. Now lets explore text summarization, a critical usecase of LLM

#### What is an embedding?

#### What is RAG?

#### Retrieval Augmented Generation is a technique in generative AI to boost the knowledge of an LLM. The LLM parameters are learned and not updated to the current information, so a specialized database of knowledge (could be private) is created for the LLM is access. This is a non-parametric memory, i.e, this information is not stored in the learned paramaters of the LLM. 

#### Implementing RAG with Llama Index using ChromaDB as the vector database. Ollama is used to serve Llama 2 locally

In [25]:
import openai

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.embeddings import OpenAIEmbedding

from langchain.llms import Ollama

import chromadb

In [26]:
from dotenv import dotenv_values

In [29]:
api_key = dotenv_values('../.env')["OPENAI_API_KEY"]
openai.api_key = api_key

In [30]:
#Set the embedding

llm = Ollama(model="llama2")
#embed_model = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2") #Local Llama 2 embedding model
embed_model = OpenAIEmbedding() #Using OpenAI's text-embed-002

In [31]:
COLLECTION = "aiprof"
PATH = '../chroma'

In [32]:
# create client and a new collection
db = chromadb.PersistentClient(path=PATH)
chroma_collection = db.get_or_create_collection(COLLECTION)

In [33]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [34]:
# load documents
documents = SimpleDirectoryReader("../data").load_data()

In [None]:
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

In [35]:
# load from disk
db2 = chromadb.PersistentClient(path=PATH)
chroma_collection = db2.get_or_create_collection(COLLECTION)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index2 = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
)

In [36]:
query_engine = index2.as_query_engine()

In [37]:
resp = query_engine.query("What is The rational agent approach?")
print(resp.response)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The rational agent approach in artificial intelligence is a method of designing intelligent agents that act rationally based on their perceived environment and available actions. This means that the agent takes into account its current state and the expected outcomes of different actions to make decisions that maximize its performance measure. The rational agent approach is based on the idea that an ideal rational agent should always act in a way that maximizes its performance measure, given the evidence provided by its perceived environment and whatever built-in knowledge it has.

In more detail, the rational agent approach involves defining an ideal rational agent as one that, for each possible percept sequence, chooses the action that is expected to maximize its performance measure. This definition assumes that the agent has a complete perceptual history (percept sequence), knows everything it needs to know about the environment, can perform certain actions, and is not limited by it

In [39]:
resp = query_engine.query("Generate 2 concise questions about rational agents")

In [40]:
resp.response.split('\n')

['1. What is the difference between a performance measure and a utility function in the context of rational agents?',
 '2. How would you design an agent architecture for a domain that involves crossing a busy road, considering the possible policies (logical, goal-based, or utility-based)?']