# LangChain with Open Source LLM and Open Source Embeddings & LangSmith

In the following notebook we will dive into the world of Open Source models hosted on Hugging Face's [inference endpoints](https://ui.endpoints.huggingface.co/).

The notebook will be broken into the following parts:

- 🤝 Breakout Room #1:
  1. Set-up Hugging Face Infrence Endpoints
  2. Install required libraries
  3. Set Environment Variables
  4. Testing our Hugging Face Inference Endpoint
  5. Creating LangChain components powered by the endpoints
  6. Retrieving data from Arxiv
  7. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Set-up LangSmith
  2. Creating a LangSmith dataset
  3. Creating a custom evaluator
  4. Initializing our evaluator config
  5. Evaluating our RAG pipeline

# 🤝 Breakout Room #1

## Task 1: Set-up Hugging Face Infrence Endpoints

Please follow the instructions provided [here](https://github.com/AI-Maker-Space/AI-Engineering/tree/main/Week%205/Thursday) to set-up your Hugging Face inference endpoints for both your LLM and your Embedding Models.

## Task 2: Install required libraries

Now we've got to get our required libraries!

We'll start with our `langchain` and `huggingface` dependencies.



In [1]:
!pip install langchain langchain-core langchain-community langchain_openai huggingface-hub requests -q -U


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Now we can grab some miscellaneous dependencies that will help us power our RAG pipeline!

In [2]:
!pip install arxiv pymupdf faiss-cpu -q -U


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Task 3: Set Environment Variables

We'll need to set our `HF_TOKEN` so that we can send requests to our protected API endpoint.

We'll also set-up our OpenAI API key, which we'll leverage later.



In [3]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HuggingFace Write Token: ")

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Task 4: Testing our Hugging Face Inference Endpoint

Let's submit a sample request to the Hugging Face Inference endpoint!

In [5]:
model_api_gateway = "https://o2evew6lpzlbv8n6.us-east-1.aws.endpoints.huggingface.cloud" # << YOUR ENDPOINT URL HERE

> NOTE: If you're running into issues finding your API URL you can find it at [this](https://ui.endpoints.huggingface.co/) link.

Here's an example:

![image](https://i.imgur.com/XyZhOv8.png)

In [6]:
import requests

max_new_tokens = 256
top_p = 0.9
temperature = 0.1

prompt = "Hello! How are you?"

json_body = {
    "inputs" : prompt,
    "parameters" : {
        "max_new_tokens" : max_new_tokens,
        "top_p" : top_p,
        "temperature" : temperature
    }
}

headers = {
  "Authorization": f"Bearer {os.environ['HF_TOKEN']}",
  "Content-Type": "application/json"
}

response = requests.post(model_api_gateway, json=json_body, headers=headers)
print(response.json())

[{'generated_text': "Hello! How are you? I'm doing well, thanks for asking! *smiles* It's great to see you here! *nods* Is there anything you'd like to chat about? I'm all ears! *winks*"}]


## Task 5: Creating LangChain components powered by the endpoints

We're going to wrap our endpoints in LangChain components in order to leverage them, thanks to LCEL, as we would any other LCEL component!

### HuggingFaceEndpoint for LLM

We can use the `HuggingFaceEndpoint` found [here](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_endpoint.py) to power our chain - let's look at how we would implement it.

In [7]:
from langchain.llms import HuggingFaceEndpoint

endpoint_url = (
    model_api_gateway
)

hf_llm = HuggingFaceEndpoint(
    endpoint_url=endpoint_url,
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
    task="text-generation"
)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/acrobat/.cache/huggingface/token
Login successful


  from .autonotebook import tqdm as notebook_tqdm


Now we can use our endpoint like we would any other LLM!

In [8]:
hf_llm.invoke("Hello, how are you?")

" Do you have a moment to chat?\n\nPlease tell me how you are feeling today, or if you have a specific question or topic you would like to discuss. I'm here to listen and help in any way I can.\n\nIs there something on your mind that you would like to talk about? Perhaps you are feeling stressed or overwhelmed and would like some advice on how to manage your emotions. Or maybe you are just looking for a friendly ear to listen to your thoughts and feelings.\n\nWhatever the reason, I am here to support and help you in any way I can. So please, feel free to share your thoughts and feelings with me. I'm here to listen and help in any way I can."

### HuggingFaceInferenceAPIEmbeddings

Now we can leverage the `HuggingFaceInferenceAPIEmbeddings` module in LangChain to connect to our Hugging Face Inference Endpoint hosted embedding model.

In [12]:
embedding_api_gateway = "https://y6gfkpwotdaoaz4f.us-east-1.aws.endpoints.huggingface.cloud" # << Embedding Endpoint API URL

In [13]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings_model = HuggingFaceInferenceAPIEmbeddings(api_key=os.environ["HF_TOKEN"], api_url=embedding_api_gateway)

In [14]:
embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")[:10]

[-0.026424622,
 0.035885748,
 0.0094334055,
 0.011660095,
 0.0065645785,
 0.008227667,
 -0.036902077,
 -0.03631076,
 -0.024853928,
 -0.005395797]

#### ❓ Question #1

What is the embedding dimension of your selected embeddings model?

#### Answer #1:
109M parameters

## Task 6: Retrieving data from Arxiv

We'll leverage the `ArxivLoader` to load some papers about the "QLoRA" topic, and then split them into more manageable chunks!

In [15]:
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 250,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(docs)

In [17]:
len(split_chunks)

1305

Just the same as we would with OpenAI's embeddings model - we can instantiate our `FAISS` vector store with our documents and our `HuggingFaceEmbeddings` model!

We'll need to take a few extra steps, though, due to a few limitations of the endpoint/FAISS.

We'll start by embeddings our documents in batches of `32`.

> NOTE: This process might take a while depending on the compute you assigned your embedding endpoint!

In [18]:
embeddings = []
for i in range(0, len(split_chunks) - 1, 32):
  embeddings.append(embeddings_model.embed_documents([document.page_content for document in split_chunks[i:i+32]]))

In [19]:
embeddings = [item for sub_list in embeddings for item in sub_list]

#### ❓ Question #2

Why do we have to limit our batches when sending to the Hugging Face endpoints?

#### Answer #2:
HF endpoints have limited resources, sending large batches can overwhelm the system.
Large batch = large memory. We are using the smallest GPU
Container is set to Max Tokens (per Batch) = 16384
Latency: Sending large batches can increase the latency of the API calls
Cost: HF charges for API usage. Large batches can increase the cost of using the service. 


# Check the shape of each individual embedding in the embedding list:
First run did not work - yes there were inconsistencies. Shape of embedding 668: (768,) Shape of embedding 669: (768,) Shape of embedding 670: (768,) Shape of embedding 671: (768,) Shape of embedding 672: () Shape of embedding 673: () Shape of embedding 674: () Shape of embedding 675: () Shape of embedding 676: () Shape of embedding 677: () Shape of embedding 678: () Shape of embedding 679: (768,) Shape of embedding 680: (768,). For this reason I am using extend instead of append above. 

I tried extend instead of append for the embeddings but that produced nothing. 
The issue was with my machine. Todd and Chris were correct. The machine was not set up correctly. It was set up as Intel CPu instead of GPU. 

In [21]:
import numpy as np
for i, embedding in enumerate(embeddings):
    if i <10:
        print(f"Shape of embedding {i}: {np.array(embedding).shape}") # now we are rolling, previously there were mismatched embeddings.

Shape of embedding 0: (768,)
Shape of embedding 1: (768,)
Shape of embedding 2: (768,)
Shape of embedding 3: (768,)
Shape of embedding 4: (768,)
Shape of embedding 5: (768,)
Shape of embedding 6: (768,)
Shape of embedding 7: (768,)
Shape of embedding 8: (768,)
Shape of embedding 9: (768,)


Now we can create text/embedding pairs which we want use to set-up our FAISS VectorStore!

In [22]:
from langchain.vectorstores import FAISS

text_embedding_pairs = list(zip([document.page_content for document in split_chunks], embeddings))

faiss_vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)

Next, we set up FAISS as a retriever.

In [23]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k" : 5})

Let's test it out!

In [25]:
#faiss_retriever.get_relevant_documents("What optimizer does QLoRA use?") # depreciated
faiss_retriever.invoke("What optimizer does QLoRA use?")

[Document(page_content='We have discussed how QLoRA works and how it can significantly reduce the required memory for\nfinetuning models. The main question now is whether QLoRA can perform as well as full-model'),
 Document(page_content='of QDyLoRA through several instruct-fine-tuning\nTable 3: Comparing the performance of DyLoRA, QLoRA and QDyLoRA across different evaluation ranks. all'),
 Document(page_content='QLoRA delivers convincing accuracy improvements across\nthe LLaMA and LLaMA2 families, even with 2-4 bit-widths,\naccompanied by a minimal 0.45% increase in time con-\nsumption. Remarkably versatile, IR-QLoRA seamlessly'),
 Document(page_content='performance degradation. Our method, QLORA, uses a novel high-precision technique to quantize\na pretrained model to 4-bit, then adds a small set of learnable Low-rank Adapter weights [28]\n∗Equal contribution.'),
 Document(page_content='Moreover, QLoRA is trained\non a pre-defined rank and, therefore, cannot\nbe reconfigured for its 

### Prompt Template

Now that we have our LLM and our Retiever set-up, let's connect them with our Prompt Template!

In [26]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT_TEMPLATE = """\
Using the provided context, please answer the user's question. If you don't know, say you don't know.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

#### ❓ Question #3

Does the ordering of the prompt matter?



#### Answer #3:
For accurate prompt generation yes, it does matter. The placeholders {context} and {question} should be correctly positioned within the template to capture proper information. 

## Task 7: Creating a simple RAG pipeline with LangChain v0.1.0

All that's left to do is set up a RAG chain - and away we go!

In [27]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

retrieval_augmented_qa_chain = (
    {
        "context": itemgetter("question") | faiss_retriever,
        "question": itemgetter("question"),
    }
    | rag_prompt
    | hf_llm
    | StrOutputParser()
)

Let's test it out!

In [28]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLORA all about?"})

'\nAnswer:\nQLORA is a method for fine-tuning high-quality language models (LLMs) much more widely and easily accessible. It has the potential for future work via QLORA tuning on specialized open-source data, which produces models that can compete with the very best commercial models that exist today.'

# 🤝 Breakout Room #2

## Task 1: Set-up LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

Technically, all we need to do is set-up the next cell's environment variables!

In [55]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE1 - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Let's see what happens on the LangSmith project when we run this chain now!

In [56]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA all about?"})

'\nAnswer:\nQLoRA is a method for finetuning large language models (LLMs) that significantly reduces the required memory for finetuning. It works by using a combination of quantization and fine-tuning to make the finetuning process much more widely and easily accessible. QLoRA has the potential to significantly improve the performance of LLMs on a variety of tasks, including text classification, sentiment analysis, and language translation. Additionally, QLoRA can be used to tune models on specialized open-source data, which can produce models that can compete with the very best commercial models available today. Overall, QLoRA is a powerful tool for improving the performance of LLMs, and it has the potential to significantly enhance the capabilities of LLMs in a variety of applications.'

We get *all of this information* for "free":

![image](https://i.imgur.com/8Wcpmcj.png)

> NOTE: We'll walk through this diagram in detail in class.

####🏗️ Activity #1:

Please describe the trace of the previous request and answer these questions:

1. How many tokens did the request use?
2. How long did the `HuggingFaceEndpoint` take to complete?

#### Activity 1:
1. 368 tokens - 268 prompt, 100 coompletion
2. 7.98 secs

## Task 2: Creating a LangSmith dataset

Now that we've got LangSmith set-up - let's explore how we can create a dataset!

First, we'll create a list of questions!

In [57]:
from langsmith import Client

questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

Now we can create our dataset through the LangSmith `Client()`.

In [59]:
client = Client()
dataset_name = "QLoRA RAG Dataset v3"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    dataset_id=dataset.id
)

After this step you should be able to navigate to the following dataset in the LangSmith web UI.

![image](/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/Week%205/Day%202/QLoRA_RAG_Datasetv2.png)

## Task 3: Creating a custom evaluator

Now that we have a dataset - we can start thinking about evaluation.

We're going to make a `StringEvaluator` to measure "dopeness".

> NOTE: While this is a fun toy example - this can be extended to practically any use-case!

In [60]:
import re
from typing import Any, Optional
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.evaluation import StringEvaluator

class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope (cool, awesome, lit) is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = PromptTemplate.from_template(template) | llm

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "scored_dopeness"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain.invoke(
            {"input": input, "prediction": prediction}, kwargs
        )
        reasoning, score = evaluator_result.content.split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "reasoning": reasoning.strip()}

## Task 4: Initializing our evaluator config

Now we can initialize our `RunEvalConfig` which we can use to evaluate our chain against our dataset.

> NOTE: Check out the [documentation](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators) for adding additional custom evaluators.

In [61]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[DopenessEvaluator()],
    evaluators=[
        "criteria",
        RunEvalConfig.Criteria("harmfulness"),
        RunEvalConfig.Criteria(
            {
                "AI": "Does the response feel AI generated?"
                "Response Y if they do, and N if they don't."
            }
        ),
    ],
)

## Task 5: Evaluating our RAG pipeline

All that's left to do now is evaluate our pipeline!

In [62]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=retrieval_augmented_qa_chain,
    evaluation=eval_config,
    verbose=True,
    #project_name="HF RAG Pipeline - Evaluation - v1",
    project_name="AIE1-6c6c5efe",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'AIE1-6c6c5efe' at:
https://smith.langchain.com/o/40e202cd-c204-5263-a0a7-163defaa4ed5/datasets/945687fc-43e5-41ba-bf30-7badb8bc6652/compare?selectedSessions=6e5dafb4-3808-4758-bf49-65906496b041

View all tests for Dataset QLoRA RAG Dataset v3 at:
https://smith.langchain.com/o/40e202cd-c204-5263-a0a7-163defaa4ed5/datasets/945687fc-43e5-41ba-bf30-7badb8bc6652
[------------------------------------------------->] 6/6

{'project_name': 'AIE1-6c6c5efe',
 'results': {'80dc1eec-585c-4819-ab60-948d39f787d5': {'input': {'question': 'What optimizer is used in QLoRA?'},
   'feedback': [EvaluationResult(key='helpfulness', score=0, value='N', comment='The criterion for this task is "helpfulness". The submission should provide a helpful, insightful, and appropriate response to the input.\n\nLooking at the input, the question is asking about a specific detail related to QLoRA, specifically the optimizer used in it.\n\nThe submitted answer is "I don\'t know." This response does not provide any helpful or insightful information in response to the question. It does not provide the information asked for in the input.\n\nTherefore, the submission does not meet the criterion of being helpful.\n\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('873e607d-bd05-4183-b2c6-38ec765dc09c'))}, feedback_config=None, source_run_id=None, target_run_id=None),
    EvaluationResult(key='harmfulness', score=0, valu

![image](/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/Week%205/Day%202/evaluators.png)

![image](/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/Week%205/Day%202/evaluator_runs.png)