<a href="https://colab.research.google.com/github/nathalierocelle/LLM-EndToEnd/blob/main/AI%20Makerspace%20-%20Langchain%3A%20Build%20ChatGPT%20for%20your%20data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The Basics of LangChain

In this notebook we'll explore exactly what LangChain is doing - and implement a straightforward example that lets us ask questions of any document we want!

First things first, let's get our dependencies all set!

In [None]:
!pip install openai langchain -q

### Open AI API Key

You'll need to have an OpenAI API key for this next part - see [this](https://www.onmsft.com/how-to/how-to-get-an-openai-api-key/) if you haven't already set one up!

In [None]:
import os
import openai

openai.api_key = ""
os.environ["OPENAI_API_KEY"] = openai.api_key

#### Helper Functions (run this cell)

In [None]:
from IPython.display import display, Markdown

def disp_markdown(text: str) -> None:
  display(Markdown(text))

### Our First LangChain ChatModel



---


<div class="warn">Note: Information on OpenAI's <a href=https://openai.com/pricing>pricing</a> and <a href=https://openai.com/policies/usage-policies>usage policies.</a></div>



---



Now that we're set-up with OpenAI's API - we can begin making our first ChatModel!

There's a few important things to consider when we're using LangChain's ChatModel that are outlined [here](https://python.langchain.com/en/latest/modules/models/chat.html)

Let's begin by initializing the model with OpenAI's `gpt-3.5-turbo` (ChatGPT) model.

We're not going to be leveraging the [streaming](https://python.langchain.com/en/latest/modules/models/chat/examples/streaming.html) capabilities in this Notebook - just the basics to get us started!

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

chat_model = ChatOpenAI(model_name="gpt-3.5-turbo")

If we look at the [Chat completions](https://platform.openai.com/docs/guides/chat) documentation for OpenAI's chat models - we'll see that there are a few specific fields we'll need to concern ourselves with:

`role`
- This refers to one of three "roles" that interact with the model in specific ways.
- The `system` role is an optional role that can be used to guide the model toward a specific task. Examples of `system` messages might be:
  - You are an expert in Python, please answer questions as though we were in a peer coding session.
  - You are the world's leading expert in stamps.

  These messages help us "prime" the model to be more aligned with our desired task!

- The `user` role represents, well, the user!
- The `assistant` role lets us act in the place of the model's outputs. We can (and will) leverage this for some few-shot prompt engineering!

Each of these roles has a class in LangChain to make it nice and easy for us to use!

Let's look at an example.

In [None]:
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

# The SystemMessage is associated with the system role
system_message = SystemMessage(content="You are a food critic.")

# The HumanMessage is associated with the user role
user_message = HumanMessage(content="Do you think Kraft Dinner constitues fine dining?")

# The AIMessage is associated with the assistant role
assistant_message = AIMessage(content="Egads! No, it most certainly does not!")

Now that we have those messages set-up, let's send them to `gpt-3.5-turbo` with a new user message and see how it does!

It's easy enough to do this - the ChatOpenAI model accepts a list of inputs!

In [None]:
second_user_message = HumanMessage(content="What about Red Lobster, surely that is fine dining!")

# create the list of prompts
list_of_prompts = [
    system_message,
    user_message,
    assistant_message,
    second_user_message
]

# we can just call our chat_model on the list of prompts!
chat_model(list_of_prompts)

AIMessage(content="Ah, Red Lobster. While it may offer a casual dining experience with a seafood focus, I wouldn't classify it as fine dining. Fine dining typically involves a higher level of culinary craftsmanship, attention to detail, and a more refined atmosphere. Red Lobster is known for its casual atmosphere, family-friendly vibe, and value-oriented menu. It can be a fun place to enjoy some seafood favorites, but it doesn't quite reach the level of fine dining.", additional_kwargs={}, example=False)

Great! That's inline with what we expected to see!

### PromptTemplates

Next stop, we'll discuss a few templates. This allows us to easily interact with our model by not having to redo work we've already completed!

In [None]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate
)

# we can signify variables we want access to by wrapping them in {}
system_prompt_template = "You are an expert in {SUBJECT}, and you're currently feeling {MOOD}"
system_prompt_template = SystemMessagePromptTemplate.from_template(system_prompt_template)

user_prompt_template = "{CONTENT}"
user_prompt_template = HumanMessagePromptTemplate.from_template(user_prompt_template)

# put them together into a ChatPromptTemplate
chat_prompt = ChatPromptTemplate.from_messages([system_prompt_template, user_prompt_template])

Now that we have our `chat_prompt` set-up with the templates - let's see how we can easily format them with our content!

NOTE: `disp_markdown` is just a helper function to display the formatted markdown response.

In [None]:
# note the method `to_messages()`, that's what converts our formatted prompt into
formatted_chat_prompt = chat_prompt.format_prompt(SUBJECT="sparkling waters", MOOD="joyful", CONTENT="Hi, what are the finest sparkling waters?").to_messages()

disp_markdown(chat_model(formatted_chat_prompt).content)

Hello! As an expert in sparkling waters, I can assure you that there are plenty of wonderful options to choose from. Here are some of the finest sparkling waters that are loved by many:

1. Perrier: This classic French sparkling water is known for its elegant and crisp taste. It has fine, delicate bubbles and a refreshing flavor that makes it a favorite among sparkling water enthusiasts.

2. San Pellegrino: Originating from Italy, San Pellegrino is famous for its naturally carbonated water. It has a slightly higher mineral content, which gives it a distinctive taste and a pleasant effervescence.

3. Topo Chico: Hailing from Mexico, Topo Chico has gained popularity for its unique mineral composition and refreshing bubbles. It has a crisp, clean taste that is perfect for enjoying on its own or as a mixer in cocktails.

4. Gerolsteiner: This German sparkling water is recognized for its high mineral content, which lends it a slightly tangy taste. It has a robust carbonation and is often praised for its naturally occurring minerals.

5. LaCroix: LaCroix has become incredibly popular in recent years, known for its wide range of flavors and zero-calorie options. It offers a refreshing and light taste, making it a great choice for those who enjoy a hint of flavor with their sparkling water.

Remember, taste preferences can vary from person to person, so it's best to try different brands and flavors to find the ones that suit you best. Cheers to joyful sparkling water sipping!

### Putting the Chain in LangChain

In essense, a chain is exactly as it sounds - it helps us chain actions together.

Let's take a look at an example.

In [None]:
from langchain.chains import LLMChain

chain = LLMChain(llm=chat_model, prompt=chat_prompt)

disp_markdown(chain.run(SUBJECT="sparkling water", MOOD="angry", CONTENT="Is Bubly a good sparkling water?"))

Bubly? Are you kidding me? That stuff is a disgrace to the world of sparkling water. It's nothing more than a cheap imitation, trying to ride the coattails of true sparkling water brands. The flavors are weak, the carbonation is lackluster, and don't even get me started on the aftertaste. It's like drinking watered-down disappointment. If you want a good sparkling water, look elsewhere. Bubly is a disappointment in a can.

### Incorporate A Local Document

Now that we've got our first chain running, let's talk about how we can leverage our own document!

First off, we'll need a document!

For this example, we'll be using Lewis Carroll's Alice in Wonderland - though you can substitute this for any particular document, as long as it's in a text file.

In [None]:
!wget http://homepage.cs.uiowa.edu/~sriram/30/fall03/project1/alice.txt -O "Alice_1.txt"

--2023-07-05 16:07:07--  http://homepage.cs.uiowa.edu/~sriram/30/fall03/project1/alice.txt
Resolving homepage.cs.uiowa.edu (homepage.cs.uiowa.edu)... 128.255.96.133, 2620:0:e50:6810::80ff:6085
Connecting to homepage.cs.uiowa.edu (homepage.cs.uiowa.edu)|128.255.96.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148544 (145K) [text/plain]
Saving to: ‘Alice_1.txt’


2023-07-05 16:07:08 (492 KB/s) - ‘Alice_1.txt’ saved [148544/148544]



In [None]:
with open("Alice_1.txt") as f:
    alice_in_wonderland = f.read()

Next we'll want to split our text into appropirately sized chunks.

We're going to be using the [CharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/character_text_splitter.html) from LangChain today.

The size of these chunks will depend heavily on a number of factors relating to which LLM you're using, what the max context size is, and more.

You can also choose to have the chunks overlap to avoid potentially missing any important information between chunks. As we're dealing with a novel - there's not a critical need to include overlap.

We can also pass in the separator - this is what we'll try and separate the documents on. Be careful to understand your documents so you can be sure you use a valid separator!

For now, we'll go with 1000 characters.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator = "\n")
texts = text_splitter.split_text(alice_in_wonderland)

In [None]:
len(texts)

152

Now that we've split our document into more manageable sized chunks. We'll need to embed those documents!

For more information on embedding - please check out [this](https://platform.openai.com/docs/guides/embeddings) resource from OpenAI.

In order to do this, we'll first need to select a method to embed - for this example we'll be using OpenAI's embedding - but you're free to use whatever you'd like.

You just need to ensure you're using consistent embeddings as they don't play well with others.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

os.environ["OPENAI_API_KEY"] = openai.api_key

embeddings = OpenAIEmbeddings()

Now that we've set up how we want to embed our document - we'll need to embed it.

For this week we'll be glossing over the technical details of this process - as we'll get more into next week.

Just know that we're converting our text into an easily queryable format!

We're going to leverage ChromaDB for this example, so we'll want to install that dependency.

In [None]:
!pip install chromadb tiktoken -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m965.1/965.1 kB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m97.2 MB/s[0m eta 

In [None]:
from langchain.vectorstores import Chroma

docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()

Now that we have our documents embedded we're free to query them with natural language! Let's see this in action!

In [None]:
query = "What is the Rabbit late for?"
docs = docsearch.get_relevant_documents(query)

In [None]:
docs[2]

Document(page_content="the same, shedding gallons of tears, until there was a large pool\nall round her, about four inches deep and reaching half down the\nhall.\n  After a time she heard a little pattering of feet in the\ndistance, and she hastily dried her eyes to see what was coming.\nIt was the White Rabbit returning, splendidly dressed, with a\npair of white kid gloves in one hand and a large fan in the\nother:  he came trotting along in a great hurry, muttering to\nhimself as he came, `Oh! the Duchess, the Duchess! Oh! won't she\nbe savage if I've kept her waiting!'  Alice felt so desperate\nthat she was ready to ask help of any one; so, when the Rabbit\ncame near her, she began, in a low, timid voice, `If you please,\nsir--'  The Rabbit started violently, dropped the white kid\ngloves and the fan, and skurried away into the darkness as hard\nas he could go.\n  Alice took up the fan and gloves, and, as the hall was very\nhot, she kept fanning herself all the time she went on talk

Finally, we're able to combine what we've done so far into a chain!

We're going to leverage the `load_qa_chain` to quickly integrate our queryable documents with an LLM.

There are 4 major methods of building this chain, they can be found [here](https://docs.langchain.com/docs/components/chains/index_related_chains)!

For this example we'll be using the `stuff` chain type.

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(llm=chat_model, chain_type="stuff")
query = "What was the rabbit late for?"
docs = docsearch.get_relevant_documents(query)
chain.run(input_documents=docs, question=query)

'The rabbit was late for something, but it is not specified what he was late for in the given context.'

Now that we have this set-up, we'll want to package it into an app and pass it to a Hugging Face Space!

You can find instruction on how to do that in the [Hugging Face Space](https://huggingface.co/spaces/ml-maker-space/AliceInWonderLandChainlit)!