# LangChain 101

LangChain is a framework for developing applications powered by language models.

- GitHub: https://github.com/hwchase17/langchain
- Docs: https://python.langchain.com/en/latest/index.html

## 1. LLMs

A generic interface for all LLMs. See all LLM providers: https://python.langchain.com/en/latest/modules/models/llms/integrations.html

### OpenAI


In [None]:
import os
from langchain.llms import OpenAI

def init_llm_openai():
    # os.environ["OPENAI_API_KEY"] ="YOUR_OPENAI_TOKEN"

    llm = OpenAI(temperature=0.9)  # model_name="text-davinci-003"
    return llm
   
llm = init_llm_openai()

text = "What would be a good AWS new service name that allow customers to chat with their own data"
print(llm(text))

### SagaMaker Endpoint

In [None]:
import json
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

def init_llm_sm_endpoint():

    endpoint_name = 'jumpstart-dft-meta-textgeneration-llama-2-13b-f'
    aws_region='us-east-1'
    parameters = {"max_new_tokens": 700, "temperature": 0.1}


    class ContentHandler(LLMContentHandler):
        content_type = "application/json"
        accepts = "application/json"

        # def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        #     input_str = json.dumps({"inputs": prompt, **model_kwargs})
        #     return input_str.encode("utf-8")
        def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
            input_str = json.dumps({"inputs" : [[{"role" : "system",
            "content" : "You are a kind robot."},
            {"role" : "user", "content" : prompt}]],
            "parameters" : {**model_kwargs}})
            return input_str.encode('utf-8')

        def transform_output(self, output: bytes) -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            return response_json[0]["generation"]["content"]

    content_handler = ContentHandler()

    sm_llm = SagemakerEndpoint(
        endpoint_name=endpoint_name,
        region_name=aws_region,
        model_kwargs=parameters,
        content_handler=content_handler,
        endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
    )
    return sm_llm

llm = init_llm_sm_endpoint()
text = "What would be a good AWS new service name that allow customers to chat with their own data? Keep answer short. Just give me one answer in your reponse"
llm(text)

## 2. Prompt Templates

LangChain faciliates prompt management and optimization.

Normally when you use an LLM in an application, you are not sending user input directly to the LLM. Instead, you need to take the user input and construct a prompt, and only then send that to the LLM.

In [None]:
prompt = """Question: What would be a good AWS new service name that allow customers to chat with their own data?

Give a marketing slogan for the service.

Let's think step by step.

Answer: """
print(llm(prompt))

In [None]:
from langchain import PromptTemplate

template = """Question: What would be a good AWS new service name that {feature}.

Let's think step by step. Just answer the designed name

Answer: """

prompt = PromptTemplate(template=template, input_variables=["feature"])

In [None]:
input = prompt.format(feature="allow customers to chat with their own data")
print(f"Prompt = \n {input}")

print(llm(input))

## 3. Chains

Combine LLMs and Prompts in multi-step workflows

In [None]:
from langchain import PromptTemplate

template = """Question: What would be a good AWS new service name that {feature}.

Let's think step by step. Just answer the designed name

Answer: """

prompt = PromptTemplate(template=template, input_variables=["feature"])

In [None]:
from langchain import LLMChain
name_creation_chain = LLMChain(prompt=prompt, llm=llm)

feature = "Chat with data on AWS"
print(name_creation_chain.run(feature))

In [None]:
# This is an LLMChain to write a marketing slogan.

template = """You are a marketing agency. Given a AWS service name, it is your job to write a marketing slogan for that service.
Let's think step by step. 

Service: {name}
Slogan: This is a marketing slogan for the above service:
Explanation: This is the explanation of the slogan and why you think it is good:"""
prompt_template = PromptTemplate(input_variables=["name"], template=template)
slogan_chain = LLMChain(llm=llm, prompt=prompt_template)

slogan_chain.run("AWS Data Chatter")

In [None]:
# This is the overall chain where we run these two chains in sequence.
from langchain.chains import SimpleSequentialChain
overall_chain = SimpleSequentialChain(chains=[name_creation_chain, slogan_chain], verbose=True)

new_launch = overall_chain.run("Chat with your data on AWS")

## 4. Agents and Tools

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done.


When used correctly agents can be extremely powerful. In order to load agents, you should understand the following concepts:

- Tool: A function that performs a specific duty. This can be things like: Google Search, Database lookup, Python REPL, other chains.
- LLM: The language model powering the agent.
- Agent: The agent to use.

Tools: https://python.langchain.com/en/latest/modules/agents/tools.html

Agent Types: https://python.langchain.com/en/latest/modules/agents/agents/agent_types.html

In [None]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent

In [None]:
%pip install -qqq wikipedia

In [None]:
tools = load_tools(["wikipedia", "llm-math"], llm=llm)

In [None]:
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

In [None]:
agent.run("In what year was the Amazon S3 released? What is this year raised to the 0.43 power?")

# Demo - Chat with your data on AWS

In [None]:
%pip -qqq install -r requirements.txt


In [None]:
from dotenv import load_dotenv

load_dotenv()

## [Demo 1]: Read data on S3 and Talk

### 1. RAG
- Load Document (PDF, docx, text) from S3 
- Store into Vector store 
- QnA using LLM with RetrievalQA chain provided by LangChain

Ref: https://python.langchain.com/docs/use_cases/question_answering/

In [None]:


# Read a PDF document from S3 using S3FileLoader
from langchain.document_loaders import S3FileLoader
loader = S3FileLoader("BUCKET", "FILE")
all_splits = loader.load_and_split()
print(f"Original: Number of document splits = {len(all_splits)}")


# Embedding and Store into Vector Store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

## TODO: use AOS as vector store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import OpenSearchVectorSearch
from opensearchpy import RequestsHttpConnection

service = 'es' # must set the service as 'aoss'
region = 'us-east-1'
# credentials = boto3.Session(aws_access_key_id='xxxxxx',aws_secret_access_key='xxxxx').get_credentials()
# awsauth = AWS4Auth('xxxxx', 'xxxxxx', region,service, session_token=credentials.token)

vectorstore = OpenSearchVectorSearch.from_documents(
    all_splits,
    OpenAIEmbeddings(),
    opensearch_url="https://xxxxxx.us-east-1.es.amazonaws.com",
    http_auth=("admin", "password"),
    timeout = 300,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name="test-index-using-aoss",
    engine="faiss"
)




In [None]:
# question = "Why need a Modern Data architecture? summarize in 100 words"
# question = "What are the pillars of a Modern Data architecture? summarize in 100 words"
question = "Explain about Modern Data Architecture with a 5 year-old kid. summarize in 100 words"
docs = vectorstore.similarity_search(question, k=10)
print(f"Vector search: Number of document related to the question = {len(docs)}")

# QnA the content using RetrievalQA chain provided by Langchain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())
output = qa_chain({"query": question})
print(output['result'])

### 2. Read CSV from S3 into Pandas. Chat with LLM using Pandas DataFrame Agent provided by LangChain

In [None]:
# Read CSV from S3 into Pandas

import pandas as pd
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
from langchain.agents.agent_types import AgentType

df = pd.read_csv("s3://BUCKET/titanic.csv") # install s3fs

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)

agent.run("Give the statistic on survived distribution by gender")

## [Demo 2]: Chat with your data on AWS by Text-to-SQL

In LangChain, there are SQLDatabaseChain and SQL Agents for Text-to-SQL query and execution.

Ref: https://python.langchain.com/docs/use_cases/sql

It is very good for popular data sources supported by SQLAlchemy (such as MySQL, PostgreSQL, Oracle SQL, Databricks, SQLite). 

However, the **pre-defined prompts inside the tools are restricted and may not optimized for some SQL engine (e.g. Amazon Athena)**. 
Therefore, I customize the Prompt template into the LangChain SQLDatabaseChain for better result.

Instead of using pre-built agent, here i use 
1. `create_sql_query_chain` module to perform text-to-sql using LLM
2. execute the SQL script by pushing down the SQL into Athena using Pandas.
3. Display the DataFrame into UI instead of summarizing into human-readable reponse. (Because some customers may just want to it as data extraction without writing SQL)

In [None]:
#define athena connection engine
from  sqlalchemy import create_engine
from langchain.sql_database import SQLDatabase

region = 'us-east-1'
glue_database_name='chinook'
glue_databucket_name='aws-athena-query-results-xxxxxxxx-us-east-1'

connathena=f"athena.{region}.amazonaws.com" 
portathena='443' #Update, if port is different
schemaathena=glue_database_name #from cfn params
s3stagingathena=f's3://{glue_databucket_name}/athenaresults/'#from cfn params
wkgrpathena='primary'#Update, if workgroup is different

##  Create the athena connection string
connection_string = f"awsathena+rest://@{connathena}:{portathena}/{schemaathena}?s3_staging_dir={s3stagingathena}/&work_group={wkgrpathena}"
##  Create the athena  SQLAlchemy engine
engine_athena = create_engine(connection_string, echo=False)

db = SQLDatabase(engine_athena, sample_rows_in_table_info=0, custom_table_info={})

In [None]:
# Customize prompt for Athena
## - use Presto SQL syntax
## - Make sure selecting the columns only which is in GROUP BY.
## - If you use string indicating a date, add date before the string. For example, date '2012-01-01'.
## - Rename the columns to the best of answering the question.
## - If you think the question is not related to any tables in the database, just reply 'Sorry, it seems not related to the data'.

from langchain.prompts.prompt import PromptTemplate

PROMPT_SUFFIX = """Only use the following tables:
{table_info}

Question: {input}"""

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct Presto query to run, then look at the results of the query and return the answer. Unless the user specifies in his question a specific number of examples he wishes to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.

Never query for all the columns from a specific table, only ask for a the few relevant columns given the question.

Pay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.

Make sure selecting the columns only which is in GROUP BY. If the group column is a aggregated column, make sure the same aggregation is used in GROUP BY.

If you use string indicating a date, add date before the string. For example, date '2012-01-01'. Other than this, avoid to use date and time functions and Operator which may not be supported in Presto query.

Rename the columns to the best of answering the question.

Review the answer and improve before giving the answer.

If you think the question is not related to any tables in the database, just reply 'Sorry, it seems not related to the data'.

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

"""


CUSTOM_PROMPT = PromptTemplate(
    input_variables=["input", "table_info",  "top_k"],
    template=_DEFAULT_TEMPLATE + PROMPT_SUFFIX
)

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_sql_query_chain

llm = ChatOpenAI(temperature=0)
chain = create_sql_query_chain(llm, db, prompt=CUSTOM_PROMPT)

In [None]:
import re
import pandas as pd
from sqlalchemy import text

# question = "Find top 5 best selling albums. Add corresponding artist, total revenue, total duration in minutes as columns"
question = 'monthly sales volume and revenue trend over time'
sql_response = chain.invoke({"question":question})

print(sql_response)

sql_keywords_regex = r'^(SELECT|WITH|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER)'
match = re.search(sql_keywords_regex, sql_response.strip(), re.IGNORECASE)
if match:
    with engine_athena.connect() as conn:
        df = pd.read_sql_query(text(sql_response), con=conn)
        print(df.to_markdown())

In [None]:
# I want a better UI
! streamlit run st-demo-gen-sql.py > /dev/null 2>&1

# Demo : visualize data

In [None]:
from langchain import LLMChain
from langchain.agents import (AgentExecutor, Tool, ZeroShotAgent,
                              initialize_agent, load_tools)
from langchain_experimental.sql import SQLDatabaseChain

In [None]:
db_chain = SQLDatabaseChain.from_llm(
    llm=llm,
    db=db,
    verbose=True,  # Show its work
    return_direct=True,  # Return the results without sending back to the LLM
    prompt=CUSTOM_PROMPT
)

In [None]:
# Add python_repl to our list of tools
tools = load_tools(["python_repl"])

# Define our voter_data tool

# Set a description to help the LLM know when and how to use it.
description = (
    "Useful for when you need to answer questions about chinook data. "
    # "You must not input SQL. Use this more than the Python tool if the question "
    # "is about chinook data, such as albums, artists, invoice, playlist, track."
)

chinook_data = Tool(
    name="AthenaQuery",  # We'll just call it 'Data'
    func=db_chain.run,
    description=description
)

tools.append(chinook_data)

In [None]:
# Standard prefix
prefix = "Fulfill the following request as best you can. You have access to the following tools:"

# Remind the agent of the Data tool, and what types of input it expects
suffix = (
    "Begin! When looking for data, do not write a SQL query. "
    "Pass the relevant portion of the request directly to the Data tool in its entirety."
    "\n\n"
    "Request: {input}\n"
    "{agent_scratchpad}"
)

# The agent's prompt is built with the list of tools, prefix, suffix, and input variables
prompt = ZeroShotAgent.create_prompt(
    tools, prefix=prefix, suffix=suffix, input_variables=["input", "agent_scratchpad"]
)

# Set up the llm_chain
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Specify the tools the agent may use
tool_names = [tool.name for tool in tools]
agent = ZeroShotAgent(llm_chain=llm_chain, allowed_tools=tool_names)

# Create the AgentExecutor
agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent, tools=tools, verbose=True
)



In [None]:
print(prompt.template)

In [None]:


request = "Show a bar graph visualizing the answer to the following question:" \
        "Find top 3 best selling employees in terms of revenue"

agent_executor.run(request)



In [None]:
request = "Show a line graph visualizing the answer to the following question:" \
        "show me the sales volume and revenue trend over time for recent 50 sales invoice"

agent_executor.run(request)


# [Demo 3] - AWS SDK for pandas (awswrangler) LangChain Tool 



In [None]:
# Import things that are needed generically
from langchain import LLMMathChain, SerpAPIWrapper
from langchain.agents import AgentType, initialize_agent
from langchain.chat_models import ChatOpenAI
from langchain.tools import BaseTool, StructuredTool, Tool, tool
from langchain import LLMChain
from langchain import PromptTemplate
from langchain.llms import OpenAI


llm = ChatOpenAI(temperature=0)



In [None]:

template = '''You are a python programming expert who. Your job is to write a python code using AWS SDK for pandas (awswrangler) to complete below tasks as your best.
Let's think step by step. 

Use below output format:
Task: {task}
Python code:'''

prompt = PromptTemplate(template=template, input_variables=["task"])

In [None]:
coder = LLMChain(prompt=prompt, llm=llm)

In [None]:
output = coder.run('list all index from opensearch')
print(output)

In [None]:
output = coder.run('Create EMR Cluster. then group all small file in s3://data/*.json into a big file using the EMR')
print(output)

In [None]:
print(OpenAI()("what is the latest version of AWS SDK for pandas (awswrangler) you know "))


In [None]:
print(OpenAI()(question))


In [None]:
# grab url from the api doc
from bs4 import BeautifulSoup
import requests

url = "https://aws-sdk-pandas.readthedocs.io/en/stable/api.html"
r=requests.get(url)

soup=BeautifulSoup(r.content,"html.parser")
a_href=soup.find_all("a",{"class":"reference internal"}, href=True)
print(len(a_href))

urls = [url]
for a in a_href:
    if(a['href'].startswith('stub')):
        urls.append(f"https://aws-sdk-pandas.readthedocs.io/en/stable/{a['href']}")

print(urls)

In [None]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

# urls = ["https://aws-sdk-pandas.readthedocs.io/en/stable/api.html"]
loader = UnstructuredURLLoader(urls=urls)
html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML
)
all_splits = loader.load_and_split(text_splitter = html_splitter)
# all_splits = loader.load_and_split()

print(f"Original: Number of document = {len(all_splits)}")


In [None]:

# Embedding and Store into Vector Store
# from langchain.embeddings import OpenAIEmbeddings
# from langchain.vectorstores import Chroma
# vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import OpenSearchVectorSearch
from opensearchpy import RequestsHttpConnection

service = 'es' # must set the service as 'aoss'
region = 'us-east-1'
# credentials = boto3.Session(aws_access_key_id='xxxxxx',aws_secret_access_key='xxxxx').get_credentials()
# awsauth = AWS4Auth('xxxxx', 'xxxxxx', region,service, session_token=credentials.token)

vectorstore = OpenSearchVectorSearch.from_documents(
    all_splits,
    OpenAIEmbeddings(),
    opensearch_url="https://search-vectorstore-xxxxxxxxx.us-east-1.es.amazonaws.com",
    http_auth=("admin", "Login123!"),
    timeout = 300,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name="awswrangler-api-index",
    engine="faiss",
    bulk_size=5000
)



In [None]:
question = 'Using awswrangler, Execute Clean Rooms Protected SQL query and return the results as a Pandas DataFrame.'
docs = vectorstore.similarity_search(question, k=10)
print(f"Vector search: Number of document related to the question = {len(docs)}")

In [None]:
docs[1]

In [None]:
# QnA the content using RetrievalQA chain provided by Langchain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0.0)

qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())


In [None]:
query = "what is the latest version of AWS SDK for pandas (awswrangler) you know?"
before_rag = llm.predict(query)
print(f"Answer before RAG : {before_rag}")

print("\n================\n")
output = qa_chain({"query":query})

print(f"Answer After rag : {output['result']}")

In [None]:
query = "Using awswrangler, Execute Clean Rooms Protected SQL query and return the results as a Pandas DataFrame."
before_rag = llm.predict(query)
print(f"Answer before RAG : {before_rag}")

print("\n================\n")
output = qa_chain({"query": query})

print(f"Answer After rag : {output['result']}")

In [None]:
# code generate chain: 1) Gen code 2) code review 3) output improved code
from langchain.memory import ConversationBufferMemory

step1_code_plan_template = '''You are a python programming expert. Your job is to plan step by step how to write a python code using AWS SDK for pandas (awswrangler) to complete below tasks at your best.
Use the following pieces of context related to awswrangler api to complete the task. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<Context>
{context}
</Context>


Use the following output format:
Task: <The task to be implemented>
Plan:
1. <the first step that have to be done when writing a python code. list all possible awswrangler APIs if needed>
2. <the second step that have to be done when writing a python code. list all possible awswrangler APIs if needed>
3. <the third step that have to be done when writing apython code. list all possible awswrangler APIs if needed>
(You can plan up to 8 steps)

Let's think step by step. 

Double check if all the awswrangler APIs exist in the context.

Begin!
Task: {question}


'''

step1_code_plan_prompt = PromptTemplate(template=step1_code_plan_template, input_variables=["context", "question"])
# chain_type_kwargs = {"prompt": step1_code_plan_prompt}
# step1_gen_code_chain = LLMChain(prompt=step1_gen_code, llm=llm)
step1_code_plan_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=vectorstore.as_retriever())


step2_code_gen_template = '''You are a senior python programmer. Your job is to write python code using AWS SDK for pandas (awswrangler) to complete the planning at your best.
Let's think step by step. 

Use the following output format:
``` python
## Step1: <the first step in the plan>
<code to implement step 1>

## Step2: <the second step in the plan>
<code to implement step 2>

## Step3: <the third step in the plane>
<code to implement step 3>

```

Begin!
Plan: {step1}
'''
step2_code_gen_prompt= PromptTemplate(template=step2_code_gen_template, input_variables=["step1"])
# chain_type_kwargs = {"prompt": step2_code_review_prompt}
# step2_code_review_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=vectorstore.as_retriever(), chain_type_kwargs=chain_type_kwargs)
step2_code_gen_chain = LLMChain(prompt=step2_code_gen_prompt, llm=llm)

step2_code_review_template = '''You are a python programming expert. Your job is to review the python code using AWS SDK for pandas (awswrangler) to make sure the code is correct.
Let's review line by line. Think step by step. 

{question}

Using the context to double check if all the APIs exist in the context
{context}


Begin!
Task: <the orignal task required>
Code: <the original code>
Findings: <list all findings from the code review>

'''

step2_code_review_prompt= PromptTemplate(template=step2_code_review_template, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": step2_code_review_prompt}
step2_code_review_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=vectorstore.as_retriever(), chain_type_kwargs=chain_type_kwargs)
# step2_code_review_chain = LLMChain(prompt=step2_code_review_prompt, llm=llm)


step3_code_improve_template = '''You are a python programming expert. Your job is to rewrite the python code using AWS SDK for pandas (awswrangler) based on the findings.
Let's think step by step. 

Findings: {step2}

Only output the python with comments!'''

step3_code_improve_prompt= PromptTemplate(template=step3_code_improve_template, input_variables=["step2"])
# chain_type_kwargs = {"prompt": step2_code_review_prompt}
# step2_code_review_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=vectorstore.as_retriever(), chain_type_kwargs=chain_type_kwargs)
step3_code_improve_chain = LLMChain(prompt=step3_code_improve_prompt, llm=llm)




In [None]:
import langchain
langchain.debug = False

In [None]:
# This is the overall chain where we run these two chains in sequence.
from langchain.chains import SimpleSequentialChain
overall_chain = SimpleSequentialChain(chains=[step1_code_plan_chain, step2_code_gen_chain, step2_code_review_chain, step3_code_improve_chain], verbose=True)

code_gen = overall_chain.run("Load data from s3 into Redshift")

print(code_gen)

In [None]:
# This is the overall chain where we run these two chains in sequence.
from langchain.chains import SimpleSequentialChain
overall_chain = SimpleSequentialChain(chains=[step1_gen_code_chain, step2_code_review_chain, step3_code_improve_chain], verbose=True)

code_gen = overall_chain.run("Query awswrangler inde")