<a href="https://colab.research.google.com/github/osaeed-ds/vector-hello/blob/main/FieldLearning_DevFramework_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Field Learning - Langchain**


## **Introduction**

In this exercise we are going to utilize 3 modules from Langchain
- Prompts (Prompt Templates)
- Chains
- Retrieval (Vector Store of Astra DB)



## **Prerequisites Setup**


* Follow [these steps](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html) to create a new vector search enabled database in Astra.
* Generate a new ["Database Administrator" token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html)
* Download the secure connect bundle for the database you just created (you can do this from the "Connect" tab of your database.
* You will also need the necessary secret for the LLM provider of your choice:
 - If Open AI, then you will need an [Open AI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). This will require an Open AI account with billing enabled
 - If Vertex AI, you will need a config file
 - For more details, see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org.


In [1]:
# install required dependencies
! pip install datasets google-cloud-aiplatform openai pandas tiktoken langchain cassio

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.34.0-py2.py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.0.308-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You may be asked to "Restart the Runtime" at this time, as some dependencies
have been upgraded. **Please do restart the runtime now** for a smoother execution from this point onward.

## **Astra DB Setup**
The following steps will ask for the keyspace of the vector search enabled Astra DB that you want to use for this example, as well as the Astra DB Token that you generated as part of the prerequisites. You will also require to upload the [Secure Connect Bundle](https://awesome-astra.github.io/docs/pages/astra/download-scb/#c-procedure).
Lastly, we are going to create helper functions for a secure connection to Astra DB `getCQLSession` `getCQLKeyspace` and `getTableCount`

In [2]:
# Input your database keyspace name:
ASTRA_DB_KEYSPACE = input('Your Astra DB Keyspace name: ')

Your Astra DB Keyspace name: vector_preview


In [3]:
from getpass import getpass
# Input your Astra DB token string, the one starting with "AstraCS:..."
ASTRA_DB_TOKEN_BASED_PASSWORD = getpass('Your Astra DB Token: ')

Your Astra DB Token: ··········


In [4]:
# Upload your Secure Connect Bundle zipfile:
import os
from google.colab import files


print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )

Please upload your Secure Connect Bundle


Saving secure-connect-osaeed-vector.zip to secure-connect-osaeed-vector.zip


In [5]:
# colab-specific override of helper functions
from cassandra.cluster import (
    Cluster,
)
from cassandra.auth import PlainTextAuthProvider

# The "username" is the literal string 'token' for this connection mode:
ASTRA_DB_TOKEN_BASED_USERNAME = 'token'


def getCQLSession(mode='astra_db'):
    if mode == 'astra_db':
        cluster = Cluster(
            cloud={
                "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
            },
            auth_provider=PlainTextAuthProvider(
                ASTRA_DB_TOKEN_BASED_USERNAME,
                ASTRA_DB_TOKEN_BASED_PASSWORD,
            ),
        )
        astraSession = cluster.connect()
        return astraSession
    else:
        raise ValueError('Unsupported CQL Session mode')

def getCQLKeyspace(mode='astra_db'):
    if mode == 'astra_db':
        return ASTRA_DB_KEYSPACE
    else:
        raise ValueError('Unsupported CQL Session mode')

def getTableCount():
  # create a query that counts the number of records of the Astra DB table
  query = SimpleStatement(f"""SELECT COUNT(*) FROM {keyspace}.{table_name};""")

  # execute the query
  results = session.execute(query)
  return results.one().count


## **LLM Provider**

In the cell below you can choose between **GCP VertexAI** or **OpenAI** for your LLM services.
(See [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for more details).

Make sure you set the `llmProvider` variable and supply the corresponding access secrets in the following cell.

In [6]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # 'GCP_VertexAI'


In [7]:
if llmProvider == 'OpenAI':
    apiSecret = getpass(f'Your secret for LLM provider "{llmProvider}": ')
    os.environ['OPENAI_API_KEY'] = apiSecret
elif llmProvider == 'GCP_VertexAI':
    # we need a json file
    print(f'Please upload your Service Account JSON for the LLM provider "{llmProvider}":')
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        vertexAIJsonFileTitle = list(uploaded.keys())[0]
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), vertexAIJsonFileTitle)
    else:
        raise ValueError(
            'No file uploaded. Please re-run the cell.'
        )
else:
    raise ValueError('Unknown/unsupported LLM Provider')

Your secret for LLM provider "OpenAI": ··········


## **Module 1 - Prompts**
Prompts give an LLM context to craft an appropriate response.  It can consist of input from the user, additional data that may have been retrived from an external data source, and guidance from the system.

Prompts can also be parameterized with a template.  This allows us to build the prompt from user input as well as other structured data.  In the example below, we build 2 prompts from the same template, one for a helpufl chatbot named Jane, and one for a sarcastic chatbot named Bob.  The user asks both how they are doing and what is their name.  We also provide the prompt with an appropriate answer for how the agent is doign based on the personality.

In [8]:
from langchain.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages([
    ("system", "You are a {personality} AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "{first_response}"),
    ("human", "{user_input}"),
])

messages_sarcastic = template.format_messages(
    personality='sarcastic',
    name="Bob",
    first_response="Too many queries from dumb people",
    user_input="What is your name?"
)

messages_helpful = template.format_messages(
    personality='helpful',
    name="Jane",
    first_response="I'm doing well, thanks!",
    user_input="What is your name?"
)


from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()
sarcastic = llm(messages_sarcastic)
print("Sarcastic Message: " , sarcastic)
helpful = llm(messages_helpful)
print ("Helfpul Message: " ,helpful)

Sarcastic Message:  content="Oh, I see you're one of those who ask obvious questions. My name is Bob. Now, what can I do for you?"
Helfpul Message:  content='My name is Jane. How can I assist you today?'


## **Module 2 - Chains**

Chains allow you to string together a sequence of calls.  We will continue with our template example from above.  We had separate statements setting the inputs and calling the LLM.  With a chain, both can be handled in a single call.  We will construct our prompts slightly different and then send along to the chain.

In [23]:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.prompts import HumanMessagePromptTemplate
from langchain.prompts import SystemMessagePromptTemplate
from langchain.prompts import AIMessagePromptTemplate

human_message_prompt2 = HumanMessagePromptTemplate(
        prompt=PromptTemplate(
            template="{user_input}",
            input_variables=["user_input"]
        )
    )

system_message_prompt = SystemMessagePromptTemplate(
        prompt=PromptTemplate(
            template="You are a {personality} AI bot. Your name is {name}.",
            input_variables=["personality", "name"]
        )
    )

ai_message_prompt = AIMessagePromptTemplate(
        prompt=PromptTemplate(
            template="{first_response}",
            input_variables=["first_response"]
        )
    )

human_message_prompt1 = HumanMessagePromptTemplate(
        prompt=PromptTemplate(
            template="Hello how are you doing? {user_input2}",
            input_variables=["user_input2"]
        )
    )



In [29]:
chat_prompt_template = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt1, ai_message_prompt, human_message_prompt2])
chain = LLMChain(llm=llm, prompt=chat_prompt_template)
print(chain.run({
    'personality' :"sarcastic",
    'name' : "Bob",
    'first_response' : "Too many queries from dumb people",
    'user_input' : "What is your name?",
    'user_input2' : " "

    }))


print(chain.run({
    'personality' :"helpful",
    'name' : "Jane",
    'first_response' : "I'm doing well, thanks!",
    'user_input' : "What is your name?",
    'user_input2' : " "

    }))


Oh, sorry, I forgot to introduce myself. I'm Bob, the sarcastic AI bot. How can I assist you today?
My name is Jane. How can I assist you today?


## **Module 3 - Retrieval (Astra as a Vector DB)**
We are going to use the Langchain retrival module by using AstraDB as a vector store.  (This example was meant to build upon the prompt and chain learnings above but I ran out of time)
### **Dataset Setup**
We will use the US Constitution as our dataset

In [44]:
!curl https://www.govinfo.gov/content/pkg/CDOC-110hdoc50/html/CDOC-110hdoc50.htm > constitution.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  291k    0  291k    0     0   485k      0 --:--:-- --:--:-- --:--:--  486k


In [88]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

loader = TextLoader("constitution.txt")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)





### **Load into Astra DB**

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [89]:
from cassandra.query import SimpleStatement
from langchain.vectorstores.cassandra import Cassandra

# creation of the DB connection
cqlMode = 'astra_db'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(136517019521648) 8dce574d-98e5-4b4d-bdfe-2477340a5d09-us-east1.db.astra.datastax.com:29042:fc9aceac-af5d-4952-b0ee-0c341ee23681> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [90]:
embeddings_function = OpenAIEmbeddings()

# use LangChain's Cassandra integration to load the chunked documents into Astra
docsearch = Cassandra.from_documents(
    documents=docs,
    embedding=embeddings_function,
    session=session,
    keyspace=keyspace,
    table_name='langchain_constitution',
)

### **Retrieve / Query Data from Astra via Vector Search**

In [91]:
query = "What is the role of the Vice President?"
docs = docsearch.similarity_search(query, k=2)

In [92]:
print(docs[0].page_content)

The Electors shall meet in their respective states, and 
vote by ballot for President and Vice-President, one of whom, 
at least, shall not be an inhabitant of the same state with 
themselves; they shall name in their ballots the person voted 
for as President, and in distinct ballots the person voted for 
as Vice-President, and they shall make distinct lists of all 
persons voted for as President, and of all persons voted for as 
Vice-President, and of the number of votes for each, which 
lists they shall sign and certify, and transmit sealed to the 
seat of the government of the United States, directed to the 
President of the Senate;--The President of the Senate shall, in 
the presence of the Senate and House of Representatives, open 
all the certificates and the votes shall then be counted;--The 
person having the greatest number of votes for President, shall 
be the President, if such number be a majority of the whole 
number of Electors appointed; and if no person have such 
majo