<a href="https://colab.research.google.com/github/parthdasawant/LLM-Pinecone-OpenAI/blob/main/LLM_Pinecone_%26_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Setup

In [61]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [62]:
!pip install -r /content/drive/MyDrive/LLM/requirements.txt -q

In [83]:
!cp /content/drive/MyDrive/LLM/env /content/

In [84]:
import os

old_name = r"/content/env"
new_name = r"/content/.env"
os.rename(old_name, new_name)

In [65]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

print(os.environ["PINECONE_ENV"])

gcp-starter


#### Getting Started

In [66]:
from langchain.llms import OpenAI
llm = OpenAI(model_name = "text-davinci-003", temperature=0.7)
print(llm)

[1mOpenAI[0m
Params: {'model_name': 'text-davinci-003', 'temperature': 0.7, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}}


In [67]:
output = llm('exlpain quatum mechanics in one sentence')
print(output)



Quantum mechanics is the branch of physics that describes the behavior of matter and energy at the subatomic scale.


In [68]:
print(llm.get_num_tokens('exlpain quatum mechanics in one sentence'))

9


In [69]:
output = llm.generate(['... is the capital of India.',
                       'What is the formula of the area of a circle?',
                       'Who has won all ICC trophies in him captaincy?'])
print(output)

generations=[[Generation(text='\n\nNew Delhi is the capital of India.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nArea of a circle = π x (radius)^2 \n\nor \n\nArea of a circle = π x (diameter/2)^2', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nMS Dhoni has won all ICC trophies in his captaincy, including the 2007 ICC World T20, the 2011 ICC ODI World Cup, and the 2013 ICC Champions Trophy.', generation_info={'finish_reason': 'stop', 'logprobs': None})]] llm_output={'token_usage': {'total_tokens': 115, 'completion_tokens': 86, 'prompt_tokens': 29}, 'model_name': 'text-davinci-003'} run=[RunInfo(run_id=UUID('b5fa1f16-438c-4f82-8295-49ac6e63f74d')), RunInfo(run_id=UUID('bb74a9bd-cd13-49e2-809e-81d496aa1279')), RunInfo(run_id=UUID('09580f4b-c177-4f82-aef2-60a8fb2b1b60'))]


In [70]:
print(output.generations[2][0].text)



MS Dhoni has won all ICC trophies in his captaincy, including the 2007 ICC World T20, the 2011 ICC ODI World Cup, and the 2013 ICC Champions Trophy.


In [71]:
output = llm.generate(['Write an orignal tagline for burger restaurant'] * 3)

In [72]:
for o in output.generations:
  print(o[0].text)



"Sink your teeth into our delicious burgers - the taste is guaranteed to satisfy!"


"Sink your teeth into the tastiest burgers around!"


"Tastiest Burgers on the Block - Our Burgers are a Real Treat!"


#### ChatModels: *ChatGPT* 3.5 turbo

In [73]:
from langchain.schema import(
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.chat_models import ChatOpenAI

In [74]:
chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature = 0.5, max_tokens=1024)
messages = [
    SystemMessage(content='You are a physicist and respond only in Marathi.'),
    HumanMessage(content='explain quantum mechanics in one sentence')
]
output = chat(messages)

In [75]:
print(output)

content='क्वांटम मेकॅनिक्स हा विज्ञान आहे, ज्यामुळे अणुंच्या व्यवहाराच्या नियमांची व्याख्या करते आणि अणुंच्या स्थिती, गती आणि अवस्थेच्या आधारे वस्त्र, चालना, आणि शक्तीच्या वर्णन करते.'


#### Prompt Templates

In [76]:
from langchain import PromptTemplate

In [77]:
template = '''You are an expericed virologist.
Write a few sentences about the following {virus} in {language}.'''
prompt = PromptTemplate(
    input_variables=['virus, language'],
    template=template
)
print(prompt)

input_variables=['language', 'virus'] template='You are an expericed virologist.\nWrite a few sentences about the following {virus} in {language}.'


In [78]:
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003', temperature=0.7)
output = llm(prompt.format(virus='ebola', language='Marathi'))
print(output)



एबोला एक गंभीर स्वास्थ्यसंवेदना आहे जो एक आम आणि दुरुस्त व्हायरस आहे. यामुळे ते आपल्याला गंभीर मोठे आणि आता आलेले व्हायरसचे संक्रमण करते. त्यामुळे आपल्याला अने


#### Simple Chains

In [79]:
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5)

template = '''You are an expericed virologist.
Write a few sentences about the following {virus} in {language}.'''

prompt = PromptTemplate(
    input_variables=['virus, language'],
    template=template
)

chain = LLMChain(llm=llm, prompt=prompt)
output = chain.run({'virus': 'HSV', 'language': 'Marathi'})

In [80]:
print(output)

HSV (Herpes Simplex Virus) हे एक व्हायरस आहे ज्यामुळे हर्पीज रोग होतो. ह्या व्हायरसच्या दोन प्रकारांमध्ये पहिला प्रकार HSV-1 आणि दुसरा प्रकार HSV-2 आहे. ह्या व्हायरसचे प्रसार संपर्काच्या माध्यमाने होते, जसे की संभोग, आजाराच्या आणि आपातकालीन ज्या दिवशी आपण व्हायरससह संपर्क साधला आहे. ह्या व्हायरसच्या लक्षणे ज्वर, नाकाचा वाढ, खोकल्याचा उगवणे, खोकल्याचा दुखणे, त्वचेवर दाग आणि दुखणे आणि त्वचेवर छाले असे असू शकतात.


#### Sequential Chains

In [81]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import SimpleSequentialChain

llm1 = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=1024)
prompt1 = PromptTemplate(
    input_variables=['concept'],
    template='''You are an expericed scientist and Python programmer.
    Write a function that implements the concept of {concept}'''
)
chain1 = LLMChain(llm=llm1, prompt=prompt1)

llm2 = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.2)
prompt2 = PromptTemplate(
    input_variables=['function'],
    template='Given the python function {function}, describe it as detatiled as possible.'
)
chain2 = LLMChain(llm=llm2, prompt=prompt2)

overall_chian = SimpleSequentialChain(chains=[chain1, chain2], verbose=True)
output = overall_chian.run('Linear Regression')




[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m.

def linear_regression(x_values, y_values):
    # Calculate the number of values
    n = len(x_values)
 
    # Calculate the sum of the x-values
    sum_x = sum(x_values)
 
    # Calculate the sum of the y-values
    sum_y = sum(y_values)
 
    # Calculate the sum of the x-squared values
    sum_x_squared = sum([x**2 for x in x_values])
 
    # Calculate the sum of the x*y values
    sum_xy = sum([x * y for (x, y) in zip(x_values, y_values)])
 
    # Calculate the linear regression slope
    slope = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x**2)
 
    # Calculate the linear regression intercept
    intercept = (sum_y - slope * sum_x) / n
 
    return slope, intercept[0m
[33;1m[1;3mThe given python function `linear_regression` calculates the slope and intercept for a linear regression model using a list of x-values and y-values.

Here is a detailed description of the function:

1. The function takes

#### Diving into Pinecone

In [85]:
import pinecone
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY'),
    environment=os.environ.get('PINECONE_ENV')
)

  from tqdm.autonotebook import tqdm


In [86]:
pinecone.info.version()

VersionResponse(server='2.0.11', client='2.2.4')

Pinecone Indexes

In [87]:
pinecone.list_indexes()

[]

In [88]:
index_name = 'langchain-pinecone'
if index_name not in pinecone.list_indexes():
  print(f'Creating index {index_name}....')
  pinecone.create_index(index_name, dimension=1536, metric='cosine', pods=1, pod_type='p1.x2')
  #For delete pinecone.delete_index(index_name)
  print('Done')
else:
  print(f'Index {index_name} already exists')

Creating index langchain-pinecone....
Done


In [89]:
pinecone.describe_index(index_name)

IndexDescription(name='langchain-pinecone', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [90]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [92]:
import random
vectors = [[random.random() for _ in range(1536)] for v in range(5)]

ids = list('abcde') #5

In [103]:
index.upsert(vectors=zip(ids, vectors))

{'upserted_count': 5}

In [95]:
index.upsert(vectors=[('c', [0.3] * 1536)])

{'upserted_count': 1}

In [96]:
index.fetch(ids=['c','d'])

{'namespace': '',
 'vectors': {'c': {'id': 'c',
                   'metadata': {},
                   'values': [0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
       

In [104]:
vector = [random.random() for _ in range(1536)]
#queries = [[random.random() for _ in range(1536)] for v in range(5)] deprecated

In [105]:
index = pinecone.Index(index_name)
index.query(
    vector = vector, #queries = queries deprecated
    top_k =3,
    include_values= False
)

{'matches': [{'id': 'a', 'score': 0.762876272, 'values': []},
             {'id': 'd', 'score': 0.750744045, 'values': []},
             {'id': 'b', 'score': 0.75040257, 'values': []}],
 'namespace': ''}

#### Spliting and Embedding Text using LangChain

In [106]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

with open('/content/drive/MyDrive/LLM/speech.txt') as f:
  churchill_speech = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap= 20,
    length_function=len
)

In [110]:
chunks = text_splitter.create_documents([churchill_speech])

print(chunks[0].page_content)

print(f'Now we have {len(chunks)}')

We Shall Fight on the Beaches, 1940
Now we have 271


In [112]:
from prompt_toolkit.shortcuts import print_container
def print_embedding_cost(texts):
  import tiktoken
  enc = tiktoken.encoding_for_model('text-embedding-ada-002')
  total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
  print(f'Total Tokens: {total_tokens}')
  print(f'Embedding Cost in USD: {total_tokens/1000 * 0.0004:.6f}')

print_embedding_cost(chunks)

Total Tokens: 5410
Embedding Cost in USD: 0.002164


In [119]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [117]:
vector = embeddings.embed_query(chunks[0].page_content)
vector

[-0.04068162725339234,
 -0.040732287476915374,
 -0.00902417781228435,
 -0.006202934867852838,
 0.007497983630620358,
 0.012963152683555952,
 -0.01865630065858718,
 -0.004451927795232753,
 0.003025474304674109,
 -0.010829013002212184,
 0.0022766259837845833,
 0.033234306174601484,
 0.012380538692621575,
 -0.010468046336755653,
 0.0031711278024077034,
 -0.005465168587274299,
 0.023013236343762605,
 0.0004654575965282595,
 0.00783361971865408,
 -0.009898098333024621,
 -0.006101610695516425,
 -0.010759352866561548,
 0.032626361140583,
 -0.008865858560178057,
 -0.00879619842452742,
 -0.007858950761738183,
 0.010227401893117966,
 -0.024659753212906733,
 0.01973286859267769,
 -0.019289575804367177,
 0.020062171221448447,
 -0.009929762369710398,
 -0.014362691184138132,
 -0.0030112255347316394,
 -0.010721356767596687,
 -0.019162921520269247,
 -0.01949222414904,
 -0.011525616221363732,
 0.016338510682053015,
 -0.01841565621494949,
 0.00043933495834777786,
 -0.0012768419882393273,
 0.007706964503

#### Inserting the Embedding into a Pinecone Index

In [118]:
indexes = pinecone.list_indexes()
for i in indexes:
  print(f'Deleting all indexes...', end="")
  pinecone.delete_index(i)
  print('Done')

index_name = 'churchill-speech'
if index_name not in pinecone.list_indexes():
  print(f'Creating index {index_name}...')
  pinecone.create_index(index_name, dimension=1536, metric='cosine')
  print('Done')

Deleting all indexes...Done
Creating index churchill-speech...
Done


In [126]:
import pinecone
from langchain.vectorstores import Pinecone

vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)

#### Asking Questions (Similarity Search)

In [128]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query)
print(result)

[Document(page_content='We Shall Fight on the Beaches, 1940'), Document(page_content='We Shall Fight on the Beaches, 1940'), Document(page_content='on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the'), Document(page_content='on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the')]


In [129]:
for r in result:
  print(r.page_content)
  print('-' * 50)

We Shall Fight on the Beaches, 1940
--------------------------------------------------
We Shall Fight on the Beaches, 1940
--------------------------------------------------
on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the
--------------------------------------------------
on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the
--------------------------------------------------


In [130]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)

retriever = vector_store.as_retriever(search_type='similarity', search_kwarges={'k':3})

chain = RetrievalQA.from_chain_type(llm =llm, chain_type='stuff', retriever= retriever)

In [133]:
query = 'Where should we fight?'
answer = chain.run(query)
print(answer)


We should fight on the beaches, on the landing grounds, and in the fields.
