# Langchain

Langchain is a framework designed for working with LLM's to create powerfull applications powerd by AI. It works as an interface between the llm models and other tools and applciations. It allows us to create complex interactions between llm's, API's and different data sources

## LLM Wrappers

Load API keys

In [99]:
import os
from dotenv import load_dotenv
load_dotenv()

True

Read env keys

In [None]:
openai_api_key = os.getenv("OPEN_API_KEY")
hugging_face_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
langchain_token = os.getenv("LANGCHAIN_API_KEY")

pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_env = os.getenv("PINECONE_ENV")
pinecone_index_host = os.getenv("PINECONE_INDEX_HOST")


### LLM Wrappers

In [3]:
from langchain_huggingface import HuggingFaceEndpoint

repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Initialize the HuggingFaceEndpoint
llm = HuggingFaceEndpoint(repo_id=repo_id,
                          max_length=128,
                          temperature=0.3,
                          huggingfacehub_api_token=hugging_face_token
                          )

# Use the llm instance as needed
response = llm.invoke("Who won the FIFA World Cup in the year 1994? ")
print(response)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\Hori\.cache\huggingface\token
Login successful
1994 FIFA World Cup was won by Brazil. Brazil defeated Italy in the final match, which was held on July 17, 1994, at the Rose Bowl in Pasadena, California, United States. The team was led by Romário, who was the top scorer of the tournament with five goals. Brazil won their fourth FIFA World Cup title with this victory.


#### Chat models

In [4]:
from langchain_core.messages import (
    HumanMessage,
    SystemMessage,
)
from langchain_huggingface import ChatHuggingFace

messages = [
    SystemMessage(content="You are a sports expert"),
    HumanMessage(content="Who won the world cup in 1994?")
]

chat_model = ChatHuggingFace(llm=llm)


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [5]:
chat_model.model_id

'mistralai/Mistral-7B-Instruct-v0.2'

In [6]:
chat_model._to_chat_prompt(messages)

'<s> [INST] You are a sports expert\n\nWho won the world cup in 1994? [/INST]'

In [7]:
res = chat_model.invoke(messages)
print(res.content)

 I'd be happy to help answer your sports-related question! The 1994 FIFA World Cup was won by the Brazilian national soccer team. Brazil defeated Italy in a thrilling penalty shootout after the match ended in a 0-0 draw following extra time. This was Brazil's fourth World Cup title, making them the most successful team in the history of this prestigious tournament. Giving Brazilian fans one more reason to celebrate and wave their yellow flags high!


#### Prompt Templates

In [17]:
from langchain import PromptTemplate

template = """
    You are an expert european footbal commentator.
    Explain who won the  world cup in {world_cup_year}. and why they were successful.
"""

prompt = PromptTemplate(
    input_variables = ["world_cup_year"],
    template = template,
)

prompt

PromptTemplate(input_variables=['world_cup_year'], template='\n    You are an expert european footbal commentator.\n    Explain who won the  world cup in {world_cup_year}. and why they were successful.\n')

In [20]:
llm(prompt.format(world_cup_year="2002"))

"\nI'd be happy to help you with that!\n\nThe 2002 FIFA World Cup was won by the Brazilian national football team. This was their fifth World Cup title, making them the most successful team in the history of the tournament.\n\nBrazil's success in 2002 can be attributed to several factors. First and foremost, they had a talented and experienced squad, with many players at the peak of their careers. The team was led by Ronaldo, who had just come off a brilliant season with Real Madrid, scoring 47 goals in all competitions. Other key players included Rivaldo, Ronaldinho, Kaká, and Roberto Carlos.\n\nBrazil's attacking football was a joy to watch, with their players combining intricate passing, quick movements, and individual skill to create numerous goal-scoring opportunities. They scored 18 goals in seven matches, with Ronaldo and Rivaldo both netting six goals each.\n\nDefensively, Brazil were solid, with Roberto Carlos and Cafu providing excellent cover at full-back, and the central de

#### Chains

Takes a language model and a prompt template and combines them to an interface that takes an input from the user and outputs an answer from the llm - sort of like a composite function where the inner function is the prompt template and the outer function is the language model

We can also build composite chains where one chain outputs and answer and than we have a second chain that takes the output of the first chain as an input

In [19]:
from langchain.chains import LLMChain

chain = LLMChain(llm=llm, prompt=prompt)

#Run the chain only specifying the input variable
print(chain.run("2002"))

  warn_deprecated(
  warn_deprecated(



I'd be happy to help you with that!

The 2002 FIFA World Cup was won by the Brazilian national football team. This was their fifth World Cup title, making them the most successful team in the history of the tournament.

Brazil's success in 2002 can be attributed to several factors. First and foremost, they had a talented and experienced squad, with many players at the peak of their careers. The team was led by Ronaldo, who had just come off a brilliant season with Real Madrid, scoring 47 goals in all competitions. Other key players included Rivaldo, Ronaldinho, Kaká, and Roberto Carlos.

Brazil's attacking football was a joy to watch, with their players combining intricate passing, quick movements, and individual skill to create numerous goal-scoring opportunities. They scored 18 goals in seven matches, with Ronaldo and Rivaldo both netting six goals each.

Defensively, Brazil were solid, with Roberto Carlos and Cafu providing excellent cover at full-back, and the central defensive pa

In [30]:
second_prompt = PromptTemplate(
    input_variables = ["year"],
    template = "Who was the best player of that world cup in {year} of that team that won the world cup",
)

chain_two = LLMChain(llm=llm, prompt=second_prompt)

In [31]:
from langchain.chains import SimpleSequentialChain

overall_chain = SimpleSequentialChain(chains=[chain, chain_two], verbose=True)

# Run the chain specifying only the input variable of the first chain
explanation = overall_chain.run("2010")
print(explanation)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m
I'd be happy to help you with that!

The 2010 FIFA World Cup was won by Spain. This was a historic achievement for the Spanish national team, as they were the first European team to win the World Cup on South American soil.

Spain's success in the tournament can be attributed to several factors. First and foremost, they had a well-organized and disciplined team, with a clear game plan and a strong defensive structure. They played a possession-based style of football, keeping the ball for long periods of time and controlling the tempo of the game.

Another key factor was the quality of their players. Spain had a talented and experienced squad, with stars like Xavi, Andres Iniesta, Sergio Busquets, and Iker Casillas leading the way. These players were all in their prime and had a great deal of international experience, having won the European Championship in 2008.

Spain also benefited from some good fortune in the tou

#### Embeddings and vector stores

Split the text into chunks

In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 0,
)

texts = text_splitter.create_documents([explanation])

In [33]:
texts

[Document(page_content=', it is difficult to single out just one player as the best, as they all played crucial roles in the'),
 Document(page_content="team's success. However, some players stood out for their exceptional performances throughout the"),
 Document(page_content='tournament.'),
 Document(page_content='Xavi Hernandez was a key figure in midfield, controlling the tempo of the game and setting the'),
 Document(page_content="team's attacks in motion. He completed an incredible 113 passes in the final against the"),
 Document(page_content='Netherlands, more than any other player in a single World Cup match.'),
 Document(page_content='Andres Iniesta was another standout player, scoring the winning goal in the final and providing'),
 Document(page_content='numerous creative moments throughout the tournament. He was named the best player of the tournament'),
 Document(page_content='by FIFA, and it was a well-deserved recognition of his exceptional talent and impact on the game.'),

In [35]:
texts[0].page_content

', it is difficult to single out just one player as the best, as they all played crucial roles in the'

Turn the text into Embeddings

In [56]:
#from langchain.embeddings import OpenAIEmbeddings
#embeddings = OpenAIEmbeddings(model="ada", openai_api_key=openai_api_key)

from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


Embedd a document

In [71]:
# Document to be embedded
text = texts[0].page_content

# Embed the document
query_result = embeddings.embed_query(text)

# Print the embedding
print(query_result)
print(len(query_result))

[-0.004571183118969202, -0.0062856292352080345, 0.01014440692961216, 0.07735513150691986, 0.003391179023310542, -0.05500480532646179, 0.013262740336358547, -0.050202302634716034, -0.04726168513298035, 0.03233026713132858, 0.018941501155495644, 0.04555084556341171, 0.043422114104032516, -0.07671452313661575, 0.028825916349887848, -0.030693737789988518, 0.018923476338386536, 0.0123733039945364, -0.025069521740078926, 0.023064352571964264, -0.010610031895339489, -0.054961055517196655, 0.03724659979343414, 0.0045930324122309685, 0.025212250649929047, 0.0496058352291584, 0.018157940357923508, -0.01492040790617466, -0.007096467074006796, -0.010721901431679726, 0.02423938177525997, -0.02732083387672901, -0.07894320785999298, -0.06646625697612762, 1.273520297218056e-06, -0.033433716744184494, -0.028400326147675514, -0.05011608079075813, 0.02330242097377777, -0.007194665726274252, -0.043996673077344894, 0.10483585298061371, -0.016050133854150772, -0.022306136786937714, 0.0010073486482724547, 0.

Connect to pinecone and create an index if it does not exist

In [107]:
import pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=pinecone_api_key)

index_name = "langchain-quickstart"
index_list = pc.list_indexes()


def check_index_exists(data, index_name):
    indexes = data.get('indexes', [])
    for index in indexes:
        if index.get('name') == index_name:
            return True
    return False



# Check if the index already exists
if check_index_exists(index_list, index_name) is False:
    pc.create_index(
        name=index_name,
        dimension=768, # Replace with your model dimensions
        metric="euclidean", # Replace with your model metric
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )


Grab the index from pinecone

In [108]:
index = pinecone.Index(api_key=pinecone_api_key,
                       host=pinecone_index_host,
                       index_name=index_name
                       )

Upload the embeddings to pinecone vector store

In [103]:
data_to_upsert = [
    {
        "id": "1",  # Replace with a unique ID for your document
        "values": query_result
    }
]

# Upsert the embedding into the Pinecone index
index.upsert(vectors=data_to_upsert)
print("Embedding successfully upserted to Pinecone index.")

{'upserted_count': 1}

Check how many documents we have

In [109]:
len(texts)

18

Generate a unique id for each document, embedd it and upload that dictinary object to Pinecone

In [110]:
# Function to generate unique IDs
def generate_unique_id():
    import uuid  # Import UUID module for generating unique IDs
    return str(uuid.uuid4())  # Generate and return a random UUID as string

# Construct data_to_upsert
data_to_upsert = []

for text in texts:
    unique_id = generate_unique_id()
    data_to_upsert.append({
        "id": unique_id,
        "values": embeddings.embed_query(text.page_content) 
    })

# Upsert the embeddings into the Pinecone index
index.upsert(vectors=data_to_upsert)

print(f"{len(data_to_upsert)} documents successfully upserted to Pinecone index.")

18 documents successfully upserted to Pinecone index.


In [111]:
print(data_to_upsert)

[{'id': '1836e9c1-d74a-435f-b018-fb80895d8bb8', 'values': [-0.004571183118969202, -0.0062856292352080345, 0.01014440692961216, 0.07735513150691986, 0.003391179023310542, -0.05500480532646179, 0.013262740336358547, -0.050202302634716034, -0.04726168513298035, 0.03233026713132858, 0.018941501155495644, 0.04555084556341171, 0.043422114104032516, -0.07671452313661575, 0.028825916349887848, -0.030693737789988518, 0.018923476338386536, 0.0123733039945364, -0.025069521740078926, 0.023064352571964264, -0.010610031895339489, -0.054961055517196655, 0.03724659979343414, 0.0045930324122309685, 0.025212250649929047, 0.0496058352291584, 0.018157940357923508, -0.01492040790617466, -0.007096467074006796, -0.010721901431679726, 0.02423938177525997, -0.02732083387672901, -0.07894320785999298, -0.06646625697612762, 1.273520297218056e-06, -0.033433716744184494, -0.028400326147675514, -0.05011608079075813, 0.02330242097377777, -0.007194665726274252, -0.043996673077344894, 0.10483585298061371, -0.0160501338

Query the pinecone vector store to find out the most simmilar docs given a question

In [137]:

# Top k answers
top_k = 3

# Question
question = "How many passes completed Xavi Hernandez?"

# Questin embeddings vector
query_vector = embeddings.embed_query(question)

results = index.query(
    vector=query_vector,
    top_k=3,
    include_values=True
)

# Iterate over matches and print document IDs and scores
print("Top {} similar documents:".format(len(results.matches)))
for idx, match in enumerate(results.matches):
    doc_id = match['id']
    score = match['score']
    print(f"{idx + 1}. Document ID: {doc_id}, Similarity Score: {score}")


Top 3 similar documents:
1. Document ID: a3e2c8f5-487d-4f9a-a964-59fe022946ba, Similarity Score: 0.877077222
2. Document ID: a5cfa680-47b0-4774-901e-20815c527ed8, Similarity Score: 0.944328427
3. Document ID: d81b83ad-a630-40bf-b70c-16b858bf5e24, Similarity Score: 1.22589588


Make sure that you add a unique id for each document before storing it so you keep track of the embedding and the document real content