# Creating a RAG with Groq API's LLama 70b as base model and HuggingFace's sentence-transformer as embedding model

In [1]:
!pip install langchain sentence-transformers langchain-community chromadb langchain-groq langchain-core

Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Using cached langchain_text_splitters-0.0.2-py3-none-any.whl.metadata (2.2 kB)
Using cached langchain_text_splitters-0.0.2-py3-none-any.whl (23 kB)
Installing collected packages: langchain-text-splitters
  Attempting uninstall: langchain-text-splitters
    Found existing installation: langchain-text-splitters 0.2.1
    Uninstalling langchain-text-splitters-0.2.1:
      Successfully uninstalled langchain-text-splitters-0.2.1
Successfully installed langchain-text-splitters-0.0.2


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scrapegraphai 0.10.0 requires langchain-openai<0.2.0,>=0.1.6, but you have langchain-openai 0.0.5 which is incompatible.
scrapegraphai 0.10.0 requires tiktoken<0.7.0,>=0.6.0, but you have tiktoken 0.5.2 which is incompatible.


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["HUGGINGFACE_AI_KEY"] = os.getenv("HUGGINGFACE_API_KEY")

### Step 1 : Create and populate the Vector Database

Step 1.1 : Load the text file using LangChain's TextLoader

In [4]:
## Data infestion for text files
from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
text_documents = loader.load()

Step 1.2 : (Optional) : If its data from a website that we need, then we have to scrape the web data. We can do that using Beautiful Soup module from Python.

After scraping the data, we have to load it. This is done using the WebBaseLoader from LangChain

In [12]:
## Data ingestion using web based loader

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_path =("https://www.bbc.com/sport/football/articles/cyrrjv3e4zro"), # URL of the webpage
    bs_kwargs=dict(parse_only = bs4.SoupStrainer(
        class_=("zephr_join_beta","ssrcss-181c4hk-SectionWrapper eustbbg0",
        "ssrcss-4rxmy3-PageStack e1mcntqj2","ssrcss-l6cntj-ContentStack e1mcntqj0",
        "ssrcss-irv5dn-Zone e1mcntqj4","ssrcss-irv5dn-Zone e1mcntqj4",
        "ssrcss-1ocoo3l-Wrap e42f8511","ssrcss-1y7k614-FooterStack e1mcntqj1")
        )
    )
)

web_load = loader.load()
print(web_load)

[Document(page_content='BBC HomepageSkip to contentAccessibility HelpYour accountHomeNewsSportBusinessInnovationCultureTravelMore menuMore menuSearch BBCHomeNewsSportBusinessInnovationCultureTravelEarthVideoLiveClose menuBBC SportMenuHomeFootballCricketFormula 1Rugby UTennisGolfAthleticsCyclingMoreA-Z SportsAmerican FootballAthleticsBasketballBoxingCricketCyclingDartsDisability SportFootballFormula 1Gaelic GamesGolfGymnasticsHorse RacingMixed Martial ArtsMotorsportNetballOlympic SportsRugby LeagueRugby UnionSnookerSwimmingTennisWinter SportsFull Sports A-ZMore from SportEnglandScotlandWalesNorthern IrelandNews FeedsHelp & FAQsFootballScores & FixturesTablesUEFA Euro 2024GossipTransfersTop ScorersWomenEuropeanAll TeamsLeagues & CupsMessi farewell? Vinicius Jr for Ballon d\'Or? Copa America set to startImage source, Getty ImagesImage caption, Lionel Messi is third top scorer in international men\'s football ever with 106 goalsEmlyn BegleyBBC Sport journalistPublished19 June 2024Updated 2

Step 1.3 : (Optional) : Lets imagine that the data we want if from a PDF. We can load the data in the file using PyPDFLoader

In [7]:
## Data injestion using PDFs

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("jose's_plan.pdf") # Path to the PDF file
docs = loader.load()
docs

[Document(page_content='José\'s Plan\nAmalendu P\nTottenham Hotspur hasn\'t won a trophy in the last 13 years, since \nthey won the Carling cup in 2008 beating Chelsea at Wembley. \nFor Spurs Chairman Daniel Levy, the most logical move was to \nbring in a special manager. Someone who has plenty of experience\nwinning trophies and making history. \nJosé Mourinho was appointed Manager and Head Coach of \nTottenham Hotspur on 20th November 2019, following the \nsacking of ex-coach Mauricio Pochettinho. In his press conference\na week later, he was asked, "Do you think, not winning the final in\nThe Champions League pay a toll in the squad of Mauricio \nPochettinho ?". José replied, "I don\'t know, because I\'ve never lost\na champions league final." This shows José\'s mentality and that he\nwas under no pressure. Spurs were 14th when he was appointed. \nThey ended up finishing 6th in the 2019-2020 season, nabbing a \nEuropa League spot from Wolverhampton Wanderers only on goal\ndifference

### Step 2 : Chunk the data into smaller segments

RecursiveCharacterTextSplitter splits the data into chunks of the specified chunk size and overlap.

In [8]:
## Dividing the text into chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 300, chunk_overlap = 100)
final_docs = text_splitter.split_documents(docs)

### Step 3 : Ready the embedder from HuggingFace

In [9]:
## Vector embeddings and Vector Store
from langchain_community.embeddings import HuggingFaceEmbeddings
ollama_emb = HuggingFaceEmbeddings() # huggingface embeddings



In [8]:
r1 = ollama_emb.embed_documents(
    [
        "Alpha is the first letter of Greek alphabet",
        "Beta is the second letter of Greek alphabet",
    ]
)
r1  # To check if it works

[[-0.013434885069727898,
  -0.03690732270479202,
  -0.022709505632519722,
  0.027076220139861107,
  -0.009050464257597923,
  -0.0067413742654025555,
  -0.0031774165108799934,
  0.0017906143330037594,
  -0.0029444140382111073,
  -0.009304695762693882,
  0.05261484906077385,
  -0.0536082461476326,
  0.03472718596458435,
  0.061702292412519455,
  -0.008199661038815975,
  -0.007371018175035715,
  -0.007594190537929535,
  0.0021142098121345043,
  0.033031001687049866,
  -0.0045167021453380585,
  -0.06418143212795258,
  0.0034098150208592415,
  -0.011867438443005085,
  0.0009274245821870863,
  0.0799378901720047,
  -0.007561047095805407,
  0.013167117722332478,
  -0.025330819189548492,
  -0.017058778554201126,
  -0.010063151828944683,
  -0.0484318733215332,
  0.0036084160674363375,
  0.00022258306853473186,
  0.02261287160217762,
  1.4194664572642068e-06,
  -0.020168449729681015,
  0.0179514829069376,
  -0.004376732744276524,
  -0.05995223671197891,
  -0.02620953693985939,
  0.04921158403158

### Step 4 : Create the embeddings of the loaded data and insert it into the Vector DB

In [10]:
from langchain_community.vectorstores import Chroma
from chromadb.errors import InvalidDimensionException       # Store data into chroma db

try:
    db = Chroma.from_documents(final_docs,ollama_emb)
except InvalidDimensionException:
    Chroma().delete_collection()
    db = Chroma.from_documents(final_docs,ollama_emb)

Lets see if the database is ready by querying it

In [16]:
query = "What is the playstyle of Tottenham Hotspur ?"
result = db.similarity_search(query)
result      ## Similarity search for vector database

[Document(page_content="José Mourinho has kept Tottenham's conventional counter attack \nplaystyle and combined it with deep defending and quick breaks to\nmake a very balanced team. Spurs have looked very dangerous in \ntheir counter attacks this season. The Belgian Toby Alderweireld", metadata={'page': 1, 'source': "jose's_plan.pdf"}),
 Document(page_content="pressing when possession is lost. Spurs' faliure in scoring one goal\nand shutting down the game has lead them to become more \nadventurous, trying to get a second goal before 'Parking the bus'.\nThere can't be another striker enjoying this season as much as Son", metadata={'page': 2, 'source': "jose's_plan.pdf"}),
 Document(page_content="José's Plan\nAmalendu P\nTottenham Hotspur hasn't won a trophy in the last 13 years, since \nthey won the Carling cup in 2008 beating Chelsea at Wembley. \nFor Spurs Chairman Daniel Levy, the most logical move was to \nbring in a special manager. Someone who has plenty of experience", metadata=

### Step 5 : Prepare the LLM's API Endpoint

In [11]:
# import and ready the llm, here, we use groq api
from langchain_groq import ChatGroq

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

llm = ChatGroq(temperature=0, model="llama3-70b-8192")

Step 6 : (<b>MOST IMPORTANT PART<b>) : Pre-prompt is declared

In [17]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
    Answer the following question based only the context provided.
    Think step by step and provide a detailed answer.
    Just give the final precise answer, no need to explain.
    <context>
    {context}
    </context>
    Question : {input}
    """
    )


Step 7 : Create the chain between the Prompt, LLM and Retriever

In [18]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm,prompt)
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000018094626650>)

Step 8 : Query the model :)

In [21]:
from langchain.chains import create_retrieval_chain

input_question = input("Enter your question : ")


retrieval_chain = create_retrieval_chain(retriever,document_chain)
retrieval_chain.invoke({"input":input_question,
                        "context":retriever})['answer']

'There is no answer to this question based on the provided context, as Ryan Reynolds is not mentioned at all in the text.'