<a href="https://colab.research.google.com/github/mdhuzaifapatel/localGPT/blob/main/OpenSourceLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build RAG pipeline using Open Source Large Languages

In the notebook we will build a Chat with Website use cases using Zephyr 7B model

## Installation

In [1]:
!pip install langchain faiss-cpu sentence-transformers chromadb

Collecting langchain
  Downloading langchain-0.1.4-py3-none-any.whl (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
C

## Import RAG components required to build pipeline

In [2]:
from langchain.llms import HuggingFaceHub
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chains import RetrievalQA, LLMChain

## Setup HuggingFace Access Token

- Log in to [HuggingFace.co](https://huggingface.co/)
- Click on your profile icon at the top-right corner, then choose [“Settings.”](https://huggingface.co/settings/)
- In the left sidebar, navigate to [“Access Token”](https://huggingface.co/settings/tokens)
- Generate a new access token, assigning it the “write” role.


In [3]:
import os
from getpass import getpass

HF_TOKEN = getpass("HF Token:")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

HF Token:··········


## External data/document - ETL

In [5]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
WEBSITE_URL = "https://mdhuzaifa.netlify.app/"

In [6]:
loader = WebBaseLoader(WEBSITE_URL)
loader.requests_per_second = 1
docs = loader.aload()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.58it/s]


In [7]:
docs

[Document(page_content="Md Huzaifa {••}Md Huzaifa PatelHello, I'mMuhammad Huzaifa..._Download CVAbout meScroll DownMy IntroAbout MeExperienceFresherCompleted5+ ProjectsSupportOnline 24/7I am an aspiring developer, currently pursuing my B.E in Electronics & Communication (2023).  I love programming and developing websites. Interested in Full Stack Software Development and Web Development.   Especially active on HackerRank!Contact MeMy abilitiesMy ExperienceFrontendHTMLIntermediateJavaScriptIntermediateReactBasicCSSIntermediateGitHubIntermediateBackendPythonAdvancedC++IntermediateJavaBasicMySQLBasicMy ServicesWhat I OfferFrontend DevelopmentSee more Frontend DevelopmentWhat I can provide at the frontend?Responsive Website Development.I develop the interactive user interface.Mobile-first design.Backend DevelopmentSee more Backend DevelopmentWhat I can provide at the backend?Backend development using NodeJS, ExpressJS, MySQL, etc.Python & Flask Framework.Develop CMS (Content Management Sys

## Text Splitting - Chunking

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)

In [9]:
chunks[0]

Document(page_content="Md Huzaifa {••}Md Huzaifa PatelHello, I'mMuhammad Huzaifa..._Download CVAbout meScroll DownMy IntroAbout MeExperienceFresherCompleted5+ ProjectsSupportOnline 24/7I am an aspiring developer, currently pursuing my B.E in Electronics & Communication (2023).", metadata={'source': 'https://mdhuzaifa.netlify.app/', 'title': 'Md Huzaifa {••}', 'language': 'en'})

In [10]:
chunks[1]

Document(page_content='(2023).  I love programming and developing websites. Interested in Full Stack Software Development and Web Development.   Especially active on HackerRank!Contact MeMy abilitiesMy', metadata={'source': 'https://mdhuzaifa.netlify.app/', 'title': 'Md Huzaifa {••}', 'language': 'en'})

In [11]:
chunks[2]

Document(page_content='MeMy abilitiesMy ExperienceFrontendHTMLIntermediateJavaScriptIntermediateReactBasicCSSIntermediateGitHubIntermediateBackendPythonAdvancedC++IntermediateJavaBasicMySQLBasicMy ServicesWhat I OfferFrontend DevelopmentSee more Frontend DevelopmentWhat I can', metadata={'source': 'https://mdhuzaifa.netlify.app/', 'title': 'Md Huzaifa {••}', 'language': 'en'})

In [12]:
chunks[3]

Document(page_content='I can provide at the frontend?Responsive Website Development.I develop the interactive user interface.Mobile-first design.Backend DevelopmentSee more Backend DevelopmentWhat I can provide at the backend?Backend development using NodeJS, ExpressJS, MySQL,', metadata={'source': 'https://mdhuzaifa.netlify.app/', 'title': 'Md Huzaifa {••}', 'language': 'en'})

In [13]:
chunks[4]

Document(page_content='ExpressJS, MySQL, etc.Python & Flask Framework.Develop CMS (Content Management System) using Sanity, Strapi, etc.Payment gateway integration.Visual DesignerSee more Visual DesignerWhat I can do with my creative designing skills?Photoshop, Premeire Pro,', metadata={'source': 'https://mdhuzaifa.netlify.app/', 'title': 'Md Huzaifa {••}', 'language': 'en'})

## Embeddings

In [14]:
embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=HF_TOKEN, model_name="BAAI/bge-base-en-v1.5"
)

## Vector Store - FAISS or ChromaDB

In [15]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x7c4a34463a00>

In [None]:
query = "Where does Tarun work?"
search = vectorstore.similarity_search(query)

In [None]:
search[0].page_content

"Tarun Jain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\nAbout Me\nEvents\nAchievements\nProjects\nWork\nConnect\n\nResume\n\n\nAI with Tarun\n\n\n\n\n\n\n\n\n\n\n\n\nIt's Me...  Tarun Jain\n\n DevRel @AI Planet🥑 ||  Community Lead @Embedchain.ai🤗||  GDE in ML🚀 \nKnow more"

## Retriever

In [None]:
retriever = vectorstore.as_retriever(
    search_type="mmr", #similarity
    search_kwargs={'k': 4}
)

In [None]:
retriever.get_relevant_documents(query)

[Document(page_content="Tarun Jain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\nAbout Me\nEvents\nAchievements\nProjects\nWork\nConnect\n\nResume\n\n\nAI with Tarun\n\n\n\n\n\n\n\n\n\n\n\n\nIt's Me...  Tarun Jain\n\n DevRel @AI Planet🥑 ||  Community Lead @Embedchain.ai🤗||  GDE in ML🚀 \nKnow more", metadata={'language': 'pt-BR', 'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain'}),
 Document(page_content='MangaPi\n\n\n\n\n\n\n\n\n\n\nHyperspectral Image Compression Using MATLAB (Hardware)\n\n\n\n\n\n\n\n\n\n\nAppreciate my Work\nFeel free to check this resources...\n\n\n\n\n\n\nAI With Tarun:\nSubscribe...', metadata={'language': 'pt-BR', 'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain'}),
 Document(page_content='MediaPipe Tasks API Bootcamp\n\r\n                 I explained the need for Mediapipe and its applications. And later the participants implemented Mediapipe Tasks projects in the Audio, Image and Text domains.', metadata={'language': 'pt-

## Large Language Model - Open Source

In [None]:
llm = HuggingFaceHub(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    model_kwargs={"temperature": 0.5, "max_length": 64,"max_new_tokens":512}
)

## Prompt Template and User Input (Augment - Step 2)

In [None]:
query = "Name the projects Tarun has worked on?"

prompt = f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

## RAG RetrievalQA chain

In [None]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever)

In [None]:
response = qa.run(prompt)

In [None]:
response

"\n\nIn addition to MangaPi, a hardware-based hyperspectral image compression project using MATLAB, Tarun has worked on a variety of other projects. One such project was a computer vision project that involved developing a face recognition system using OpenCV. Additionally, Tarun participated in an Object Tracking Bootcamp, where he worked on a vehicle tracking project. \n\nTarun's experience in these projects, as well as his participation in hackathons and competitions, demonstrates his passion for innovation and problem-solving in the fields of computer vision, machine learning, and artificial intelligence."

## Chain

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [None]:
template = """
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
response = rag_chain.invoke("Name the projects Tarun has worked on?")

In [None]:
print(response)

I do not have access to specific information about any particular person unless it is publicly available. however, if you provide me with the name "tarun" and specify which tarun you are referring to, i can help you with that. please provide me with more context or details.
