# Working with LLMs and LangChain

## Dependencies

In [1]:
%pip install accelerate transformers[torch] torch sentencepiece chromadb langchain --user

Note: you may need to restart the kernel to use updated packages.


In [2]:
# need to be run after the above dependencies installation
%pip install xformers sentence_transformers lark

Note: you may need to restart the kernel to use updated packages.


## Loading Model

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain import HuggingFacePipeline

model="bigscience/bloom-1b1"
text_gen_pipeline = pipeline(
    model=model, 
    model_kwargs= {
        "device_map": "auto", 
        "load_in_8bit": False, 
        "temperature": 0.1,
        "top_p": 1.0,
        "max_length": 1024,
        
    },
    max_new_tokens=2048)

llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

## Testing Model

In [4]:
from langchain import PromptTemplate, LLMChain

template = (
    "Question: {question}"
    "Answer: Let's think step by step."
)
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"
print(llm_chain.run(question))

 First, we need to know what is electroencephalography. Electroencephalography is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical activity of the brain. It is a type of electro-magnetic recording that is used to measure the electrical ac

## Accessing Embeddings Database

In [5]:
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vector_db = Chroma(
    collection_name="airflow_docs_stable",
    persist_directory="./db/",
    embedding_function=embeddings,
)
print(f"Documents: {vector_db._collection.count()}")

Documents: 1474


In [6]:
question = "Python Code to create a Dag Class"
docs = vector_db.similarity_search(question, k=3)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)

dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶

However, you should always use data_interval_start or data_interval_end if possible, since those names are semantically more correct and less prone to misunderstandings. Note that ds (the YYYY-MM-DD form of data_interval_start) refers to date *string*, not date *start* as may be confusing to some.  Tip For more information on logical date, see Data Interval and Running DAGs.    How to create DAGs dynamically?¶ Airflow looks in your DAGS_FOLDER for modules that contain DAG

## Experimenting with other retrival methods

### Max Marginal Relevance
Research Paper: https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

In [7]:
question = "Python Code to create a Dag Class"
docs = vector_db.max_marginal_relevance_search(question, k=3, fetch_k=5)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)

dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶

However, you should always use data_interval_start or data_interval_end if possible, since those names are semantically more correct and less prone to misunderstandings. Note that ds (the YYYY-MM-DD form of data_interval_start) refers to date *string*, not date *start* as may be confusing to some.  Tip For more information on logical date, see Data Interval and Running DAGs.    How to create DAGs dynamically?¶ Airflow looks in your DAGS_FOLDER for modules that contain DAG

### Metadata filtering

In [8]:
question = "Python Code to create a Dag Class"
docs = vector_db.similarity_search(question, k=3)
for doc in docs:
    print(doc.metadata)

{'source': 'https://airflow.apache.org/docs/apache-airflow/stable/faq.html', 'title': 'FAQ — Airflow Documentation', 'language': 'en'}
{'source': 'https://airflow.apache.org/docs/apache-airflow/stable/faq.html', 'title': 'FAQ — Airflow Documentation', 'language': 'en'}
{'source': 'https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html', 'title': 'Best Practices — Airflow Documentation', 'language': 'en'}


In [9]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The url webpage the chunk is from",
        type="string",
    ),
    AttributeInfo(
        name="title",
        description="The title of the of documentation entry",
        type="string",
    ),
    AttributeInfo(
        name="language",
        description="The language the text is written in",
        type="string",
    )
]

document_content_desc = "Documentation text"
retriever = SelfQueryRetriever.from_llm(
    llm,
    vector_db,
    document_content_desc,
    metadata_field_info,
    verbose=True
)

In [10]:
question = "Python Code to create a Dag Class"
docs = retriever.get_relevant_documents(question)



OutputParserException: Parsing text
```json
{
    "query": "documentation text",
    "filter": "and(or(eq(\"source\", \"http://pypi.python.org/pypi?%3Apypi%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A%3A
 raised following error:
Got invalid JSON object. Error: Expecting value: line 1 column 1 (char 0)

### Contextual Compreesion

In [11]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [12]:
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever()
)

In [13]:
question = "Python Code to create a Dag Class"
docs = compression_retriever.get_relevant_documents(question)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)



>>>
dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶
>>>
Extracted irrelevant parts:
>>>
dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is

#### Adding MMR

In [14]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever(search_type="mmr")
)

In [15]:
question = "Python Code to create a Dag Class"
docs = compression_retriever.get_relevant_documents(question)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)

>>>
dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶
>>>
Extracted irrelevant parts:
>>>
dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is