# Working with LLMs and LangChain

## Dependencies

In [1]:
%pip install accelerate transformers[torch] torch sentencepiece chromadb langchain --user

Note: you may need to restart the kernel to use updated packages.


In [11]:
# need to be run after the above dependencies installation
%pip install xformers sentence_transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m652.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting torchvision (from sentence_transformers)
  Downloading torchvision-0.15.2-cp311-cp311-manylinux2014_aarch64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting nltk (from sentence_transformers)
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5

## Loading Model

In [8]:
from langchain import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id = "gpt2",
    task="text-generation",
    model_kwargs={"temperature": 0.1, "max_length": 512},
)

## Testing Model

In [9]:
from langchain import PromptTemplate, LLMChain

template = (
    "Question: {question}"
    "Answer: Let's think step by step."
)
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"
print(llm_chain.run(question))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 First, let's look at the brain. The brain is a large, complex, and highly connected organ. It is the brain's primary source of information.


## Accessing Embeddings Database

In [23]:
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vector_db = Chroma(
    collection_name="airflow_docs_stable",
    persist_directory="./db/",
    embedding_function=embeddings,
)
print(f"Documents: {vector_db._collection.count()}")

Documents: 1474


In [26]:
question = "Python Code to create a Dag Class"
docs = vector_db.similarity_search(question, k=3)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)

dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶

However, you should always use data_interval_start or data_interval_end if possible, since those names are semantically more correct and less prone to misunderstandings. Note that ds (the YYYY-MM-DD form of data_interval_start) refers to date *string*, not date *start* as may be confusing to some.  Tip For more information on logical date, see Data Interval and Running DAGs.    How to create DAGs dynamically?¶ Airflow looks in your DAGS_FOLDER for modules that contain DAG

## Experimenting with other retrival methods

### Max Marginal Relevance
Research Paper: https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

In [27]:
question = "Python Code to create a Dag Class"
docs = vector_db.max_marginal_relevance_search(question, k=3, fetch_k=5)
result_text = "\n\n".join([doc.page_content for doc in docs])
print(result_text)

dag_loader.py¶  from airflow import DAG  from airflow.decorators import task   import pendulum    def create_dag(dag_id, schedule, dag_number, default_args):      dag = DAG(          dag_id,          schedule=schedule,          default_args=default_args,          pendulum.datetime(2021, 9, 13, tz="UTC"),      )       with dag:           @task()          def hello_world():              print("Hello World")              print(f"This is DAG: {dag_number}")           hello_world()       return dag       DAG construction¶

However, you should always use data_interval_start or data_interval_end if possible, since those names are semantically more correct and less prone to misunderstandings. Note that ds (the YYYY-MM-DD form of data_interval_start) refers to date *string*, not date *start* as may be confusing to some.  Tip For more information on logical date, see Data Interval and Running DAGs.    How to create DAGs dynamically?¶ Airflow looks in your DAGS_FOLDER for modules that contain DAG