# Experiment with LangChain and Flan-T5

In this notebook we will experiment with the LangChain and Flan-T5 models. We will use the same data as in the previous notebooks.
The goal is to get explanations with deep reasoning about particular topics.

## Local vector database

We need to store embeddings of the documents in a local database. We will use embedded database - Weaviate.

In [None]:
%%bash
pip install tqdm beautifulsoup4

In [None]:
%%bash
pip install weaviate-client

In [3]:
import os

import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

input_directory = os.path.join("..", "data", "confluence_exports")
include_extensions = [".html"]

dataset_path = os.path.join("..", "datasets", "confluence_exports-inputs-augmented")


def get_files_to_process(root_path):
    for dirpath, _, filenames in os.walk(root_path):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in include_extensions):
                yield os.path.join(dirpath, filename)


articles_df = pd.DataFrame(columns=["source_raw", "target", "file"])
fileList = list(get_files_to_process(input_directory))

for filePath in tqdm(fileList, desc="Processing files"):
    with open(filePath, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file.read(), "html.parser")
        main_header = soup.find("h1").text.strip()
        header_tags = ["h2", "h3", "h4", "h5", "h6"]
        headers_stack = []
        for header in soup.find_all(header_tags):
            header_level = int(header.name[1])

            while len(headers_stack) >= header_level:
                headers_stack.pop()

            headers_stack.append(header.text)

            target = ''
            current_element = header.next_element

            while current_element is not None and (
                    current_element.name is None or current_element.name not in header_tags):
                if current_element.name is None:
                    target = " ".join([target, current_element.getText().strip()])
                current_element = current_element.next_element

            source_raw = " : ".join([main_header] + headers_stack).replace(':', '>')
            articles_df = pd.concat(
                [articles_df, pd.DataFrame([[source_raw, target, filePath]], columns=["source_raw", "target", "file"])])


def has_content(row):
    return len(row["source_raw"].split()) > 2 and len(row["target"].split()) > 5


articles_df = articles_df.drop_duplicates(subset=["source_raw"])
articles_df = articles_df.drop_duplicates(subset=["target"])
articles_df = articles_df[articles_df.apply(has_content, axis=1)]

articles_df.reset_index(drop=True, inplace=True)

articles_df.sample(10)

Processing files:   0%|          | 0/266 [00:00<?, ?it/s]

Unnamed: 0,source_raw,target,file
551,Hadoop > Hadoop 2.8.0 Release > Blocker/Critic...,TODOs before RC TODO item Status - 1/4/2017 C...,../../data/confluence_exports/HADOOP/Hadoop-2....
1219,Apache Tomcat > ClusteringOverview > General T...,Linux in General You'll find that Linux is a ...,../../data/confluence_exports/TOMCAT/Clusterin...
588,Hadoop > How To Contribute > Dev Environment S...,Integrated Development Environment (IDE) You ...,../../data/confluence_exports/HADOOP/How-To-Co...
232,TIKA > API Bindings for Tika > Kubernetes Char...,Ruby Tika-Client: Ruby Bindings for Tika Se...,../../data/confluence_exports/TIKA/API-Binding...
947,Apache Tomcat > UsingDataSources > How do I us...,How do I use DataSources with Tomcat? When de...,../../data/confluence_exports/TOMCAT/UsingData...
399,TIKA > VirtualMachine > Install software (this...,prep nsfpolardata scp -r <user>@nsfpolardata....,../../data/confluence_exports/TIKA/VirtualMach...
328,TIKA > ComparisonTikaAndPDFToText201811 > Lang...,Improvements to tika-eval We observed a han...,../../data/confluence_exports/TIKA/ComparisonT...
858,Apache Tomcat > Community Review of DISA STIG ...,V-222964 This finding misses multiple TLS set...,../../data/confluence_exports/TOMCAT/Community...
140,TIKA > Release Process for tika-helm > Apache ...,Creating Git Release Based on the above gener...,../../data/confluence_exports/TIKA/Release-Pro...
562,Hadoop > GitHub Integration > Git setup > Clos...,Closing a PR without committing (for committe...,../../data/confluence_exports/HADOOP/GitHub-In...


### Starting Weaviate Vector DB

To complete following step you need to ensure that `docker` and `compose plugin` installed.
Then you need to run:

```sh
docker compose up -d
```

In [4]:

from tqdm.notebook import tqdm
import weaviate

schema = {
    "classes": [
        {
            "class": "Document",
            "description": "A class called document",
            "moduleConfig": {
                "text2vec-huggingface": {
                    "model": "google/flan-t5-large",
                    "options": {
                        "waitForModel": True,
                        "useGPU": True,
                        "useCache": True
                    }
                }
            },
            "properties": [
                {
                    "dataType": [
                        "text"
                    ],
                    "description": "Title of the document",
                    "moduleConfig": {
                        "text2vec-transformers": {
                            "skip": True,
                            "vectorizePropertyName": True
                        }
                    },
                    "name": "title"
                },
                {
                    "dataType": [
                        "text"
                    ],
                    "description": "Content that will be vectorized",
                    "moduleConfig": {
                        "text2vec-transformers": {
                            "skip": True,
                            "vectorizePropertyName": True
                        }
                    },
                    "name": "content"
                }
            ],
            "vectorizer": "text2vec-transformers"
        }
    ]
}

client = weaviate.Client(
    "http://127.0.0.1:8080",
    startup_period=30
)

client.schema.delete_all()
client.schema.create(schema)

with client.batch as batch:
    batch.batch_size=100
    for index, article in tqdm(articles_df.iterrows(), total=len(articles_df), desc="Persisting articles to Weaviate"):
        data_obj = {
            "title": article["source_raw"],
            "content": article["target"]
        }
        client.batch.add_data_object(data_obj, "Document")


Persisting articles to Weaviate:   0%|          | 0/1235 [00:00<?, ?it/s]

## Usage of LangChain with Flan-T5

We will use the Flan-T5 model from the HuggingFace library and text2text-generation pipeline.


In [None]:
%%bash
pip install transformers langchain

In [13]:
import os
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Weaviate

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
vectorstore = Weaviate(client, "Document", "content")
query = "What is apache tika?"
docs = vectorstore.similarity_search(query)

docs

ValueError: Document prompt requires documents to have metadata variables: ['source']. Received document with missing metadata: ['source'].