# Generating text with LangChain and Huggingface

We will start by setting up a standartd huggingface pipeline from our local Vicuna model. From there, it can be used as a normal Langchain LLM.

In [8]:
import pandas as pd
import re

from transformers import pipeline, LlamaForCausalLM
from accelerate import Accelerator
import torch

from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain import PromptTemplate
from langchain.output_parsers.regex_dict import RegexDictParser

In [2]:
model_location = '/home/jovyan/project-archive/vicuna-7b'

model = LlamaForCausalLM.from_pretrained(
        model_location,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map={'': Accelerator().local_process_index}
    )


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/jovyan/conda_envs/peft/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /home/jovyan/conda_envs/peft/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/jovyan/conda_envs/peft/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
pipe = pipeline(model=model,
                tokenizer=model_location,
                use_fast=False,
                task='text-generation',
                model_kwargs={'load_in_8bit': True},
                max_length=2048,
                temperature=0.6,
                top_p=0.95,
                repetition_penalty=1.1
               )

In [7]:
llm = HuggingFacePipeline(pipeline=pipe)

## Data

In [11]:
df = pd.read_csv('../data/subsections.csv')
subsections = df.clean_text

## Langchain

The questions are looking fairly good. Now let's see if we can first extract the automatically generated questions reliably. Then, we will work on generating answers to those questions with the same model using LangChain.

### Prompt

In [23]:
inference_template = (
    'In the field of education, there are three main types of comprehension questions: summary questions, recall questions, and inference questions. '
    'The following is a passage from a macroeconomics textbook. Please provide an inference comprehension question to assess the learner\'s understanding. '
    'An inference comprehension question will ask the learner to make an educated guess or draw a conclusion based on the information presented in a passage or text. '
    'The learner should be able to adequately answer the question in one or two sentences.\n\n'
    '{source}\n\n'
    'Inference Question:\n\n'
)

inference_prompt = PromptTemplate(
    input_variables=['source'],
    template=inference_template,
)

In [24]:
recall_template = (
    'In the field of education, there are three main types of comprehension questions: summary questions, recall questions, and inference questions. '
    'The following is a passage from a macroeconomics textbook. Please provide a summary comprehension question to assess the learner\'s understanding. '
    'A recall comprehension question will ask the learner to remember specific details or information from the passage. '
    'The learner should be able to adequately answer the question in one or two sentences.\n\n'
    '{source}\n\n'
    'Recall Question:\n\n'
)

recall_prompt = PromptTemplate(
    input_variables=['source'],
    template=recall_template,
)

In [25]:
summary_template = (
    'In the field of education, there are three main types of comprehension questions: summary questions, recall questions, and inference questions. '
    'The following is a passage from a macroeconomics textbook. Please provide a summary comprehension question to assess the learner\'s understanding. '
    'A summary comprehension question will ask the learner to provide a brief overview of the main points or ideas in the passage. '
    'The learner should be able to adequately answer the question in one or two sentences.\n\n'
    '{source}\n\n'
    'Summary Question:\n\n'
)

summary_prompt = PromptTemplate(
    input_variables=['source'],
    template=summary_template,
)

In [26]:
inference_chain = LLMChain(llm=llm, prompt=inference_prompt)
recall_chain = LLMChain(llm=llm, prompt=recall_prompt)
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)

In [29]:
recall_chain.run(subsections.sample().item())

"What is the difference between the median weekly earnings for a full-time worker over 25 with no higher than a bachelor's degree and a high school diploma in 2015 according to the Bureau of Labor Statistics (BLS)?"

In [35]:
from langchain.chains import LLMChain
from langchain.chains.base import Chain

class QuestionGenerationChain(Chain):
    summary_chain: LLMChain
    recall_chain: LLMChain
    inference_chain: LLMChain

    @property
    def input_keys(self):
        return ['source']

    @property
    def output_keys(self):
        return ['Summary Question', 'Recall Question', 'Inference Question']

    def _call(self, inputs):
        return {
            'Summary Question': self.summary_chain.run(inputs),
            'Recall Question': self.recall_chain.run(inputs),
            'Inference Question': self.inference_chain.run(inputs),
        }

question_generator = QuestionGenerationChain(summary_chain=summary_chain, recall_chain=recall_chain, inference_chain=inference_chain)

In [38]:
question_generator(subsections.sample().item())

{'source': 'In Chicago, Illinois, the highest recorded temperature was 105° in July 1995, while the lowest recorded temperature was 27° below zero in January 1958. Understanding why these extreme weather patterns occurred is certainly interesting. However, if you wanted to understand the typical weather pattern in Chicago, instead of focusing on one- time extremes, you would need to look at the entire pattern of data over time. A similar lesson applies to the study of macroeconomics. It is interesting to study extreme situations, like the 1930s Great Depression or the 2008-2009 Great Recession. If you want to understand the whole picture, however, you need to look at the long term using the neoclassical perspective.\nConsider the unemployment rate. The unemployment rate has fluctuated from as low as 3.5% in 1969 to as high as 9.7% in 1982 and 9.6% in 2009. Even as the U.S. unemployment rate rose during recessions and declined during expansions, it kept returning to the general neighbor

### Output Parser

Langchain reallys wants us to use a JSON or Pydantic parser. I highly doubt LLaMA-7B can reliably output structured responses. Let's try to build something with regex that fails gracefully.

In [198]:
class RegexParser(RegexDictParser):
    '''Overriding the parse method so that it does not escape regex patterns.
    I need to match the first question at the beginning of the string with the regex '^' special character'''
    def parse(self, text):
        result = {}
        for output_key, expected_format in self.output_key_to_format.items():
            specific_regex = self.regex_pattern.format(expected_format)
            matches = re.findall(specific_regex, text)
            if not matches:
                print(
                    f"No match found for output key: {output_key} with expected format ```{expected_format}``` on text ```{text.strip()}```"
                )
                result[output_key] = '' # we can add in a retry function to try again if the model fails. for now, we will just return an empty string.
            elif len(matches) > 1:
                raise ValueError(
                    f"Multiple matches found for output key: {output_key} with expected format ```{expected_format}``` on text ```{text.strip()}```"
                )
            elif (
                self.no_update_value is not None and matches[0] == self.no_update_value
            ):
                continue
            else:
                result[output_key] = matches[0]
        return result

In [199]:
output_key_to_format = {'Question 1': '^', # for the first question, we need to match the beginning of the string.
                        'Question 2': 'Question 2:'}

re_parser = RegexParser(
    regex_pattern=r'{}\s*(.*?)(?=\n|$)', # searches for the key, a colon, any whitespace, and then matches on all the characters that follow until a linebreak or the end of string.
    output_key_to_format=output_key_to_format,
    no_update_value='N/A'
)

In [200]:
for sample in subsections.sample(15):
    output = chain.run(sample)
    try:
        questions = re_parser.parse(output)
        print('Parsed Output:', questions)
    except ValueError as e:
        print('Failed Parse:', e)

Parsed Output: {'Question 1': 'What does the author mean by "demand" in the context of economics?', 'Question 2': 'According to the passage, what are the two key components that determine the shape of a demand curve?'}
Parsed Output: {'Question 1': 'Why do economists consider the ability to pay when measuring demand?', 'Question 2': 'How does the law of demand relate to the price and quantity demanded of a good or service?'}
Parsed Output: {'Question 1': 'What is the main argument presented in this passage?', 'Question 2': 'How might high-income countries influence low-income countries to adopt stronger environmental standards without resorting to protectionism?'}
Parsed Output: {'Question 1': 'What is the main idea of the passage?', 'Question 2': 'Why should sunk costs not affect the current decision according to the budget constraint framework?'}
Parsed Output: {'Question 1': 'What is the difference between the aggregate supply and aggregate demand model and the microeconomic analysi