# Generating text with LangChain and Huggingface

We will start by setting up a standartd huggingface pipeline from our local Vicuna model. From there, it can be used as a normal Langchain LLM.

In [12]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, LlamaForCausalLM
from accelerate import Accelerator
import torch

In [None]:
model_location = '/home/jovyan/project-archive/vicuna-7b'

model = LlamaForCausalLM.from_pretrained(
        model_location,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map={'': Accelerator().local_process_index}
    )

In [50]:
pipe = pipeline(model=model,
                tokenizer=model_location,
                use_fast=False,
                task='text-generation',
                model_kwargs={'load_in_8bit': True},
                max_length=2048,
                temperature=0.6,
                top_p=0.95,
                repetition_penalty=1.1
               )

# Data

In [17]:
import pandas as pd
subsections = pd.read_csv('../data/subsections.csv')

## Testing Prompts

In [65]:
template = (
    'The following is a passage from a macroeconomics textbook. Please read the passage and write two inference-type comprehension questions to assess the learner\'s understanding. '
    'An infernece question will encourage the learner to go beyond the text in their response, but the learner should still be able to adequately answer the question in one or two sentences. '
    'Please read the following passage and generate two inference-type comprehension questions:\n\n{0}\n\nQuestion 1: '
)

In [56]:
sources = subsections['clean_text'][subsections['clean_text'].str.len() > 100] # check string length to filter out short subsections "learn with videos" and "link it up" etc.
prompt.format(sources.sample().item())

"The following is a passage from a macroeconomics textbook. Please read the passage and write two comprehension questions to assess the learner's understanding. A good comprehension question will encourage the learner to reflect on the reading. The learner should be able to adequately answer the question in one or two sentences. \n Overview\n  \n    By the end of this section, you will be able to:\n  \n  - Define absolute advantage, comparative advantage, and opportunity costs - Explain\n  the gains of trade created when a country specializes\n  \n  No nation was ever ruined by trade\n  \n  Benjamin Franklin (1706-1790)\nMany economists would express their attitudes toward international trade in an even more positive manner. The evidence that international trade confers overall benefits on economies is pretty strong. Trade has accompanied economic growth in the United States and around the world. Many national economies that have shown the most rapid growth in the last several decades—

In [68]:
print(pipe(prompt.format(sources.sample().item()))[0]['generated_text'])

The following is a passage from a macroeconomics textbook. Please read the passage and write two inference-type comprehension questions to assess the learner's understanding.An infernece question will encourage the learner to go beyond the text in their response, but the learner should still be able to adequately answer the question in one or two sentences.Please read the following passage and generate two inference-type comprehension questions:

The Aggregate Supply Curve and Potential GDP
Firms make decisions about what quantity to supply based on the profits they expect to earn. They determine profits, in turn, by the price of the outputs they sell and by the prices of the inputs, like labor or raw materials, that they need to buy.
  
    Aggregate supply (AS)  refers to the total quantity of output (i.e. real GDP) firms will produce and sell.
    The aggregate supply (AS) curve shows the total quantity of output (i.e. real GDP) that firms will produce and sell at each price level.


## Langchain

The questions are looking fairly good. Now let's see if we can first extract the automatically generated questions reliably. Then, we will work on generating answers to those questions with the same model using LangChain.

### Prompt

In [144]:
from langchain.chains import LLMChain
from langchain import PromptTemplate
from langchain.output_parsers.regex_dict import RegexDictParser

llm = HuggingFacePipeline(pipeline=pipe)

In [186]:
template = (
    'The following is a passage from a macroeconomics textbook. Please read the passage and write two inference-type comprehension questions to assess the learner\'s understanding. '
    'An infernece question will encourage the learner to go beyond the text in their response, but the learner should still be able to adequately answer the question in one or two sentences. '
    'Format your response in the following way:\n'
    'Question 1:\n'
    'The first question.\n'
    'Question 2:\n'
    'The second question.\n'
    'Please read the following passage and generate two inference-type comprehension questions:\n\n{source}\n\nQuestion 1: '
)

prompt = PromptTemplate(
    input_variables=['source'],
    template=template,
)

In [187]:
chain = LLMChain(llm=llm, prompt=prompt)

In [147]:
output = chain.run(sources.sample().item())



### Output Parser

Langchain reallys wants us to use a JSON or Pydantic parser. I highly doubt LLaMA-7B can reliably output structured responses. Let's try to build something with regex that fails gracefully.

Unlike good Python code, RegEx is never easily readable. The following expression searches for "Question 1:" and any trailing whitespace. It then lazily matches all characters, including newline characters (re.DOTALL), until it finds "Question 2:" OR the end of the string ($). It skips over any whitespace, then captures the rest of the string until it finds a newline.

In [155]:
import re
pattern = re.compile(r'Question 2:\s*(.*?)(?=\n|$)')
pattern = re.compile(r'^\s*(.*?)(?=\n|$)')
pattern.search(output)[1]

'What were economic conditions like before 1870?'

In [198]:
class RegexParser(RegexDictParser):
    '''Overriding the parse method so that it does not escape regex patterns.
    I need to match the first question at the beginning of the string with the regex '^' special character'''
    def parse(self, text):
        result = {}
        for output_key, expected_format in self.output_key_to_format.items():
            specific_regex = self.regex_pattern.format(expected_format)
            matches = re.findall(specific_regex, text)
            if not matches:
                print(
                    f"No match found for output key: {output_key} with expected format ```{expected_format}``` on text ```{text.strip()}```"
                )
                result[output_key] = '' # we can add in a retry function to try again if the model fails. for now, we will just return an empty string.
            elif len(matches) > 1:
                raise ValueError(
                    f"Multiple matches found for output key: {output_key} with expected format ```{expected_format}``` on text ```{text.strip()}```"
                )
            elif (
                self.no_update_value is not None and matches[0] == self.no_update_value
            ):
                continue
            else:
                result[output_key] = matches[0]
        return result

In [165]:
regex_pattern=r'{}\s*(.*?)(?=\n|$)'
re.findall(regex_pattern.format(re.escape('^')), output)
re.findall(regex_pattern.format(re.escape(r'^')), output)

[]

In [167]:
re.findall(r'^\s*(.*?)(?=\n|$)', output)

['What were economic conditions like before 1870?']

In [199]:
output_key_to_format = {'Question 1': '^', # for the first question, we need to match the beginning of the string.
                        'Question 2': 'Question 2:'}

re_parser = RegexParser(
    regex_pattern=r'{}\s*(.*?)(?=\n|$)', # searches for the key, a colon, any whitespace, and then matches on all the characters that follow until a linebreak or the end of string.
    output_key_to_format=output_key_to_format,
    no_update_value='N/A'
)

In [200]:
for sample in sources.sample(15):
    output = chain.run(sample)
    try:
        questions = re_parser.parse(output)
        print('Parsed Output:', questions)
    except ValueError as e:
        print('Failed Parse:', e)

Parsed Output: {'Question 1': 'What does the author mean by "demand" in the context of economics?', 'Question 2': 'According to the passage, what are the two key components that determine the shape of a demand curve?'}
Parsed Output: {'Question 1': 'Why do economists consider the ability to pay when measuring demand?', 'Question 2': 'How does the law of demand relate to the price and quantity demanded of a good or service?'}
Parsed Output: {'Question 1': 'What is the main argument presented in this passage?', 'Question 2': 'How might high-income countries influence low-income countries to adopt stronger environmental standards without resorting to protectionism?'}
Parsed Output: {'Question 1': 'What is the main idea of the passage?', 'Question 2': 'Why should sunk costs not affect the current decision according to the budget constraint framework?'}
Parsed Output: {'Question 1': 'What is the difference between the aggregate supply and aggregate demand model and the microeconomic analysi