## SynapseML Langchain Transformer 
### Used for Paper Reading and Organization

LangChain is a software development framework designed to simplify the creation of applications using large language models (LLMs). Chains in Langchain go beyond just a single LLM call and are sequences of calls (can be a call to an LLM or a different utility), automating the execution of a series of call and actions. 
To make it easier to scale up the Langchain execution on a large dataset, here we integrated Langchain with the distributed machine learning library [SynapseML](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of data with the Langchain Framework. This tutorial shows how to apply langchain at a distributed scale to process a chunk of data. Specifically, it gives an example on paper summarization and organization, where you can just provide a bunch of arxiv links, and the Langchain Transformer in SynapseML is able to give back a table containing the corresponding paper title, paper authors, paper summary, and some related papers.

## Prerequisites

The key prerequisites for this quickstart include a working Azure OpenAI resource, and an Apache Spark cluster with SynapseML installed. We suggest creating a Synapse workspace, but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a python environment with the `pyspark` package will work. If you need to use the last component of the chain - An agent with web searching capabilities, you also need a SerpAPIKey.

1. An Azure OpenAI resource – request access [here](https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7en2Ais5pxKtso_Pz4b1_xUOFA5Qk1UWDRBMjg0WFhPMkIzTzhKQ1dWNyQlQCN0PWcu) before [creating a resource](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource)
1. [Create a Synapse workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-create-workspace)
1. [Create a serverless Apache Spark pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-analyze-spark#create-a-serverless-apache-spark-pool)
1. Get a SerpAPIKey from [SerpApi](https://serpapi.com/).

## Import this guide as a notebook

The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo. Or download the notebook and import it into Synapse Analytics

1. Import the notebook [into the Synapse Workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#create-a-notebook) or if using Databricks [into the Databricks Workspace](https://docs.microsoft.com/en-us/azure/databricks/notebooks/notebooks-manage#create-a-notebook)
1. Install SynapseML on your cluster. Please see the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). Note that this requires pasting an additional cell at the top of the notebook you just imported
1. Connect your notebook to a cluster and follow along, editing and running the cells below.

In [0]:
%pip install langchain openai unstructured pdfminer.six 

In [0]:
import os, openai, langchain, uuid
from langchain.llms import AzureOpenAI, OpenAI
from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import TransformChain, LLMChain, SimpleSequentialChain
from langchain.document_loaders import OnlinePDFLoader
from langchain.prompts import PromptTemplate
import pyspark.sql.functions as f
from synapse.ml.cognitive.langchain import LangchainTransformer
from synapse.ml.core.platform import running_on_synapse, find_secret

### Fill in the service information and construct the LLM
Next, please edit the cell in the notebook to point to your service. In particular set the `model_name`, `deployment_name`, `openai_api_base`, and `open_api_key` variables to match those for your OpenAI service:

Note: you'll need to get the SERPAPI key from [Here](https://serpapi.com/dashboard)

In [0]:
os.environ["SERPAPI_API_KEY"] = "YOURSERPAPIKEY"
openai_api_key = find_secret("openai-api-key")
openai_api_base = "https://synapseml-openai.openai.azure.com/"
openai_api_version = "2022-12-01"
openai_api_type = "azure"

os.environ["OPENAI_API_TYPE"] = openai_api_type
os.environ["OPENAI_API_VERSION"] = openai_api_version
os.environ["OPENAI_API_BASE"] = openai_api_base
os.environ["OPENAI_API_KEY"] = openai_api_key
llm = AzureOpenAI(
    deployment_name="text-davinci-003",
    model_name="text-davinci-003",
    temperature=0.1,
    verbose=True,
)

### Basic Usage of Langchain Transformer

#### Construction of Chain
It is a very simple chain, basically just copy the input column to the output column, this is just to demonstrate the basic usage of Langchain Transformer

In [0]:
copy_prompt = PromptTemplate(
    input_variables=["technology"],
    template="Copy the following word: {technology}",
)

chain = LLMChain(llm=llm, prompt=copy_prompt)
transformer = (
    LangchainTransformer()
    .setInputCol("technology")
    .setOutputCol("copied_technology")
    .setChain(chain)
    .setSubscriptionKey(openai_api_key)
    .setUrl(openai_api_base)
)

### Data Construction and Langchain Transformation

In [0]:
# construction of test dataframe
sentenceDataFrame = spark.createDataFrame(
    [(0, "docker"), (1, "spark"), (2, "python")], ["label", "technology"]
)
transformed_df = transformer.transform(sentenceDataFrame)
transformed_df.show(truncate=True)

### Transformer Save and Load
Langchain Transformers can be saved and loaded if the chain is a simple chain that doesn't have memory

In [0]:
temp_dir = "tmp"
if not os.path.exists(temp_dir):
    os.mkdir(temp_dir)
path = os.path.join(temp_dir, "langchainTransformer")
transformer.save(path)
loaded_transformer = LangchainTransformer.load(path)
loaded_transformer_transformed_df = transformer.transform(sentenceDataFrame)
loaded_transformer_transformed_df.show(truncate=True)

# Langchain Transformer 
# for Paper Reading Usecase

### Sequential Chain Construction

Construct a Sequential Chain for extracting paper content from arxiv link, get the paper title and paper author information, and summarize the paper content into a summary. After that, use web search tool to find the recent papers written by the first author. 

Here is the Sequential Chain Construction:

**Transform Chain**: Extract Paper Content from arxiv Link **=>**

**LLMChain**: Summarize the Paper, extract paper title and authors **=>**

**Transform Chain**: to generate the prompt **=>**

**Agent with Web Search Tool**: Use Web Search to find the recent papers by the first author (this part is commented out as it needs the SerpAPIKey to run successfully)

In [0]:
def paper_content_extraction(inputs: dict) -> dict:
    arxiv_link = inputs["arxiv_link"]
    loader = OnlinePDFLoader(arxiv_link)
    pages = loader.load_and_split()
    return {"paper_content": pages[0].page_content + pages[1].page_content}


def prompt_generation(inputs: dict) -> dict:
    output = inputs["Output"]
    prompt = (
        "find the paper title, author, summary in the paper description below, output them. After that, Use websearch to find out 3 recent papers of the first author in the author section below (first author is the first name separated by comma) and list the paper titles in bullet points: <Paper Description Start>\n"
        + output
        + "<Paper Description End>."
    )
    return {"prompt": prompt}


paper_content_extraction_chain = TransformChain(
    input_variables=["arxiv_link"],
    output_variables=["paper_content"],
    transform=paper_content_extraction,
    verbose=False,
)

paper_summarizer_template = """You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, and extract authors and paper title from the paper content.
Here is the paper content:
{paper_content}
Output:
paper title, authors and summary.
"""
prompt = PromptTemplate(
    input_variables=["paper_content"], template=paper_summarizer_template
)
summarize_chain = LLMChain(llm=llm, prompt=prompt, verbose=False)

prompt_generation_chain = TransformChain(
    input_variables=["Output"],
    output_variables=["prompt"],
    transform=prompt_generation,
    verbose=False,
)

tools = load_tools(["serpapi"], llm=llm)
web_search_agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False
)

sequential_chain = SimpleSequentialChain(
    chains=[paper_content_extraction_chain, summarize_chain]
)

"""
Uncomment the following when you have SerpAPIKey to enable websearch.
"""
# sequential_chain = SimpleSequentialChain(chains=[
#   paper_content_extraction_chain, summarize_chain, prompt_generation_chain, web_search_agent
# ])

### Langchain Transformer for Paper Summaries 

Create a dataframe containing arxiv links, use the Langchain transformer to generate a dataset containing title, author, summary and recent publications from the first author

In [0]:
paperLinkDataFrame = spark.createDataFrame(
    [
        (0, "https://arxiv.org/pdf/2107.13586.pdf"),
        (1, "https://arxiv.org/pdf/2101.00190.pdf"),
        (2, "https://arxiv.org/pdf/2103.10385.pdf"),
        (3, "https://arxiv.org/pdf/2110.07602.pdf"),
    ],
    ["label", "arxiv_link"],
)
# construct langchain transformer using the paper summarizer chain define above
langchainTransformer = (
    LangchainTransformer()
    .setInputCol("arxiv_link")
    .setOutputCol("paper_info")
    .setChain(sequential_chain)
    .setSubscriptionKey(openai_api_key)
    .setUrl(openai_api_base)
)


# extract paper information from arxiv links, the paper information needs to include:
# paper title, paper authors, brief paper summary, and recent papers published by the first author
transformed_df = langchainTransformer.transform(paperLinkDataFrame)
transformed_df.show(truncate=True)

"""
Uncomment the following when you have SerpAPIKey to enable websearch.
"""
# paper_info = f.split(transformed_df['paper_info'], "\n", 4)
# # brief post processing to generate multiple columns from a single column
# result = transformed_df.withColumn('paper_title', f.split(paper_info.getItem(0), ":", 2)[1]).withColumn('authors', f.split(paper_info.getItem(1), ":", 2)[1]).withColumn('paper_summary', f.split(paper_info.getItem(2), ":", 2)[1]).withColumn('recent_papers', f.split(paper_info.getItem(3), ":", 2)[1])
# result.select("arxiv_link", "paper_title", "authors", "paper_summary", "recent_papers").show(truncate=True)