# Introduction to Automation with LangChain, Generative AI, and Python
**3.1: Other Parsers (CSV, JSON, Pandas, Datetime)**
* Instructor: [Jeff Heaton](https://youtube.com/@HeatonResearch), WUSTL Center for Analytics and Business Insight (CABI), [Washington University in St. Louis](https://olin.wustl.edu/faculty-and-research/research-centers/center-for-analytics-and-business-insight/index.php)
* For more information visit the [class website](https://github.com/jeffheaton/cabi_genai_automation).

In this section, we'll explore how LangChain offers versatile parsers capable of handling a variety of data formats, enhancing its functionality across numerous applications. Among these, it can seamlessly integrate with data in the form of Pandas dataframes, comma-separated lists, JSON structures, and datetime objects, among others. This capability ensures that LangChain can adapt to diverse data inputs, making it a powerful tool for data manipulation and analysis in different contexts. We will delve into some of these parsers and demonstrate their practical applications, highlighting how they can be utilized to streamline processes and extract meaningful insights from data.

## Parse Comma Separated List Response

We will begin with the CommaSeparatedListOutputParser parser, which can take the LLM output in a comma-separated list and extract it as a Python list.

In [1]:
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_aws import ChatBedrock

output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="List ten {subject}.\n{format_instructions}",
    input_variables=["subject"],
    partial_variables={"format_instructions": format_instructions},
)

MODEL = 'mistral.mistral-7b-instruct-v0:2'
# Initialize bedrock, use built in role
model = ChatBedrock(
    model_id=MODEL,
    model_kwargs={"temperature": 0.1},
)

chain = prompt | model | output_parser

Extract a list of cities.

In [2]:
chain.invoke({"subject": "cities"})

['New York',
 'London',
 'Paris',
 'Tokyo',
 'Sydney',
 'Beijing',
 'Mumbai',
 'Istanbul',
 'Seoul',
 'Rio de Janeiro']

Extract a list of programming languages.

In [3]:
chain.invoke({"subject": "programming languages"})

['Here are ten programming languages',
 'listed as comma-separated values:\n\nJava',
 'Python',
 'C++',
 'JavaScript',
 'Ruby',
 'Swift',
 'PHP',
 'Kotlin',
 'Go',
 'R.\n\nThese languages are popular and widely used in various industries and applications.']

## Parse JSON Response

We can format the output from the LLM into JSON. For this example, we will accept a sentence that we detect as English and then translate it into Spanish, French, and Chinese.

In [4]:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# Define your desired data structure.
class Translate(BaseModel):
  detected: str = Field(description="the detected language of the input")
  spanish: str = Field(description="the input translated to Spanish")
  french: str = Field(description="the input translated to French")
  chinese: str = Field(description="the input translated to Chinese")

# And a query intented to prompt a language model to populate the data structure.
input_text = "What is your name?"

# Set up a parser + inject instructions into the prompt template.
parser = JsonOutputParser(pydantic_object=Translate)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{input}\n",
    input_variables=["input"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"input": input_text})

{'detected': 'English',
 'spanish': 'No se comprende tu consulta',
 'french': 'Je ne comprends pas ta requête',
 'chinese': '您的查询我理解不了'}

## Query Pandas Dataframe

Langchain's capabilities include parsing and analyzing Pandas dataframes using the PandasDataFrameOutputParser. This feature allows users to seamlessly integrate data stored in Pandas dataframes and use Langchain to query and extract insights from this data. By leveraging the PandasDataFrameOutputParser, Langchain can interpret the dataframe's structure, contents, and context, enabling it to provide accurate answers to user queries. This integration is particularly useful for data analysis, enabling more interactive and natural language-based exploration of data stored in Pandas dataframes.

The following code reads and displays the first lines from the classic [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).

In [5]:
import pprint
from typing import Any, Dict

import pandas as pd
from langchain.output_parsers import PandasDataFrameOutputParser
from langchain_core.prompts import PromptTemplate

# Load the iris dataset
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/iris.csv", na_values=["NA", "?"]
)

print(df.head())

   sepal_l  sepal_w  petal_l  petal_w      species
0      5.1      3.5      1.4      0.2  Iris-setosa
1      4.9      3.0      1.4      0.2  Iris-setosa
2      4.7      3.2      1.3      0.2  Iris-setosa
3      4.6      3.1      1.5      0.2  Iris-setosa
4      5.0      3.6      1.4      0.2  Iris-setosa


Next we load the iris dataframe into a PandasDataFrameOutputParser class.

In [6]:
#MODEL = 'anthropic.claude-3-sonnet-20240229-v1:0'
#llm = ChatBedrock(
#    model_id=MODEL,
#    model_kwargs={"temperature": 0.1},
#)

parser = PandasDataFrameOutputParser(dataframe=df)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

We query for the sum of one of the columns.

In [7]:
query = "Get the sum of petal_w column."
parser_output = chain.invoke({"query": query})
print(parser_output)

{'sum': 179.90000000000003}


## Datetime

Langchain includes a feature known as the DatetimeOutputParser, which is specifically designed to parse datetime values from text. This capability allows it to recognize and interpret dates and times expressed in various formats, converting them into a standardized datetime format. This functionality is invaluable in applications involving scheduling, data analysis, or any context where accurate handling of dates and times is essential. By utilizing the DatetimeOutputParser, developers can streamline the processing of temporal data, ensuring that their applications can effectively manage and respond to time-related information.

In [8]:
from langchain.output_parsers import DatetimeOutputParser
from langchain_core.prompts import PromptTemplate

output_parser = DatetimeOutputParser()
template = """Answer the users question:

{question}

{format_instructions}"""
prompt = PromptTemplate.from_template(
    template,
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

We can display that prompt that we will use to obtain dates.

In [9]:
print(prompt)

input_variables=['question'] partial_variables={'format_instructions': "Write a datetime string that matches the following pattern: '%Y-%m-%dT%H:%M:%S.%fZ'.\n\nExamples: 458-12-04T04:36:16.905956Z, 714-07-04T10:35:44.885803Z, 283-04-16T04:04:00.424015Z\n\nReturn ONLY this string, no other words!"} template='Answer the users question:\n\n{question}\n\n{format_instructions}'


We create the chain that we will use to parse dates.

In [10]:
chain = prompt | model | output_parser

We will query for two dates, one real and the other fictional.

In [22]:
output = chain.invoke({"question": "When was the Python language introduced? Your response should just be a date string, otherwise I can't parse it."})
print(output)

1994-03-20 04:00:00


In [21]:
output = chain.invoke({"question": "What is the date of the war in the video game Fallout? Your response should just be a single date string, otherwise I can't parse it."})
print(output)

2277-10-23 14:35:12.012345
