# Generative AI Application Development

## Overview

- Start Serverless Warehouse

- Run initial setups

- Explore some Langchain tools/wrappers: PythonREPL, DuckDuckGo, Google Serper and ArXiv.

- Create a agent that decide what tool it should to use to answer the questions (as part of the output they generate a Python code that implement some tasks)

- Construct a second agent that answer questions about a Panda DataFrame of a dataset of paper reviews.

## Setup

In [0]:
# %run ./setup_env/agent_design

In [0]:
import pprint

## Checking tools

- https://docs.langchain.com/oss/python/integrations/tools

#### PythonREPL tool

In [0]:
from langchain_experimental.tools import PythonREPLTool

python_repl = PythonREPLTool()
python_repl.run('print("Hello!",  end="")')

Python REPL can execute arbitrary code. Use with caution.


'Hello!'


### DuckDuckGo search tool

In [0]:
from langchain_community.tools import DuckDuckGoSearchRun

search_ddg = DuckDuckGoSearchRun()
print(search_ddg.invoke('Top movies of Grammy Awards 2025'))

In [0]:
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper

search_ddg_wrapper = DuckDuckGoSearchAPIWrapper()
print(search_ddg_wrapper.run('Top movies of Grammy Awards 2025'))

### Google Serper tool

- Need to create and configure SERPER_API_KEY

In [0]:
import os
os.environ['SERPER_API_KEY'] = dbutils.secrets.get(
    scope='serper',
    key='SERPER_API_KEY'
)

In [0]:
%sh
echo $SERPER_API_KEY

[REDACTED]


In [0]:
from langchain_community.utilities import GoogleSerperAPIWrapper

search_serper = GoogleSerperAPIWrapper()
print(search_serper.run('Top movies of Grammy Awards 2025'))

### ArXiv tool

In [0]:
from langchain_community.utilities import ArxivAPIWrapper

In [0]:
arxiv_tool = ArxivAPIWrapper()
docs = arxiv_tool.run('1706.03762')
print(docs)

## Create the 1st agent

### Define the brain of the agent

In [0]:
from langchain_community.chat_models import ChatDatabricks
# define a langchain chat model
endpoint_llm_1 = 'databricks-qwen3-next-80b-a3b-instruct'
llm_model_1 = ChatDatabricks(
	endpoint=endpoint_llm_1,
	max_tokens='400',
 	verbose=True
)


  llm_model_1 = ChatDatabricks(


### Define the tools available to the agent

In [0]:
# ?PythonREPLTool

In [0]:
from langchain_community.utilities import ArxivAPIWrapper
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_experimental.tools import PythonREPLTool

from langchain.tools import tool
from langchain.agents import create_agent

arxiv_wrapper = ArxivAPIWrapper()
search_ddg_wrapper = DuckDuckGoSearchAPIWrapper()
python_repl = PythonREPLTool()

@tool
def arxiv_tool(paper_desc: str) -> str:
    '''Search papers in the Arxiv.'''
    return arxiv_wrapper.run(paper_desc)

@tool
def search_tool(terms: str) -> str:
    '''Search terms in the Web.'''
    return search_ddg_wrapper.run(terms)

@tool
def python_tool(terms: str) -> str:
    '''Execute arbitrary Python codes.'''
    name='python_repl',
    description='A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.',
    func=python_repl.run,
    return python_repl.run(terms)


# set of tools
tools = [python_tool, arxiv_tool, search_tool]

### Define logic of the plan (ReAct)

In [0]:
from langchain_core.prompts import PromptTemplate

template = '''Answer the following questions as best you can.You have acces to the following tools: [{tools}].

Use the following format:

Question: Is the input question that you must answer.
Though: You should always think about what to do.
Action: The action to take should be one of [{tools}].
Action_Input: The input to the action.
Observation: The result of the actions.
... (this Though/Action/Action_Input/Observation can repeat N times)
Though: I now know the final answer.
Final Answer: The final answer to the origial input question.

Question: {input}
Though: {agent_scratchpad}
'''
prompt = PromptTemplate.from_template(template)

In [0]:
from langgraph.prebuilt import create_react_agent

agent1 = create_react_agent(
    model=llm_model_1,
    tools=tools,
    # checkpointer=memory,
)

/home/spark-807bc36d-88b8-45a4-84eb-8e/.ipykernel/3333/command-5906691721714227-2037726547:3: LangGraphDeprecatedSinceV10: create_react_agent has been moved to `langchain.agents`. Please update your import to `from langchain.agents import create_agent`. Deprecated in LangGraph V1.0 to be removed in V2.0.
  agent1 = create_react_agent(


In [0]:
config = {
    'configurable': {'thread_id': '2026'}
}

human_input = '''What could be a nice paper to read about machine learning? Answer this question following the steps below.
Step 1: Decide which paper you would recommend.
Step 2: Show me the abstract of this paper.
Step 3: Collect data about the suggested paper using search tool.
Step 4: Generate the Python code to make a *word cloud* graph, using *wordcloud* library, with the words of the paper abstractof Step 3. If you do not encouter data at Step 3, use some dummy data to make the Python word cloud and then include a graph caption with the label "Using dummy data". 
... (Avoid use ``` for python code. Input should be sanitized by a convenient preprocessing that you suggest. If the input starts with "python", remove this word. The output must be a Python code.)
Final step: Tell me a curiosity about the paper that have you chosen.
...(Don't forget to generate the Python code for the word cloud.)
'''
input = {
    'messages': [('human', human_input)]
}

res1 = agent1.invoke(input=input, config=config, verbose=True, handle_parsing_error=True)

In [0]:
print(res1)

<img src="./imgs/code_output.png">

## Construct another agent

### Load data

In [0]:
import json
import pandas as pd

# define constants
DATASET_PATH = '/Volumes/studies/databricks-dev/data_json/reviews.json'
EXPORT_TABLES_TO_DATABRICKS = False


# load json dataset
with open(DATASET_PATH, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

# make some adjusts to the dataset
df = pd.json_normalize(dataset['paper'])

df = pd.json_normalize(
    dataset,
    record_path=['paper', 'review'],  # take columns paper and review; each item in review is mapped to a row.
    meta=[['paper', 'id']],  # add paper.id as column
)

df.rename(columns={'paper.id': 'paper_id'}, inplace=True)

print(df.shape)
df.head()

Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan,paper_id
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05,1
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05,1
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05,1
3,4,2,1,es,1,,Se explica en forma ordenada y didáctica una e...,2010-07-05,2
4,4,2,2,es,0,,,2010-07-05,2


### Define the agent

In [0]:
from langchain_community.chat_models import ChatDatabricks
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

prefix = '''Input should be sanitized by a convenient preprocessing that you suggest. If the input starts with "python", remove this word. You must to use the Pandas DataFrame dataset available to answer the input questions.
'''

# define a langchain chat model
endpoint_llm_2 = 'databricks-llama-4-maverick'
llm_model_2 = ChatDatabricks(
	endpoint=endpoint_llm_2,
	max_tokens='400',
	allow_dangerous_code=True,
    verbose=True
)


In [0]:
agent2 = create_pandas_dataframe_agent(
    llm=llm_model_2,
    df=df,
    prefix=prefix,
    allow_dangerous_code=True,
    verbose=True,
)

## Q/A session with the 2nd agent

In [0]:
df.head()

Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan,paper_id
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05,1
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05,1
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05,1
3,4,2,1,es,1,,Se explica en forma ordenada y didáctica una e...,2010-07-05,2
4,4,2,2,es,0,,,2010-07-05,2


In [0]:
agent2.invoke('What was the most confidence paper reviews?')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the most confidence paper reviews, I need to look at the 'confidence' column in the DataFrame and determine the maximum value.

Action: python_repl_ast
Action Input: `df['confidence'].max()`[0m[36;1m[1;3mTypeError: '>=' not supported between instances of 'str' and 'float'[0m[32;1m[1;3mThe error indicates that the 'confidence' column contains non-numeric data, which is preventing the max() function from working correctly. I need to check the data type of the 'confidence' column and clean it if necessary.

Action: python_repl_ast
Action Input: `df['confidence'].dtype`[0m[36;1m[1;3mobject[0m[32;1m[1;3mThe 'confidence' column is of object type, indicating it contains non-numeric data. I need to convert this column to a numeric type to perform the max() operation.

Action: python_repl_ast
Action Input: `df['confidence'] = pd.to_numeric(df['confidence'], errors='coerce')`[0m[36;1m[1;3mNameError: name

{'input': 'What was the most confidence paper reviews?',
 'output': 'The most confidence paper reviews had a confidence of 5.0, and there were 47 such reviews.'}

In [0]:
agent2.invoke('Which languages are there in the paper reviews dataset?')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To determine the languages present in the paper reviews dataset, I need to examine the 'lan' column of the DataFrame. First, I should verify that the DataFrame is loaded and available for use.

Action: python_repl_ast
Action Input: `print(df.head())`[0m[36;1m[1;3m   confidence evaluation  id lan orientation remarks  \
0         4.0          1   1  es           0           
1         4.0          1   2  es           1           
2         5.0          1   3  es           1           
3         4.0          2   1  es           1           
4         4.0          2   2  es           0           

                                                text    timespan paper_id  
0  - El artículo aborda un problema contingente y...  2010-07-05        1  
1  El artículo presenta recomendaciones prácticas...  2010-07-05        1  
2  - El tema es muy interesante y puede ser de mu...  2010-07-05        1  
3  Se explica en forma

{'input': 'Which languages are there in the paper reviews dataset?',
 'output': "The languages present in the paper reviews dataset are Spanish ('es') and English ('en')."}

In [0]:
agent2.invoke('Which period had the worst results in paper evaluations?')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To determine the period with the worst results in paper evaluations, we first need to understand the structure of the given DataFrame and identify the relevant columns. The DataFrame contains information about paper evaluations, including the 'timespan' (which could be interpreted as the period or date of evaluation) and 'evaluation' (which is the evaluation score). We need to analyze the 'evaluation' scores in relation to the 'timespan'.

Action: python_repl_ast
Action Input: `df.groupby('timespan')['evaluation'].mean().idxmin()`[0m[36;1m[1;3mTypeError: agg function failed [how->mean,dtype->object][0m[32;1m[1;3mThe error indicates that the 'evaluation' column is of object type, not numeric, so we cannot directly calculate the mean. We need to convert it to a numeric type first. Let's check the unique values in the 'evaluation' column to understand its content.

Action: python_repl_ast
Action Input: `df['evalua

{'input': 'Which period had the worst results in paper evaluations?',
 'output': '2014-07-05'}

In [0]:
agent2.invoke('Tell me a curiosity about any paper review of this dataset.')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find a curiosity about any paper review in the dataset, I should first examine the available columns and data in the DataFrame. The `print(df.head())` output provides a snapshot of the data, including columns like `text`, `lan`, `confidence`, and `paper_id`. A curiosity could be related to the distribution of languages or the average confidence level in the reviews.

Action: python_repl_ast
Action Input: `df.describe()`[0m[36;1m[1;3m       confidence  evaluation          id
count  403.000000  405.000000  405.000000
mean     3.573201    0.182716    1.824691
std      0.844341    1.502868    0.821362
min      1.000000   -2.000000    1.000000
25%      3.000000   -1.000000    1.000000
50%      4.000000    0.000000    2.000000
75%      4.000000    2.000000    2.000000
max      5.000000    2.000000    4.000000[0m[32;1m[1;3mThe `df.describe()` output provides summary statistics for the numerical columns in the DataF

{'input': 'Tell me a curiosity about any paper review of this dataset.',
 'output': 'One curiosity is that some papers, like `paper_id` 54, have very high average confidence levels (5.0) in their reviews. The review for this paper comments on its good presentation but suggests that it is too long.'}

<img src="./imgs/agent_output.png">