# Spark Connect in SparkAgent
This notebook shows how to use SparkAgent with Spark Connect. (go to [Spark Dataframe Agent](spark.ipynb) for basics.)

**NOTE: this agent calls the Python agent under the hood, which executes LLM generated Python code - this can be bad if the LLM generated Python code is harmful. Use cautiously.**

In [None]:
# in apache-spark root directory. (tested here with "spark-3.4.0-bin-hadoop3 and later")
# To launch Spark with support for Spark Connect sessions, run the start-connect-server.sh script.
!./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0

In [None]:
from pyspark.sql import SparkSession

# Now that the Spark server is running, we can connect to it remotely using Spark Connect. We do this by 
# creating a remote Spark session on the client where our application runs. Before we can do that, we need 
# to make sure to stop the existing regular Spark session because it cannot coexist with the remote 
# Spark Connect session we are about to create.
SparkSession.builder.master("local[*]").getOrCreate().stop()

In [4]:
# The command we used above to launch the server configured Spark to run as localhost:15002. 
# So now we can create a remote Spark session on the client using the following command.
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

In [32]:
csv_file_path = "titanic.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [35]:
from langchain.agents import create_spark_connect_dataframe_agent
from langchain.llms import OpenAI
import os

os.environ["OPENAI_API_KEY"] = "...input your OpenAI API key here..."

agent = create_spark_connect_dataframe_agent(llm=OpenAI(temperature=0), df=df, verbose=True)

In [42]:
agent.run("""
who bought the most expensive ticket?
You can find all supported function types in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: I need to find the ticket with the highest fare
Action: python_repl_ast
Action Input: df.agg({"Fare": "max"}).collect()[0m
Observation: [36;1m[1;3m[Row(max(Fare)=512.3292)][0m
Thought:[32;1m[1;3m I need to find the ticket with the highest fare
Action: python_repl_ast
Action Input: df.filter(df.Fare == 512.3292).collect()[0m
Observation: [36;1m[1;3m[Row(PassengerId=259, Survived=1, Pclass=1, Name='Ward, Miss. Anna', Sex='female', Age=35.0, SibSp=0, Parch=0, Ticket='PC 17755', Fare=512.3292, Cabin=None, Embarked='C'), Row(PassengerId=680, Survived=1, Pclass=1, Name='Cardeza, Mr. Thomas Drake Martinez', Sex='male', Age=36.0, SibSp=0, Parch=1, Ticket='PC 17755', Fare=512.3292, Cabin='B51 B53 B55', Embarked='C'), Row(PassengerId=738, Survived=1, Pclass=1, Name='Lesurer, Mr. Gustave J', Sex='male', Age=35.0, SibSp=0, Parch=0, Ticket='PC 17755', Fare=512.3292, Cabin='B101', Embarked='C')][0m
Thought:[32;1m[1;3m

'The passengers with the most expensive ticket were Miss. Anna Ward, Mr. Thomas Drake Martinez Cardeza, and Mr. Gustave J Lesurer.'