# ["Arabic" SDK for Apache Spark](https://github.com/pyspark-ai/pyspark-ai).

Parts of this notebook were presented during my BigData and GenAI at the [Esri Saudi Arabia User Conference](https://www.esrisaudiarabia.com/en-sa/about/events/esrisaudi-uc2024/overview).

It showcased ChatGPT Interaction with BigData using Apache Spark while asking the questions in Arabic.

This project is based on the [English SDK for Apache Spark](https://github.com/pyspark-ai/pyspark-ai) and uses [HERE historical traffic data](https://www.esri.com/en-us/arcgis-marketplace/listing/products/566238f6f13c43db8ebbb9780c0f2a7a) for the Kingdom of Saudi Arabia.

If you want to run something similar to this notebook, you need a conda environment with the following packages installed and have access to your own private GPT service.
- pyspark-ai[all]
- pyspark<3.5
- seaborn>0.13.0

IMHO, GPT-4 had the best responses in Spark SQL code generation to answer the questions in Arabic.

In [None]:
import os
import warnings

from langchain.chat_models import AzureChatOpenAI
from pyspark_ai import SparkAI

warnings.filterwarnings("ignore")

## Create LLM Instance.

Here (no pun intended), we are connecting to Esri's private Azure OpenAI ChatGPT instance and we are using a GPT-4-32K model.

In [None]:
llm = AzureChatOpenAI(
    base_url=os.environ["AZURE_ENDPOINT"],
    openai_api_key=os.environ["OPENAI_API_KEY"],
    openai_api_type="azure",
    openai_api_version="2023-07-01-preview",
    deployment_name="gpt-4-32k",
    temperature=0.0,
)

## Active partial functions for Spark DataFrame.

In [None]:
spark_ai = SparkAI(llm=llm)
spark_ai.activate()  

## Create a Spark dataframe from HERE Traffic CSV file.

In [None]:
csv_path = "HERE_20M.csv"  # Picked the first 20M records from the HERE dataset.

schema = ",".join(
    [
        "`LINK-DIR` string",
        "`DATE-TIME` timestamp",
        "`EPOCH-5MIN` integer",
        "`LENGTH` double",
        "`FREEFLOW` double",
        "`SPDLIMIT` double",
        "`COUNT` integer",
        "`MEAN` double",
        "`STDDEV` double",
        "`MIN` double",
        "`MAX` double",
        "`CONFIDENCE` double",
    ]
)

df = spark.read.csv(
    csv_path,
    header=True,
    schema=schema,
    mode="DROPMALFORMED",
).cache()

### What is the record count? ما هو عدد السجلات؟

In [None]:
df.ai.transform("ما هو عدد السجلات؟").show(truncate=False)

## What are the top 5 busiest hours? ا هي أكثر 5 ساعات ازدحاما؟

In [None]:
df.ai.transform("ما هي أكثر 5 ساعات ازدحاما؟").show(truncate=False)

In [None]:
res = df.ai.transform(
    "Please show the count by hour of the day and make sure to order the output by the hour of the day"
)
res.show(truncate=False, vertical=False)

In [None]:
res.ai.plot()

### Plot chart of the top 5 busiest hours

In [None]:
df.ai.plot("رسم بياني لأكثر 5 ساعات ازدحامًا")