# LLM Code Converter - Apache Pig to PySpark 

Tools used: 
* Langchain: 
* Ollama: Run LLM locally 
    * LLM models: llama3, mixtral, etc. (TODO: which models are )
* Autogen: LLMs work together to output correct code (code writer & code executor)
* LangGraph: Write a pipeline with sequence, conditional routing, and loop
* LangChain: monitoring 

---
Log: 
* 05-07-2024: 
  * AutoGen + Ollama: 
    * Ollama + Autogen doc: https://ollama.com/blog/openai-compatibility
    * jupyter code executor: https://microsoft.github.io/autogen/docs/topics/code-execution/jupyter-code-executor
    * 
  * 

In [None]:
# What is Nomic GPR4ALLEMBEDDINGS? 
# - store to vector database 

# Vector databases are used in Low-Latency Machine Applications (LLMs) to provide additional information that LLMs have not been trained on. 
# - TODO: can we use previously successful set of (pig_code, pyspark_code, sample_data), store them to vector DB and use that for future code gen? 
# 

---

## 0. Initial Setup 

* generate LangSmith API key.
* TODO: How to safely save and load API keys
* https://docs.smith.langchain.com/

In [None]:
import os 
from dotenv import load_dotenv

load_dotenv()

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')

In [None]:
# configure 
run_local = "Yes"

# select llm model 
# local_llm = "mistral" # mistral: https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_crag_local.ipynb
# local_llm = "mixtral"  # mixtral: https://scalastic.io/en/mixtral-ollama-llamaindex-llm/
# local_llm = "llama3" # llama3: https://python.langchain.com/docs/integrations/chat/ollama/
local_llm = "codellama"

**IMPORTANT! Make sure to run the below in terminal to start ollama and download LLM model.**  
`ollama serve`  
`ollama pull {model_name}`

TODO: Automate the above (e.g., add to Dockerfile).

---

## 1. RAG (Index?) - Uplaod Supporting Documents 
Not really needed for this project but as a placeholder add vector DB.   
* reference: https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_crag.ipynb?ref=blog.langchain.dev

In [None]:
from langchain_community.document_loaders import WebBaseLoader # this is for pulling 
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain_mistralai import MistralAIEmbeddings

# Load
url = "https://github.com/palantir/pyspark-style-guide"
loader = WebBaseLoader(url)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=100
)
all_splits = text_splitter.split_documents(docs)

# Embed and index
if run_local == "Yes":
    embedding = GPT4AllEmbeddings()
else:
    # embedding = MistralAIEmbeddings(mistral_api_key=mistral_api_key)
    pass

# Index
vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag-chroma",
    embedding=embedding,
)

In [None]:
# create a retriever
retriever = vectorstore.as_retriever()

---

## 2. LLMs

We build two sets of LLMs: 
1. PIG code --> create benchmark input data (if none is provided by user)
2. PIG code --> PySpark code

**references:** 
* https://python.langchain.com/docs/integrations/chat/ollama/
* JsonOutputParser: https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.json.JsonOutputParser.html
* OutputParser: https://medium.com/@larry_nguyen/langchain-101-lesson-3-output-parser-406591b094d7

### 2.1. Generate input data (CSV) given Pig code

If the user does not provide benchmark input data for testing the PIG code, we will use LLM to generate a Python script. This script will create and save sample data to the specified folder.

We achieve this in the following steps: 
1. Create an LLM prompt template that outputs a Python script designed to generate and save CSV data based on the given PIG code.
2. Execute the LLM to produce the Python script.
3. Parse the generated Python script.
4. Execute the Python script, which results in the CSV file being saved.

In [None]:
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
# from langchain_mistralai.chat_models import ChatMistralAI
from langchain_core.output_parsers import JsonOutputParser

# we use locally hosted llm models 
llm = ChatOllama(model = local_llm, format = "json", temperature=0.2)


## Create two templates: 
# 1. pig code --> benchmark input data
prompt_data_gen = PromptTemplate(
    template="""
    You are an expert data scientist fluent in PIG and Python coding languages.
    Generate Python code that do the following: 
    1. Generate 20 lines or more CSV data that can be used to test the PIG code. 
       Ensure column names are consistent with the names in PIG code. 
    2. Write Python code that save this CSV data to the directory provided. 
        
    Here is the PIG code: \n\n {pig_code} \n\n
    Here is the directory to save CSV file: \n\n {sample_input_path} \n\n

    Give a string of Python code with correct indentation that can be ran to create and save CSV file to correct path. 
    Provide this as a JSON with a single key 'data_gen_code' and no premable or explaination.""",
    input_variables=["pig_code", "sample_input_path"],
)
sample_input_code_generator = prompt_data_gen | llm | JsonOutputParser()

_**Below codes are work in progress.**_

In [None]:
# If there was an error in the outputted Python script for generating CSV file, 
#  add that error back to LLM and re-generate an updated Python script. 
prompt_data_regen = PromptTemplate(
    template="""
    You are an expert data scientist fluent in PIG and Python coding languages.
    Generate Python code that do the following: 
    * Debug and share updated Python code to generate 100 lines or more CSV data that can be used to thest the PIG code. 
    * Use the error message and the data that resulted in error as a reference to fix the Python code. 
        
    Here is the PIG code: \n\n {pig_code} \n\n
    Here is the Python code with error: \n\n {pycode_error} \n\n
    Here is the Python code error message: \n\n {pycode_error_message} \n\n
    Here is the directory to save CSV file: \n\n {sample_input_path} \n\n

    Give a string of Python code with correct indentation that can be ran to create and save CSV file with more than 100 records to correct path. 
    Provide this as a JSON with a single key 'data_gen_code' and no premable or explaination.""",
    input_variables=["pig_code", "pycode_error", "pycode_error_message", "sample_input_path"],
)
fix_sample_input_code_generator = prompt_data_regen | llm | JsonOutputParser()


# 2. pig code to pyspark code 
prompt_pig2pyspark = PromptTemplate(
    template="""
    You are an expert data scientist fluent in PIG and PySpark coding languages.
    Generate PySpark code that do the following: 
    * Implement same logic and methods as the provided PIG code. 
    * When ran against a sample input data, outputs identical result as PIG code. 
        
    Here is the PIG code: \n\n {pig_code} \n\n

    Give a string of PySpark code with correct indentation. 
    Provide this as a JSON with a single key 'pyspark_code' and no premable or explaination.""",
    input_variables=["pig_code"],
)
pig_to_pyspark_converter = prompt_pig2pyspark | llm | JsonOutputParser()

prompt_pig2pyspark_regen = PromptTemplate(
    template="""
    You are an expert data scientist fluent in PIG and PySpark coding languages.
    Generate PySpark code that do the following: 
    * Implement same logic and methods as the provided PIG code. 
    * Use the PySpark code that returned an error message to update the PySpark code. 
    * Use the PySpark code error message to update the PySpark code. 
    * When ran against a sample input data, outputs identical result as PIG code. 
        
    Here is the PIG code: \n\n {pig_code} \n\n
    Here is the PySpark code with error: \n\n {pycode_error} \n\n
    Here is the PySpark code error message: \n\n {pycode_error_message} \n\n

    Give a string of PySpark code with correct indentation. 
    Provide this as a JSON with a single key 'pyspark_code' and no premable or explaination.""",
    input_variables=["pig_code", "pycode_error", "pycode_error_message"],
)
fix_pig_to_pyspark_converter = prompt_pig2pyspark | llm | JsonOutputParser()

In [None]:
from datetime import datetime

## test with sample PIG code 
pig_script_dir = './scripts/pig1.pig'

with open(pig_script_dir, 'r') as file:
    sample_pig_code = file.read()

print('*'*88)
print("Pig Code\n")
print(sample_pig_code)
print('*'*88)

data_output_dir = './data'

datagen_code = sample_input_code_generator.invoke({"pig_code": sample_pig_code, 
                                                   "sample_input_path": data_output_dir})
print('*'*88)
print("Python Code to generate sample data:\n")

print(datagen_code['data_gen_code'])
print('*'*88)

**TODO:** 
* Add unit tests to ensure generated data is useful.
* Number of records are incorrect. 

---

## 1. Install and Load Dependencies

In [None]:
# this will be handled in requirements.txt

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain_core.prompts.prompt import PromptTemplate
import regex as re
import subprocess
import os
import pandas as pd
import glob
import pyspark as spark

---

## Load LLM 

In [None]:
class LanguageModelService:
    def __init__(self, model_id="mistralai/Mistral-7B-Instruct-v0.2", # LLM pre-trained model 
                 device=0, # Use 1st GPU 
                 max_new_tokens=1000 # output token count 
                ):
        """
        Initializes the model with the specified parameters.
        """
        self.hf = HuggingFacePipeline.from_model_id(
            model_id=model_id,
            task="text-generation",
            device=device,
            pipeline_kwargs={"max_new_tokens": max_new_tokens},
        )
        
    def query(self, input_text):
        """
        Sends a custom query to the model and returns the output.
        """
        result = self.hf(input_text)
        return result.text if hasattr(result, 'text') else result

# Usage
%time  lm_service = LanguageModelService()  # Initialize once

---

## 2. Load PIG Code (optionsal: Upload sample data)

---

## 3. Create sample data 

Note Batch GPU can work: https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/

### 3.1. Create sample test data using LLM (if none provided by user)

In [None]:
%%time 

query_create_sample_data = f"""
    Given the following PIG code, generate sample CSV data that will produce consistent 
    results when processed by this PIG code. The PIG code is intended to perform operations 
    such as filtering, grouping, and aggregation. Ensure that the sample data is diverse enough 
    to test all parts of the code effectively.\n\n
    PIG Code:\n{pig_code}\n\n
    Write a Python code that will save the sample CSV file to ./data/sample1.csv with a header column. Encapsulate the code between ```. 
    Make sure there is only one code chunk (```).:
"""

response = lm_service.query(query_create_sample_data)
print("Response from model:\n", response)

In [None]:
def normalize_indentation(code):
    lines = code.split('\n')
    # Find the first non-empty line to determine the base indentation level
    base_indent = None
    for line in lines:
        stripped_line = line.lstrip()
        if stripped_line:
            base_indent = len(line) - len(stripped_line)
            break

    if base_indent is None:
        return code  # Return original code if it's all empty lines or no base indent found

    # Normalize each line by removing the base indentation
    normalized_lines = []
    for line in lines:
        stripped_line = line.lstrip()
        if len(line) > base_indent:
            normalized_lines.append(line[base_indent:])
        else:
            normalized_lines.append(stripped_line)

    return '\n'.join(normalized_lines)

def parse_python_code_from_text(text):
    normalized_text = normalize_indentation(text)
    
    # Define the pattern to extract code between ```python and ```
    pattern = r'```python\s*(.*?)\s*```'
    match = re.search(pattern, normalized_text, re.DOTALL)
    
    if match:
        code_to_execute = match.group(1)
        print(code_to_execute)
        return code_to_execute
    else:
        print("No Python code block found.")

def run_pyspark_code(code):
    """
    Executes PySpark code, returns either error message or result DataFrame.
    Assumes PySpark session and context are already set.
    """
    try:
        exec(code)
        return None, globals().get('sales_summary')  # Assuming 'sales_summary' is the result DataFrame
    except Exception as e:
        return str(e), None


In [None]:
### 3.2. LLM may not output working test data. Repeat until correct data is outputtedz

### 3.3. Save sample data and output 

#### 3.3.1. Save Sample Data

In [None]:
code = parse_python_code_from_text(response)
run_pyspark_code(code)

#### 3.3.2. Execute PIG code and save output 

In [None]:
def run_pig_script(script_path, data_path):
    # Set up the environment variable to point to the directory containing the data file
    os.environ['PIG_DATA_PATH'] = data_path

    # Execute the Pig script using subprocess, assuming Pig is installed and configured to run in local mode
    result = subprocess.run(['pig', '-x', 'local', '-f', script_path], capture_output=True, text=True)
    
    # Check the results of the Pig script execution
    if result.returncode != 0:
        print("Error occurred during Pig script execution:")
        print(result.stderr)
    else:
        print("Pig script executed successfully. Output:")
        print(result.stdout)

# Define the path to the Pig script and the directory containing the data file
pig_script = './scripts/pig1.pig'
csv_data = './data/'

# Execute the Pig script
run_pig_script(pig_script, csv_data)

In [None]:

def get_latest_file(directory, pattern):
    """
    Returns the path to the latest file in the given directory that matches the pattern.

    Args:
    directory (str): The directory to search in.
    pattern (str): The file name pattern to match.

    Returns:
    str: The path to the latest file matching the pattern or None if no file matches.
    """
    # Create the full path pattern
    search_pattern = os.path.join(directory, pattern)
    
    # List all files matching the pattern
    files = glob.glob(search_pattern)
    
    if not files:
        return None

    # Find the latest file based on last modification time
    latest_file = max(files, key=os.path.getmtime)
    
    return latest_file

# Example usage
directory = './output/high_value_transactions'
file_pattern = 'part-*'
latest_file = get_latest_file(directory, file_pattern)
print(f"The latest file is: {latest_file}")


In [None]:
# Path to the output file created by Pig
output_file = latest_file  # Adjust path as needed

# Read the CSV file into a DataFrame
df = pd.read_csv(output_file, names=['depStore', 'total_sales'])
df.head()

---

In [None]:
# A.1. LLM transcriber PIG2PySpark
# Prompt: 
# You are an experienced Software Engineer and Machine Learning Engineer fluent in PIG and PySpark coding languages. 
# Rewrite the following PIG code into PySpark so that they can perform identical tasks and output identical results given same input data. 
# PIG Code: {pig_code}
# A.2. Code Parser: Parse and save PySpark code 

# B. Sample Data Builder 
# C. PIG Interpreter 
# D. PySpark Interpreter 

# If sample data NOT available (run B.)
  # Run PIG code and generate output 
  # 1. Generate PySpark code from PIG (run A.1) and parse/save the PySpark Code (run A.2)
# 2. Run against sample data in a separate module.
# 3. Check 

---

## Building LLM Layer - Pig2PySpark 
* https://medium.com/@yash9439/unleashing-the-power-of-falcon-code-a-comparative-analysis-of-implementation-approaches-803048ce65dc
* ref: https://medium.com/@ajay_khanna/leveraging-llama2-0-for-question-answering-on-your-own-data-using-cpu-aa6f75868d2d
* ref: https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476
* https://wellsr.com/python/fine-tuning-llama2-for-question-answering-tasks/
* https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb

In [None]:
prompt_pig2pyspark = f"""
You are an experienced software and machine learning engineer fluent in both PIG and PySpark. 
Re-write the following PIG code into PySpark code. 
Ensure PySpark code is logically identical and output identical results as the provided PIG code. 
Make sure to only share PySpark in a single code block (inside ```). 
PIG code: {pig_code}
"""

%time response = lm_service.query(prompt_pig2pyspark)
print("Response from model:\n", response)


In [None]:
error_message = run_pyspark_code(response)
print(error_message)

In [None]:
def build_prompt_fix_error(pyspark_code, error_message, input_data):
    """
    Generate prompt to fix error in PySpark code.
    Returns a string with the prompt to fix errors.
    """
    return f"""There was an error in your PySpark code: {error_message}. Please fix and re-share the full PySpark code with relevant updates inside a single code cell.
    
    Below is the PySpark code that returned an error: 
    {pyspark_code}. 

    Below is the input data: 
    {input_data}
    """

In [None]:
prompt_fix_pyspark_code = build_prompt_fix_error(response, error_message, df)

response = lm_service.query(prompt_fix_pyspark_code)
print("Response from model:\n", response)

In [None]:
print(response)

In [None]:
code = parse_python_code_from_text(response) # this did not run 
run_pyspark_code("""from pyspark.sql import SparkSession, functions as F

# Load the data from a CSV file
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()
transactions = spark.read.option("header", "true").option("inferSchema", "true").csv("data/sample1.csv")

# Filter transactions to include only those where the amount is greater than 200
high_value_transactions = transactions.filter(F.col("amount") > 200)

# Group the transactions by store
grouped_by_store = high_value_transactions.groupBy("depStore")

# Calculate total and average sales per depStore
sales_summary = grouped_by_store.agg(F.sum("amount").alias("total_sales"), F.avg("amount").alias("average_sales"))

# Store the summary in a CSV file
sales_summary.write.option("header", "true").csv("output/sales_summary", mode="overwrite")

# Optional: Just for demonstration, store filtered data to another directory
high_value_transactions.write.option("header", "true").csv("output/high_value_transactions", mode="overwrite")""")

In [None]:
import subprocess
import os

def run_pig_script(script_path, data_path):
    # Set up the environment variable to point to the directory containing the data file
    os.environ['PIG_DATA_PATH'] = data_path

    # Execute the Pig script using subprocess, assuming Pig is installed and configured to run in local mode
    result = subprocess.run(['pig', '-x', 'local', '-f', script_path], capture_output=True, text=True)
    
    # Check the results of the Pig script execution
    if result.returncode != 0:
        print("Error occurred during Pig script execution:")
        print(result.stderr)
    else:
        print("Pig script executed successfully. Output:")
        print(result.stdout)

# Define the path to the Pig script and the directory containing the data file
pig_script = './scripts/pig1.pig'
csv_data = './data/'

# Execute the Pig script
run_pig_script(pig_script, csv_data)


In [None]:
execute_python_code_from_text

In [None]:
def run_pyspark_code(code):
    """
    Executes PySpark code, returns either error message or result DataFrame.
    Assumes PySpark session and context are already set.
    """
    try:
        exec(code)
        return None, globals().get('sales_summary')  # Assuming 'sales_summary' is the result DataFrame
    except Exception as e:
        return str(e), None


In [None]:
# ===============================================================================================================
# parse Python code inside the code cell (TODO: How to ensure code is consistently inside the triple back ticks?) 
# How to loop the code so that it runs until correct code is written? 
# How does Langchain come into play? 
# When debugging: 1) provide loaded data head 2) code 3) error message or output if run was successful --> output updated code
# ===============================================================================================================

---
---
---

In [None]:
# # Load the model and tokenizer from the cache for use
# tokenizer = AutoTokenizer.from_pretrained('./model_cache/Meta-Llama-3-8B-Instruct')
# model = AutoModelForCausalLM.from_pretrained('./model_cache/Meta-Llama-3-8B-Instruct')

# # Setup the pipeline with local model and tokenizer
# text_generation = pipeline(
#     "text-generation",
#     model=model,
#     tokenizer=tokenizer,
#     device=0  # Assuming using the first GPU
# )

# # Generate text
# generated_text = text_generation("Sample prompt text goes here", max_length=50)
# print(generated_text)

In [None]:
# question = f"Re-write the following PIG code into PySpark code. Following is the PIG code: \n {pig_script_code}"
# template = f"""
# You are an intelligent software engineer and machine learning engineer. Re-write the following PIG code into PySpark code. Make sure to only share PySpark code so it's easy to copy and paste. 
# PIG code: {question}
# --------------------------------------------------------------------
# PySpark code:"""

# sequences = pipeline(
#     template,
#     max_length=5000,
#     do_sample=True,
#     top_k=10,
#     num_return_sequences=1,
#     eos_token_id=tokenizer.eos_token_id,
# )

# for seq in sequences:
#     print(f"Result: {seq['generated_text']}")

---
## Test PySpark Code

In [None]:
data.head()

In [None]:
def build_prompt_starter(): 
    """
    Generate prompt to start transcribing PIG to PySpark.
    """

def build_prompt_fix_error(): 
    """
    Generate prompt to fix error in PySpark code.
    """

def build_prompt_fix_output(): 
    """
    Generate prompt to fix code output mismatch (between result from PIG and PySpark). 
    """
    

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg


def run_pyspark_code(code_dir, code_name, sample_data_dir): 
    # start PySpark server
    # run code against test data

    if error: 
        return error_message
    else: 
        return result_df

def check_resutls(pig_result_df, pyspark_result_df): 
    # return True if the results are identical 
    return is_same 


# loop until both pyspark code runs fine and output data from PIG and PYSpark are the same
    




In [None]:
from pyspark.sql import SparkSession
from tabulate import tabulate

def build_prompt_starter(pig_code):
    """
    Generate prompt to start transcribing PIG to PySpark.
    Returns a string with the starter prompt.
    """
    return f"""
    I need to convert the following Apache PIG script into Apache PySpark code. 
    The PySpark code should perform the same tasks and produce identical outputs as the PIG code. 
    Please ensure that the PySpark code uses DataFrame operations wherever possible and include comments explaining any complex parts or transformations.
    
    PIG Code: 
    {pig_code}
    """






pig_output = tabulate(pig_output, headers='keys', tablefmt='psql', showindex="never")
pyspark_output = tabulate(pyspark_output, headers='keys', tablefmt='psql', showindex="never")


def build_prompt_fix_output(pig_output, pyspark_output):
    """
    Generate prompt to fix code output mismatch between results from PIG and PySpark.
    Returns a string with the prompt for fixing output mismatches.
    """
    return f"""
    The outputs between PIG and PySpark do not match. Please adjust the PySpark code so that the output from PySpark code is identical to that from PIG code. 
    Below are the two outputs with mismatch:

    Pig Output (ground truth): 
    {pig_output}

    
    PySpark Output: 
    {pyspark_output}
    """

def check_results(pig_result_df, pyspark_result_df):
    """
    Compare PIG and PySpark DataFrames to check if the results are identical.
    """
    return pig_result_df.subtract(pyspark_result_df).count() == 0 and pyspark_result_df.subtract(pig_result_df).count() == 0


# Set up PySpark
spark = SparkSession.builder.appName("Sales Summary").getOrCreate()

# Initial PySpark code generation using an LLM (not shown here)
pyspark_code = """
# Assume pyspark_code is filled with the initially generated code
"""
pig_result_df = spark.createDataFrame(...)  # Assume this is setup elsewhere

# Run and refine PySpark code
error_message, pyspark_result_df = run_pyspark_code(pyspark_code)
while error_message or not check_results(pig_result_df, pyspark_result_df):
    if error_message:
        prompt = build_prompt_fix_error(error_message)
    else:
        prompt = build_prompt_fix_output()
    # Here you would update `pyspark_code` based on the LLM's output (not shown here)
    error_message, pyspark_result_df = run_pyspark_code(pyspark_code)

# Results are now fine
print("PySpark code executed successfully and results match PIG.")


In [None]:
# subprocess to test out sample codes: 
# link: https://www.google.com/search?q=Python+application+which+run+PIG+code&rlz=1C1OPNX_enUS1108US1108&oq=Python+application+which+run+PIG+code+&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIHCAEQIRigATIHCAIQIRigATIHCAMQIRigAdIBCTEwNTIzajBqN6gCALACAA&sourceid=chrome&ie=UTF-8

# langchain 

# langsmith 

# streamlit 

# airflow 

# 