Skip to content

newxzh/crosspl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

121 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrossPL: CrossPL: Systematic Evaluation of Large Language Models for Cross Programming Language Interoperating Code Generation

CrossPL is the first benchmark for systematically assessing LLM performance of cross-programming language (CPL) code generation across two primary interoperation modes and 2534 tasks, specifically 1,982 Inter-Process Communication (IPC) tasks spanning six languages and 522 Python–C Foreign Function Interface(FFI) tasks.


Table of Contents


Why CrossPL? (Motivation)

Modern software systems are inherently multi-language—over 80% of real-world projects use two or more programming languages to combine complementary strengths (e.g., Python for productivity, C/C++ for performance).

Existing LLM benchmarks focus on:

  • Single-language code generation
  • Cross-language code translation

They do not evaluate whether models can generate interoperating code that enables real cross-language collaboration.

In practice, cross-language systems rely on two core mechanisms:

  • IPC (Inter-Process Communication): protocol compliance, serialization, synchronization, and correct state transitions
  • FFI (Foreign Function Interface): function signatures, type conversion, and memory management

These scenarios require correctness beyond syntax—errors can cause deadlocks, crashes, or undefined behavior.

CrossPL addresses this gap by systematically evaluating LLMs’ ability to generate correct and executable cross-language interoperating code across IPC and FFI settings.

ipc demo

Figure 1: Examples of CPL interoperating (IPC and FFI).


Our Contributions

1. CrossPL Benchmark

  • We introduce CrossPL, the first benchmark specifically designed to evaluate LLMs’ ability to generate cross-programming-language (CPL) interoperating code involving both IPC and FFI.
  • The benchmark contains 2,534 tasks in total:
    • IPC subset: 1,982 tasks spanning six programming languages
    • FFI subset: 522 Python–C interoperability tasks

2. Automated Benchmark Construction Methodology

We propose a unified and automated construction framework that combines FSM-based IPC interface characterization with LLM-driven workflows.

  • FSM-based IPC modeling

    • Designed 156 finite state machines (FSMs) based on official CPL interface specifications.
    • Formally characterize IPC interaction patterns.
    • Enable automatic detection and extraction of IPC snippets from real-world GitHub repositories.
    • Serve as structured evaluators for protocol compliance and state-transition coverage.
  • Two LLM-based construction pipelines

    • CrossPL-IPC pipeline:
      FSM-guided snippet identification → LLM-based Judgement → Code extraction → FSM-based validation → Instruction generation → Human check → performance evaluation.
    • CrossPL-FFI pipeline:
      Focused Python–C task construction with controlled compilation environments and assertion-based testing for functional correctness.

3. Large-scale Empirical Study

  • Evaluated 20 representative LLMs on CrossPL.
  • Systematically investigated whether current LLMs can accurately generate cross-language interoperating code.
  • Revealed substantial performance gaps compared to single-language code generation benchmarks.
  • Demonstrated that CPL interoperability remains a significantly underexplored and challenging capability for modern LLMs.

Benchmark Construction Workflow

framework

Figure 2: Framework for CPL Interoperating Code Analysis, Extraction, Generation and Evaluation.

CrossPL is constructed using two LLM-driven workflow, including CrossPL-IPC workflow and CrossPL-FFI workflow.


CrossPL-IPC Construction Workflow

⚠️ Note: The following prompt templates for Judger, Function Extractor, and Class Extractor are exemplified using Java. Prompt templates for other programming languages can be found in the prompt_template directory of the project.

🤖 FSMs for detect CPL interface among MPL repositories: using the 156 FSMs to identify CPL interoperating instances among 19169 GitHub MPL repositories and record their metadata.

The following figure illustrates an example of FSM-modeled CPL interoperating.

FSM-modeled CPL Interoperability

Figure 3: An example of FSM-modeled CPL interoperating.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to cae.py, Analyzer.py, LangApiAnalyzer.py, Extraction_and_Benchmark_Construction.py, Algorithm 1 and Algorithm 2 in our paper.


🤖 Judger: Determine whether a given code file contains any CPL interaction code snippets. If such a snippet is found and corresponds to a function-level implementation, return "Function-level"; if it corresponds to a class-level implementation, return "Class-level"; if no CPL interaction code is present, return "null". The prompt template used by this LLM tool is as follows:

judger

⚠️ Note: Additional implementation details can be found in Extraction_and_Benchmark_Construction.py.


🤖 Function Extractor: Used for extracting "function-level" CPL interaction code snippets. Additional implementation details can be found in Extraction_and_Benchmark_Construction.py. The prompt template used by this LLM tool is as follows:

Func

⚠️ Note: Additional implementation details can be found in Extraction_and_Benchmark_Construction.py.


🤖 Class Extractor: Used for extracting "Class-level" CPL interaction code snippets. Additional implementation details can be found in Extraction_and_Benchmark_Construction.py. The prompt template used by this LLM tool is as follows:

Class

⚠️ Note: Additional implementation details can be found in Extraction_and_Benchmark_Construction.py.


🤖 FSM-based validator: The correctness of the interaction snippets extracted by LLMs is verified using FSMs corresponding to the specific CPL techniques.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to cae.py, Evaluation.py, Analyzer.py, LangApiAnalyzer.py, Extraction_and_Benchmark_Construction.py, Algorithm 1 and Algorithm 2.


🤖 Instructor: If the verification is successful, the interaction snippet extracted by the LLM is passed to the "Instructor" to generate the corresponding instruction. Additional implementation details can be found in Extraction_and_Benchmark_Construction.py. The prompt template used by the Instructor is as follows:

instruction

⚠️ Note: Additional implementation details can be found in Extraction_and_Benchmark_Construction.py.


🔍 Evaluation: The correctness of the interaction snippets generate by LLMs is verified using FSMs corresponding to the specific CPL techniques.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to tmp_test\testexample.py,Analyzer.py, LangApiAnalyzer.py, Algorithm 1 and Algorithm 2 in our paper.


CrossPL-IPC Construction Workflow

Algorithm 3 in paper illustrates the construction of the CrossPL-FFI for Python-C external function calls. The underlying C code is sourced from the GNU Scientific Library (GSL), a widely used and self-contained library of mathematical and statistical functions. The workflow begins by compiling the GSL library into shared object (.so) files using Autotools and Make, establishing the runtime environment. C source files are then cleaned and applied using an initial FFI prompt and an error-revision prompt. Execution of the candidate solution is performed in the environment where the precompiled .so files are available for FFI calls; successful executions are saved as benchmark entries, while failures are iteratively refined via the LLM (powered by Deepseek-V3). This approach ensures a scalable, reproducible, and controlled benchmark for assessing LLMs’ ability to generate correct Python-C FFI code. Additionally, key information from the canonical solution, including class names, function names, and parameter names, is incorporated into the ``Instruction'' field of the benchmark. Finally, these benchmark entries are provided as tasks to the LLMs under evaluation. The outputs from the LLMs are combined with automatically generated assertion test cases to verify correctness, enabling systematic execution and testing. This approach ensures a scalable, reproducible, and controlled benchmark for assessing LLMs’ ability to generate correct Python-C FFI code. Figs.4-7 provide the detailed prompt information.


4

Figure 4: Prompt template for constructing CrossPL-FFI.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to FFI_Consruction.py,execute_solution.pyand Algorithm 3 in our paper.


1

Figure 5: Prompt template with error information for constructing CrossPL-FFI.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to FFI_Consruction.py,execute_solution.py and Algorithm 3 in our paper.


2

Figure 6: Add class information to the Instruction.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to FFI_Consruction.py,Add_Info.py and Algorithm 3 in our paper.


3

Figure 7: Add class information to the Instruction.

⚠️ Note: A more comprehensive understanding of the implementation details can be obtained by referring to FFI_Consruction.py,Add_Info.py and Algorithm 3 in our paper.


Statistics of CrossPL

In constructing CrossPL-IPC benchmark, we conducted a thorough search and review of official documentation related to IPC technologies in CPL projects using keywords such as gRPC, Pipe, message queue, TCP, UDP, WebSocket, and HTTP across different programming languages. Fig.8(a) summarizes the distribution in CrossPL-IPC from different perspectives. Overall it covers six programming languages and seven IPC technologies, comprising a total of 1982 tasks. Among the programming languages, Java accounts for the highest proportion of IPC-related tasks with 615 instances (31.03%), whereas C++ the fewest, with 51 tasks (2.57%). Among IPC technologies, HTTP accounts for the highest proportion of tasks with 779 (39.30%), while UDP for the lowest with 92 tasks (4.64%).

Fig.8(b) further details the distribution of IPC technologies used by different programming languages within the CrossPL-IPC. Due to the distinct characteristics of each language, the proportional use of IPC technologies varies considerably; for example, Java tasks are predominantly associated with TCP, while JavaScript tasks are mostly related to HTTP. Notably, some IPC techniques cannot be fully implemented within a single function and require implementation at the class level. Consequently, during IPC-related code extraction, we distinguished between class-level and function-level code. Overall, class-level code accounts for 59.99% of the CrossPL-IPC, while function-level code constitutes 40.01%. This distribution reflects language-specific design patterns: for example, class-level implementations dominate in Java and PHP due to their object-oriented paradigms, while languages like Go and JavaScript tend to favor function-level or lightweight constructs. Such variations emphasize the necessity of handling both granularities in CrossPL-IPC to faithfully capture real-world IPC usage across different programming ecosystems.

Language Distribution

(a) Pie chart of CrossPL-IPC dataset from different view.

Task Distribution

(b) Distribution of different IPC technologies across different programming languages.


Figure 8: Distribution of CrossPL benchmark.


Fig.9 presents the distribution of code line counts in the canonical solutions across CrossPL. Fig.9(a) presents the distribution of Canonical solution Code Lines of CrossPL-IPC. Fig.9(b) presents the distribution of Canonical solution Code Lines of CrossPL_FFI. Specifically, the code line of the CrossPL_IPC exhibits a median of 51 lines, an average of 66.96 lines, and a maximum of 483 lines. In contrast, the CrossPL_FFI shows a more compact distribution, with a median of 44 lines, an average of 47.1 lines, a maximum of 112 lines, and a minimum of 8 lines.

x

Figure 9: Distribution of Canonical solution Code Lines of CrossPL-IPC and CrossPL-FFI.


Fig.10 illustrates the distribution of instruction lengths, measured in characters, across the canonical solutions of CrossPL. Fig.10(a) shows the distribution of instruction length (characters) of CrossPL-IPC. Fig.10(b) presents the distribution of instruction length (characters) of CrossPL-FFI. For the CrossPL-IPC, the instructions have a median length of 1204 characters, an average of 1284.64 characters, and a minimum of 473 characters. In comparison, the length of instruction in CrossPL-FFI shows a slightly larger scale, with a median of 1277.5 characters, an average of 1329.74 characters, a maximum of 2421 characters, and a minimum of 973 characters.

y

Figure 10: Distribution of instruction length (characters) of CrossPL-IPC and CrossPL-FFI.

⚠️ Note: The benchmark is stored in the PolyBench/IPC_Bench and PolyBench/FFI_Bench directories.


Key Findings

  • LLMs generate IPC code with varying effectiveness, performing better with C++ and \textit{gRPC} and worse with Go and low-level protocols like \textit{Pipe}. Failures for models like GPT-4o often stem from protocol and data transmission setup, while Llama3-8b-instruct frequently fails earlier at library configuration.
  • LLMs struggle with FFI-based CPL code generation (GPT-4o 19.54% Pass@1; Llama3-8b-instruct $<$1%), with failures dominated by symbol resolution, runtime, calling, and memory errors.
  • Model characteristics, such as think mode, improve performance in reasoning-intensive FFI tasks but have limited or even negative impact on IPC tasks, which rely on well-structured communication patterns.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors