# Scraping GitHub for PySpark Usage

This notebook provides a framework to clone repos and find pre-defined PySpark patterns defined in `./pyspark-rules.yml`. These patterns include PySpark DataFrame expressions, PySpark UDF definitions, and import usage in function definitions.

## Using GitHub API

In [1]:
from dotenv import load_dotenv
from github import Github, Auth
import os

load_dotenv()
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

if not GITHUB_TOKEN:
    raise ValueError("Specify GITHUB_TOKEN in .env file.")

g = Github(auth=Auth.Token(GITHUB_TOKEN))

## Currently looking at most popular repos mentioning PySpark

Feel free to change the list of repos used for pattern searching. The most popular repos include a lot of tutorials and styleguides which may not be representative of true PySpark workloads.

In [2]:
query = "pyspark in:name,description"
sort = "stars"
order = "desc"
limit = 100

repos = g.search_repositories(query=query, sort=sort, order=order)

# popular_repos = []
# for i, repo in enumerate(repos):
#     if i >= limit:
#         break

#     popular_repos.append({
#         "name": repo.full_name,
#         "url": repo.clone_url,
#         "stars": repo.stargazers_count,
#         "description": repo.description
#     })

# for repo in popular_repos:
#     print(repo)

In [3]:
popular_repos = map(
    lambda r: {"name": r.full_name, "url": r.clone_url}, repos
)

good_repo_count = 0
limit = 200

## Annotating Argument Types

Unfortunately, publicly available code does not always follow bad practices. While PySpark's udf requires that output types be specified, we must still determine the input types without being aware of any schemas. While we will rely on Semgrep to extract as much information as possible, we will use an LLM to infer types.

In [5]:
from openai import OpenAI
import json

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

def annotate_input_types(func_string, model="gpt-3.5-turbo"):
    client = OpenAI(api_key=OPENAI_API_KEY)
    prompt = '''
    Annotate the parameter types for the following Python function.
    {func_string}
    Format output as a JSON object where keys are parameter names and values are types.
    '''.strip().format(func_string=func_string)

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    # Extract response content and return as dictionary with error handling
    content = response.choices[0].message.content
    try:
        annotations = json.loads(content)
        return annotations
    except json.JSONDecodeError:
        print("Failed to parse JSON from model response.")
        return {}

## Using Semgrep to search for PySpark patterns

For each repo, the following cell will:
1. Clone repo into a temp directory
2. Convert any notebooks (`.ipynb`) into python files (`.py`) using `nbconvert`
3. Capture the output of `semgrep scan` in a JSON object using the rules specified in `./pyspark-rules.yml`
4. Process matches for DataFrame expressions and UDF definitions
5. Process matches for imported library usage in functions that are tagged as UDFs
6. Store processed result as JSON (see `./README.md` for schema)

All results are stored in `./results/summary.jsonl`

In [6]:
import git
import json
from nbconvert import PythonExporter
import nbformat
import tempfile
import subprocess

DELIM = "%__%"

# Create results jsonl
os.makedirs("results", exist_ok=True)
summary_path = os.path.join("results", "summary.jsonl")

with open(summary_path, "a", encoding="utf-8") as summary_file:
    for repo in popular_repos:
        if good_repo_count >= limit:
            break

        repo_name = repo["name"]
        clone_url = repo["url"]

        with tempfile.TemporaryDirectory() as tmpdir:
            # clone repo into temporary directory
            repo_dir = os.path.join(tmpdir, "repo")
            try:
                git.Repo.clone_from(clone_url, repo_dir, depth=1)
            except:
                continue

            # find .ipynb files and convert into .py files
            for root, _, files in os.walk(repo_dir):
                for file in files:
                    if file.endswith(".ipynb"):
                        ipynb_path = os.path.join(root, file)
                        py_path = os.path.join(root, "CONVERTED"+file.replace(".ipynb", ".py"))

                        with open(ipynb_path, "r", encoding="utf-8") as ipynbf:
                            try:
                                nb_node = nbformat.read(ipynbf, as_version=4)
                                exporter = PythonExporter()

                                python_code, _ = exporter.from_notebook_node(nb_node)
                                with open(py_path, "w", encoding="utf-8") as pyf:
                                    pyf.write(python_code)
                            except:
                                continue


            # use semgrep to detect udf definitions
            semgrep_result = subprocess.run(
                ["semgrep", "scan", "--config", "pyspark-rules.yml", repo_dir, "--json"],
                capture_output=True,
                encoding="utf-8", #quick fix: to avoid byte serialization issue on Windows laptops.
                text=True,
                check=False
            )

            # parse pyspark dataframe expressions and track udf usage
            try:
                data = json.loads(semgrep_result.stdout)
                matches = data.get("results", [])
                print(f"Found {len(matches)} potential matches in {repo_name}\n")

                repo_results = {
                    "repo_name" : repo_name,
                    "clone_url" : clone_url,
                }
                
                file_dic = {}
                for match in matches:
                    if match["check_id"] == "library-usage":
                        continue # process dataframe expressions & udf definitions first

                    file_path = match["path"]
                    rel_path = os.path.relpath(file_path, repo_dir)
                    if rel_path not in file_dic:
                        file_dic[rel_path] = {
                            "udfs": {},
                            "df_exprs": []
                        }
                    
                    if match["check_id"] == "pyspark-udf-definition":
                        msg_fields = match["extra"]["message"].split(DELIM, 4)
                        func_name = msg_fields[0]
                        func_alias = msg_fields[1] if msg_fields[1] != "$ALIAS" else func_name
                        func_args = msg_fields[2].split(",")
                        func_body = msg_fields[3]
                        func_output = msg_fields[4].split(",") if msg_fields[4] != "$...OUTPUT" else "StringType()"

                        # use LLM to annotate input types
                        func_string = f"def {func_name}({', '.join(func_args)}):\n{func_body}"
                        annotations = annotate_input_types(func_string)
                        func_args = [
                            arg if ":" in arg else f"{arg}: {annotations.get(arg, 'Any')}"
                            for arg in func_args
                        ]

                        file_dic[rel_path]["udfs"][func_name] = {
                            "alias": func_alias,
                            "body": func_body,
                            "args": func_args,
                            "output": func_output,
                            "calls": []
                        }

                    elif match["check_id"] == "pyspark-df-expression":
                        start_offset = match["start"]["offset"]
                        end_offset = match["end"]["offset"]
                        with open(file_path, "r") as f:
                            content = f.read()
                            snippet = content[start_offset:end_offset]
                            file_dic[rel_path]["df_exprs"].append(snippet)
                
                # if found any df expres or udfs, consider file a good match
                if file_dic:
                    good_repo_count += 1

                # parse library usages now
                for match in matches:
                    if match["check_id"] != "library-usage":
                        continue # only process library calls once udf's have been tagged

                    msg_fields = match["extra"]["message"].split(DELIM, 2)
                    func_name = msg_fields[0]
                    library_name = msg_fields[1]
                    call_name = msg_fields[2]

                    file_path = match["path"]
                    rel_path = os.path.relpath(file_path, repo_dir)
                    if rel_path in file_dic:
                        if func_name in file_dic[rel_path]["udfs"]:
                            file_dic[rel_path]["udfs"][func_name]["calls"].append({
                                "library": library_name,
                                "call": call_name
                            })
                
                for file_data in file_dic.values():
                    file_data["udfs"] = [
                        {"name": k, **v} for k,v in file_data["udfs"].items()
                    ]

                repo_results["files"] = [{"path": k, **v} for k,v in file_dic.items()]

                # write to results jsonl
                summary_file.write(json.dumps(repo_results, ensure_ascii=False) + "\n")
                summary_file.flush()


            except json.JSONDecodeError:
                print("Semgrep output not valid JSON.")
                print(semgrep_result.stdout[:500])



Found 11 potential matches in AlexIoannides/pyspark-example-project

Found 325 potential matches in uber/petastorm

Found 32 potential matches in jadianes/spark-py-notebooks

Found 0 potential matches in ptyadana/SQL-Data-Analysis-and-Visualization-Projects

Found 1964 potential matches in hi-primus/optimus

Found 209 potential matches in spark-examples/pyspark-examples

Found 0 potential matches in mahmoudparsian/pyspark-tutorial

Found 7 potential matches in palantir/pyspark-style-guide

Found 24 potential matches in kavgan/nlp-in-practice

Found 159 potential matches in lensacom/sparkit-learn

Found 67 potential matches in pyspark-ai/pyspark-ai

Found 0 potential matches in lyhue1991/eat_pyspark_in_10_days

Found 0 potential matches in WeBankFinTech/Scriptis

Found 49 potential matches in MrPowers/chispa

Found 135 potential matches in mrpowers-io/quinn

Found 0 potential matches in kevinschaich/pyspark-cheatsheet

Found 33 potential matches in drabastomek/learningPySpark

Found 14 

## Type Annotation
Unfortunately, not all Python code is annotated and we must find another way to determine the type signature of the udfs. For the scope of this project, an LLM is probably the most practical solution.

In [None]:
from dotenv import load_dotenv

load_dotenv()
