Skip to content

ksingurov/leetcode-db-pyspark-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lc-sql-pyspark

This project is a lab for solving LeetCode SQL problems using PySpark. It provides an environment to practice and experiment with SQL queries and data transformations in PySpark, simulating LeetCode-style problems.

Getting Started

Environment Setup

Java (OpenJDK):

PySpark requires Java (JDK) to run because it relies on the Java Virtual Machine (JVM). So, OpenJDK needs to be instaleed globally:

  • via Homebrew run in the terminal:

    brew install openjdk@17
    brew link --force --overwrite openjdk@17

    ℹ️ This command downloads and installs the version 17 OpenJDK, including the JDK and JVM, into Homebrew's managed directory (usually /opt/homebrew/opt/openjdk on Apple Silicon or /usr/local/opt/openjdk on Intel Macs)

    ⚠️ IMPORTANT: PySpark and Hadoop internally rely on some deprecated or restricted Java APIs. Java 21+ (which is the default in Homebrew) removes or restricts those APIs, causing runtime errors like: UnsupportedOperationException: getSubject is not supported. Java 17 is the latest long-term support (LTS) version that remains fully compatible with PySpark 3.x and Hadoop. Homebrew installs openjdk@17 as keg-only - does not link it globally to avoid conflicting with other Java versions. That's why brew link needed to be run.

    • Alternatively, other version could be installed, check:

      brew search openjdk
    • Or download and install vendor distributions: Temurin from Adoptium, Oracle OpenJDK from Orcale, Corretto from Amazon, or others.

  • After Java is installed it need to be done "visible" to the shell or PySpark. To make Java accessible to PySpark and other programs:

    export JAVA_HOME="$(brew --prefix openjdk)"
    export PATH="$JAVA_HOME/bin:$PATH"
    

    ℹ️ This sets the JAVA_HOME environment variable to point to where Java is installed

    • Restart the terminal or soucre shell configuation file:

      source ~/<config-file>

      <config-file>s are .bash_profile/.bashrc or .zprofle/.zshrc depending on the default shell

    ⚠️ IMPORTANT: If the project is run within an IDE and the IDE is launched from the GUI (e.g., from the Applications folder or Dock on macOS), aliases and environment variables defined in shell configuration files (like .zshrc or .bashrc) might not be initialized. This happens because GUI-launched applications do not load full shell environment by default. Thus it should be launched from the Terminal (f.ex. like code . or code <project-directory>).

Python dependencies:

Install Python packages (including PySpark) either globally or in a virtual environment:

  • Using a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  • Or install globally:

    pip install -r

Test The Setup

Once everything is set up, try running the following code to verify your PySpark installation:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("Solving-Leetcode-SQL-problems-with-PySpark").getOrCreate()
df = spark.range(5)
df.show()

If you see a table with numbers 0 to 4, the environment is ready!

About

Solve LeetCode Database problems using PySpark with a local Spark setup and SQL schema support

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published