This project is a lab for solving LeetCode SQL problems using PySpark. It provides an environment to practice and experiment with SQL queries and data transformations in PySpark, simulating LeetCode-style problems.
PySpark requires Java (JDK) to run because it relies on the Java Virtual Machine (JVM). So, OpenJDK needs to be instaleed globally:
-
via Homebrew run in the terminal:
brew install openjdk@17 brew link --force --overwrite openjdk@17
ℹ️ This command downloads and installs the version 17 OpenJDK, including the JDK and JVM, into Homebrew's managed directory (usually /opt/homebrew/opt/openjdk on Apple Silicon or /usr/local/opt/openjdk on Intel Macs)
⚠️ IMPORTANT: PySpark and Hadoop internally rely on some deprecated or restricted Java APIs. Java 21+ (which is the default in Homebrew) removes or restricts those APIs, causing runtime errors like:UnsupportedOperationException: getSubject is not supported
. Java 17 is the latest long-term support (LTS) version that remains fully compatible with PySpark 3.x and Hadoop. Homebrew installs openjdk@17 as keg-only - does not link it globally to avoid conflicting with other Java versions. That's whybrew link
needed to be run. -
After Java is installed it need to be done "visible" to the shell or PySpark. To make Java accessible to PySpark and other programs:
export JAVA_HOME="$(brew --prefix openjdk)" export PATH="$JAVA_HOME/bin:$PATH"
ℹ️ This sets the JAVA_HOME environment variable to point to where Java is installed
-
Restart the terminal or soucre shell configuation file:
source ~/<config-file>
<config-file>
s are.bash_profile
/.bashrc
or .zprofle
/.zshrc
depending on the default shell
⚠️ IMPORTANT: If the project is run within an IDE and the IDE is launched from the GUI (e.g., from the Applications folder or Dock on macOS), aliases and environment variables defined in shell configuration files (like .zshrc or .bashrc) might not be initialized. This happens because GUI-launched applications do not load full shell environment by default. Thus it should be launched from the Terminal (f.ex. likecode .
orcode <project-directory>
). -
Install Python packages (including PySpark) either globally or in a virtual environment:
-
Using a virtual environment:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
-
Or install globally:
pip install -r
Once everything is set up, try running the following code to verify your PySpark installation:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Solving-Leetcode-SQL-problems-with-PySpark").getOrCreate()
df = spark.range(5)
df.show()
If you see a table with numbers 0 to 4, the environment is ready!