lc-sql-pyspark

This project is a lab for solving LeetCode SQL problems using PySpark. It provides an environment to practice and experiment with SQL queries and data transformations in PySpark, simulating LeetCode-style problems.

Getting Started

Environment Setup

Java (OpenJDK):

PySpark requires Java (JDK) to run because it relies on the Java Virtual Machine (JVM). So, OpenJDK needs to be instaleed globally:

via Homebrew run in the terminal:
```
brew install openjdk@17
brew link --force --overwrite openjdk@17
```
ℹ️ This command downloads and installs the version 17 OpenJDK, including the JDK and JVM, into Homebrew's managed directory (usually /opt/homebrew/opt/openjdk on Apple Silicon or /usr/local/opt/openjdk on Intel Macs)

⚠️ IMPORTANT: PySpark and Hadoop internally rely on some deprecated or restricted Java APIs. Java 21+ (which is the default in Homebrew) removes or restricts those APIs, causing runtime errors like: UnsupportedOperationException: getSubject is not supported. Java 17 is the latest long-term support (LTS) version that remains fully compatible with PySpark 3.x and Hadoop. Homebrew installs openjdk@17 as keg-only - does not link it globally to avoid conflicting with other Java versions. That's why brew link needed to be run.
- Alternatively, other version could be installed, check:
```
brew search openjdk
```
- Or download and install vendor distributions: Temurin from Adoptium, Oracle OpenJDK from Orcale, Corretto from Amazon, or others.
After Java is installed it need to be done "visible" to the shell or PySpark. To make Java accessible to PySpark and other programs:
```
export JAVA_HOME="$(brew --prefix openjdk)"
export PATH="$JAVA_HOME/bin:$PATH"
```
ℹ️ This sets the JAVA_HOME environment variable to point to where Java is installed
- Restart the terminal or soucre shell configuation file:
```
source ~/<config-file>
```
  <config-file>s are .bash_profile/.bashrc or .zprofle/.zshrc depending on the default shell
⚠️ IMPORTANT: If the project is run within an IDE and the IDE is launched from the GUI (e.g., from the Applications folder or Dock on macOS), aliases and environment variables defined in shell configuration files (like .zshrc or .bashrc) might not be initialized. This happens because GUI-launched applications do not load full shell environment by default. Thus it should be launched from the Terminal (f.ex. like code . or code <project-directory>).

Python dependencies:

Install Python packages (including PySpark) either globally or in a virtual environment:

Using a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Or install globally:
```
pip install -r
```

Test The Setup

Once everything is set up, try running the following code to verify your PySpark installation:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("Solving-Leetcode-SQL-problems-with-PySpark").getOrCreate()
df = spark.range(5)
df.show()

If you see a table with numbers 0 to 4, the environment is ready!

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
cli		cli
data_loader		data_loader
dev		dev
leetcode-solutions		leetcode-solutions
spark_setup		spark_setup
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

lc-sql-pyspark

Getting Started

Environment Setup

Java (OpenJDK):

Python dependencies:

Test The Setup

About

Uh oh!

Releases

Packages

Languages

ksingurov/leetcode-db-pyspark-lab

Folders and files

Latest commit

History

Repository files navigation

lc-sql-pyspark

Getting Started

Environment Setup

Java (OpenJDK):

Python dependencies:

Test The Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages