This notebook screens that it can run pyspark on a SageMaker notebook instance. This is **NOT** intended to screen a PySpark processing job. It is designed to run in one go without a kernel restart, hence run only a short PySpark operation.

Steps:

- **Pre-requisite**: make sure to choose kernel `conda_python3`
- **Action**: click *Kernel* -> *Restart Kernel and Run All Cells...*
- **Expected outcome**: no exception seen.

# Setup

Before you run the next cell, please open `smconfig.py` and review+update the `s3_bucket` variable, then disable the `NotImplementedException` in the last line.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import sagemaker_pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import smconfig

# Configuration of this screening test
testfile = 'testfile.snappy.parquet'
s3_path = f'{smconfig.s3_bucket}/screening/pyspark-on-smnb/{testfile}'
s3a_path = 's3a' + s3_path[2:]

# Propagate to env vars of the whole notebook, for usage by ! or %%.
%set_env S3_PATH=$s3_path
%set_env TESTFILE=$testfile

# PySpark on this SageMaker notebook instance

In [None]:
# Since sometime in 2021, must change from JDK-11 in the JupyterSystemEnv to JDK-1.8 in system path.
import os
if os.environ.get('JAVA_HOME', '') == '/home/ec2-user/anaconda3/envs/JupyterSystemEnv':
    del os.environ['JAVA_HOME']

# Prep a test data file
df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]})
df.to_parquet(f'/tmp/{testfile}', compression='snappy')
!aws s3 cp /tmp/$TESTFILE $S3_PATH --storage-class ONEZONE_IA

# Setup pyspark local master
classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = (
    SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .master("local[*]").getOrCreate()
)

In [None]:
display(
    spark.read.load(f'/tmp/{testfile}'),
    spark.read.load(s3a_path),
    spark.read.parquet(s3a_path),
)

DataFrame[a: bigint, b: bigint]

DataFrame[a: bigint, b: bigint]

DataFrame[a: bigint, b: bigint]