# Local PySpark on SageMaker Studio

This notebook shows how to run local PySpark code within a SageMaker Studio notebook. For this example we use the **Data Science - Python3** image and kernel, but this methodology should work for any kernel within SM Studio, including BYO custom images.

## Setup
There are two things that must be done to enable local PySpark within SageMaker Studio.
1. Make sure there is an available Java installation. The easiest way to install JDK and set the proper paths is to utilize conda
2. We need to append the local container's hostname into `/etc/hosts` in order for Spark to properly communicate

In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y
!grep `hostname` /etc/hosts >/dev/null || echo 127.0.0.1 `hostname` >> /etc/hosts

## Install PySpark

In [None]:
%pip install pyspark==3.2.0

## Utilize S3 Data within local PySpark
* By specifying the `hadoop-aws` jar in our Spark config we're able to access S3 datasets using the s3a file prefix. 
* Since we've already authenticated ourself to SageMaker Studio , we can use our assumed SageMaker ExecutionRole for any S3 reads/writes by setting the credential provider as `ContainerCredentialsProvider`

In [None]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

In [None]:
csv_df = spark.read.csv(
    "s3a://nyc-tlc/csv_backup/fhvhv_tripdata_2019-02.csv", header=True
)
csv_df.show()