# Read write sas files

Sas used to be one of the popular frameworks to work with data. It provides a data format called `.sas7bdat`. By default, spark does not provide any reader to read sas file. There is a third party packages which can help us to read the 
sas files. You can find the project github page [here](https://github.com/saurfang/spark-sas7bdat).

You can find the doc for java and scala in the official doc. 

> For R, you need to follow this doc https://github.com/bnosac/spark.sas7bdat 

The package is released via [spark-packages](https://spark-packages.org/package/saurfang/spark-sas7bdat)

In this notebook, we only shows the python version.

In [1]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when

## 1. Read sas files

In below example, we will show two examples.
1. read sas files from local file system
2. Read sas file from remote s3 server

First, we need to create a spark session with the required packages



In [2]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation-check")\
        .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation-check")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
        .getOrCreate()

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/onyxia/.ivy2/cache
The jars for the packages stored in: /home/onyxia/.ivy2/jars
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9990780e-efd4-4523-b547-099afb30f1b2;1.0
	confs: [default]
	found saurfang#spark-sas7bdat;3.0.0-s_2.12 in spark-packages
	found com.epam#parso;2.0.11 in central
	found org.slf4j#slf4j-api;1.7.5 in central
	found org.apache.logging.log4j#log4j-api-scala_2.12;12.0 in central
	found org.scala-lang#scala-reflect;2.12.10 in central
	found org.apache.logging.log4j#log4j-api;2.13.2 in central
:: resolution report :: resolve 274ms :: artifacts dl 9ms
	:: modules in use:
	com.epam#parso;2.0.11 from central in [default]
	org.apache.logging.log4j#log4j-api;2.13.2 from central in [default]
	org.apache.logging.log4j#log4j-api-scala_2.12;12.0 from central in [default]
	org.scala-lang#scala-reflect;2.12.10 from central in [default]
	org.slf4j#slf4j-api;1.7.5 from central in [defau

### 1.1 Read sas file from local file system

Now we have the appropriate spark session, we can read sas files now. First, we will read data from a local file system

In [16]:
file_path="/home/onyxia/work/PySparkCommonFunc/data/tiny.sas7bdat"

In [17]:
df = spark.read.format("com.github.saurfang.sas.spark").load(file_path, forceLowercaseNames=True, inferLong=True)

In [18]:
df.show()

+---+
|  i|
+---+
|0.0|
+---+



In [19]:
df.printSchema()

root
 |-- i: double (nullable = true)



### 1.2 Read sas file from local s3

In this example, we will read data from a remote s3 server. 

You can check if your are connected to a s3 server by using below command

In [20]:
! env | grep -i "aws_s3_endpoint"

AWS_S3_ENDPOINT=minio.lab.sspcloud.fr


In [21]:
# the file path is <protocol>://<bucket_name>/<file_path>
remote_path="s3a://pengfei/diffusion/data_format/sas/private_school_survey/pss1718_pu.sas7bdat"

In [22]:
df1 = spark.read.format("com.github.saurfang.sas.spark").load(remote_path, forceLowercaseNames=True, inferLong=True)

In [23]:
df1.show()

+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-----------

In [24]:
df1.count()

                                                                                

22895

## 2. Create spark session with local jar files

In the above example, we created the spark session by using the --packages option, this works if the driver and executor has internet connections. Otherwise, you need to use the --jars option and specifies the location of the jar file.

Let's start our first try by downloading only the 


In [1]:
from pyspark.sql import SparkSession,DataFrame

import os

In [2]:
local = True
jarPath = "/home/onyxia/work/PySparkCommonFunc/lib/spark-sas7bdat-3.0.0-s_2.12.jar"

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation-check")\
        .config("spark.jars",jarPath)\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation-check")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
        .getOrCreate()

2023-04-11 13:23:24,550 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-04-11 13:23:25,778 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
remote_path="s3a://pengfei/diffusion/data_format/sas/private_school_survey/pss1718_pu.sas7bdat"

In [4]:
df3 = spark.read.format("com.github.saurfang.sas.spark").load(remote_path, forceLowercaseNames=True, inferLong=True)

It does not work, you can notice in the error message. It cries about missing classes. That's because the jar does not contains sub dependencies. In our case, the `spark-sas7bdat-3.0.0-s_2.12.jar` requires below
dependencies:
- com.epam:parso:2.0.11
- org.slf4j:slf4j-api:1.7.5
- org.apache.logging.log4j:log4j-api-scala_2.12:12.0
- org.scala-lang:scala-reflect;2.12.10
- org.apache.logging.log4j:log4j-api;2.13.2

Below is an example which works. You can notice we add a jar of the parso. The rest jar files can be foun

In [None]:
local = True
jarPath = "/home/onyxia/work/PySparkCommonFunc/lib/spark-sas7bdat-3.0.0-s_2.12.jar,\
           /home/onyxia/work/PySparkCommonFunc/lib/parso-2.0.11.jar"

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation-check")\
        .config("spark.jars",jarPath)\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation-check")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
        .getOrCreate()

In [5]:
df3.show()

2023-04-11 13:23:41,275 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-----------

## 2 Write sas files 

As sas file format is under a commercial licences, so there is no opensoucre package which can write sas files. 