# 3.8 Read write Excel files

Spark does not have built in connector to read Excel file directly. But there are third party connector
- https://mvnrepository.com/artifact/com.crealytics/spark-excel

With pyspark, the best way is to use pandas to read the excel then convert it back to spark dataframe.

In [4]:
from pyspark.sql import SparkSession, DataFrame
import pyspark.pandas as ps
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType
from pyspark.sql.functions import lit, col, when, concat, udf
import os



In [3]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("ReadExcelFiles")\
                  .getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("ReadExcelFiles") \
                      .config("spark.kubernetes.container.image",os.environ['IMAGE_NAME']) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages','com.crealytics:spark-excel_2.12:3.1.2_0.17.1') \
                      .getOrCreate()

22/08/09 10:55:49 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/08/09 10:55:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/09 10:55:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [5]:
file_path="../../../data/per.xls"

## 3.8.1 Use third party excel connector


In my local environment, I use the spark 3.1.x. So I use the below jar as the
```xml
<!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel -->
<dependency>
    <groupId>com.crealytics</groupId>
    <artifactId>spark-excel_2.12</artifactId>
    <version>3.1.2_0.17.1</version>
</dependency>

```

In [None]:
## 3.8.2 Use standalone pandas

## 3.8.3 Use pandas on spark

**Note: With spark 3.1.1, the pandas on spark has problem with types when converting pandas df to spark df. So use stand-alone pandas is
recommended**
It requires a dependency 'xlrd', you need to install in on your python virtual env.

```shell
pip install xlrd
poety add xlrd
```

Note the function read_excel returns a pandas dataframe not a spark dataframe. You need to convert it explicitly back to spark dataframe.

For more detail about read_excel, read the official [doc](https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html)

In [7]:
df=ps.read_excel(file_path, sheet_name='per', index_col=[0])

*** No CODEPAGE record, no encoding_override: will use 'iso-8859-1'


  return pd.read_excel(


In [9]:
df.head()

Unnamed: 0_level_0,Assureur/Support,Avis sur 5,Frais Vers.,Frais Gestion Fonds ?,Frais Gestion UC,Frais/rente,Fonds euros,Taux brut,Nombre SCPI,Nombre SCI,Nombre OPCI,Nombre ETF,Nombre UC
PER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
ABEILLE RETRAITE PLURIELLE,ABEILLE RETRAITE PROFESSIONNELLE,,5.0,1.0,1.0,,ABEILLE EURO PERP,,0,0,0,0,80
AFER RETRAITE INDIVIDUELLE,ABEILLE,,3.0,1.0,1.0,0.0,ABEILLE RP SECURITE RETRAITE,,0,0,0,0,80
ALLIANZ PER HORIZON,ALLIANZ,,4.8,0.85,0.85,,ALLIANZ RETRAITE,,0,0,0,0,92
AMBITION RETRAITE INDIVIDUELLE,LA MONDIALE,,3.9,0.7,0.7,0.0,FONDS EUROS RETRAITE,,0,0,0,0,0
AMPLI-PER LIBERTE,AMPLI-MUTUELLE,,0.0,0.5,0.4,0.0,AMPLI PER EUROS,,2,0,0,3,4


In [13]:
# use pandas dataframe function to write csv
path="/tmp/spark"
df.to_csv(
    path=r'%s/excel_output' % path,
    index_col=["PER_name"])

Pandas requires an engine to write excel (xlswriter), so you need to install it in your python virtual env

```shell
pip install xlsxwriter
poetry add xlsxwriter
```

In [17]:
# use pandas dataframe function to write excel
#
df.to_excel(r'%s/excel_output.xlsx' % path, sheet_name='PER',engine='xlsxwriter')