# 3.3 Read write csv files
The official doc can be found [here](https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Apache Parquet file is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.


In [1]:
from pyspark.sql import SparkSession
import os

In [2]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("ReadWriteCSV").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("ReadWriteCSV") \
                      .config("spark.kubernetes.container.image",os.environ["IMAGE_NAME"]) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

22/02/13 09:21:11 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/02/13 09:21:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/13 09:21:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3.3.1 Read simple CSV file

Ideally, the csv file should contain a header that describes the column names. Spark can use this header to infer a schema. In below example, we set header=True and inferSchema=True.

There are many ways to setup these options. We can set them up in method csv() as argument, or use method option, and options. Below code are equivalent.
```python
# as csv method argument
df=spark.read.csv(path=clean_csv_path,header=True,inferSchema=True)

# use option
df=spark.read\
    .option("header",True)\
    .option("inferSchema", True) \
    .csv(path=clean_csv_path)

# use options
df=spark.read\
    .options(header=True,inferSchema=True)\
    .csv(path=clean_csv_path)
```

## Configurable Options

As you can see, when read csv files, we need to specify many options such as:
- header (bool) : e.g. header=True,inferSchema=True
- inferSchema (bool) : e.g. inferSchema=True
- delimiter (String) : e.g. delimiter=','
- encoding (String) : e.g. encoding='UTF-8'
- nullValue (String) : e.g. nullValue='1900-1-1' all date row with value '1900-1-1' will be considered as null
- quotes (String) : When you have a column with a delimiter that used to split the columns (e.g. you have a text column contains ,), use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.

You can find the full option list https://spark.apache.org/docs/latest/sql-data-sources-csv.html

In [28]:
clean_csv_path="data/csv/adult_nospace_header.csv"
raw_csv_path="data/csv/adult.csv"

In [73]:
df=spark.read\
    .options(header=True,inferSchema=True,delimiter=',',nullValue="?")\
    .csv(path=clean_csv_path)
df.show(5)

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

In [30]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



In [52]:
# output dataframe as json file
json_schema=df.schema.json()
schema_file_path="data/schema.json"
with open(schema_file_path,"w") as f:
    f.write(json_schema)
    f.close()

In [31]:
df.count()

32561

You can notice the column name and type are correct. If the csv has no head, spark will use _c0, _c1, ... as default column name

In [32]:
spark.read.csv(raw_csv_path).show(1)

+---+----------+------+----------+---+--------------+-------------+--------------+------+-----+-----+----+----+--------------+------+
|_c0|       _c1|   _c2|       _c3|_c4|           _c5|          _c6|           _c7|   _c8|  _c9| _c10|_c11|_c12|          _c13|  _c14|
+---+----------+------+----------+---+--------------+-------------+--------------+------+-----+-----+----+----+--------------+------+
| 39| State-gov| 77516| Bachelors| 13| Never-married| Adm-clerical| Not-in-family| White| Male| 2174|   0|  40| United-States| <=50K|
+---+----------+------+----------+---+--------------+-------------+--------------+------+-----+-----+----+----+--------------+------+
only showing top 1 row



In [33]:
spark.read.csv(raw_csv_path).printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)



As we did not set inferSchema to true, by default spark consider all columns datatype are String.

## 3.3.2 Read multiple CSV files

 To read multiple csv files, you just need to pass all file names with comma between them as a path, for example
```python
df = spark.read.csv("path1,path2,path3")
```
Does not work on local file, need to find out why

In [34]:
multi_path="data/csv/adult_nospace_header.csv,data/csv/adult.csv"
print(multi_path)
spark.read.csv(multi_path).count()

data/csv/adult_nospace_header.csv;data/csv/adult.csv


AnalysisException: Path does not exist: file:/home/pliu/git/PySparkCommonFunc/notebooks/pysparkbasics/L03_ReadFromVariousDataSource/data/csv/adult_nospace_header.csv;data/csv/adult.csv

We can also give the parent directory path of the csv files. Spark will read all csv files in it, each file will be considered as a partition. Note if the directory contains subdirectory or files in other format, it will fail

In [38]:
parent_path="data/csv"
multi_df=spark.read.csv(path=parent_path,header=True,inferSchema=True)
multi_df.count()

65121

In [39]:
multi_df.printSchema()

root
 |-- 39: string (nullable = true)
 |--  State-gov: string (nullable = true)
 |--  77516: string (nullable = true)
 |--  Bachelors: string (nullable = true)
 |--  13: string (nullable = true)
 |--  Never-married: string (nullable = true)
 |--  Adm-clerical: string (nullable = true)
 |--  Not-in-family: string (nullable = true)
 |--  White: string (nullable = true)
 |--  Male: string (nullable = true)
 |--  2174: string (nullable = true)
 |--  0: string (nullable = true)
 |--  40: string (nullable = true)
 |--  United-States: string (nullable = true)
 |--  <=50K: string (nullable = true)



Note, it will use the schema inferred from the first csv file as the schema of dataframe. As the first read csv does not have header, so we have a schema completely wrong.

## 3.3.3 Read CSV files with explicit schema

Sometimes, you will encounter csv files without header. So you need to specify column names or give a complete schema.

Below example read a csv with a given list of column names

In [50]:
cols=["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income"]
spark.read.csv(raw_csv_path,header=False).toDF("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income").printSchema()

root
 |-- age: string (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: string (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: string (nullable = true)
 |-- capital-loss: string (nullable = true)
 |-- hours-per-week: string (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



You can notice the column type are all string. To make the column type right, you need to set option inferSchema to True


In [51]:
spark.read.csv(raw_csv_path,header=False,inferSchema=True).toDF("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income").printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: double (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: double (nullable = true)
 |-- capital-loss: double (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



You can notice the inferred schema still has problems, the integer column has double as data type. So the best way is to give a schema explicitly. You can specify a schema by using the StructType class, or via a generated schema.json file

For more detail about spark schema, please visit this [notebook](../L02_DataFrame/S01_DataStructures/01_Spark_Schema.ipynb) notebooks/pysparkbasics/L02_DataFrame/S01_DataStructures/01_Spark_Schema.ipynb

Here we only show how to read schema from a json file

In [56]:
import json
from pyspark.sql.types import StructType

json_schema=None
with open(schema_file_path,'r') as f:
    schema_json_str=f.read()
    f.close()
    json_schema=StructType.fromJson(json.loads(schema_json_str))

print(json_schema)

StructType(List(StructField(age,IntegerType,true),StructField(workclass,StringType,true),StructField(fnlwgt,IntegerType,true),StructField(education,StringType,true),StructField(education-num,IntegerType,true),StructField(marital-status,StringType,true),StructField(occupation,StringType,true),StructField(relationship,StringType,true),StructField(race,StringType,true),StructField(sex,StringType,true),StructField(capital-gain,IntegerType,true),StructField(capital-loss,IntegerType,true),StructField(hours-per-week,IntegerType,true),StructField(native-country,StringType,true),StructField(income,StringType,true)))


In [64]:
df_with_schema=spark.read.options(header=False,inferSchema=False,nullValue="?").schema(schema=json_schema).csv(raw_csv_path)

In [65]:
df_with_schema.printSchema()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)



In [66]:
df_with_schema.show(5)

+---+-----------------+------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|age|        workclass|fnlwgt| education|education-num|     marital-status|        occupation|  relationship|  race|    sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+-----------------+------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
| 39|        State-gov|  null| Bachelors|         null|      Never-married|      Adm-clerical| Not-in-family| White|   Male|        null|        null|          null| United-States| <=50K|
| 50| Self-emp-not-inc|  null| Bachelors|         null| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|        null|        null|          null| United-States| <=50K|
| 38|          Private|  null|   HS-grad|         null|     

You can notice with the explicit schema, the column type is integer not double now. The reason the inferSchema does not work is that the csv fields contains space.
For example below filter does not find anything

In [67]:
from pyspark.sql.functions import col
df_with_schema.filter(col("workclass")=="State-gov").show(5)

+---+---------+------+---------+-------------+--------------+----------+------------+----+---+------------+------------+--------------+--------------+------+
|age|workclass|fnlwgt|education|education-num|marital-status|occupation|relationship|race|sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+---------+------+---------+-------------+--------------+----------+------------+----+---+------------+------------+--------------+--------------+------+
+---+---------+------+---------+-------------+--------------+----------+------------+----+---+------------+------------+--------------+--------------+------+



In [68]:
# if we add space, we will have results
df_with_schema.filter(col("workclass")==" State-gov").show(5)

+---+----------+------+-------------+-------------+-------------------+---------------+--------------+-------------------+-----+------------+------------+--------------+--------------+------+
|age| workclass|fnlwgt|    education|education-num|     marital-status|     occupation|  relationship|               race|  sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------+------+-------------+-------------+-------------------+---------------+--------------+-------------------+-----+------------+------------+--------------+--------------+------+
| 39| State-gov|  null|    Bachelors|         null|      Never-married|   Adm-clerical| Not-in-family|              White| Male|        null|        null|          null| United-States| <=50K|
| 30| State-gov|  null|    Bachelors|         null| Married-civ-spouse| Prof-specialty|       Husband| Asian-Pac-Islander| Male|        null|        null|          null|         India|  >50K|
| 22| State-gov|  null| Some-college|   

To remove space, we can use trim() function. Check below example

In [70]:
from pyspark.sql.functions import trim
df_with_schema.withColumn("new_workclass",trim(col("workclass"))).filter(col("new_workclass")=="State-gov").show(5)

+---+----------+------+-------------+-------------+-------------------+---------------+--------------+-------------------+-----+------------+------------+--------------+--------------+------+-------------+
|age| workclass|fnlwgt|    education|education-num|     marital-status|     occupation|  relationship|               race|  sex|capital-gain|capital-loss|hours-per-week|native-country|income|new_workclass|
+---+----------+------+-------------+-------------+-------------------+---------------+--------------+-------------------+-----+------------+------------+--------------+--------------+------+-------------+
| 39| State-gov|  null|    Bachelors|         null|      Never-married|   Adm-clerical| Not-in-family|              White| Male|        null|        null|          null| United-States| <=50K|    State-gov|
| 30| State-gov|  null|    Bachelors|         null| Married-civ-spouse| Prof-specialty|       Husband| Asian-Pac-Islander| Male|        null|        null|          null|       

As we used option nullValue="?", spark will convert all "?" to null value. In below example, we can count null values in column `workclass`

In [74]:
df.filter(col("workclass").isNull()).count()

1836

## 3.3.4 Write CSV files

The output csv file numbers depends on the partition number of the dataframe. The default option values are good. Only thing you need to pay attention is header by default is false. Note spark do provide compression on csv file. But we don't see the interest. Because we use csv for human readability, if we want to better space-saving, we will choose other data format


In [76]:
df.rdd.getNumPartitions()

1

In [77]:
output_path="/tmp/output_test"

In [78]:
# check the output csv files, you will find it without header
df.write.csv(output_path)

                                                                                

In [79]:
# now check the output csv file with below options.
df.write.mode("overwrite").options(header=True,delimiter=";").csv(output_path)