# 3.5 Read write Text files

The official doc can be found [here](https://spark.apache.org/docs/latest/sql-data-sources-text.html). Spark SQL provides spark.read().text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe.write().text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.

At the end of this tutorial, We will also talk about how to read text files to rdd, you can skip it if you want.



In [2]:
from pyspark.sql import SparkSession
import os

In [3]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("ReadWriteText").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("ReadWriteText") \
                      .config("spark.kubernetes.container.image",os.environ["IMAGE_NAME"]) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

22/02/15 15:47:13 WARN Utils: Your hostname, pliu-SATELLITE-P850 resolves to a loopback address: 127.0.1.1; using 172.22.0.33 instead (on interface wlp3s0)
22/02/15 15:47:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/15 15:47:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3.5.1 Read text file into Dataframe

You can notice in below example, when reading a text file, spark uses "\n" as default line separator. As a result, each line becomes each row that has “value” as column name.

In [4]:
text_file_path="data/text/users1.txt"
df=spark.read.text(text_file_path)

In [5]:
df.show()

[Stage 0:>                                                          (0 + 1) / 1]

+------------+
|       value|
+------------+
|  alice F 32|
|    bob M 38|
|Charlie M 48|
+------------+



                                                                                

In [5]:
df.printSchema()

root
 |-- value: string (nullable = true)



We can change the default line separator by using option('lineSep','value'). In below example, we set space as line separator.

In [6]:
df_sep=spark.read.option('lineSep',' ').text(text_file_path)

In [7]:
df_sep.show()

+----------+
|     value|
+----------+
|     alice|
|         F|
|    32
bob|
|         M|
|38
Charlie|
|         M|
|        48|
+----------+



## 3.5.2 Read Multiple text files

Note if you put a folder at the input, spark will read all the text file in it

In [11]:
folder_path="data/text"
! ls {folder_path}

users1.txt  users2.txt


In [8]:

df_multi=spark.read.text(folder_path)

In [9]:
df_multi.show()

+------------+
|       value|
+------------+
|  alice F 32|
|    bob M 38|
|Charlie M 48|
|   toto F 32|
|   titi M 38|
|   tata M 48|
+------------+



If you want all content of a file as a line, you can use the option wholetext=True. In scala, it will be

```scala
val df3 = spark.read.option("wholetext", true).text(path)
```

In java, it will be

```java
Dataset<Row> df3 = spark.read().option("wholetext", "true").text(path);
```

In [17]:
df_full=spark.read.text(folder_path,wholetext=True)
df_full.show()

+--------------------+
|               value|
+--------------------+
|alice F 32
bob M ...|
|toto F 32
titi M ...|
+--------------------+



## 3.5.3 Read text file into RDD

In above example, we read text files and returns a dataframe. We can also return an RDD instead.

We can use two methods:
- textFile()
- wholeTextFiles()

In [7]:
sc=spark.sparkContext
text_rdd=sc.textFile(text_file_path)

In [9]:
#  collect will send all partitions of the rdd to driver and return it as a list
list1=text_rdd.collect()
print(list1)

['alice F 32', 'bob M 38', 'Charlie M 48']


In [12]:
file_rdd=sc.wholeTextFiles(folder_path)
list2=file_rdd.collect()

In [13]:
print(list2)

[('file:/home/pliu/PycharmProjects/PySparkCommonFunc/notebooks/pysparkbasics/L03_ReadFromVariousDataSource/data/text/users1.txt', 'alice F 32\nbob M 38\nCharlie M 48'), ('file:/home/pliu/PycharmProjects/PySparkCommonFunc/notebooks/pysparkbasics/L03_ReadFromVariousDataSource/data/text/users2.txt', 'toto F 32\ntiti M 38\ntata M 48')]
