# Reading Text Files

Text files are very simple to load from and save to with Spark. When we load a single
text file as an RDD, each input line becomes an element in the RDD. We can also
load multiple whole text files at the same time into a pair RDD, with the key being the
name and the value being the contents of each file.

## Loading text files
Loading a single text file is as simple as calling the textFile() function on our
SparkContext with the path to the file, as you can see in Examples 5-1 through 5-3. If
we want to control the number of partitions we can also specify minPartitions.

In [None]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="RDDBasics")
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)

In [45]:
#Example 5-1#
lines = sc.textFile("appl.log")

In [46]:
lines.top(2)

['log_message',
 '[Wed Mar 10 11:45:51 2004] [info] [client 24.71.236.129] (104)Connection reset by peer: client stopped connection before send body completed']

# Doing a calculation

In [47]:
# Now search for other words
from operator import add
search_word='notice'
counts_rdd = lines.flatMap(lambda x: x.split(' ')) \
        .filter(lambda x : search_word in x) \
        .map(lambda word : (word, 1)) \
        .reduceByKey(add)

In [48]:
counts_rdd.collect()

[('[notice]', 2)]

## Saving text files - default - parquet format
Outputting text files is also quite simple. The method saveAsTextFile(), demon‐
strated in Example 5-5, takes a path and will output the contents of the RDD to that
file. The path is treated as a directory and Spark will output multiple files underneath
that directory. This allows Spark to write the output from multiple nodes. With this
method we don’t get to control which files end up with which segments of our data,
but there are other output formats that do allow this.

In [49]:
counts_rdd.saveAsTextFile("appl_counts.txt")

## Some Basic Formats
1. Parquet
1. CSV
1. JSON

## How to readfrom Spark data sources
Let's see how to read from a structured CSV.

In [50]:
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)

In [51]:
# Read in all available data files into a data frame
df = spark.read.csv("housing.csv")   

### Now check the data schema

In [52]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)



* Ok , but the column names are not very telling. 
* How to improve this? , by telling Spark to use the header ( if exists )

In [53]:
df = spark.read \
    .option("header", "true") \
    .csv("housing.csv")  

In [54]:
df.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)
 |-- ocean_proximity: string (nullable = true)



* Better , but still one caveat though , all values are interpreted as string, while some of them (actually most).
* How to improve this ?, by either telling Spark what schema to use OR telling it to infer the Schema of the data
* Note : Asking Spark to infer schema may have a performance impact depending on the number of rows required to infer the schema

In [55]:
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("housing.csv")  

In [56]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [57]:
type(df)

pyspark.sql.dataframe.DataFrame

In [58]:
df.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'ocean_proximity']

In [59]:
df.head(3)

[Row(longitude=-122.23, latitude=37.88, housing_median_age=41.0, total_rooms=880.0, total_bedrooms=129.0, population=322.0, households=126.0, median_income=8.3252, median_house_value=452600.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.22, latitude=37.86, housing_median_age=21.0, total_rooms=7099.0, total_bedrooms=1106.0, population=2401.0, households=1138.0, median_income=8.3014, median_house_value=358500.0, ocean_proximity='NEAR BAY'),
 Row(longitude=-122.24, latitude=37.85, housing_median_age=52.0, total_rooms=1467.0, total_bedrooms=190.0, population=496.0, households=177.0, median_income=7.2574, median_house_value=352100.0, ocean_proximity='NEAR BAY')]

In [60]:
df.show(3)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 3 rows



In [61]:
df.show(4)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
+---------+--------+--------------

### mode 
* permissive (create line with null) 
* dropMalformed (ignore line)
* failFast (raise an error and stop reading)

### Missing values
* nullValue as NA


In [62]:
# Read CSV into Data Frame
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("nullValue", "NA") \
    .option("mode", "dropMalformed") \
    .csv("housing.csv")

## Saving df to dataframe

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:


# MUST INSTALL pandas BEFORE EXPORTING: pip3 intall pandas

In [63]:
pdf = df.toPandas()

In [64]:
type(pdf)

pandas.core.frame.DataFrame

In [65]:
pdf.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [66]:
df.toPandas().to_csv('housing_export.csv')

In [67]:
sc.version 

'3.3.1'

Otherwise you can use spark-csv:

Spark <= 1.3



In [68]:
# df.save('mycsv.csv', 'com.databricks.spark.csv')

Spark >= 1.4 and < 2.0

In [69]:
# df.write.format('com.databricks.spark.csv').save('mycsv.csv')

In Spark > 2.0 you can use csv data source directly:

In [74]:
df.write.csv('housing_direct_export2.csv')

# Ejercicio 1) - leer a spark el fichero exporta housing_export.csv

In [70]:
df1 = spark.read.csv("housing_export.csv", header=True, inferSchema=True)

# Ejercicio 2) - leer a spark el fichero housing_direct_export2.csv

In [86]:
df2 = spark.read.csv("housing_direct_export2.csv", header=False, inferSchema=True)

In [87]:
df2.show()

+-------+-----+----+------+------+------+------+------+--------+--------+
|    _c0|  _c1| _c2|   _c3|   _c4|   _c5|   _c6|   _c7|     _c8|     _c9|
+-------+-----+----+------+------+------+------+------+--------+--------+
|-122.23|37.88|41.0| 880.0| 129.0| 322.0| 126.0|8.3252|452600.0|NEAR BAY|
|-122.22|37.86|21.0|7099.0|1106.0|2401.0|1138.0|8.3014|358500.0|NEAR BAY|
|-122.24|37.85|52.0|1467.0| 190.0| 496.0| 177.0|7.2574|352100.0|NEAR BAY|
|-122.25|37.85|52.0|1274.0| 235.0| 558.0| 219.0|5.6431|341300.0|NEAR BAY|
|-122.25|37.85|52.0|1627.0| 280.0| 565.0| 259.0|3.8462|342200.0|NEAR BAY|
|-122.25|37.85|52.0| 919.0| 213.0| 413.0| 193.0|4.0368|269700.0|NEAR BAY|
|-122.25|37.84|52.0|2535.0| 489.0|1094.0| 514.0|3.6591|299200.0|NEAR BAY|
|-122.25|37.84|52.0|3104.0| 687.0|1157.0| 647.0|  3.12|241400.0|NEAR BAY|
|-122.26|37.84|42.0|2555.0| 665.0|1206.0| 595.0|2.0804|226700.0|NEAR BAY|
|-122.25|37.84|52.0|3549.0| 707.0|1551.0| 714.0|3.6912|261100.0|NEAR BAY|
|-122.26|37.85|52.0|2202.0| 434.0| 910

### Writing dataframe to disk as csv is similar read from csv.  If you want your result as one file, you can use coalesce.

In [88]:
df.coalesce(1) \
    .write \
    .option("header","true") \
    .option("sep",",") \
    .mode("overwrite") \
    .csv("housing_direct_export3.csv")


# Ejercicio 3) - leer a spark el fichero housing_direct_export3.csv

In [89]:
df3 = spark.read.csv("housing_export.csv", header=True, inferSchema=True)

In [91]:
df3.show()

23/03/21 12:16:36 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity
 Schema: _c0, longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity
Expected: _c0 but found: 
CSV file: file:///home/javiortig/uni/procesamiento_datos/semana_6/housing_export.csv
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|_c0|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  0|  -122.23|   37.88|              41.0|      880.0|         

---

# Write a file into JSON

In [92]:
counts_rdd.collect()

[('[notice]', 2)]

In [93]:
import json
counts_rdd.map(lambda x: json.dumps(x)).collect()

['["[notice]", 2]']

In [94]:
import json
counts_rdd.map(lambda x: json.dumps(x)).saveAsTextFile('counts_rdd_json2.txt')


## Reading json format into Key/Value RDD

In [101]:
fichero = sc.textFile("counts_rdd_json2.txt")

In [102]:
import json
data = fichero.map(lambda x: json.loads(x))


In [103]:
data.collect()

[['[notice]', 2]]

# SQL - Writting and Reading via Pandas

In [104]:
import sqlite3
import pandas as pd

In [105]:
conn = sqlite3.connect("database.db")

In [106]:
df.toPandas().to_sql('housing',conn,if_exists='replace', index=False)

20640

In [107]:
df.toPandas().to_sql('housing',conn,if_exists='append', index=False)

20640

In [108]:
df2 = pd.read_sql("select * from  housing;", conn)

In [41]:
len(df2)

41280

# Ejercicio 4) - leer a spark el fichero database.db (tabla: housing)

# Ejercicio 5 - hasta las 10:25 de 9/abril


Descargar ReadingWrittingFilesWithSpark_nosolution.ipynb y exercise_files.zip de Blackboard apartado __Semana6-Reading Writting Files - Spark D__, importar el notebook y los ficheros del zip (descomprimidos) a Spark (en Ubuntu) y para cada fichero dentro exercise_files.zip, es decir, leer todos los ficheros: 

1. leer el fichero usando el comando Spark correspodiente
1. imprimir las primeras 3 primeras líneas
1. imprimir el número de líneas total del fichero
