In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

In [None]:
data_dir = '../data'
master = 'local[2]'

import os
import pyspark
import pyspark.sql.functions as sf

spark = (
    pyspark.sql.SparkSession.builder
    .master(master)
    .getOrCreate()
)
spark

# Reading data in Spark

![footer_logo_new](images/logo_new.png)

In this chapter, we'll cover the following topics:
+ How to read data using Spark
+ HDFS and Spark
+ Data compression

## Reading data into dataframes
Most input operations are under `spark.read`, for example:

- `spark.read.csv()`: CSV.
- `spark.read.json()`: JSON.
- `spark.read.parquet()`: Parquet.
- `spark.read.table()`: Hive table.

All file-based methods with a file, a wildcard, or folder(s) with files.

An example of `spark.read.csv()`:

In [None]:
chicago_path = os.path.join(data_dir, 'chicagoCensus.csv')
chicago = spark.read.csv(chicago_path, header=True)
chicago.printSchema()

An example of `spark.read.parquet()`:

In [None]:
airlines_path = os.path.join(data_dir, 'airlines.parquet/')  # Folder
airlines = spark.read.parquet(airlines_path)
airlines.printSchema()
# airlines.limit(5).toPandas()

###  `read()` and locality

Depending on how your run Spark, it will read files from different places:

* Yarn client mode: `/data/` is the HDFS folder `/data/`.
    * Use `file://data/` to load a local file.
* Local mode: `/data/` is the folder `/data` on your machine.
    * Use `hdfs://data` to load an HDFS file.
* Any mode:
    * Use `gs://bucket/` to load a file from Google Cloud Storage.
    * Or `s3a://bucket/` to load a file from Amazon's S3 storage.

## HDFS
- Distributed filesystem:
    - NameNode (master)
        - Stores metadata on filesystem.
        - Controls file permissions.
        - Executes changes on filesystem.
    - DataNodes
        - Store actual data (blocks).
        - Execute read and write operations.

![](images/hdfs.png)

## Why HDFS?

Advantages:

* Distributed load - avoid high disk/network load on single system.
* Data locality - bring compute to the data.
* High reliablity - replication across multiple systems.

Drawbacks:

* Inefficient with many small files.
* Security, latency, ease of use, etc.
* Combined data storage + compute = always running cluster.

## Other options: blob storage

Examples: Amazon S3, Google Cloud Storage.

Features:
* High durability, scalability, availability.
* Low storage costs + additional cost options.
* Strong support for security + auditing.

Compared to HDFS: easy separation of storage and compute. But... No data locality

## Compression

Advantages:
* Helps reduce file sizes, especially for text-based file formats like CSV, JSON.
* Less network load + storage costs.

Drawbacks:
* Increased CPU overhead for compressing/decompressing files.
* Non-splittable compression formats cannot be read in blocks (requiring the entire file to be read).

## Splittability

Compressed files are splittable if file blocks can be read without de-compressing preceding blocks.

Examples:
* Splittable formats - bzip2/LZO
* Non-splittable formats - gzip/snappy

Note that splittability also depends on the file format: for example Avro and Parquet use compression internally and are therefore still splittable when using gzip or snappy compression.

# Summary

In this chapter, we covered:
+ How you can read data from different formats using Spark
+ What kind of storage you can use with Spark
+ The tradeoffs that can be involved with compressing data to reduce storage.