# Read data with Spark

This notebook covers the DataFrameReader topics about how to read data, just to refresh:

Candidates are expected to know how to:

* Read data for the “core” data formats (CSV, JSON, JDBC, ORC, Parquet, text and tables)
* How to configure options for specific formats
* How to read data from non-core formats using format() and load()
* How to specify a DDL-formatted schema
* How to construct and specify a schema using the StructType classes

## Read data for the “core” data formats (CSV, JSON, JDBC, ORC, Parquet, text and tables)

To read data with Spark, you must use the [DataframeReader](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) object, contained into the Spark Session, to acess the DataFrameReader in Databricks, just type `spark.read`

In [3]:
spark.read

if you run the cell above, it will display the `DataFrameReader`, but it won't work if you don't specify which type of data you want to read.

The "core" data formats are methods built in into `DataFrameReader`, that will have more rich features to handle data of that specific format. Let's read each type of data now.

### Reading CSV data

First we will read a CSV(comma separated values) file, which consists of a file(or a set of files) that have data **organized in rows** separated by a **delimiter**(which, for me, it's never a comma), may have a **header row** or not and at last, the data may have **enclosing quotes** or not.

In [6]:
import glob

glob.glob("/dbfs/databricks-datasets/*/*.csv")

For this reads, we will use Databricks toy datasets located in `/databricks-datasets` when available. For this CSV exercise we got plenty of files, so we will pick `/dbfs/databricks-datasets/flights/departuredelays.csv`.

Let's check how this file looks like:

In [8]:
%sh head -5 /dbfs/databricks-datasets/flights/departuredelays.csv

it's a file delimited by commas(miracle!) with a header and no enclosing quotes. So, let's transform it into a Spark Dataframe with [read.csv](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv)!

In [10]:
delay_filepath = '/databricks-datasets/flights/departuredelays.csv'

delays = spark.read.csv(delay_filepath)

display(delays.limit(5))

_c0,_c1,_c2,_c3,_c4
date,delay,distance,origin,destination
01011245,6,602,ABE,ATL
01020600,-8,369,ABE,DTW
01021245,-2,602,ABE,ATL
01020605,-4,602,ABE,ATL


Nice, we just have created a `dataframe` from a csv file, but some things aren't quite right here:

* The header were considered as a row on the read process
* All fields are strings
* The `date` field looks wierd as f**k

To fix this, you need to pass extra parameters to the `read.csv`, informing these caveats to Spark.

In [12]:
#reading again, with extra params

delays_done_right = spark.read.csv(delay_filepath,inferSchema=True,header=True)

display(delays_done_right.limit(3))

date,delay,distance,origin,destination
1011245,6,602,ABE,ATL
1020600,-8,369,ABE,DTW
1021245,-2,602,ABE,ATL


Almost everything done, the column names were specified by the `header=True` and the data types were inferred by `inferSchema=True`.

> The date is still weird, that's because year was ommited from the data, but, this is a problem for another notebook

Time to check another formats