# Read CSV File With Spark

## Overview

This section illustrates how to read a CSV file into Spark. CSV formatted files are ubiquitous in data science despite
some of their disadvantages. This is due to their simplicity. 

## Read csv file with Spark

In this example, we will read a CSV file into Spark and perform some basic statistics on the loaded data. Specifically download
the DC_Taxi_2015 data from https://opendata.dc.gov/explore?categories=%2Fcategories%2Ftransportation&query=taxi&type=document%20link. 

```
"""Loads a CSV file into Spark
"""

"""Convert a csv file to parquet format.
This application is meant to be submitted on
Spark for execution

"""
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pathlib import Path
import sys


APP_NAME="LOAD_CSV_FILE_TO_SPARK"

if __name__ == '__main__':

    if len(sys.argv) != 2:
        print("Usage: filename <file>", file=sys.stderr)

    # get a spark session
    spark = SparkSession.builder.appName(APP_NAME).getOrCreate()


    # read the filename from the commandline

    # where is the file to read
    filename = Path(sys.argv[1])
    
    print(f"Loading file {filename}")

    # read the file into a Spark DataFrame
    # the schema is inferred and it assumes that
    # a header is contained
    csv_df = (spark.read.format("csv")
              .option("header", "true")
              .option("inferSchema", "true")
              .option("delimiter", "|")
              .load(str(filename)))

    
    
    # how many rows the file has
    n_total_rows = csv_df.count()

    # let see the top 10 rows of the DataFrame
    csv_df.show(n=10, truncate=False)
    
    print(f"Total number of rows {n_total_rows}")
    
    spark.stop()
    
```

Execute the script using the supplied bash script file. Make sure you set the ```<SPARK-PATH>``` and the ```<DATA-PATH>``` so that
these match your file system. Executing the script, should produce the following

```
23/10/07 11:11:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/10/07 11:11:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Loading file /home/alex/qi3/qi3_notes/mlops/data/taxi_2015_01.txt
23/10/07 11:11:48 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+--------+--------+------------+----------+--------------+---------------+---------------+----------+-----------+-----------+----------+-----------+---------+---------------+----------------+--------------+-------+--------+---------------------+----------------------+----------------+--------------------------+---------------------------+---------------------+-------+-----------------+----------------------+
|OBJECTID|TRIPTYPE|PROVIDERNAME|FAREAMOUNT|GRATUITYAMOUNT|SURCHARGEAMOUNT|EXTRAFAREAMOUNT|TOLLAMOUNT|TOTALAMOUNT|PAYMENTTYPE|ORIGINCITY|ORIGINSTATE|ORIGINZIP|DESTINATIONCITY|DESTINATIONSTATE|DESTINATIONZIP|MILEAGE|DURATION|ORIGIN_BLOCK_LATITUDE|ORIGIN_BLOCK_LONGITUDE|ORIGIN_BLOCKNAME|DESTINATION_BLOCK_LATITUDE|DESTINATION_BLOCK_LONGITUDE|DESTINATION_BLOCKNAME|AIRPORT|ORIGINDATETIME_TR|DESTINATIONDATETIME_TR|
+--------+--------+------------+----------+--------------+---------------+---------------+----------+-----------+-----------+----------+-----------+---------+---------------+----------------+--------------+-------+--------+---------------------+----------------------+----------------+--------------------------+---------------------------+---------------------+-------+-----------------+----------------------+
|1       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 05:00 |01/05/2015 05:00      |
|2       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 11:00 |01/05/2015 11:00      |
|3       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 09:00 |01/05/2015 09:00      |
|4       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 08:00 |01/05/2015 08:00      |
|5       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 04:00 |01/05/2015 04:00      |
|6       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |4          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 05:00 |01/05/2015 05:00      |
|7       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 06:00 |01/05/2015 06:00      |
|8       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |4          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 08:00 |01/05/2015 08:00      |
|9       |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |2          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 08:00 |01/05/2015 08:00      |
|10      |3       |Transco     |33.00     |0.0           |0.0            |0.0            |0.0       |33.0       |4          |NULL      |NULL       |0        |NULL           |NULL            |0             |NULL   |NULL    |NULL                 |NULL                  |NULL            |NULL                      |NULL                       |NULL                 |N      |01/05/2015 09:00 |01/05/2015 09:00      |
+--------+--------+------------+----------+--------------+---------------+---------------+----------+-----------+-----------+----------+-----------+---------+---------------+----------------+--------------+-------+--------+---------------------+----------------------+----------------+--------------------------+---------------------------+---------------------+-------+-----------------+----------------------+
only showing top 10 rows

Total number of rows 1307

```

Notice that I have reduced the logging that Spark performs to ```WARN``` level. The following performs some basic statistics
on the ```FAREAMOUNT``` column.

```
    print(f"Total number of rows {n_total_rows}")
    
    # let's calculate the mean of the FAREAMOUNT column
    total_fareamount_sum = csv_df.select(F.sum("FAREAMOUNT")).collect()[0][0]
    print(f"Total fareamount sum {total_fareamount_sum}")
    print(f"Mean value is  {total_fareamount_sum / n_total_rows}")
    
    # collect the values in a numpy and do statistics.
    # this is not done in parallel
    
    fare_values = csv_df.select("FAREAMOUNT").collect()
    
    print(type(fare_values))
    
    # drop the NULL
    fare_values = [float(row['FAREAMOUNT']) for row in fare_values if row['FAREAMOUNT'] != 'NULL']
    print(fare_values[0:10])
    # don't need Spark anymore. Can't call Spark
    # functionality pass this point
    
    
    print(f"Mean value {np.mean(fare_values)}")
    print(f"Variance value {np.var(fare_values)}")
    
    
    
    spark.stop()
    
```

Notice how we select all the values in the relevant column. Using ```numpy``` means that we need to clean first the data.
Certainly Spark can compute the mean and variance for us. This is shown below

```
...
csv_df.select(F.mean("FAREAMOUNT"), F.variance("FAREAMOUNT")).show()
...
```

## Summary

## References

1. Jules S. Damji, Brooke Wenig, Tathagata Das, Deny Lee, _Learning Spark. Lighting-fasts data analytics_, 2nd Edition, O'Reilly.