### READING SPARK DATAFRAMES FROM HDFS AND PERFORMING BASIC OPERATIONS

- The goal is simply to get acquainted with pyspark on jupyter and warmu a bit
- In this workshop we will focus on inferring schemas for a CSV file to have some structure on our dataset  


- **Prereqs**  
Make sure you restart your context manager before you start a Jupyter Pyspark Session   
(just in case there are some previous spark sessions that are still running.)  


- CAUTION : careful with that command : It will also kick you pair colleague from any of his active sessions
  please synchronize together to avoid that.  
  
  
  On a terminal, type :  
    `sudo su`  
    `cd`  
    `systemctl restart hadoop-yarn-resourcemanager`  


- **Explore the DataSets :**  


- Reference of the Data : 
  [Bureau of Transportation Statistics - AIrline Traffic](https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)  
  
- Take a couple of minutes to acknowledge the dataset data  

- The dataset files have been preloaded in your Cluster for your convenience  

- Check the data files on HDFS :  
    `hdfs dfs -ls /user/root/data/BOTS/CSV`  
    `hdfs dfs -ls /user/root/data/BOTS_REF/CSV`  
    
    
<br>


- **Start Spark Application**   

This is just to turn the spark application on and retrieve an Application ID

In [None]:
spark

- **Define the HDFS Data Directory**

In [None]:
print('hello pyspark kernel')

In [None]:
BOTS_DIR_HDFS = "hdfs:///user/root/data/BOTS/CSV"

- **Load Dataframe from HDFS**<br>
- Load the whole directory

In [None]:
df = spark.read.format("csv").load(BOTS_DIR_HDFS)

- **Count DataFrame Records**

In [None]:
df.count()

- **Print Default DataFrame Schema**
- Explore the Schema 
- What can you say about it ? What are your thoughts ?
- Is this Schema ok ?

In [None]:
df.printSchema()

- **Load Dataframe : Add Header Option**<br>


In [None]:
df = spark.read.format("csv").option("header", "true").load(BOTS_DIR_HDFS)

- Did you Notice any change in the execution time compared to the execution without the header option ?  



- **Print DataFrame Schema**
- Explore the Schema. What changes do you notice ?
- How does this Schema compare to the previous one ?

In [None]:
df.printSchema()

- **Load Dataframe : add InferSchema Option**<br>
- What changes do you notice regarding execution time ?
- What are your guesses about that ?

In [None]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(BOTS_DIR_HDFS)

- **Print DataFrame Schema**<br>
- What can you say about this Schema ?
- Is it better than the previous one ?
- Is it perfect ?
- Why in your opinion does it take so much time ?
- What is actually happening ? How is spark reading data to infer the Schema ?

- **Let's try some simple transformations / actions **<br>

In [None]:
df_state = df.select("OriginStateName").distinct()

- How much time did that take ?
- What did Spark do at this point ?
- What is the nature of the distinct operation ?

In [None]:
df_state.show()

- Why does the show operation take more time than the distinct ?
- What is the nature of the show operation ?

In [None]:
df_state.count()

#### Monitoring Jobs, Stages, & Tasks

In the training markdown guide there is a monitoring guide.   
Follow the instructions there to track the jobs, stages, tasks, and the dags generated by your operations, 
on the Spark Web UI

##### Conlusion : 

As you probably noticed,   
We made some progress by infering the Schema, but we are still lacking some sound Schema information.   
What are the benefits of having a more robust Schema ?  
Can you list some ?  Business wise ? Technically ?   


Later we will explore a few options to make for a better structure of our data.  

##### Bonus Exercise : 
*(if time permits)*

Use some reference data provided in the BOTS_REF data folder in HDFS  
You can try to perform a few sensible joins & filters, play around and display the resulting data sets.   
This will be a convenient warmup for tomorrow's Spark SQL analytics workshops  
It will also be a good occasion to build transformations and fire some actions, to watch DAG creation and execution.   

Useful examples for data exploration & filtering & joining   
https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/  
https://sparkbyexamples.com/pyspark/pyspark-where-filter/  
https://sparkbyexamples.com/pyspark/pyspark-groupby-explained-with-example/  
https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/  