## Previewing airlines data
Let us preview the airlines data to understand more about it.

* As we have too many files, we will just process ten files and preview the data.
* File Name: **hdfs://public/airlines_all/airlines/part-0000***
* `spark.read.csv` will create a variable or object of type Data Frame.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [2]:
airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema

In [3]:
airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part*",
        header=True
       )

A Data Frame will have structure or schema.

* We can print the schema using `airlines.printSchema()`
* We can preview the data using `airlines.show()`. By default it shows 20 records and some of the column values might be truncated for readability purpose.
* We can review the details of show by using `help(airlines.show)`
* We can pass custom number of records and say `truncate=False` to show complete information of all the records requested. It will facilitate us to preview all columns with desired number of records.




In [None]:
airlines.show(100, truncate=False)

* We can get the number of records or rows in a Data Frame using `airlines.count()`
* In Databricks Notebook, we can use `display` to preview the data using Visualization feature
* We can perform all kinds of standard transformations on our data. We need to have good knowledge of functions on Data Frames as well as functions on columns to apply all standard transformations.
* Let us also validate if there are duplicates in our data, if yes we will remove duplicates while reorganizing the data later.


In [5]:
airlines_schema = spark.read. \
    csv("/public/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema

In [6]:
%%sh

hdfs dfs -ls /public/airlines_all/airlines/part-0000*

-rw-r--r--   2 hdfs supergroup   67108879 2021-01-28 08:56 /public/airlines_all/airlines/part-00000
-rw-r--r--   2 hdfs supergroup   67108862 2021-01-28 09:34 /public/airlines_all/airlines/part-00001
-rw-r--r--   2 hdfs supergroup   67108930 2021-01-28 07:44 /public/airlines_all/airlines/part-00002
-rw-r--r--   2 hdfs supergroup   67108804 2021-01-28 10:44 /public/airlines_all/airlines/part-00003
-rw-r--r--   2 hdfs supergroup   67108908 2021-01-28 08:01 /public/airlines_all/airlines/part-00004
-rw-r--r--   2 hdfs supergroup   67108890 2021-01-28 10:51 /public/airlines_all/airlines/part-00005
-rw-r--r--   2 hdfs supergroup   67108825 2021-01-28 11:02 /public/airlines_all/airlines/part-00006
-rw-r--r--   2 hdfs supergroup   67108880 2021-01-28 09:12 /public/airlines_all/airlines/part-00007
-rw-r--r--   2 hdfs supergroup   67108832 2021-01-28 08:48 /public/airlines_all/airlines/part-00008
-rw-r--r--   2 hdfs supergroup   67108857 2021-01-28 08:53 /public/airlines_all/airlines/part-00009


In [7]:
airlines = spark.read. \
    schema(airlines_schema). \
    csv("/public/airlines_all/airlines/part-0000*",
        header=True
       )

In [8]:
airlines.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

In [9]:
airlines.show()

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|   12|    

In [10]:
airlines.show(100, truncate=False)

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|12   |20  

In [11]:
airlines.count()

6489231

In [12]:
airlines.distinct().count()

6489146