## Starting Spark Context

>**SparkSession** is a class that is part of pyspark.sql package.

>It is a wrapper on top of Spark Context.

>When Spark application is submitted using *spark-submit* or *spark-shell* or *pyspark*, a web service called as *Spark Context* will be started.

>**Spark Context** maintains the context of all the jobs that are submitted until it is killed.

>**SparkSession** is nothing but wrapper on top of Spark Context.

>We need to first create SparkSession object with any name. But typically we use spark. Once it is created, several APIs will be exposed including read.

>We need to at least set Application Name and also specify the execution mode in which Spark Context should run while creating SparkSession object.

>We can use appName to specify name for the application and master to specify the execution mode.



In [1]:
from pyspark.sql import SparkSession

In [2]:
import getpass
username = getpass.getuser()

In [3]:
spark = SparkSession.\
    builder.\
    config('spark.ui.port', '0').\
    enableHiveSupport().\
    appName('Python - Data Processing - Overview').\
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
%load_ext sparksql_magic
%load_ext sql

# Overview of Spark read APIs

- park has a bunch of APIs to read data from files of different formats.

- All APIs are exposed under spark.read

  - **text** - to read single column data from text files as well as reading each of the whole text file as one record.
  - **csv**- to read text files with delimiters. Default is a comma, but we can use other delimiters as well.
  - **json** - to read data from JSON files
  - **orc** - to read data from ORC files
  - **parquet** - to read data from Parquet files.
  - We can also read data from other file formats by plugging in and by using ***spark.read.format***

- We can also pass options based on the file formats.
  - **inferSchema** - to infer the data types of the columns based on the data.
  - **header** - to use header to get the column names in case of text files.
  - **schema** - to explicitly specify the schema.

- We can get the help on APIs like spark.read.csv using help(spark.read.csv).

Reading delimited data from text files.

In [5]:
spark.read.csv('/data/retail_db/orders',
        schema='''
            order_id INT, 
            order_date STRING, 
            order_customer_id INT, 
            order_status STRING
        '''
       ). \
    show(5)

[Stage 0:>                                                          (0 + 1) / 1]

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



                                                                                

In [6]:
#spark.read.json?

### Examples with airlines data

- We can use spark.read.text on one of the files to preview the data and understand the following

    >Whether header is present in files or not.

    >Field Delimiter that is being used.



In [7]:
airlines = spark.read.text("/data/airtrafficdata/airlines_all/airlines/part-00000")

In [8]:
type(airlines)

pyspark.sql.dataframe.DataFrame

In [9]:
#help(airlines.show)

In [10]:
airlines.show(truncate=False)

[Stage 1:>                                                          (0 + 1) / 1]

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Year,Month,Dayo

                                                                                

In [11]:
#help(spark.read.text)

### Inferring Schema

- We can pass the file name pattern to **spark.read.csv** and read all the data in files under ***/data/airtrafficdata/airlines_all/airlines***
- We can use options such as **header** and **inferSchema** to assign names and data types.
- However **inferSchema** will end up going through the entire data to assign schema. We can use **samplingRatio** to process fraction of data and then infer the schema.
- In case if the data in all the files have similar structure, we should be able to get the schema using one file and then apply it on others.
- In our airlines data, schema is consistent across all the files and hence we should be able to get the schema by going through one file and apply on the entire dataset.


In [12]:
airlines_part_00000 = spark.read. \
    csv("/data/airtrafficdata/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       )

                                                                                

In [13]:
type(airlines_part_00000)

pyspark.sql.dataframe.DataFrame

In [14]:
airlines_part_00000.show(truncate=False)

[Stage 4:>                                                          (0 + 1) / 1]

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|10   |14  

                                                                                

In [15]:
airlines_part_00000.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

In [16]:
airlines_part_00000.schema

StructType([StructField('Year', IntegerType(), True), StructField('Month', IntegerType(), True), StructField('DayofMonth', IntegerType(), True), StructField('DayOfWeek', IntegerType(), True), StructField('DepTime', StringType(), True), StructField('CRSDepTime', IntegerType(), True), StructField('ArrTime', StringType(), True), StructField('CRSArrTime', IntegerType(), True), StructField('UniqueCarrier', StringType(), True), StructField('FlightNum', IntegerType(), True), StructField('TailNum', StringType(), True), StructField('ActualElapsedTime', StringType(), True), StructField('CRSElapsedTime', IntegerType(), True), StructField('AirTime', StringType(), True), StructField('ArrDelay', StringType(), True), StructField('DepDelay', StringType(), True), StructField('Origin', StringType(), True), StructField('Dest', StringType(), True), StructField('Distance', StringType(), True), StructField('TaxiIn', StringType(), True), StructField('TaxiOut', StringType(), True), StructField('Cancelled', In

In [17]:
type(airlines_part_00000.schema)

pyspark.sql.types.StructType

get schema using one file and apply on other files

In [18]:
airlines_schema = spark.read. \
    csv("/data/airtrafficdata/airlines_all/airlines/part-00000",
        header=True,
        inferSchema=True
       ). \
    schema

                                                                                

In [19]:
type(airlines_schema)

pyspark.sql.types.StructType

In [20]:
airlines = spark.read.schema(airlines_schema). \
    csv("/data/airtrafficdata/airlines_all/airlines/part*",
        header=True
       )

In [21]:
#help(airlines)

In [22]:
airlines.show()

[Stage 7:>                                                          (0 + 1) / 1]

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|   10|    

                                                                                

In [23]:
airlines.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

In [24]:
airlines.count()

                                                                                

1290395

In [25]:
airlines.distinct().count()

                                                                                

1290323

### Overview of Data Frame APIs


- Row Level Transformations or Projection of Data can be done using select, **selectExpr, withColumn, drop** on Data Frame.
- We typically apply functions from ***pyspark.sql.functions*** on columns using **select** and **withColumn**
- Filtering is typically done either by using filter or where on Data Frame.
- We can pass the condition to filter or where either by using SQL Style or Programming Language Style.
- Global Aggregations can be performed directly on the Data Frame.
- By Key or Grouping Aggregations are typically performed using ***groupBy*** and then aggregate functions using ***agg***
- We can sort the data in Data Frame using **sort** or **orderBy**
- We can use use Window Functions for some advanced Aggregations and Ranking.


Creating Dataframe employees using Collection

In [26]:
employees = [(1, "Scott", "Tiger", 1000.0, "united states"),
             (2, "Henry", "Ford", 1250.0, "India"),
             (3, "Nick", "Junior", 750.0, "united KINGDOM"),
             (4, "Bill", "Gomes", 1500.0, "AUSTRALIA")
            ]

In [27]:
type(employees)

list

In [28]:
type(employees[0])

tuple

In [29]:
employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, nationality STRING"""
                   )

In [30]:
employeesDF.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- nationality: string (nullable = true)



In [31]:
employeesDF.show()

[Stage 18:>                                                         (0 + 1) / 1]

+-----------+----------+---------+------+--------------+
|employee_id|first_name|last_name|salary|   nationality|
+-----------+----------+---------+------+--------------+
|          1|     Scott|    Tiger|1000.0| united states|
|          2|     Henry|     Ford|1250.0|         India|
|          3|      Nick|   Junior| 750.0|united KINGDOM|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|
+-----------+----------+---------+------+--------------+



                                                                                

Getting employee first name and last name.

In [32]:
employeesDF. \
    select("first_name", "last_name"). \
    show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Scott|    Tiger|
|     Henry|     Ford|
|      Nick|   Junior|
|      Bill|    Gomes|
+----------+---------+



Project all the fields except for Nationality

In [33]:
employeesDF. \
    drop("nationality"). \
    show()

+-----------+----------+---------+------+
|employee_id|first_name|last_name|salary|
+-----------+----------+---------+------+
|          1|     Scott|    Tiger|1000.0|
|          2|     Henry|     Ford|1250.0|
|          3|      Nick|   Junior| 750.0|
|          4|      Bill|    Gomes|1500.0|
+-----------+----------+---------+------+



In [34]:
from pyspark.sql.functions import concat, lit, lpad

In [35]:
employeesDF. \
    withColumn('full_name', concat('first_name', lit(' '), 'last_name')). \
    show()

+-----------+----------+---------+------+--------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|  full_name|
+-----------+----------+---------+------+--------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states|Scott Tiger|
|          2|     Henry|     Ford|1250.0|         India| Henry Ford|
|          3|      Nick|   Junior| 750.0|united KINGDOM|Nick Junior|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA| Bill Gomes|
+-----------+----------+---------+------+--------------+-----------+



### Overview of Functions

- While Data Frame APIs work on the Data Frame, at times we might want to apply functions on column values.
- Functions to process column values are available under pyspark.sql.functions. These are typically used in select or withColumn on top of Data Frame.
- There are approximately 300 pre-defined functions available for us.
- Some of the important functions can be broadly categorized into String Manipulation, Date Manipulation, Numeric Functions and Aggregate Functions.
    - #### String Manipulation Functions
        - Concatenating Strings - concat
        - Getting Length - length
        - Trimming Strings - trim, rtrim, ltrim
        - Padding Strings - lpad, rpad
        - Extracting Strings - split, substring
    - #### Date Manipulation Functions
        - Date Arithmetic - date_add, date_sub, datediff, add_months
        - Date Extraction - dayofmonth, month, year
        - Get beginning period - trunc, date_trunc
    - #### Numeric Functions 
        -  abs
        - greatest
    - #### Aggregate Functions 
        - sum, 
        - min
        - max


In [36]:
#CONCAT
employeesDF. \
    withColumn("full_name", concat("first_name", lit(", "), "last_name")). \
    drop("first_name", "last_name"). \
    show()

+-----------+------+--------------+------------+
|employee_id|salary|   nationality|   full_name|
+-----------+------+--------------+------------+
|          1|1000.0| united states|Scott, Tiger|
|          2|1250.0|         India| Henry, Ford|
|          3| 750.0|united KINGDOM|Nick, Junior|
|          4|1500.0|     AUSTRALIA| Bill, Gomes|
+-----------+------+--------------+------------+



In [37]:
#use of concat,lit,drop
employeesDF. \
    withColumn("full_name", concat("first_name", lit(", "), "last_name")). \
    drop("first_name", "last_name"). \
    show()

+-----------+------+--------------+------------+
|employee_id|salary|   nationality|   full_name|
+-----------+------+--------------+------------+
|          1|1000.0| united states|Scott, Tiger|
|          2|1250.0|         India| Henry, Ford|
|          3| 750.0|united KINGDOM|Nick, Junior|
|          4|1500.0|     AUSTRALIA| Bill, Gomes|
+-----------+------+--------------+------------+



In [38]:
employeesDF. \
    selectExpr("employee_id",
               "concat(first_name, ', ', last_name) AS full_name",
               "salary",
               "nationality"
              ). \
    show()

+-----------+------------+------+--------------+
|employee_id|   full_name|salary|   nationality|
+-----------+------------+------+--------------+
|          1|Scott, Tiger|1000.0| united states|
|          2| Henry, Ford|1250.0|         India|
|          3|Nick, Junior| 750.0|united KINGDOM|
|          4| Bill, Gomes|1500.0|     AUSTRALIA|
+-----------+------------+------+--------------+



In [39]:
data = [("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)]
df = spark.createDataFrame(data,schema = ["date","increment"])
df.show()

# Increment month of the date
df.selectExpr("date","increment", \
              "add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int)) as inc_date") \
    .show()

from pyspark.sql.functions import expr, col
df.select(col("date"),col("increment"), \
      expr("add_months(to_date(date,'yyyy-MM-dd'),cast(increment as int))").alias("inc_date")) \
    .show()

+----------+---------+
|      date|increment|
+----------+---------+
|2019-01-23|        1|
|2019-06-24|        2|
|2019-09-20|        3|
+----------+---------+

+----------+---------+----------+
|      date|increment|  inc_date|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+

+----------+---------+----------+
|      date|increment|  inc_date|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+



### Overview of Spark Write APIs

- All APIs are exposed under spark.read

    **text** - to write single column data to text files.

    **csv** - to write to text files with delimiters. Default is a comma, but we can use other delimiters as well.

    **json** - to write data to JSON files

    **orc** - to write data to ORC files

    **parquet** - to write data to Parquet files.

- We can also write data to other file formats by plugging in and by using **write.format**, for example avro

- We can use options based on the type using which we are writing the Data Frame to.

    **compression** - Compression codec (gzip, snappy etc)

    **sep** - to specify delimiters while writing into text files using csv

- We can overwrite the directories or append to existing directories using **mode**
- We can control number of files by using **coalesce**. It has to be invoked on top of Data Frame before invoking write.


Creating copy of orders data in parquet file format with no compression. If the folder already exists overwrite it. 

In [40]:
orders = spark. \
    read. \
    csv('/data/retail_db/orders',
        schema='''
            order_id INT, 
            order_date STRING, 
            order_customer_id INT, 
            order_status STRING
        '''
       )

In [41]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [42]:
orders.show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [43]:
type(orders)

pyspark.sql.dataframe.DataFrame

In [44]:
orders. \
    write. \
    parquet('/workarea/orders', 
            mode='overwrite', 
            compression='none'
           )

                                                                                

In [45]:
orders. \
    write. \
    mode('overwrite'). \
    option('compression', 'none'). \
    parquet('/workarea/orders')

                                                                                

In [46]:
# Alternative approach - using format
orders. \
    write. \
    mode('overwrite'). \
    option('compression', 'none'). \
    format('parquet'). \
    save('/workarea/orders')

                                                                                

In [47]:
!hdfs dfs -ls /workarea/orders

Found 2 items
-rw-rw-rw-   2 hadoop supergroup          0 2023-04-08 14:30 /workarea/orders/_SUCCESS
-rw-rw-rw-   2 hadoop supergroup     495997 2023-04-08 14:30 /workarea/orders/part-00000-2d237614-9745-4610-8bd2-06e0f53270ac-c000.parquet


In [48]:
order_items = spark. \
    read. \
    json('/data/retail_db_json/order_items')

                                                                                

In [49]:
order_items.show(5)

+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|
+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|            1|                  1|                  957|                  299.98|                  1|             299.98|
|            2|                  2|                 1073|                  199.99|                  1|             199.99|
|            3|                  2|                  502|                    50.0|                  5|              250.0|
|            4|                  2|                  403|                  129.99|                  1|             129.99|
|            5|                  4|                  897|                   24.99|                  2|              49.98|
+-------------+-

In [50]:
# Using format
order_items. \
    coalesce(1). \
    write. \
    mode('ignore'). \
    option('compression', 'gzip'). \
    option('sep', '|'). \
    format('csv'). \
    save('/workarea/order_items')

In [51]:
# Alternative approach - using keyword arguments
order_items. \
    coalesce(1). \
    write. \
    csv('/workarea/order_items',
        sep='|',
        mode='overwrite',
        compression='gzip'
       )

                                                                                

In [52]:
%%sh
hdfs dfs -ls /workarea/order_items

Found 2 items
-rw-rw-rw-   2 hadoop supergroup          0 2023-04-08 14:30 /workarea/order_items/_SUCCESS
-rw-rw-rw-   2 hadoop supergroup    1032820 2023-04-08 14:30 /workarea/order_items/part-00000-b3efdb44-ba64-4535-a312-a50e3f5030aa-c000.csv.gz


In [53]:
order_items.printSchema()

root
 |-- order_item_id: long (nullable = true)
 |-- order_item_order_id: long (nullable = true)
 |-- order_item_product_id: long (nullable = true)
 |-- order_item_product_price: double (nullable = true)
 |-- order_item_quantity: long (nullable = true)
 |-- order_item_subtotal: double (nullable = true)



In [54]:
order_items.count()

                                                                                

172198

## ExerciseReorganizing airlines data

In [55]:
airlines = spark.read. \
    schema(airlines_schema). \
    csv('/data/airtrafficdata/airlines_all/airlines/*',
        header=True
       )

In [56]:
airlines.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

In [57]:
airlines.show(5)

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|   10|    

In [58]:
airlines.count()

                                                                                

1290395

In [59]:
airlines.distinct().count()

                                                                                

1290323

In [60]:
help(airlines.write.parquet)

Help on method parquet in module pyspark.sql.readwriter:

parquet(path: str, mode: Union[str, NoneType] = None, partitionBy: Union[str, List[str], NoneType] = None, compression: Union[str, NoneType] = None) -> None method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in Parquet format at the specified path.
    
    .. versionadded:: 1.4.0
    
    Parameters
    ----------
    path : str
        the path in any Hadoop supported file system
    mode : str, optional
        specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorifexists`` (default case): Throw an exception if data already                 exists.
    partitionBy : str or list, optional
        names of partitioni

In [61]:
spark.conf.set("spark.sql.shuffle.partitions", "255")
airlines. \
    distinct(). \
    withColumn('flightmonth', concat('year', lpad('month', 2, '0'))). \
    repartition(255, 'flightmonth'). \
    write. \
    mode('overwrite'). \
    partitionBy('flightmonth'). \
    format('parquet'). \
    save('/workarea/airlines-part')

                                                                                

In [62]:
!hdfs dfs -ls /workarea/airlines-part

Found 4 items
-rw-rw-rw-   2 hadoop supergroup          0 2023-04-08 14:31 /workarea/airlines-part/_SUCCESS
drwxrwxrwx   - hadoop supergroup          0 2023-04-08 14:31 /workarea/airlines-part/flightmonth=198710
drwxrwxrwx   - hadoop supergroup          0 2023-04-08 14:31 /workarea/airlines-part/flightmonth=198711
drwxrwxrwx   - hadoop supergroup          0 2023-04-08 14:31 /workarea/airlines-part/flightmonth=198712


Let us create a DataFrame object by using spark.read.parquet("/workarea/airlines-part/flightmonth=198710") - let’s say airlines.

In [63]:
airlines= spark.read.parquet("/workarea/airlines-part/flightmonth=198710")

In [64]:
airlines.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

In [65]:
 airlines.show(10, truncate=False)

[Stage 64:>                                                         (0 + 1) / 1]

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|10   |27  

                                                                                

In [66]:
display(airlines)

DataFrame[Year: int, Month: int, DayofMonth: int, DayOfWeek: int, DepTime: string, CRSDepTime: int, ArrTime: string, CRSArrTime: int, UniqueCarrier: string, FlightNum: int, TailNum: string, ActualElapsedTime: string, CRSElapsedTime: int, AirTime: string, ArrDelay: string, DepDelay: string, Origin: string, Dest: string, Distance: string, TaxiIn: string, TaxiOut: string, Cancelled: int, CancellationCode: string, Diverted: int, CarrierDelay: string, WeatherDelay: string, NASDelay: string, SecurityDelay: string, LateAircraftDelay: string, IsArrDelayed: string, IsDepDelayed: string]

In [67]:
 airlines.describe().show()

                                                                                

+-------+------+------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------+-----------------+-------+------------------+-----------------+-------+------------------+------------------+------+------+-----------------+------+-------+--------------------+----------------+--------------------+------------+------------+--------+-------------+-----------------+------------+------------+
|summary|  Year| Month|        DayofMonth|        DayOfWeek|           DepTime|        CRSDepTime|           ArrTime|        CRSArrTime|UniqueCarrier|        FlightNum|TailNum| ActualElapsedTime|   CRSElapsedTime|AirTime|          ArrDelay|          DepDelay|Origin|  Dest|         Distance|TaxiIn|TaxiOut|           Cancelled|CancellationCode|            Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+-------+------+------+------------------+-----------------+------------------+---

In [68]:
airlines.count() 

                                                                                

448558

Get number of unique origins

In [69]:
airlines. \
    select("Origin"). \
    distinct(). \
    count()

234

Get number of unique destinations

In [70]:
airlines. \
    select("Dest"). \
    distinct(). \
    count()

                                                                                

234

Get all unique carriers

In [71]:
airlines. \
    select('UniqueCarrier'). \
    distinct(). \
    show()



+-------------+
|UniqueCarrier|
+-------------+
|           UA|
|           DL|
|           US|
|           AA|
|           AS|
|           TW|
|           PS|
|           NW|
|           HP|
|           WN|
|           PI|
|       PA (1)|
|           CO|
|           EA|
+-------------+



                                                                                