 # Spark Data Operations (WIP)

Following notebook showcases code snippets for the frequently used data operations on Spark dataframes.

It covers following ops:
- Dataframes
 - Loading data
- Basic ops: shape, select columns, show data sample
- Data info
 - Schema
 - Summary
- Filtering Data
- Add new columns
- Joins- both with same name and with different names
- Aggregations
- Sorting
- Pivot- single and multiple and the difference
- Window functions
- EDA related steps: fillna, etc.
- Writing SQL codes- quick run through of all above things via sql, small examples
- Persist; lazy eval; actions; SparkUI to look at storage
- Partitioning- getting number of partitions of existing dfs, repartition options
- Writing data; file types; disk partitioning while writing

***

<b>Spark 3.1.2</b> (with Python 3.8) has been used for this notebook.<br>
Refer to [spark documentation](https://spark.apache.org/docs/3.1.2/api/sql/index.html) for help with <b>data ops functions</b>.<br>
Refer to [this article](https://medium.com/analytics-vidhya/installing-and-using-pyspark-on-windows-machine-59c2d64af76e) to <b>install and use PySpark on Windows machine</b>.

***

<mark><b>Note</b></mark>: We are dealing with a sample dataset in this exercise and hence I have freely used `.show()`, `.count()`, `.collect()` actions on the dataframes. Please be careful with such steps on actual datasets.

### Building a spark session
To create a SparkSession, use the following builder pattern:
 
`spark = SparkSession\
    .builder\
    .master("local")\
    .appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()`

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
#initiating spark session
spark.stop()

In [3]:
spark = SparkSession\
    .builder\
    .appName("data_ops")\
    .config("spark.executor.memory", "2700m")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()

In [4]:
spark

## Dataframes

A DataFrame is a Dataset organized into named columns.<br>
It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.<br>DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

## Loading data from files

### csv files

Usually, first line of a csv file is the header and we want to infer the schema of the file.<br>
So below command generally suffices for most use cases.

`spark.read.csv(path, inferSchema=True, header=True)`

Important Parameters: [reference](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrameReader.csv.html)
1. `path` : str or list<br>
path to csv location. String, or list of strings, for input path(s), or RDD of Strings storing CSV rows. If this points to a directory, all files under that directory are read and appended.

2. `inferSchema` : str or bool, optional<br>
infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.<br>
Or you can specify the input schema using `schema` param

3. `schema` : pyspark.sql.types.StructType or str, optional<br>
an optional pyspark.sql.types.StructType for the input schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).

4. `sep` : str, optional<br>
sets a separator (one or more characters) for each field and value. If None is set, it uses the default value, ,

5. `header` : str or bool, optional<br>
uses the first line as names of columns. If None is set, it uses the default value, false.

In [5]:
df_sales = spark.read.csv('./data/clustering_sales.csv',inferSchema=True,header=True)

### Parquet files

Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.

Parquet files can 1/10th or lesser in size compared to csv format and hence are preferred storage format.

`spark.read.parquet(path)`

Important Parameters: [reference](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrameReader.parquet.html)

1. `paths` : str
2. `mergeSchema` : str or bool, optional<br>
sets whether we should merge schemas collected from all Parquet part-files. This will override spark.sql.parquet.mergeSchema. The default value is specified in spark.sql.parquet.mergeSchema.

In [6]:
df_features = spark.read.parquet('./data/clustering_features/')

## Basic data ops

### Getting data shape

In [7]:
# column count
len(df_sales.columns)

8

In [8]:
# row count
df_sales.count()

10000

### Data sample

In [9]:
df_sales.show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|       1|            1|2020-01-01|        572|    550|  1|        20|              2|
|       2|            2|2020-01-01|        532|    630|  3|        11|              2|
|       3|            3|2020-01-01|        608|    450|  2|        18|              4|
|       4|            4|2020-01-01|        424|    110|  2|        10|              2|
|       5|            5|2020-01-01|        584|    250|  1|         8|              4|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



### Selecting columns

In [10]:
df_sales.columns

['order_id',
 'order_item_id',
 'tran_dt',
 'customer_id',
 'dollars',
 'qty',
 'product_id',
 'payment_type_id']

In [11]:
df_subset = df_sales.select('order_id','customer_id','dollars','product_id')

In [12]:
df_subset.show(5)

+--------+-----------+-------+----------+
|order_id|customer_id|dollars|product_id|
+--------+-----------+-------+----------+
|       1|        572|    550|        20|
|       2|        532|    630|        11|
|       3|        608|    450|        18|
|       4|        424|    110|        10|
|       5|        584|    250|         8|
+--------+-----------+-------+----------+
only showing top 5 rows



### Renaming columns

Let's rename 'dollars' column to 'sales': `df.withColumnRenamed('source_column','new_name')`

In [13]:
df_rename = df_sales.withColumnRenamed('dollars','sales')

In [14]:
df_rename.columns

['order_id',
 'order_item_id',
 'tran_dt',
 'customer_id',
 'sales',
 'qty',
 'product_id',
 'payment_type_id']

`.withColumnRenamed` will NOT throw an error when the source column is not found

In [15]:
df_sales = df_sales.withColumnRenamed('unknown_source_column','new_name')

#### Bulk rename
There are two ways of bulk renaming columns:
1. Loop `.withColumnRenamed()` to rename the columns
2. use `.toDF()` to rename the columns

<b>Loop</b>

In [16]:
# get a copy of the df for this task
df_rename = df_sales

In [17]:
# add _r to all column names
for col in df_rename.columns:
    df_rename = df_rename.withColumnRenamed(col, col+'_r')

In [18]:
df_rename.columns

['order_id_r',
 'order_item_id_r',
 'tran_dt_r',
 'customer_id_r',
 'dollars_r',
 'qty_r',
 'product_id_r',
 'payment_type_id_r']

<b>`.toDF()`</b>

This takes a list of new column names and replaces existing column names with the list. The replacement happens in the sequence in which columns are present in the original dataframe.

In [19]:
# get a copy of the df for this task
df_rename = df_sales

In [20]:
# add _r to all column names
list_new_names = [x + '_r' for x in df_rename.columns]
df_rename = df_rename.toDF(*list_new_names)

In [21]:
df_rename.columns

['order_id_r',
 'order_item_id_r',
 'tran_dt_r',
 'customer_id_r',
 'dollars_r',
 'qty_r',
 'product_id_r',
 'payment_type_id_r']

## Data info

In [22]:
# schema information
df_sales.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- tran_dt: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- dollars: integer (nullable = true)
 |-- qty: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- payment_type_id: integer (nullable = true)



In [23]:
# data summary
df_sales.summary().show()

+-------+------------------+------------------+----------+-----------------+-----------------+------------------+-----------------+------------------+
|summary|          order_id|     order_item_id|   tran_dt|      customer_id|          dollars|               qty|       product_id|   payment_type_id|
+-------+------------------+------------------+----------+-----------------+-----------------+------------------+-----------------+------------------+
|  count|             10000|             10000|     10000|            10000|            10000|             10000|            10000|             10000|
|   mean|          4907.044|            5000.5|      null|         491.5132|          570.717|            2.1108|           11.494|            2.5942|
| stddev|2831.8233199116644|2886.8956799071675|      null|292.5619908286239|421.2652399038348|0.8827237847299767|6.344287903216555|0.8613945437811904|
|    min|                 1|                 1|2020-01-01|                1|               50|

In [24]:
# getting more detailed information from .summary()
summary_stats = (
    'count','mean','stddev','min','0.10%','1.00%','5.00%','10.00%','20.00%','25.00%','30.00%',
    '40.00%','50.00%','60.00%','70.00%','75.00%','80.00%','90.00%','95.00%','99.00%','99.90%','max')

In [25]:
df_sales.select('tran_dt','dollars','qty').summary(*summary_stats).show(30,False)

+-------+----------+-----------------+------------------+
|summary|tran_dt   |dollars          |qty               |
+-------+----------+-----------------+------------------+
|count  |10000     |10000            |10000             |
|mean   |null      |570.717          |2.1108            |
|stddev |null      |421.2652399038348|0.8827237847299767|
|min    |2020-01-01|50               |1                 |
|0.10%  |null      |50               |1                 |
|1.00%  |null      |50               |1                 |
|5.00%  |null      |80               |1                 |
|10.00% |null      |120              |1                 |
|20.00% |null      |210              |1                 |
|25.00% |null      |240              |1                 |
|30.00% |null      |270              |2                 |
|40.00% |null      |400              |2                 |
|50.00% |null      |450              |2                 |
|60.00% |null      |550              |2                 |
|70.00% |null 

The above summary can also be converted to a Pandas dataframe for better display in Jupyter and to save it as csv/excel.

In [26]:
df_summary = df_sales.summary(*summary_stats).toPandas()

In [27]:
df_summary.head()

Unnamed: 0,summary,order_id,order_item_id,tran_dt,customer_id,dollars,qty,product_id,payment_type_id
0,count,10000.0,10000.0,10000,10000.0,10000.0,10000.0,10000.0,10000.0
1,mean,4907.044,5000.5,,491.5132,570.717,2.1108,11.494,2.5942
2,stddev,2831.8233199116644,2886.8956799071675,,292.5619908286239,421.2652399038348,0.8827237847299767,6.344287903216555,0.8613945437811904
3,min,1.0,1.0,2020-01-01,1.0,50.0,1.0,1.0,1.0
4,0.10%,9.0,9.0,,2.0,50.0,1.0,1.0,1.0


In [28]:
df_summary.to_csv('./files/df_summary.csv')

## Filtering Data
1. Using `.where()`
2. Using `.filter()`

Both give same results.

`.filter()` is the standard Scala name for the function, and `.where()` is for people who prefer SQL-type syntax.

In [29]:
df_sales.filter("dollars>2000").show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|     390|          400|2020-01-14|        119|   2200|  4|        22|              2|
|     586|          601|2020-01-21|        659|   2200|  4|        22|              4|
|     775|          792|2020-01-27|        526|   2200|  4|        20|              4|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



In [30]:
df_sales.filter(df_sales['dollars']>2000).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|     390|          400|2020-01-14|        119|   2200|  4|        22|              2|
|     586|          601|2020-01-21|        659|   2200|  4|        22|              4|
|     775|          792|2020-01-27|        526|   2200|  4|        20|              4|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



In [31]:
# please add required parenthesis separating the statements
df_sales.filter((df_sales['dollars'] > 1500) & (df_sales['qty'] < 4)).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|      33|           33|2020-01-02|        941|   1650|  3|        20|              2|
|      58|           58|2020-01-03|        711|   1650|  3|        22|              2|
|     102|          102|2020-01-05|        475|   1650|  3|        22|              4|
|     151|          151|2020-01-06|        695|   1650|  3|        20|              3|
|     251|          257|2020-01-09|        764|   1650|  3|        20|              4|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



Use of `F.col()` as column identifier instead of dataframe object

In [32]:
# please add required parenthesis separating the statements
df_sales.filter((F.col('dollars') > 1500) & (F.col('qty') < 4)).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|      33|           33|2020-01-02|        941|   1650|  3|        20|              2|
|      58|           58|2020-01-03|        711|   1650|  3|        22|              2|
|     102|          102|2020-01-05|        475|   1650|  3|        22|              4|
|     151|          151|2020-01-06|        695|   1650|  3|        20|              3|
|     251|          257|2020-01-09|        764|   1650|  3|        20|              4|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



Same output from `.where()`

In [33]:
df_sales.where("dollars>2000").show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|     390|          400|2020-01-14|        119|   2200|  4|        22|              2|
|     586|          601|2020-01-21|        659|   2200|  4|        22|              4|
|     775|          792|2020-01-27|        526|   2200|  4|        20|              4|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



In [34]:
df_sales.where(df_sales['dollars']>2000).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|     390|          400|2020-01-14|        119|   2200|  4|        22|              2|
|     586|          601|2020-01-21|        659|   2200|  4|        22|              4|
|     775|          792|2020-01-27|        526|   2200|  4|        20|              4|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



In [35]:
df_sales.where(F.col('dollars')>2000).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
|     390|          400|2020-01-14|        119|   2200|  4|        22|              2|
|     586|          601|2020-01-21|        659|   2200|  4|        22|              4|
|     775|          792|2020-01-27|        526|   2200|  4|        20|              4|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|
+--------+-------------+----------+-----------+-------+---+----------+---------------+
only showing top 5 rows



## Add new columns
`df.withColumn(colName, col)`

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

The column expression must be an expression over this DataFrame; attempting to add a column from some other DataFrame will raise an error.

In [36]:
df_sales = df_sales.withColumn('aur', F.col('dollars')/F.col('qty'))

In [37]:
df_sales.show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|  aur|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+
|       1|            1|2020-01-01|        572|    550|  1|        20|              2|550.0|
|       2|            2|2020-01-01|        532|    630|  3|        11|              2|210.0|
|       3|            3|2020-01-01|        608|    450|  2|        18|              4|225.0|
|       4|            4|2020-01-01|        424|    110|  2|        10|              2| 55.0|
|       5|            5|2020-01-01|        584|    250|  1|         8|              4|250.0|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+
only showing top 5 rows



In [38]:
# More complicated operations to create new columns: use or case-when
df_sales = df_sales.withColumn('payment_2_dollars',F.when(F.col('payment_type_id')==2,F.col('dollars')).otherwise(0))

In [39]:
df_sales.show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|  aur|payment_2_dollars|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
|       1|            1|2020-01-01|        572|    550|  1|        20|              2|550.0|              550|
|       2|            2|2020-01-01|        532|    630|  3|        11|              2|210.0|              630|
|       3|            3|2020-01-01|        608|    450|  2|        18|              4|225.0|                0|
|       4|            4|2020-01-01|        424|    110|  2|        10|              2| 55.0|              110|
|       5|            5|2020-01-01|        584|    250|  1|         8|              4|250.0|                0|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
o

## Joins

## GroupBy and Aggregate Functions

- GroupBy allows you to group rows together based on some column value, for example, you could group together sales data by the day the sale occured
- Once you've performed the GroupBy operation you can use an aggregate functions. An aggregate function aggregates multiple rows of data into a single output, such as taking the sum of inputs

## In built Functions

There are a variety of functions you can import from pyspark.sql.functions. Check out the documentation for the full list available: 

Commonly used functions:
- sum(): sum the values
- avg(): averages the values for the column mentioned
- count(): count the number of values in a column (excludes Nulls)
- countDistinct(): counts the number of ditinct values in a column (excludes Nulls)
- stddev(): finds out the standard deviation of the input values

We have imported the `from pyspark.sql.functions` module as F

## Sorting

- Sorting is one of the most common operations applied on dataframes. 
- It is achieved using the `sort()` function.


In [40]:
df_sales.sort(F.col('dollars')).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+----+-----------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id| aur|payment_2_dollars|
+--------+-------------+----------+-----------+-------+---+----------+---------------+----+-----------------+
|     544|          559|2020-01-20|        421|     50|  1|         5|              2|50.0|               50|
|    1053|         1076|2020-02-04|        262|     50|  1|         5|              3|50.0|                0|
|     777|          794|2020-01-27|        404|     50|  1|         5|              3|50.0|                0|
|      73|           73|2020-01-03|        198|     50|  1|         5|              3|50.0|                0|
|     857|          876|2020-01-29|         61|     50|  1|         5|              3|50.0|                0|
+--------+-------------+----------+-----------+-------+---+----------+---------------+----+-----------------+
only showi

In [41]:
df_sales.sort(F.col('dollars').desc()).show(5)

+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
|order_id|order_item_id|   tran_dt|customer_id|dollars|qty|product_id|payment_type_id|  aur|payment_2_dollars|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
|    1014|         1036|2020-02-03|        495|   2200|  4|        20|              2|550.0|             2200|
|    2292|         2336|2020-03-17|        506|   2200|  4|        22|              4|550.0|                0|
|    1337|         1367|2020-02-13|        955|   2200|  4|        22|              4|550.0|                0|
|     956|          976|2020-02-01|        798|   2200|  4|        22|              3|550.0|                0|
|    1464|         1496|2020-02-17|        557|   2200|  4|        22|              2|550.0|             2200|
+--------+-------------+----------+-----------+-------+---+----------+---------------+-----+-----------------+
o

## Pivots
You will often find yourself working with Time and Date information, let's walk through some ways you can deal with it!

## Window functions

## EDA related steps

## Using SQL