## PySpark Notes

[Sorting](https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/#:~:text=DataFrame%20sorting%20using%20the%20sort,it%20sorts%20by%20ascending%20order.&text=The%20above%20two%20examples%20return,takes%20columns%20in%20Column%20type.) and [counting](https://napsterinblue.github.io/notes/spark/sparksql/value_counts/) data




In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

In [2]:
spark = SparkSession.builder.appName("Spark Intro").getOrCreate()

In [3]:
df = spark.read.csv('./data/store.csv', header=True, inferSchema=True)

#### Which 5 orders resulted in the largest profits

In [4]:
largest_profits = df.sort(df.Profit.desc())
largest_profits.select('Date', 'Customer Name', 'Profit').show(n=5)

+----------+-------------+---------+
|      Date|Customer Name|   Profit|
+----------+-------------+---------+
|2020-09-03| Tamara Chand| 8399.976|
|2020-12-12| Raymond Buch|6719.9808|
|2020-03-18| Hunter Lopez|5039.9856|
|2020-08-15|Adrian Barton|  4946.37|
|2020-12-01| Sanjit Chand|4630.4755|
+----------+-------------+---------+
only showing top 5 rows



#### Sort Customers alphabetically (display the first 50)

In [5]:
df.select('Customer Name').distinct().sort('Customer Name').show(n=50)

+--------------------+
|       Customer Name|
+--------------------+
|       Aaron Bergman|
|       Aaron Hawkins|
|      Aaron Smayling|
|     Adam Bellavance|
|           Adam Hart|
|  Adam Shillingsburg|
|       Adrian Barton|
|         Adrian Hane|
|        Adrian Shami|
|         Aimee Bixby|
|         Alan Barnes|
|      Alan Dominguez|
|         Alan Haines|
|          Alan Hwang|
|   Alan Schoenberger|
|        Alan Shonely|
|Alejandro Ballentine|
|     Alejandro Grove|
|    Alejandro Savely|
| Aleksandra Gannaway|
|          Alex Avila|
|        Alex Grayson|
|        Alex Russell|
|      Alice McCarthy|
|        Allen Armold|
|      Allen Goldenen|
|    Allen Rosenblatt|
|       Alyssa Crouse|
|         Alyssa Tate|
|             Amy Cox|
|            Amy Hunt|
|        Andrew Allen|
|     Andrew Gjertsen|
|      Andrew Roberts|
|        Andy Gerbode|
|         Andy Reiter|
|          Andy Yotov|
|      Anemone Ratner|
|         Angele Hood|
|           Ann Blume|
|          

#### Sort by customer name in descending order and if there is a tie then sort by profit in ascending order

In [6]:
df.dtypes

[('Date', 'string'),
 ('Customer ID', 'string'),
 ('Customer Name', 'string'),
 ('Segment', 'string'),
 ('Country', 'string'),
 ('City', 'string'),
 ('State', 'string'),
 ('Postal Code', 'int'),
 ('Region', 'string'),
 ('Product ID', 'string'),
 ('Category', 'string'),
 ('Sub-Category', 'string'),
 ('Product Name', 'string'),
 ('Sales', 'string'),
 ('Quantity', 'string'),
 ('Discount', 'string'),
 ('Profit', 'double'),
 ('RATING', 'double')]

In [7]:
# Can use orderBy() instead of sort()
sorted_df = df.orderBy(F.col('Customer Name').desc(), F.col('Profit').asc())
sorted_df.select('Customer Name', 'Profit').show()

+------------------+----------+
|     Customer Name|    Profit|
+------------------+----------+
|Zuschuss Donatelli|    2.4824|
|Zuschuss Donatelli|     3.344|
|Zuschuss Donatelli|     4.995|
|Zuschuss Donatelli|     7.384|
|Zuschuss Donatelli|    16.011|
|Zuschuss Donatelli|   16.5888|
|Zuschuss Donatelli|   22.0472|
|Zuschuss Donatelli|   51.4975|
|Zuschuss Donatelli|  124.7808|
|  Zuschuss Carroll|-1850.9464|
|  Zuschuss Carroll|  -97.7394|
|  Zuschuss Carroll|   -55.256|
|  Zuschuss Carroll|  -50.6688|
|  Zuschuss Carroll|  -23.7822|
|  Zuschuss Carroll|  -20.1362|
|  Zuschuss Carroll|  -12.8961|
|  Zuschuss Carroll|   -5.2072|
|  Zuschuss Carroll|   -4.1136|
|  Zuschuss Carroll|   -3.8385|
|  Zuschuss Carroll|     -1.11|
+------------------+----------+
only showing top 20 rows



#### Count the occurences of each region. The count should be in descending order

In [8]:
# For each region keep a count
df.groupby('Region').count().orderBy('count', ascending=False).show()

+-------+-----+
| Region|count|
+-------+-----+
|   West| 3203|
|   East| 2848|
|Central| 2323|
|  South| 1620|
+-------+-----+



#### Create a value counts function

In [9]:
def value_counts(df, col, ascending):
    return df.groupby(col).count().orderBy('count', ascending=ascending)

In [10]:
value_counts(df, 'Region', False).show()

+-------+-----+
| Region|count|
+-------+-----+
|   West| 3203|
|   East| 2848|
|Central| 2323|
|  South| 1620|
+-------+-----+

