In [0]:
# We start by importing SparkContext and creating a context
# which we will use in class

# from pyspark import SparkContext
# sc = SparkContext()

### Core RDD Functions in Spark
* Beyond basic functions like `map`, `flatmap`, `filter`, and `reduce`, Spark offers a variety of other essential RDD methods (both transformations and actions).
* These functions enable easier and more efficient programming in the map-reduce paradigm.
* Some transformative examples include:
    * `distinct`: Yields an RDD with the unique elements from the source RDD.
    * `union`: Combines two RDDs.
    * `intersection`: Identifies common elements between two RDDs.
    * `foreach`: Applies a lambda function to each RDD element without returning any value.
      * Similar to map but without returning results.
    * `cartesian`: Generates the cartesian product of the given RDD.
    * And many more...


### Core RDD Functions in Spark - Cont'd

* Spark also provides a range of actions for RDD, such as:
    * `count`: Gives the total number of RDD elements.
    * `sum`: Calculates the total sum of RDD elements (requires numeric data).
    * `mean`: Determines the average of RDD elements (requires numeric data).
    * `stats`: Returns comprehensive statistics, including count, mean, stdev, max, and min.


In [0]:
# Distinct example 
dataset_1 = sc.parallelize(["A", "B", "C", "A", "C", "D"])
dataset_1.distinct().collect()

Out[3]: ['B', 'C', 'A', 'D']

In [0]:
dir(dataset_1)

Out[4]: ['__add__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__orig_bases__',
 '__parameters__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_computeFractionForSampleSize',
 '_defaultReducePartitions',
 '_id',
 '_is_barrier',
 '_is_protocol',
 '_jrdd',
 '_jrdd_deserializer',
 '_memory_limit',
 '_pickled',
 '_reserialize',
 '_to_java_object_rdd',
 'aggregate',
 'aggregateByKey',
 'barrier',
 'cache',
 'cartesian',
 'checkpoint',
 'cleanShuffleDependencies',
 'coalesce',
 'cogroup',
 'collect',
 'collectAsMap',
 'collectWithJobGroup',
 'combineByKey',
 'context',
 'count',
 'countApprox',
 'countApproxDistinct',
 'countByKey',
 'countByValue',
 'ctx'

In [0]:
dataset_1.getNumPartitions()

Out[5]: 8

In [0]:
# Union example

dataset_1 = sc.parallelize(["A", "B", "C"])
dataset_2 = sc.parallelize(["D", "E", "F"])

both_datasets = dataset_1.union(dataset_2)

both_datasets.collect()

Out[6]: ['A', 'B', 'C', 'D', 'E', 'F']

In [0]:
# intersection example
dataset_1 = sc.parallelize(["A", "B", "C"])
dataset_2 = sc.parallelize(["N", "B", "C", "M"])
dataset_1.intersection(dataset_2).collect()

Out[7]: ['B', 'C']

In [0]:
dataset_1.collect()

Out[8]: ['A', 'B', 'C']

In [0]:
# foreach example: do you think this should work?
out = dataset_1.foreach(lambda x: print(x))
out


In [0]:
type(out)

Out[10]: NoneType

In [0]:
# foreach example.
# Do you thin the following will work?
dataset_1.foreach(lambda x: print(x + "-suffix"))

In [0]:
modified_rdd = dataset_1.map(lambda x: x + "-suffix")    
modified_rdd.collect()

Out[13]: ['A-suffix', 'B-suffix', 'C-suffix']

In [0]:
# cartesian example
dataset_1 = sc.parallelize(["A", "B"])

dataset_2 = sc.parallelize(["D", "E", "F"])

dataset_1.cartesian(dataset_2).collect()

Out[14]: [('A', 'D'), ('A', 'E'), ('A', 'F'), ('B', 'D'), ('B', 'E'), ('B', 'F')]

In [0]:
# sum, mean axamples

dataset_1 = sc.parallelize([1, 2, 3, 4, 5, 6])

(dataset_1.sum(), dataset_1.mean())

Out[15]: (21, 3.5)

In [0]:
# stats example

dataset_1.stats()

Out[16]: (count: 6, mean: 3.5, stdev: 1.707825127659933, max: 6.0, min: 1.0)

### Essential RDD Functions for Tuples

* Remember, key-value pairs (tuples) are a core component in the Map Reduce paradigm.

  * e.g. `[("THE", 12), ("HI", 2), ("COURSE", 2), ("STUDENTS", 3), ... ]`
  * Spark offers a collection of methods tailored for these tuples, both as transformations and actions.

* Transformations on Tuples:
    * `sortByKey`: yields a new RDD ordered by its keys.
    * `reduceByKey`: applies a function `f` to combine values by key, producing a new RDD.
    * `groupByKey`: Creates a new RDD where values are assembled under their respective keys.
    * `join`: Generates a new RDD by pairing values with matching keys from two datasets.
        * Variants include: `leftOuter`, `rightOuter`, `fullOuter` joins.

* Actions for Tuples:
  * General actions like count are adaptable to any data type.
  * To perform arithmetic actions on a tuple RDD, you can first transform it using map

In [0]:
# sortByKey example

data_1  = sc.parallelize([("C", 12), ("D", 2), ("A", 2), ("B", 3)])
data_1.sortByKey().collect()


Out[17]: [('A', 2), ('B', 3), ('C', 12), ('D', '2')]

In [0]:
# reduceByKey example

data_1  = sc.parallelize([("A", 12), ("A", 2), ("C", 2), ("B", 3), ("C", 3)])
data_1.reduceByKey(lambda x,y: x+y).collect()


Out[18]: [('B', 3), ('C', 5), ('A', 14)]

In [0]:
# groupByKey example

data_1  = sc.parallelize([("A", 12), ("A", 2), ("C", 2), ("B", 3), ("C", 3), ("A", 5)])
grouped_data = data_1.groupByKey().collect()
grouped_data


Out[19]: [('B', <pyspark.resultiterable.ResultIterable at 0x7f8a74b93160>),
 ('C', <pyspark.resultiterable.ResultIterable at 0x7f8a74b93ee0>),
 ('A', <pyspark.resultiterable.ResultIterable at 0x7f8a7498f2e0>)]

In [0]:
for key, val in grouped_data:
    print(f"{key}\t{list(val)}")


B	[3]
C	[2, 3]
A	[12, 2, 5]


In [0]:
# join exmaple

data_1  = sc.parallelize([("A", 1), ("A", 3), ("B", 4), ("C", 6), ("D", 11)          ])
data_2  = sc.parallelize([("A", 2),           ("B", 5), ("C", 7),           ("E", 11)])

data_1.join(data_2).collect()

Out[24]: [('B', (4, 5)), ('C', (6, 7)), ('A', (1, 2)), ('A', (3, 2))]

In [0]:
# leftOuterJoin exmaple

data_1  = sc.parallelize([("A", 1), ("D", 3), ("B", 4), ("C", 6), ])
data_2  = sc.parallelize([("A", 2),           ("B", 5), ("C", 7), ])

data_1.leftOuterJoin(data_2).collect()

Out[30]: [('B', (4, 5)), ('D', (3, None)), ('C', (6, 7)), ('A', (1, 2))]

In [0]:
# rightOuterJoin exmaple

data_1  = sc.parallelize([("A", 1), ("D", 3), ("B", 4), ("C", 6), ])
data_2  = sc.parallelize([("A", 2),           ("B", 5), ("C", 7), ])

data_1.rightOuterJoin(data_2).collect()

[('B', (4, 5)), ('C', (6, 7)), ('A', (1, 2))]

In [0]:
# apply non tuple actions  by first using map
# extract the values

data_1  = sc.parallelize([("A", 1), ("A", 3), ("B", 4), ("C", 6)])
data_1.map(lambda x: x[1]).sum()

Out[36]: 14

### Conclusion
* An RDD (Resilient Distributed Dataset) is an immutable distributed collection of various data objects.
  * This can include lines, tuples, JSON objects, and more.

* RDDs are distributed across multiple nodes, enabling parallel operations through a low-level API.
  * Any transformation on an RDD yields a new RDD.
  * To explore the myriad of methods available for RDDs, use the dir function.

* Importantly, RDDs don't enforce a specific data structure.
  * For instance, you can have:
```python
sc.parallelize([("A", 1), {"First": "John", "Salary": 125_000}, ("B", 4), ("C", 6), ])
```
  * This is not a desired situation when working with structured data.
    * Cannot run queries on all instances (e.g., `select Salary from my_rdd`)

* Spark also offers a data structure that enforces data validation and structure.
   * This aids in optimizing data operations more effectively.


### Dive Into Spark DataFrame
* Spark DataFrames in pySpark represent immutable, distributed collections of data neatly organized into named columns.
  * They can be visualized as tables in relational databases or analogous to Pandas's DataFrame.
  * Facilitates querying either via SQL or Python-style syntax.

* Example using SQL:

```SQL
session.sql("SELECT * from users WHERE age < 21")
```

* Example using Python-style:

```
users.filter(users.age < 21)
```

* Given their structured nature, DataFrames enable enhanced optimizations internally.
  * They can be created from various sources, including:
    * Structured data files (json, csv, etc.)
    * Parquet Files
    * External databases
    * Pre-existing RDDs
* While RDDs are seen as collections of objects, DataFrames can be understood as collections of rows (instances).
  * Similar to Pandas DataFrames

### DataFrame Operations in Spark

* Spark offers a variety of high-level functions tailored for DataFrames.
  * Rooted in the map-reduce model, these functions address common tasks efficiently.
  * While they serve as shortcuts, remember they're underpinned by core functions: map, flatmap, filter, and reduce.
* While SparkContext is essential for RDDs, DataFrames rely on SparkSession.
  * Tasks such as creating, registering, and executing SQL queries necessitate SparkSession.
* Multiple methods are available to instantiate a DataFrame:
  * From a csv: Each row is an object.
  * From a json: Every record is as an object.
    * Ensure every line contains a distinct, valid JSON object.
      * This format used in assignment 1.
    * From a text file: Here too, each row becomes an object.


### Understanding Schemas
* Schemas outline the data type structure of your fields.
* They play an important role in Spark's optimization processes.
* Schemas are essential in achieving the right in-memory compression.
  * Post-compression, the data size in memory might be more compact than its uncompressed counterpart on disk.

In [0]:
# from pyspark import SparkContext
# sc = SparkContext()
from pyspark.sql import SparkSession
session = SparkSession(sc)

In [0]:
text_df  = session.read.text('dbfs:/FileStore/pride_and_prejudice.txt')
print(text_df.count())
text_df.first()

14579
Out[60]: Row(value='The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen')

In [0]:
text_df

Out[61]: DataFrame[value: string]

In [0]:
%%time

csv_df = session.read.csv("dbfs:/FileStore/flight_info.csv", header=True)
print(csv_df.count())
csv_df.head(1)


450017
CPU times: user 15.1 ms, sys: 3.34 ms, total: 18.4 ms
Wall time: 10.2 s
Out[24]: [Row(_c0='0', DayOfWeek='2', UniqueCarrier='AA', FlightNum='494', Origin='CLT', Dest='PHX', CRSDepTime='1619', DepTime='1616.0', TaxiOut='17.0', WheelsOff='1633.0', WheelsOn='1837.0', TaxiIn='5.0', CRSArrTime='1856', ArrTime='1842.0', Cancelled='0.0', CancellationCode=None, Distance='1773.0', CarrierDelay=None, WeatherDelay=None, NASDelay=None, SecurityDelay=None, LateAircraftDelay=None)]

In [0]:
csv_df.schema

Out[25]: StructType([StructField('_c0', StringType(), True), StructField('DayOfWeek', StringType(), True), StructField('UniqueCarrier', StringType(), True), StructField('FlightNum', StringType(), True), StructField('Origin', StringType(), True), StructField('Dest', StringType(), True), StructField('CRSDepTime', StringType(), True), StructField('DepTime', StringType(), True), StructField('TaxiOut', StringType(), True), StructField('WheelsOff', StringType(), True), StructField('WheelsOn', StringType(), True), StructField('TaxiIn', StringType(), True), StructField('CRSArrTime', StringType(), True), StructField('ArrTime', StringType(), True), StructField('Cancelled', StringType(), True), StructField('CancellationCode', StringType(), True), StructField('Distance', StringType(), True), StructField('CarrierDelay', StringType(), True), StructField('WeatherDelay', StringType(), True), StructField('NASDelay', StringType(), True), StructField('SecurityDelay', StringType(), True), StructField('Lat

* In Spark's StructField, the third parameter indicates whether the field is nullable or not.
  * I.e., whether the field can be null or not.

```python
StructType(
	List(StructField(year,StringType,true),
		StructField(month,StringType,true),
		StructField(day,StringType,true),
		StructField(dep_time,StringType,true),
		StructField(dep_delay,StringType,true),
		StructField(arr_time,StringType,true),
		StructField(arr_delay,StringType,true),
		StructField(carrier,StringType,true),
		StructField(tailnum,StringType,true),
		StructField(flight,StringType,true),
		StructField(origin,StringType,true),
		StructField(dest,StringType,true),
		StructField(air_time,StringType,true),
		StructField(distance,StringType,true),
		StructField(hour,StringType,true),
		StructField(minute,StringType,true)
	)
)
```


In [0]:
%%time

csv_df = session.read.options(inferSchema = True).csv("dbfs:/FileStore/flight_info.csv", header=True)

print(csv_df.count())

csv_df.head(2)


450017
CPU times: user 14.6 ms, sys: 8.35 ms, total: 23 ms
Wall time: 14.2 s
Out[26]: [Row(_c0=0, DayOfWeek=2, UniqueCarrier='AA', FlightNum=494, Origin='CLT', Dest='PHX', CRSDepTime=1619, DepTime=1616.0, TaxiOut=17.0, WheelsOff=1633.0, WheelsOn=1837.0, TaxiIn=5.0, CRSArrTime=1856, ArrTime=1842.0, Cancelled=0.0, CancellationCode=None, Distance=1773.0, CarrierDelay=None, WeatherDelay=None, NASDelay=None, SecurityDelay=None, LateAircraftDelay=None),
 Row(_c0=1, DayOfWeek=3, UniqueCarrier='AA', FlightNum=494, Origin='CLT', Dest='PHX', CRSDepTime=1619, DepTime=1614.0, TaxiOut=13.0, WheelsOff=1627.0, WheelsOn=1815.0, TaxiIn=6.0, CRSArrTime=1856, ArrTime=1821.0, Cancelled=0.0, CancellationCode=None, Distance=1773.0, CarrierDelay=None, WeatherDelay=None, NASDelay=None, SecurityDelay=None, LateAircraftDelay=None)]

In [0]:
csv_df.schema

Out[27]: StructType([StructField('_c0', IntegerType(), True), StructField('DayOfWeek', IntegerType(), True), StructField('UniqueCarrier', StringType(), True), StructField('FlightNum', IntegerType(), True), StructField('Origin', StringType(), True), StructField('Dest', StringType(), True), StructField('CRSDepTime', IntegerType(), True), StructField('DepTime', DoubleType(), True), StructField('TaxiOut', DoubleType(), True), StructField('WheelsOff', DoubleType(), True), StructField('WheelsOn', DoubleType(), True), StructField('TaxiIn', DoubleType(), True), StructField('CRSArrTime', IntegerType(), True), StructField('ArrTime', DoubleType(), True), StructField('Cancelled', DoubleType(), True), StructField('CancellationCode', StringType(), True), StructField('Distance', DoubleType(), True), StructField('CarrierDelay', DoubleType(), True), StructField('WeatherDelay', DoubleType(), True), StructField('NASDelay', DoubleType(), True), StructField('SecurityDelay', DoubleType(), True), StructField

```
StructType(
	List(
		StructField(year,IntegerType,true),
		StructField(month,IntegerType,true),
		StructField(day,IntegerType,true),
		StructField(dep_time,IntegerType,true),
		StructField(dep_delay,IntegerType,true),
		StructField(arr_time,IntegerType,true),
		StructField(arr_delay,IntegerType,true),
		StructField(carrier,StringType,true),
		StructField(tailnum,StringType,true),
		StructField(flight,IntegerType,true),
		StructField(origin,StringType,true),
		StructField(dest,StringType,true),
		StructField(air_time,IntegerType,true),
		StructField(distance,IntegerType,true),
		StructField(hour,IntegerType,true),
		StructField(minute,IntegerType,true)
	)
)
```

In [0]:
json_df  = session.read.json('dbfs:/FileStore/random_user_dicts.json')


In [0]:
json_df.show()

+----------+----------+--------------------+--------------+-----------+-----+
|first_name| last_name|            lat_long|         state|    user_id|  zip|
+----------+----------+--------------------+--------------+-----------+-----+
|  Benjamin|   Hawkins|{-88.1663, -70.6146}|       Vermont|028-73-9282|10447|
|     Flenn|   Chapman|   {24.893, 60.9322}|       Montana|852-16-2595|96244|
|     Stacy|     Owens| {-86.3024, 18.6558}|      Virginia|439-88-6137|46616|
|  Michelle|     Meyer|  {64.4687, 12.1853}|      Missouri|771-99-0447|19031|
|    Tamara|     Young|{-16.5984, -79.6928}| West Virginia|824-38-7655|46919|
| Nathaniel|      Wood|  {35.9258, 30.4532}|      Arkansas|146-06-1162|71250|
|      Dale|Cunningham|  {67.421, -24.9799}|         Maine|654-80-5980|67678|
|     Pedro|    Weaver| {37.2385, 155.5874}|North Carolina|163-01-4459|98164|
|   Stanley|     Olson|  {56.325, -74.2028}|       Indiana|939-30-2636|39194|
|   Willard|   Coleman|{-86.8209, 132.7801}|       Florida|572-0

In [0]:
print(json_df.count())


5000


In [0]:
json_df.printSchema()


root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- lat_long: struct (nullable = true)
 |    |-- latitude: double (nullable = true)
 |    |-- longitude: double (nullable = true)
 |-- state: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- zip: long (nullable = true)



In [0]:
json_df.select("first_name").collect()[0:10]

Out[48]: [Row(first_name='Benjamin'),
 Row(first_name='Flenn'),
 Row(first_name='Stacy'),
 Row(first_name='Michelle'),
 Row(first_name='Tamara'),
 Row(first_name='Nathaniel'),
 Row(first_name='Dale'),
 Row(first_name='Pedro'),
 Row(first_name='Stanley'),
 Row(first_name='Willard')]

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, FloatType

json_struct = StructType([
    StructField("first_name", StringType(), nullable=False, metadata=None),
    StructField("last_name", StringType(),  nullable=False, metadata=None),
    StructField("lat_long", 
                StructType([
                    StructField("latitude", FloatType(), metadata=None, nullable=True),
                    StructField("longitude", FloatType(), metadata=None, nullable=True)
                ]), nullable=True, metadata=None),
    StructField("state", StringType(),  nullable=True, metadata=None),
    StructField("user_id", StringType(),  nullable=True, metadata=None),
    StructField("zip", StringType(),  nullable=True, metadata=None),    
])


In [0]:
import pyspark
dir(pyspark.sql.types)

Out[50]: ['Any',
 'ArrayType',
 'AtomicType',
 'BinaryType',
 'BooleanType',
 'ByteType',
 'Callable',
 'CharType',
 'ClassVar',
 'CloudPickleSerializer',
 'DataType',
 'DataTypeSingleton',
 'DateConverter',
 'DateType',
 'DatetimeConverter',
 'DatetimeNTZConverter',
 'DayTimeIntervalType',
 'DayTimeIntervalTypeConverter',
 'DecimalType',
 'Dict',
 'DoubleType',
 'FloatType',
 'FractionalType',
 'GatewayClient',
 'IntegerType',
 'IntegralType',
 'Iterable',
 'Iterator',
 'JavaClass',
 'JavaObject',
 'List',
 'LongType',
 'MapType',
 'NullType',
 'NumericType',
 'Optional',
 'Row',
 'ShortType',
 'StringType',
 'StructField',
 'StructType',
 'T',
 'TYPE_CHECKING',
 'TimestampNTZType',
 'TimestampType',
 'Tuple',
 'Type',
 'TypeVar',
 'U',
 'Union',
 'UserDefinedType',
 'VarcharType',
 '_FIXED_DECIMAL',
 '_INTERVAL_DAYTIME',
 '_LENGTH_CHAR',
 '_LENGTH_VARCHAR',
 '__all__',
 '__annotations__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package

In [0]:
import pprint
import json

json_df  = session.read.schema(json_struct).json('./data/random_user_dicts.json')
print(json_df.count())
new_json = json_df.schema.json()
pprint.pprint(json.loads(new_json))

5000
{'fields': [{'metadata': {},
             'name': 'first_name',
             'nullable': True,
             'type': 'string'},
            {'metadata': {},
             'name': 'last_name',
             'nullable': True,
             'type': 'string'},
            {'metadata': {},
             'name': 'lat_long',
             'nullable': True,
             'type': {'fields': [{'metadata': {},
                                  'name': 'latitude',
                                  'nullable': True,
                                  'type': 'float'},
                                 {'metadata': {},
                                  'name': 'longitude',
                                  'nullable': True,
                                  'type': 'float'}],
                      'type': 'struct'}},
            {'metadata': {},
             'name': 'state',
             'nullable': True,
             'type': 'string'},
            {'metadata': {},
             'name': 'user_id',
      

In [0]:
# Note the lat_long field. It has its own format

json_df.head(1)

[Row(first_name='Christopher', last_name='Morgan', lat_long=Row(latitude=6.444200038909912, longitude=-78.50630187988281), state='Nebraska', user_id='895-76-0473', zip='73093')]

### Insights into Data Formats

* For stable datasets, structured data formats such as tab or comma-delimited tables (CSV/TSV) are recommended.
  * However, data often evolves over time which may affect the structure of datasets.
  * Semi-structured data formats like JSON are flexible to accommodate varying fields across records.
```python
[ 
  {"user_id": "Jane1234", "employed": True, "salary": 95000}, 
  {"user_id": "John777", "employed": False}, 
  ...
]
```
* Traditional tabular data formats struggle with handling nested or hierarchical data structures.
  * Complex data types like lists or nested objects are often better represented in semi-structured formats:
```python
[{
    "user_id": "Jane1234", 
    "cars": ["Sedan", "Truck"],
    "children": {
        "Jonah": {"age": 8, "school": "Noelani Elementary"},
        "Mary": {"age": 12, "school": "Sacred Hearts"}
    } 
}, ...]
```

* Some considerations when using JSON:
  * Increased disk storage might be a concern due to field name duplication across records.
  * However, when loaded into memory, field names are typically not duplicated, saving space.

* JSON format may not be natively compatible with traditional SQL querying.
  * Temporary views or tables can be generated to facilitate SQL-like querying.
* Apache Spark excels at converting collections of JSON objects into tabular formats for querying.

  

### SQL with Spark DataFrames

* SQL capabilities are integral to Spark DataFrames.
    * Direct SQL support is exclusive to DataFrames.
* Familiar SQL queries are seamlessly compatible with DataFrames.
* To execute SQL on DataFrames:
    * First, register a temporary view.
    * Use the `.sql()` method for querying.
* Every query undergoes an optimization phase before execution.
    * This employs the Catalyst SQL optimizer.
        [Catalyst Optimizer Details](https://databricks.com/glossary/catalyst-optimizer)
* Bear in mind, DataFrame data is immutable.
    * To add new columns, generate a new DataFrame.


# Example of query optimization
![](https://www.dropbox.com/s/e756fxrsi36yvj4/unoptimized_optimized.png?dl=1)
    

In [0]:
json_df  = session.read.schema(json_struct).json('./data/random_user_dicts.json')
print(json_df.count())


5000


In [0]:
json_df.createTempView("users")

session.sql("""
SELECT first_name, COUNT(*)
FROM users
GROUP BY first_name; 
""").show()

+----------+--------+
|first_name|count(1)|
+----------+--------+
|     Tyler|       6|
|  Samantha|      10|
|    Aubrey|       4|
|   Carolyn|      11|
|      Chad|       9|
|   Shannon|       8|
|     Shawn|       5|
|       Sue|      11|
|     Scott|       6|
|     Ruben|       9|
|     Flenn|       5|
|  Rosemary|       3|
|     Grace|       7|
|     Lucas|       8|
|     Keith|      12|
|    Gerald|      10|
|       Jar|       6|
|     Edwin|       6|
|     Soham|       5|
|  Savannah|       7|
+----------+--------+
only showing top 20 rows



In [0]:
session.sql("""
SELECT *
FROM users
WHERE first_name IN ("Evan", "Sarah", "John"); 
""").show()

+----------+---------+--------------------+-------------+-----------+-----+
|first_name|last_name|            lat_long|        state|    user_id|  zip|
+----------+---------+--------------------+-------------+-----------+-----+
|      Evan|     Beck|  {-17.7362, -3.022}|New Hampshire|505-92-9095|87501|
|      Evan|   Snyder| {56.7583, -64.3217}|    Minnesota|075-11-1233|74201|
|     Sarah|     Webb|   {2.5834, -89.618}| South Dakota|247-48-1845|86063|
|     Sarah|  Stanley|{-18.5369, -81.8778}|     Missouri|997-11-7309|68852|
|      John|  Nichols|{-10.576, -148.6093}|      Alabama|575-68-2404|17965|
|      John|     Reid|   {0.659, -70.5511}|  Mississippi|370-33-2662|57788|
|      John|  Freeman|{-69.3141, -12.2351}|        Maine|621-84-6581|13221|
|      John|    Price| {-7.5446, 109.4057}| North Dakota|323-62-2196|18403|
|      John|    Davis|{-39.9862, -72.4857}|        Maine|211-00-2584|56206|
|      Evan|   Torres|{-81.3332, 118.9237}|      Vermont|547-50-2905|64904|
|      Evan|

### DataFrame and Select Queries

* The same functionality is available using Python
* Many additional functions, inlcuding analytics-specific ones are available through specific library

    ```from pyspark.sql import functions as F```
    * The functions as used with a select
  * Can use `agg` to do specific operations and rename columns.

In [0]:
json_df.filter(F.length(json_df.first_name) < 4).show()

+----------+---------+--------------------+-------------+-----------+-----+
|first_name|last_name|            lat_long|        state|    user_id|  zip|
+----------+---------+--------------------+-------------+-----------+-----+
|       Joe|  Jackson|{-62.895, -143.5974}|      Florida|664-45-7303|23155|
|       Mia|   Chavez|  {13.112, 109.6125}|     Arkansas|309-47-7003|42409|
|       Jon|     Cole| {71.6651, 101.2318}| Rhode Island|217-81-4486|92912|
|       Mia|  Barrett| {-57.829, -63.2612}|New Hampshire|236-16-5120|31229|
|       Max|   Willis|   {31.288, 52.5112}|         Utah|858-00-9946|78049|
|       Kim|  Spencer|  {50.676, -70.7382}|        Idaho|715-58-2909|67867|
|       Roy| Mitchell|   {49.4706, 1.6422}|   Washington|421-56-4226|62647|
|       Eli|   Turner|{-42.0611, 144.1344}|     Delaware|744-31-7784|96253|
|       Ida|  Gilbert|   {1.7455, 81.5158}|    Louisiana|777-95-8206|65696|
|       Joe|      Fox| {-57.2874, 25.6431}|      Montana|754-89-1224|18754|
|       Lee|

In [0]:
json_df.groupby("first_name").count().show()

+---------+-----+
|last_name|count|
+---------+-----+
| Harrison|   14|
|   Porter|   17|
|    Scott|   23|
|Robertson|   20|
|   Wilson|   17|
|  Griffin|   15|
|    Lucas|   22|
|   Castro|   13|
|     Pena|   13|
|     Boyd|   25|
|    Jones|   18|
|   Graham|   18|
|  Herrera|   17|
| Crawford|   24|
|     Lowe|   11|
|  Sanchez|   13|
|Gutierrez|   15|
|    Garza|   18|
|    James|   15|
|     Soto|   13|
+---------+-----+
only showing top 20 rows



In [0]:
from pysparak.sql import functions as F

dir(F)

['Column',
 'DataFrame',
 'DataType',
 'PandasUDFType',
 'PythonEvalType',
 'SparkContext',
 'StringType',
 'UserDefinedFunction',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_create_column_from_literal',
 '_create_lambda',
 '_create_udf',
 '_get_get_jvm_function',
 '_get_lambda_parameters',
 '_invoke_binary_math_function',
 '_invoke_function',
 '_invoke_function_over_column',
 '_invoke_higher_order_function',
 '_options_to_str',
 '_test',
 '_to_java_column',
 '_to_seq',
 '_unresolved_named_lambda_variable',
 'abs',
 'acos',
 'acosh',
 'add_months',
 'aggregate',
 'approxCountDistinct',
 'approx_count_distinct',
 'array',
 'array_contains',
 'array_distinct',
 'array_except',
 'array_intersect',
 'array_join',
 'array_max',
 'array_min',
 'array_position',
 'array_remove',
 'array_repeat',
 'array_sort',
 'array_union',
 'arrays_overlap',
 'arrays_zip',
 'asc',
 'asc_nulls_first',
 'asc_nulls_last',
 'ascii',
 'asi

In [0]:
json_df.groupby("first_name").agg(F.count("first_name").alias('First Name Counts')).show()

+----------+-----------------+
|first_name|First Name Counts|
+----------+-----------------+
|     Tyler|                6|
|  Samantha|               10|
|    Aubrey|                4|
|   Carolyn|               11|
|      Chad|                9|
|   Shannon|                8|
|     Shawn|                5|
|       Sue|               11|
|     Scott|                6|
|     Ruben|                9|
|     Flenn|                5|
|  Rosemary|                3|
|     Grace|                7|
|     Lucas|                8|
|     Keith|               12|
|    Gerald|               10|
|       Jar|                6|
|     Edwin|                6|
|     Soham|                5|
|  Savannah|                7|
+----------+-----------------+
only showing top 20 rows



### More on Optimization

1. Caching Data:
   * Cache table contents or query outputs to expedite repeated data access.
   * Cached data, especially with Tungsten compression, can occupy less RAM compared to disk storage.
   * Employ Lazy Caching to cache data as needed and utilize `UNCACHE` to free up space for caching other DataFrames.

2. **Adjusting DataFrame Partitioning**:
   * Narrow Operations (e.g., `COUNT`): Operate independently on each partition, no data shuffle required.
   * Wide Operations (e.g., `GROUPBY`): Require data shuffling as they need data from multiple partitions.
   * Minimize data shuffling by tuning the number of partitions; excessive shuffling can lead to performance bottlenecks.

3. Additional Performance Tuning:
   * Explore further optimizations in Spark's [Performance Tuning documentation](https://spark.apache.org/docs/latest/sql-performance-tuning.html).
