## Data Generation

- CoffeeCo is small and fortunately for us there are only a few main stores. 
- To begin we’ll prime a temporary SQL view called stores to represent our company’s flagship stores.

In [2]:
# Import required modules
from pyspark.sql import SparkSession
from datetime import datetime
from pyspark.sql import functions as f
from pyspark.sql import *
from pyspark.sql.types import *

# Initialize SparkSession
# Create a SparkSession and set the extraClassPath configuration
spark = SparkSession.builder.master("local[1]") \
    .appName("DataGeneration") \
    .config("spark.driver.extraClassPath", "/home/jovyan/work/jars/*") \
    .getOrCreate()


# Define the schema for the Store class
store_schema = StructType([
    StructField("name", StringType(), True),
    StructField("capacity", IntegerType(), True),
    StructField("opens", IntegerType(), True),
    StructField("closes", IntegerType(), True)
])

# Create a list of dictionaries representing the data
data = [
    {"name": "a", "capacity": 24, "opens": 8, "closes": 20},
    {"name": "b", "capacity": 36, "opens": 7, "closes": 21},
    {"name": "c", "capacity": 18, "opens": 5, "closes": 23}
]

# Create a PySpark DataFrame from the list of dictionaries and schema
stores_sdf = spark.createDataFrame(data, schema=store_schema)


stores_sdf.show()

+----+--------+-----+------+
|name|capacity|opens|closes|
+----+--------+-----+------+
|   a|      24|    8|    20|
|   b|      36|    7|    21|
|   c|      18|    5|    23|
+----+--------+-----+------+



### Create Temp View for SparkSQL

In [3]:
# Create a view for the DataFrame
stores_sdf.createOrReplaceTempView("stores")

### Query

In [4]:
# Query the view
spark.sql("SELECT * FROM stores").show()

+----+--------+-----+------+
|name|capacity|opens|closes|
+----+--------+-----+------+
|   a|      24|    8|    20|
|   b|      36|    7|    21|
|   c|      18|    5|    23|
+----+--------+-----+------+



## Selection

- The process of selection is arguably the most fundamental means of reducing the footprint of the data you are working with. 
- This concept will be familiar to anyone with working knowledge of SQL.
- In a nutshell, selection enables us to reduce the set of rows returned by a query by way of a condition.
- Say we wanted to find all the stores open on or after a specific time of day.
- Returning Only the Rows that Match the Condition closes >= 22 via Simple Selection.

In [5]:
# Query the view
stores_con = spark.sql("SELECT * FROM stores where closes >= 22")
stores_con.show()

+----+--------+-----+------+
|name|capacity|opens|closes|
+----+--------+-----+------+
|   c|      18|    5|    23|
+----+--------+-----+------+



## Filtering

- When we select a column in a Dataframe, we have a few options for identifying the column. 
- There are four distinct ways to provide the target column for the selection.
- The symbolic aliases ` and $ are implicit conversations that can be used by importing the implicit functions from the SparkSession:


- `df.where(df("closes") >= 22)`
- `df.where(col("closes") >= 22)`

The where Clause Is Interchangable with the filter Function of the DataFrame

In [6]:
from pyspark.sql.functions import col

# Importing org.apache.spark.sql.functions._ is not necessary in PySpark
# Importing spark.implicits._ is not necessary if the code is run in a PySpark shell

# Filter the DataFrame using the col function
filter = stores_sdf.filter(col("closes") >= 22)

# Filter the DataFrame using the DataFrame API
where = stores_sdf.where(stores_sdf.closes >= 22)

# Show the results
filter.show()
where.show()

+----+--------+-----+------+
|name|capacity|opens|closes|
+----+--------+-----+------+
|   c|      18|    5|    23|
+----+--------+-----+------+

+----+--------+-----+------+
|name|capacity|opens|closes|
+----+--------+-----+------+
|   c|      18|    5|    23|
+----+--------+-----+------+



## Projection

**Projection as the process of reducing the total number of columns returned by a query.**

- Say we want to find all stores where the minimum occupancy is greater than 20. 
- In this case, we can assume we don’t need to worry about when a store opens or closes,
- But rather we want to find the `name` of the store only.


- Find all stores with an occupancy greater than 20.

In [7]:
# Query the view
stores_occ = spark.sql("SELECT name FROM stores where capacity > 20")
stores_occ.show()

+----+
|name|
+----+
|   a|
|   b|
+----+



- The query in Listing above  shows you how to use projection and selection together. 
- The projection dictates which columns will be returned by the query, as seen in the select name, which directs Spark to return only the column labeled name.
- The selection portion of the query, which is a fancy filter or conditional predicate, dictates which rows meet the criteria to be returned by the query, as seen in where capacity > 20.

**Let’s see how we can build the same query using the DataFrame API directly**

In [8]:
from pyspark.sql.functions import col

stores_sdf.select("name") \
            .where(col("capacity") > 20) \
                .show()

+----+
|name|
+----+
|   a|
|   b|
+----+

