# Learning Objectives
In this notebook, you will learn Spark Dataframe APIs.

# Question List

Solve the following questions using Spark Dataframe APIs

### Join

1. easy - https://pgexercises.com/questions/joins/simplejoin.html
2. easy - https://pgexercises.com/questions/joins/simplejoin2.html
3. easy - https://pgexercises.com/questions/joins/self2.html 
4. medium - https://pgexercises.com/questions/joins/threejoin.html (three join)
5. medium - https://pgexercises.com/questions/joins/sub.html (subquery and join)

### Aggregation

1. easy - https://pgexercises.com/questions/aggregates/count3.html Group by order by
2. easy - https://pgexercises.com/questions/aggregates/fachours.html group by order by
3. easy - https://pgexercises.com/questions/aggregates/fachoursbymonth.html group by with condition 
4. easy - https://pgexercises.com/questions/aggregates/fachoursbymonth2.html group by multi col
5. easy - https://pgexercises.com/questions/aggregates/members1.html count distinct
6. med - https://pgexercises.com/questions/aggregates/nbooking.html group by multiple cols, join

### String & Date

1. easy - https://pgexercises.com/questions/string/concat.html format string
2. easy - https://pgexercises.com/questions/string/case.html WHERE + string function
3. easy - https://pgexercises.com/questions/string/reg.html WHERE + string function
4. easy - https://pgexercises.com/questions/string/substr.html group by, substr
5. easy - https://pgexercises.com/questions/date/series.html generate ts
6. easy - https://pgexercises.com/questions/date/bookingspermonth.html extract month from ts

# Joins

## 1. Retrieve the start times of members' bookings
How can you produce a list of the start times for bookings by members named 'David Farrell'?

In [0]:
from pyspark.sql.functions import col

bookings_df = spark.table("bookings")
members_df = spark.table("members")

result_df = bookings_df.join(members_df, bookings_df.memid == members_df.memid)
result_df = result_df.filter((col("surname") == "Farrell") & (col("firstname") == "David")).select("starttime")

result_df.show()

+-------------------+
|          starttime|
+-------------------+
|2012-09-18 09:00:00|
|2012-09-18 17:30:00|
|2012-09-18 13:30:00|
|2012-09-18 20:00:00|
|2012-09-19 09:30:00|
|2012-09-19 15:00:00|
|2012-09-19 12:00:00|
|2012-09-20 15:30:00|
|2012-09-20 11:30:00|
|2012-09-20 14:00:00|
|2012-09-21 10:30:00|
|2012-09-21 14:00:00|
|2012-09-22 08:30:00|
|2012-09-22 17:00:00|
|2012-09-23 08:30:00|
|2012-09-23 17:30:00|
|2012-09-23 19:00:00|
|2012-09-24 08:00:00|
|2012-09-24 16:30:00|
|2012-09-24 12:30:00|
+-------------------+
only showing top 20 rows



### 2. Work out the start times of bookings for tennis courts
How can you produce a list of the start times for bookings for tennis courts, for the date '2012-09-21'? Return a list of start time and facility name pairings, ordered by the time.

In [0]:
from pyspark.sql.functions import to_date

facilities_df = spark.table("facilities")

result_df = facilities_df.join(bookings_df, bookings_df.facid == facilities_df.facid)
result_df = result_df.filter((col("name").isin("Tennis Court 1", "Tennis Court 2")) & (to_date(col("starttime")) == "2012-09-21")).select("starttime", "name").orderBy("starttime")

result_df.show()



+-------------------+--------------+
|          starttime|          name|
+-------------------+--------------+
|2012-09-21 08:00:00|Tennis Court 1|
|2012-09-21 08:00:00|Tennis Court 2|
|2012-09-21 09:30:00|Tennis Court 1|
|2012-09-21 10:00:00|Tennis Court 2|
|2012-09-21 11:30:00|Tennis Court 2|
|2012-09-21 12:00:00|Tennis Court 1|
|2012-09-21 13:30:00|Tennis Court 1|
|2012-09-21 14:00:00|Tennis Court 2|
|2012-09-21 15:30:00|Tennis Court 1|
|2012-09-21 16:00:00|Tennis Court 2|
|2012-09-21 17:00:00|Tennis Court 1|
|2012-09-21 18:00:00|Tennis Court 2|
+-------------------+--------------+



### 3. Produce a list of all members, along with their recommender
How can you output a list of all members, including the individual who recommended them (if any)? Ensure that results are ordered by (surname, firstname).

In [0]:
result_df = (members_df.alias("m1").join(members_df.alias("m2"),col("m1.recommendedby") == col("m2.memid")))
result_df = result_df.select(
        col("m1.firstname"),
        col("m1.surname"),
        col("m2.firstname").alias("recommender_firstname"),
        col("m2.surname").alias("recommender_surname")
    ).orderBy(col("m1.surname"), col("m1.firstname"))

result_df.show()

+---------+---------+---------------------+-------------------+
|firstname|  surname|recommender_firstname|recommender_surname|
+---------+---------+---------------------+-------------------+
| Florence|    Bader|               Ponder|           Stibbons|
|     Anne|    Baker|               Ponder|           Stibbons|
|  Timothy|    Baker|               Jemima|            Farrell|
|      Tim|   Boothe|                  Tim|             Rownam|
|   Gerald|  Butters|               Darren|              Smith|
|     Joan|   Coplin|              Timothy|              Baker|
|    Erica|  Crumpet|                Tracy|              Smith|
|    Nancy|     Dare|               Janice|           Joplette|
|  Matthew|  Genting|               Gerald|            Butters|
|     John|     Hunt|            Millicent|            Purview|
|    David|    Jones|               Janice|           Joplette|
|  Douglas|    Jones|                David|              Jones|
|   Janice| Joplette|               Darr

### 4. Produce a list of all members who have used a tennis court
How can you produce a list of all members who have used a tennis court? Include in your output the name of the court, and the name of the member formatted as a single column. Ensure no duplicate data, and order by the member name followed by the facility name.

In [0]:
from pyspark.sql.functions import concat_ws

result_df = (bookings_df.join(facilities_df, col("bookings.facid") == col("facilities.facid")).join(members_df, col("bookings.memid") == col("members.memid")))
result_df = result_df.filter(col("name").isin("Tennis Court 1", "Tennis Court 2")).select(
        col("name").alias("facility_name"), 
        concat_ws(" ", col("firstname"), col("surname")).alias("member_name")
    ).distinct().orderBy("member_name", "facility_name")  # Order by member name, then facility name
result_df.show()

+--------------+--------------+
| facility_name|   member_name|
+--------------+--------------+
|Tennis Court 1|    Anne Baker|
|Tennis Court 2|    Anne Baker|
|Tennis Court 1|  Burton Tracy|
|Tennis Court 2|  Burton Tracy|
|Tennis Court 1|  Charles Owen|
|Tennis Court 2|  Charles Owen|
|Tennis Court 2|  Darren Smith|
|Tennis Court 1| David Farrell|
|Tennis Court 2| David Farrell|
|Tennis Court 1|   David Jones|
|Tennis Court 2|   David Jones|
|Tennis Court 1|  David Pinker|
|Tennis Court 1| Douglas Jones|
|Tennis Court 1| Erica Crumpet|
|Tennis Court 1|Florence Bader|
|Tennis Court 2|Florence Bader|
|Tennis Court 1|   GUEST GUEST|
|Tennis Court 2|   GUEST GUEST|
|Tennis Court 1|Gerald Butters|
|Tennis Court 2|Gerald Butters|
+--------------+--------------+
only showing top 20 rows



### 5. Produce a list of all members, along with their recommender, using no joins.
How can you output a list of all members, including the individual who recommended them (if any), without using any joins? Ensure that there are no duplicates in the list, and that each firstname + surname pairing is formatted as a column and ordered.

In [0]:
result_df = (members_df.alias("m1").join(members_df.alias("m2"),col("m1.recommendedby") == col("m2.memid")))
result_df = result_df.select(
        concat_ws(" ", col("m1.firstname"), col("m1.surname")).alias("member"),
        concat_ws(" ", col("m2.firstname"), col("m2.surname")).alias("recommender")
    ).distinct().orderBy("member")
result_df.show()

+--------------------+-----------------+
|              member|      recommender|
+--------------------+-----------------+
|      Anna Mackenzie|     Darren Smith|
|          Anne Baker|  Ponder Stibbons|
|        Charles Owen|     Darren Smith|
|         David Jones|  Janice Joplette|
|        David Pinker|   Jemima Farrell|
|       Douglas Jones|      David Jones|
|       Erica Crumpet|      Tracy Smith|
|      Florence Bader|  Ponder Stibbons|
|      Gerald Butters|     Darren Smith|
|    Henrietta Rumney|  Matthew Genting|
|Henry Worthington...|      Tracy Smith|
|          Jack Smith|     Darren Smith|
|     Janice Joplette|     Darren Smith|
|         Joan Coplin|    Timothy Baker|
|           John Hunt|Millicent Purview|
|     Matthew Genting|   Gerald Butters|
|   Millicent Purview|      Tracy Smith|
|          Nancy Dare|  Janice Joplette|
|     Ponder Stibbons|     Burton Tracy|
|    Ramnaresh Sarwin|   Florence Bader|
+--------------------+-----------------+
only showing top

# Aggregation

### 1. Count the number of recommendations each member makes.
Produce a count of the number of recommendations each member has made. Order by member ID.

In [0]:
result_df = (
    members_df.filter(col("recommendedby").isNotNull())
    .groupBy("recommendedby")
    .count()
    .orderBy("recommendedby")
)

result_df.show()

+-------------+-----+
|recommendedby|count|
+-------------+-----+
|            1|    5|
|            2|    3|
|            3|    1|
|            4|    2|
|            5|    1|
|            6|    1|
|            9|    2|
|           11|    1|
|           13|    2|
|           15|    1|
|           16|    1|
|           20|    1|
|           30|    1|
+-------------+-----+



### 2. List the total slots booked per facility
Produce a list of the total number of slots booked per facility. For now, just produce an output table consisting of facility id and slots, sorted by facility id.

In [0]:
from pyspark.sql.functions import sum

result_df = (
    bookings_df.groupBy("facid")
    .agg(sum(col("slots")).alias("Total Slots"))
    .orderBy("facid") 
)

result_df.show()

+-----+-----------+
|facid|Total Slots|
+-----+-----------+
|    0|       1320|
|    1|       1278|
|    2|       1209|
|    3|        830|
|    4|       1404|
|    5|        228|
|    6|       1104|
|    7|        908|
|    8|        911|
+-----+-----------+



### 3. List the total slots booked per facility in a given month
Produce a list of the total number of slots booked per facility in the month of September 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.


In [0]:
filtered_df = bookings_df.filter(
    (col("starttime") >= "2012-09-01") & (col("starttime") < "2012-10-01")
)

result_df = (
    filtered_df.groupBy("facid")
    .agg(sum("slots").alias("total_slots"))  # Sum the slots
    .orderBy("total_slots")  # Order by total slots
)

result_df.show()

+-----+-----------+
|facid|total_slots|
+-----+-----------+
|    5|        122|
|    3|        422|
|    7|        426|
|    8|        471|
|    6|        540|
|    2|        570|
|    1|        588|
|    0|        591|
|    4|        648|
+-----+-----------+



### 4. List the total slots booked per facility per month
Produce a list of the total number of slots booked per facility per month in the year of 2012. Produce an output table consisting of facility id and slots, sorted by the id and month.



In [0]:
from pyspark.sql.functions import  month

# Filter for bookings in the year 2012
filtered_df = bookings_df.filter(
    (col("starttime") >= "2012-01-01") & (col("starttime") < "2013-01-01")
)

result_df = (
    filtered_df.withColumn("month", month(col("starttime"))) 
    .groupBy("facid", "month")
    .agg(sum("slots").alias("total_slots"))
    .orderBy("facid", "month")
)

result_df.show()

+-----+-----+-----------+
|facid|month|total_slots|
+-----+-----+-----------+
|    0|    7|        270|
|    0|    8|        459|
|    0|    9|        591|
|    1|    7|        207|
|    1|    8|        483|
|    1|    9|        588|
|    2|    7|        180|
|    2|    8|        459|
|    2|    9|        570|
|    3|    7|        104|
|    3|    8|        304|
|    3|    9|        422|
|    4|    7|        264|
|    4|    8|        492|
|    4|    9|        648|
|    5|    7|         24|
|    5|    8|         82|
|    5|    9|        122|
|    6|    7|        164|
|    6|    8|        400|
+-----+-----+-----------+
only showing top 20 rows



### 5. Find the count of members who have made at least one booking
Find the total number of members (including guests) who have made at least one booking.





In [0]:
total_members = bookings_df.select("memid").distinct().count()

print(f"Total Members: {total_members}")

Total Members: 30


### 6. List each member's first booking after September 1st 2012
Produce a list of each member name, id, and their first booking after September 1st 2012. Order by member ID.


In [0]:
from pyspark.sql.functions import col, min

filtered_bookings_df = bookings_df.filter(col("starttime") >= "2012-09-01")

joined_df = members_df.alias("m").join(filtered_bookings_df.alias("b"), col("m.memid") == col("b.memid"))

result_df = (
    joined_df.groupBy(col("m.surname"), col("m.firstname"), col("m.memid"))
    .agg(min(col("b.starttime")).alias("first_booking"))
    .orderBy(col("m.memid"))
)

result_df.show()


+---------+---------+-----+-------------------+
|  surname|firstname|memid|      first_booking|
+---------+---------+-----+-------------------+
|    GUEST|    GUEST|    0|2012-09-01 08:00:00|
|    Smith|   Darren|    1|2012-09-01 09:00:00|
|    Smith|    Tracy|    2|2012-09-01 11:30:00|
|   Rownam|      Tim|    3|2012-09-01 16:00:00|
| Joplette|   Janice|    4|2012-09-01 15:00:00|
|  Butters|   Gerald|    5|2012-09-02 12:30:00|
|    Tracy|   Burton|    6|2012-09-01 15:00:00|
|     Dare|    Nancy|    7|2012-09-01 12:30:00|
|   Boothe|      Tim|    8|2012-09-01 08:30:00|
| Stibbons|   Ponder|    9|2012-09-01 11:00:00|
|     Owen|  Charles|   10|2012-09-01 11:00:00|
|    Jones|    David|   11|2012-09-01 09:30:00|
|    Baker|     Anne|   12|2012-09-01 14:30:00|
|  Farrell|   Jemima|   13|2012-09-01 09:30:00|
|    Smith|     Jack|   14|2012-09-01 11:00:00|
|    Bader| Florence|   15|2012-09-01 10:30:00|
|    Baker|  Timothy|   16|2012-09-01 15:00:00|
|   Pinker|    David|   17|2012-09-01 08

# String & Date

### 1. Format the names of members
Output the names of all members, formatted as 'Surname, Firstname'

In [0]:
result_df = members_df.select(concat_ws(", ", col("surname"), col("firstname")).alias("member_name"))
result_df.show()


+----------------+
|     member_name|
+----------------+
|    GUEST, GUEST|
|   Smith, Darren|
|    Smith, Tracy|
|     Rownam, Tim|
|Joplette, Janice|
| Butters, Gerald|
|   Tracy, Burton|
|     Dare, Nancy|
|     Boothe, Tim|
|Stibbons, Ponder|
|   Owen, Charles|
|    Jones, David|
|     Baker, Anne|
| Farrell, Jemima|
|     Smith, Jack|
| Bader, Florence|
|  Baker, Timothy|
|   Pinker, David|
|Genting, Matthew|
| Mackenzie, Anna|
+----------------+
only showing top 20 rows



### 2. Perform a case-insensitive search
Perform a case-insensitive search to find all facilities whose name begins with 'tennis'. Retrieve all columns.

In [0]:
from pyspark.sql.functions import lower
result_df = facilities_df.filter(lower(col("name")).startswith("tennis"))

result_df.show()

+-----+--------------+----------+---------+-------------+------------------+
|facid|          name|membercost|guestcost|initialoutlay|monthlymaintenance|
+-----+--------------+----------+---------+-------------+------------------+
|    0|Tennis Court 1|       5.0|     25.0|        10000|               200|
|    1|Tennis Court 2|       5.0|     25.0|         8000|               200|
+-----+--------------+----------+---------+-------------+------------------+



### 3. Find telephone numbers with parentheses
You've noticed that the club's member table has telephone numbers with very inconsistent formatting. You'd like to find all the telephone numbers that contain parentheses, returning the member ID and telephone number sorted by member ID.

In [0]:
result_df = members_df.filter(col("telephone").rlike(r"\(.*\)")) \
                      .select("memid", "telephone") \
                      .orderBy("memid")

result_df.show()

+-----+--------------+
|memid|     telephone|
+-----+--------------+
|    0|(000) 000-0000|
|    3|(844) 693-0723|
|    4|(833) 942-4710|
|    5|(844) 078-4130|
|    6|(822) 354-9973|
|    7|(833) 776-4001|
|    8|(811) 433-2547|
|    9|(833) 160-3900|
|   10|(855) 542-5251|
|   11|(844) 536-8036|
|   13|(855) 016-0163|
|   14|(822) 163-3254|
|   15|(833) 499-3527|
|   20|(811) 972-1377|
|   21|(822) 661-2898|
|   22|(822) 499-2232|
|   24|(822) 413-1470|
|   27|(822) 989-8876|
|   28|(855) 755-9876|
|   29|(855) 894-3758|
+-----+--------------+
only showing top 20 rows



### 4. Count the number of members whose surname starts with each letter of the alphabet
You'd like to produce a count of how many members you have whose surname starts with each letter of the alphabet. Sort by the letter, and don't worry about printing out a letter if the count is 0.![](path)

In [0]:
from pyspark.sql.functions import substring

result_df = (
    members_df.groupBy(substring(col("surname"), 1, 1).alias("first_letter"))
    .agg(count("*").alias("count"))
    .orderBy("first_letter")
)

result_df.show()

+------------+-----+
|first_letter|count|
+------------+-----+
|           B|    5|
|           C|    2|
|           D|    1|
|           F|    2|
|           G|    2|
|           H|    1|
|           J|    3|
|           M|    1|
|           O|    1|
|           P|    2|
|           R|    2|
|           S|    6|
|           T|    2|
|           W|    1|
+------------+-----+



### 5. Generate a list of all the dates in October 2012
Produce a list of all the dates in October 2012. They can be output as a timestamp (with time set to midnight) or a date.

In [0]:
from pyspark.sql.functions import sequence, to_date, explode, lit

result_df = (
    spark.createDataFrame([("2012-10-01", "2012-10-31")], ["start", "end"])
    .select(explode(sequence(to_date(lit("2012-10-01")), to_date(lit("2012-10-31")))).alias("ts"))
)

result_df.show()

+----------+
|        ts|
+----------+
|2012-10-01|
|2012-10-02|
|2012-10-03|
|2012-10-04|
|2012-10-05|
|2012-10-06|
|2012-10-07|
|2012-10-08|
|2012-10-09|
|2012-10-10|
|2012-10-11|
|2012-10-12|
|2012-10-13|
|2012-10-14|
|2012-10-15|
|2012-10-16|
|2012-10-17|
|2012-10-18|
|2012-10-19|
|2012-10-20|
+----------+
only showing top 20 rows



### 6. Return a count of bookings for each month
Return a count of bookings for each month, sorted by month

In [0]:
from pyspark.sql.functions import trunc

df = (
    bookings_df.withColumn("month", trunc("starttime", "month"))
    .groupBy("month")
    .agg(count("*").alias("booking_count"))
    .orderBy("month")
)

df.show()

+----------+-------------+
|     month|booking_count|
+----------+-------------+
|2012-07-01|          658|
|2012-08-01|         1472|
|2012-09-01|         1913|
|2013-01-01|            1|
+----------+-------------+

