# Introduction

## Learning Objectives
In this notebook, you will learn Spark Dataframe APIs.

## Question List

Solve the following questions using Spark Dataframe APIs

### Join

1. easy - https://pgexercises.com/questions/joins/simplejoin.html
2. easy - https://pgexercises.com/questions/joins/simplejoin2.html
3. easy - https://pgexercises.com/questions/joins/self2.html 
4. medium - https://pgexercises.com/questions/joins/threejoin.html (three join)
5. medium - https://pgexercises.com/questions/joins/sub.html (subquery and join)

### Aggregation

1. easy - https://pgexercises.com/questions/aggregates/count3.html Group by order by
2. easy - https://pgexercises.com/questions/aggregates/fachours.html group by order by
3. easy - https://pgexercises.com/questions/aggregates/fachoursbymonth.html group by with condition 
4. easy - https://pgexercises.com/questions/aggregates/fachoursbymonth2.html group by multi col
5. easy - https://pgexercises.com/questions/aggregates/members1.html count distinct
6. med - https://pgexercises.com/questions/aggregates/nbooking.html group by multiple cols, join

### String & Date

1. easy - https://pgexercises.com/questions/string/concat.html format string
2. easy - https://pgexercises.com/questions/string/case.html WHERE + string function
3. easy - https://pgexercises.com/questions/string/reg.html WHERE + string function
4. easy - https://pgexercises.com/questions/string/substr.html group by, substr
5. easy - https://pgexercises.com/questions/date/series.html generate ts
6. easy - https://pgexercises.com/questions/date/bookingspermonth.html extract month from ts

# Answers

## Joins

### Question 1

How can you produce a list of the start times for bookings by members named 'David Farrell'?

https://pgexercises.com/questions/joins/simplejoin.html

In [0]:
from pyspark.sql.functions import col 

df_book = spark.table('bookings')
df_mem = spark.table('members')
result = df_book.join(
  df_mem, df_book.memid == df_mem.memid, "inner"
)
result = result.filter(
  (col('surname') == "Farrell") & (col("firstname") == "David")
).select(col('starttime'))
result.show()

+-------------------+
|          starttime|
+-------------------+
|2012-09-18 09:00:00|
|2012-09-18 17:30:00|
|2012-09-18 13:30:00|
|2012-09-18 20:00:00|
|2012-09-19 09:30:00|
|2012-09-19 15:00:00|
|2012-09-19 12:00:00|
|2012-09-20 15:30:00|
|2012-09-20 11:30:00|
|2012-09-20 14:00:00|
|2012-09-21 10:30:00|
|2012-09-21 14:00:00|
|2012-09-22 08:30:00|
|2012-09-22 17:00:00|
|2012-09-23 08:30:00|
|2012-09-23 17:30:00|
|2012-09-23 19:00:00|
|2012-09-24 08:00:00|
|2012-09-24 16:30:00|
|2012-09-24 12:30:00|
+-------------------+
only showing top 20 rows



### Question 2

How can you produce a list of the start times for bookings for tennis courts, for the date '2012-09-21'? Return a list of start time and facility name pairings, ordered by the time.

https://pgexercises.com/questions/joins/simplejoin2.html

In [0]:
from pyspark.sql.functions import to_date

df_fac = spark.table("facilities")

result = df_book.join(
    df_fac, df_book.facid == df_fac.facid, "inner"
).filter(
    (to_date(col('starttime')) == "2012-09-21") 
    & 
    (col('name').contains("Tennis Court"))
).select(
    'starttime',
    'name'
).orderBy(
    'starttime'
)

result.show()


+-------------------+--------------+
|          starttime|          name|
+-------------------+--------------+
|2012-09-21 08:00:00|Tennis Court 1|
|2012-09-21 08:00:00|Tennis Court 2|
|2012-09-21 09:30:00|Tennis Court 1|
|2012-09-21 10:00:00|Tennis Court 2|
|2012-09-21 11:30:00|Tennis Court 2|
|2012-09-21 12:00:00|Tennis Court 1|
|2012-09-21 13:30:00|Tennis Court 1|
|2012-09-21 14:00:00|Tennis Court 2|
|2012-09-21 15:30:00|Tennis Court 1|
|2012-09-21 16:00:00|Tennis Court 2|
|2012-09-21 17:00:00|Tennis Court 1|
|2012-09-21 18:00:00|Tennis Court 2|
+-------------------+--------------+



### Question 3

How can you output a list of all members, including the individual who recommended them (if any)? Ensure that results are ordered by (surname, firstname).

https://pgexercises.com/questions/joins/self2.html

In [0]:
result = df_mem.alias('mem').join(
    df_mem.alias('rec'), col('mem.recommendedby') == col('rec.memid'), "left"
).select(
    col("mem.firstname").alias('memfname'),
    col('mem.surname').alias('memsname'),
    col('rec.firstname').alias('recfname'),
    col('rec.surname').alias('recsname')
).orderBy('memsname', 'memfname')

result.show()

+---------+---------+---------+--------+
| memfname| memsname| recfname|recsname|
+---------+---------+---------+--------+
| Florence|    Bader|   Ponder|Stibbons|
|     Anne|    Baker|   Ponder|Stibbons|
|  Timothy|    Baker|   Jemima| Farrell|
|      Tim|   Boothe|      Tim|  Rownam|
|   Gerald|  Butters|   Darren|   Smith|
|     Joan|   Coplin|  Timothy|   Baker|
|    Erica|  Crumpet|    Tracy|   Smith|
|    Nancy|     Dare|   Janice|Joplette|
|    David|  Farrell|     NULL|    NULL|
|   Jemima|  Farrell|     NULL|    NULL|
|    GUEST|    GUEST|     NULL|    NULL|
|  Matthew|  Genting|   Gerald| Butters|
|     John|     Hunt|Millicent| Purview|
|    David|    Jones|   Janice|Joplette|
|  Douglas|    Jones|    David|   Jones|
|   Janice| Joplette|   Darren|   Smith|
|     Anna|Mackenzie|   Darren|   Smith|
|  Charles|     Owen|   Darren|   Smith|
|    David|   Pinker|   Jemima| Farrell|
|Millicent|  Purview|    Tracy|   Smith|
+---------+---------+---------+--------+
only showing top

### Question 4

How can you produce a list of all members who have used a tennis court? Include in your output the name of the court, and the name of the member formatted as a single column. Ensure no duplicate data, and order by the member name followed by the facility name.

https://pgexercises.com/questions/joins/threejoin.html

In [0]:
from pyspark.sql.functions import concat, col, lit

result = df_mem.join(
    df_book, df_mem.memid == df_book.memid, 'inner'
).join(
    df_fac, df_book.facid == df_fac.facid, 'inner'
).select(
    concat(
        df_mem.firstname, lit(' '), df_mem.surname
    ).alias('member'),
    df_fac.name.alias('facility')
).filter(
    col('facility').contains('Tennis Court')
).distinct().orderBy('member', 'facility')

result.show()

+--------------+--------------+
|        member|      facility|
+--------------+--------------+
|    Anne Baker|Tennis Court 1|
|    Anne Baker|Tennis Court 2|
|  Burton Tracy|Tennis Court 1|
|  Burton Tracy|Tennis Court 2|
|  Charles Owen|Tennis Court 1|
|  Charles Owen|Tennis Court 2|
|  Darren Smith|Tennis Court 2|
| David Farrell|Tennis Court 1|
| David Farrell|Tennis Court 2|
|   David Jones|Tennis Court 1|
|   David Jones|Tennis Court 2|
|  David Pinker|Tennis Court 1|
| Douglas Jones|Tennis Court 1|
| Erica Crumpet|Tennis Court 1|
|Florence Bader|Tennis Court 1|
|Florence Bader|Tennis Court 2|
|   GUEST GUEST|Tennis Court 1|
|   GUEST GUEST|Tennis Court 2|
|Gerald Butters|Tennis Court 1|
|Gerald Butters|Tennis Court 2|
+--------------+--------------+
only showing top 20 rows



### Question 5

How can you output a list of all members, including the individual who recommended them (if any), without using any joins? Ensure that there are no duplicates in the list, and that each firstname + surname pairing is formatted as a column and ordered.

https://pgexercises.com/questions/joins/sub.html

In [0]:
from pyspark.sql.functions import concat, col, lit, udf
from pyspark.sql.types import StringType

df_mem_broadcast = spark.sparkContext.broadcast(
    df_mem.collect()
)

def get_rec_name(recid):
    if recid is None:
        return None
    rec_name = [row for row in df_mem_broadcast.value if row['memid'] == recid]

    if rec_name:
        return f"{rec_name[0]['firstname']} {rec_name[0]['surname']}"
    else:
        return None
    
get_rec_name_udf = udf(
    get_rec_name,
    StringType()
)

result = df_mem.withColumn(
    'recommender',
    get_rec_name_udf(col('recommendedby'))
).select(
    concat(
        col('firstname'), lit(' '), col('surname')
    ).alias('member'),
    'recommender'
).orderBy('member')

result.show()

+--------------------+---------------+
|              member|    recommender|
+--------------------+---------------+
|      Anna Mackenzie|   Darren Smith|
|          Anne Baker|Ponder Stibbons|
|        Burton Tracy|           NULL|
|        Charles Owen|   Darren Smith|
|        Darren Smith|           NULL|
|        Darren Smith|           NULL|
|       David Farrell|           NULL|
|         David Jones|Janice Joplette|
|        David Pinker| Jemima Farrell|
|       Douglas Jones|    David Jones|
|       Erica Crumpet|    Tracy Smith|
|      Florence Bader|Ponder Stibbons|
|         GUEST GUEST|           NULL|
|      Gerald Butters|   Darren Smith|
|    Henrietta Rumney|Matthew Genting|
|Henry Worthington...|    Tracy Smith|
| Hyacinth Tupperware|           NULL|
|          Jack Smith|   Darren Smith|
|     Janice Joplette|   Darren Smith|
|      Jemima Farrell|           NULL|
+--------------------+---------------+
only showing top 20 rows



## Aggregations

### Question 1

Produce a count of the number of recommendations each member has made. Order by member ID.

https://pgexercises.com/questions/aggregates/count3.html

In [0]:
from pyspark.sql.functions import count, col

result = df_mem.groupBy('recommendedby').count().orderBy('recommendedby')
result = result.na.drop()

result.show()

+-------------+-----+
|recommendedby|count|
+-------------+-----+
|            1|    5|
|            2|    3|
|            3|    1|
|            4|    2|
|            5|    1|
|            6|    1|
|            9|    2|
|           11|    1|
|           13|    2|
|           15|    1|
|           16|    1|
|           20|    1|
|           30|    1|
+-------------+-----+



### Question 2

Produce a list of the total number of slots booked per facility. For now, just produce an output table consisting of facility id and slots, sorted by facility id.

https://pgexercises.com/questions/aggregates/fachours.html

In [0]:
from pyspark.sql.functions import sum

result = df_book.groupBy('facid').agg(sum('slots').alias("Total Slots")).orderBy('facid')

result.show()

+-----+-----------+
|facid|Total Slots|
+-----+-----------+
|    0|       1320|
|    1|       1278|
|    2|       1209|
|    3|        830|
|    4|       1404|
|    5|        228|
|    6|       1104|
|    7|        908|
|    8|        911|
+-----+-----------+



### Question 3

Produce a list of the total number of slots booked per facility in the month of September 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.

https://pgexercises.com/questions/aggregates/fachoursbymonth.html

In [0]:
from pyspark.sql.functions import col

result = df_book.filter(
    (col('starttime') >= "2012-09-01") & (col('starttime') < "2012-10-01")
).groupBy('facid').agg(sum('slots').alias("Total Slots")).orderBy('Total Slots')

result.show()

+-----+-----------+
|facid|Total Slots|
+-----+-----------+
|    5|        122|
|    3|        422|
|    7|        426|
|    8|        471|
|    6|        540|
|    2|        570|
|    1|        588|
|    0|        591|
|    4|        648|
+-----+-----------+



### Question 4

Produce a list of the total number of slots booked per facility per month in the year of 2012. Produce an output table consisting of facility id and slots, sorted by the id and month.

https://pgexercises.com/questions/aggregates/fachoursbymonth2.html

In [0]:
from pyspark.sql.functions import month, year

result = df_book.filter(
    year('starttime') == 2012
).groupBy('facid', month('starttime').alias('month')).agg(
    sum('slots').alias("Total Slots")
).orderBy('facid', 'month')

result.show()

+-----+-----+-----------+
|facid|month|Total Slots|
+-----+-----+-----------+
|    0|    7|        270|
|    0|    8|        459|
|    0|    9|        591|
|    1|    7|        207|
|    1|    8|        483|
|    1|    9|        588|
|    2|    7|        180|
|    2|    8|        459|
|    2|    9|        570|
|    3|    7|        104|
|    3|    8|        304|
|    3|    9|        422|
|    4|    7|        264|
|    4|    8|        492|
|    4|    9|        648|
|    5|    7|         24|
|    5|    8|         82|
|    5|    9|        122|
|    6|    7|        164|
|    6|    8|        400|
+-----+-----+-----------+
only showing top 20 rows



### Question 5

Find the total number of members (including guests) who have made at least one booking.

https://pgexercises.com/questions/aggregates/members1.html

In [0]:
result = df_book.select('memid').distinct().count()

print(result)

30


### Question 6

Produce a list of each member name, id, and their first booking after September 1st 2012. Order by member ID.

https://pgexercises.com/questions/aggregates/nbooking.html

In [0]:
from pyspark.sql.functions import min

result = df_mem.join(
    df_book, df_mem.memid == df_book.memid, 'inner'
).filter(
    col("starttime") >= "2012-09-01"
).select(
    'surname',
    'firstname',
    df_mem.memid.alias('memid'),
    "starttime"
).groupBy(
    'surname', 'firstname', 'memid'
).agg(
    min(col("starttime")).alias("starttime")
).orderBy('memid')
result.show()

+---------+---------+-----+-------------------+
|  surname|firstname|memid|          starttime|
+---------+---------+-----+-------------------+
|    GUEST|    GUEST|    0|2012-09-01 08:00:00|
|    Smith|   Darren|    1|2012-09-01 09:00:00|
|    Smith|    Tracy|    2|2012-09-01 11:30:00|
|   Rownam|      Tim|    3|2012-09-01 16:00:00|
| Joplette|   Janice|    4|2012-09-01 15:00:00|
|  Butters|   Gerald|    5|2012-09-02 12:30:00|
|    Tracy|   Burton|    6|2012-09-01 15:00:00|
|     Dare|    Nancy|    7|2012-09-01 12:30:00|
|   Boothe|      Tim|    8|2012-09-01 08:30:00|
| Stibbons|   Ponder|    9|2012-09-01 11:00:00|
|     Owen|  Charles|   10|2012-09-01 11:00:00|
|    Jones|    David|   11|2012-09-01 09:30:00|
|    Baker|     Anne|   12|2012-09-01 14:30:00|
|  Farrell|   Jemima|   13|2012-09-01 09:30:00|
|    Smith|     Jack|   14|2012-09-01 11:00:00|
|    Bader| Florence|   15|2012-09-01 10:30:00|
|    Baker|  Timothy|   16|2012-09-01 15:00:00|
|   Pinker|    David|   17|2012-09-01 08

## String and Date

### Question 1

Output all names formatted as 'Surname, Firstname'

https://pgexercises.com/questions/string/concat.html

In [0]:
from pyspark.sql.functions import concat, lit

result = df_mem.withColumn('name', concat('surname', lit(', '), 'firstname')).select('name').orderBy('name', ascending = False)

result.show()

+--------------------+
|                name|
+--------------------+
|Worthington-Smyth...|
|Tupperware, Hyacinth|
|       Tracy, Burton|
|    Stibbons, Ponder|
|        Smith, Tracy|
|         Smith, Jack|
|       Smith, Darren|
|       Smith, Darren|
|   Sarwin, Ramnaresh|
|   Rumney, Henrietta|
|         Rownam, Tim|
|  Purview, Millicent|
|       Pinker, David|
|       Owen, Charles|
|     Mackenzie, Anna|
|    Joplette, Janice|
|      Jones, Douglas|
|        Jones, David|
|          Hunt, John|
|    Genting, Matthew|
+--------------------+
only showing top 20 rows



### Question 2

Perform a case-insensitive search to find all facilities whose name begins with 'tennis'. Retrieve all columns.

https://pgexercises.com/questions/string/case.html

In [0]:
from pyspark.sql.functions import lower, startswith

result = df_fac.filter(
    lower(col('name')).startswith('tennis')
)

result.show()

+-----+--------------+----------+---------+-------------+------------------+
|facid|          name|membercost|guestcost|initialoutlay|monthlymaintenance|
+-----+--------------+----------+---------+-------------+------------------+
|    0|Tennis Court 1|         5|       25|        10000|               200|
|    1|Tennis Court 2|         5|       25|         8000|               200|
+-----+--------------+----------+---------+-------------+------------------+



### Question 3

You've noticed that the club's member table has telephone numbers with very inconsistent formatting. You'd like to find all the telephone numbers that contain parentheses, returning the member ID and telephone number sorted by member ID.

https://pgexercises.com/questions/string/reg.html

In [0]:
from pyspark.sql.functions import contains

result = df_mem.filter(
    col('telephone').contains('(') | col('telephone').contains(')')
).select(
    'memid',
    'telephone'
).orderBy('memid')

result.show()

+-----+--------------+
|memid|     telephone|
+-----+--------------+
|    0|(000) 000-0000|
|    3|(844) 693-0723|
|    4|(833) 942-4710|
|    5|(844) 078-4130|
|    6|(822) 354-9973|
|    7|(833) 776-4001|
|    8|(811) 433-2547|
|    9|(833) 160-3900|
|   10|(855) 542-5251|
|   11|(844) 536-8036|
|   13|(855) 016-0163|
|   14|(822) 163-3254|
|   15|(833) 499-3527|
|   20|(811) 972-1377|
|   21|(822) 661-2898|
|   22|(822) 499-2232|
|   24|(822) 413-1470|
|   27|(822) 989-8876|
|   28|(855) 755-9876|
|   29|(855) 894-3758|
+-----+--------------+
only showing top 20 rows



### Question 4

You'd like to produce a count of how many members you have whose surname starts with each letter of the alphabet. Sort by the letter, and don't worry about printing out a letter if the count is 0.


https://pgexercises.com/questions/string/substr.html

In [0]:
from pyspark.sql.functions import substring, upper

result = df_mem.withColumn(
    'First Letter', upper(substring(col('surname'), 1, 1))
).groupBy('First Letter').agg(
    count('*').alias('count')
).filter(
    col('count') > 0
).orderBy('First Letter')

result.show()

+------------+-----+
|First Letter|count|
+------------+-----+
|           B|    5|
|           C|    2|
|           D|    1|
|           F|    2|
|           G|    2|
|           H|    1|
|           J|    3|
|           M|    1|
|           O|    1|
|           P|    2|
|           R|    2|
|           S|    6|
|           T|    2|
|           W|    1|
+------------+-----+



### Question 5

Produce a list of all the dates in October 2012. They can be output as a timestamp (with time set to midnight) or a date.


https://pgexercises.com/questions/date/series.html

In [0]:
# Needs further analysis

### Question 6

Return a count of bookings for each month, sorted by month

https://pgexercises.com/questions/date/bookingspermonth.html

In [0]:
from pyspark.sql.functions import date_trunc, count, to_date

result = df_book.groupBy(
    to_date(date_trunc('month', 'starttime')).alias('month')
).agg(
    count('*').alias('count')
).orderBy('month')

result.show()

+----------+-----+
|     month|count|
+----------+-----+
|2012-07-01|  658|
|2012-08-01| 1472|
|2012-09-01| 1913|
|2013-01-01|    1|
+----------+-----+

