# Introduction to Apache Spark / PySpark

As of August 2022, these examples are based on [Spark 3.3.0](https://spark.apache.org/docs/3.3.0/)  

Reference/API Links


*   [Apache Spark Quick Start](https://spark.apache.org/docs/3.3.0/quick-start.html)
*   [PySpark v3.3.0 API](https://spark.apache.org/docs/3.3.0/api/python/reference/index.html)
*    [RDD Programming Guide](https://spark.apache.org/docs/3.3.0/rdd-programming-guide.html)
*    [Spark SQL Programming Guide](https://spark.apache.org/docs/3.3.0/sql-programming-guide.html)









In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
openjdk-8-jdk-headless is already the newest version (8u342-b07-0ubuntu1~18.04).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.



# Imports / Starter Example




In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql import types as sparktypes
from pyspark.sql.functions import col, split
import pyspark.sql.functions as pyspark

sc = SparkContext() 
spark = SparkSession(sc)

In [None]:
# create a Resilient Distributed Dataset (RDD) from a sequence of integers perform filter() and reduce() operations

# Function to be used in the filter() transformation
def filterSmall(x):    
   if x < 20:
      return False
   else:
      return True

# Function to be used in the map() transformation
def mapSquare(x):
    return x*x

# Function to be used in the reduce() action
def reduceSum(x,y):
    return x+y
  
rdd = sc.parallelize(range(100))         ## create an RDD of 100 numbers from 0 to 99

#print(rdd.filter(filterSmall).collect())

out1 = rdd.filter(filterSmall).map(mapSquare)  ## perform filter and map transformations
out2 = out1.reduce(reduceSum)                  ## perform reduce operation

print(out1.collect())           ## print first output (all numbers less than 20 squared)
print(out2)                 ## print second output (sum of all numbers from first output)


[400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]
325880


In [None]:
# download a sample access log for use in demos below
!rm -f apache.access.log
!wget -q https://raw.githubusercontent.com/databricks/reference-apps/master/logs_analyzer/data/apache.access.log

# Apache HTTP Log Example - Resilient Distributed Dataset (RDD)

A SparkContext instance can be used to create RDDs from various data/files/resources (text files, CSV, Hadoop data files, etc.)

In [None]:
# find the top 10 clients, map/reduce style using RDD transformations
access_log_rdd = (sc.textFile("apache.access.log")
                  .map(lambda line: ( line.split(" ")[0], 1 ))  # field 0 = client address
                  .reduceByKey(lambda count1, count2: count1 + count2)
                  .sortBy(lambda t: -t[1])) 

print ("Total count of client hostnames:")
print(access_log_rdd.count())

print ("Top 10 client hostnames:")
print(access_log_rdd.take(10))


Total count of client hostnames:
169
Top 10 client hostnames:
[('64.242.88.10', 452), ('10.0.0.153', 188), ('cr020r01-3.sac.overture.com', 44), ('h24-71-236-129.ca.shawcable.net', 36), ('h24-70-69-74.ca.shawcable.net', 32), ('market-mail.panduit.com', 29), ('ts04-ip92.hevanet.com', 28), ('ip68-228-43-49.tc.ph.cox.net', 22), ('proxy0.haifa.ac.il', 19), ('207.195.59.160', 15)]


# Apache HTTP Log Example - DataFrame

A DataFrame is equivalent to a relational table in Spark SQL, and can be created from on a variety of input formats (CSV, JSON, relational database, etc.) using the SparkSession.

In [None]:
access_log_df = spark.read.text("apache.access.log")

access_log_df.show(truncate=False)
access_log_df.printSchema()

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846   |
|64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523                             |
|64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291                                                         |
|64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET 

In [None]:
access_log_df = spark.read.options(delimiter=" ").csv("apache.access.log")

access_log_df.show(truncate=False)
access_log_df.printSchema()

+------------+---+---+---------------------+------+-------------------------------------------------------------------------------------------------+---+-----+
|_c0         |_c1|_c2|_c3                  |_c4   |_c5                                                                                              |_c6|_c7  |
+------------+---+---+---------------------+------+-------------------------------------------------------------------------------------------------+---+-----+
|64.242.88.10|-  |-  |[07/Mar/2004:16:05:49|-0800]|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1   |401|12846|
|64.242.88.10|-  |-  |[07/Mar/2004:16:06:51|-0800]|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1                            |200|4523 |
|64.242.88.10|-  |-  |[07/Mar/2004:16:10:02|-0800]|GET /mailman/listinfo/hsdivision HTTP/1.1                                                        |200|6291 |
|64.242.88.10|-  |-  |[07/Mar/2004:16:11

In [None]:
access_log_df = spark.read.options(delimiter=" ").csv("apache.access.log")

named_df = access_log_df.select(col('_c0').alias('host'),
                                col('_c3').alias('timestamp'),
                                col('_c5').alias('path'),
                                col('_c6').cast('integer').alias('status'),
                                col('_c7').cast('integer').alias('content_size'))



named_df.show(truncate=False)
named_df.printSchema()

+------------+---------------------+-------------------------------------------------------------------------------------------------+------+------------+
|host        |timestamp            |path                                                                                             |status|content_size|
+------------+---------------------+-------------------------------------------------------------------------------------------------+------+------------+
|64.242.88.10|[07/Mar/2004:16:05:49|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1   |401   |12846       |
|64.242.88.10|[07/Mar/2004:16:06:51|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1                            |200   |4523        |
|64.242.88.10|[07/Mar/2004:16:10:02|GET /mailman/listinfo/hsdivision HTTP/1.1                                                        |200   |6291        |
|64.242.88.10|[07/Mar/2004:16:11:58|GET /twiki/bin/view/TWiki/WikiSynt

In [None]:
named_df.createOrReplaceTempView("log")
sql_df = spark.sql("SELECT * FROM log WHERE status = 404")

sql_df.show(truncate=False)
sql_df.printSchema()

+-----------------------------+---------------------+--------------------------------------------------------------------------+------+------------+
|host                         |timestamp            |path                                                                      |status|content_size|
+-----------------------------+---------------------+--------------------------------------------------------------------------+------+------------+
|h24-70-56-49.ca.shawcable.net|[07/Mar/2004:21:16:17|GET /twiki/view/Main/WebHome HTTP/1.1                                     |404   |300         |
|61.9.4.61                    |[08/Mar/2004:07:27:36|GET /_vti_bin/owssvr.dll?UL=1&ACT=4&BUILD=2614&STRMVER=4&CAPREQ=0 HTTP/1.0|404   |284         |
|61.9.4.61                    |[08/Mar/2004:07:27:37|GET /MSOffice/cltreq.asp?UL=1&ACT=4&BUILD=2614&STRMVER=4&CAPREQ=0 HTTP/1.0|404   |284         |
|1513.cps.virtua.com.br       |[11/Mar/2004:02:27:39|GET /pipermail/cipg/2003-november.txt HTTP/1.1       

## What About Datasets?

Added in Spark 1.6, a **Dataset** is a distributed collection of data that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A **DataFrame** is a Dataset organized into named columns. ([source](https://spark.apache.org/docs/3.1.1/sql-programming-guide.html)) 

# Reporting Tasks (from Lab 2)


1. Most popular URL paths (top 15)
2. Request count for each HTTP response code, sorted by response code
3. Request count for each calendar month and year, sorted chronologically
4. Total bytes sent to the client with a specified hostname or IPv4 address (you may hard code an address)
5. Based on a given URL (hard coded), compute a request count for each client (hostname or IPv4) who accessed that URL, sorted by request count, highest to lowest


# (A) RDD Implementations

Perform reporting tasks 1-5 using RDD transformations

[RDD APIs PySpark v3.3.0](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.html#rdd-apis)

In [None]:
# RDD implementation
# (1) Most popular URL paths (top 15)

# find the top 10 clients, map/reduce style using RDD transformations
access_log_rdd = (sc.textFile("apache.access.log")
                  .map(lambda line: ( line.split(" ")[6], 1 ))  # field 0 = client address
                  .reduceByKey(lambda x, y: x + y)
                  .sortBy(lambda t: -t[1])) 

print ("Total count of client hostnames:")
print(access_log_rdd.count())

print ("Top 10 client hostnames:")
print(access_log_rdd.take(10))


Total count of client hostnames:
628
Top 10 client hostnames:
[('/twiki/bin/view/Main/WebHome', 40), ('/twiki/pub/TWiki/TWikiLogos/twikiRobot46x50.gif', 32), ('/', 31), ('/favicon.ico', 28), ('/robots.txt', 27), ('/razor.html', 23), ('/twiki/bin/view/Main/SpamAssassinTaggingOnly', 18), ('/twiki/bin/view/Main/SpamAssassinAndPostFix', 17), ('/cgi-bin/mailgraph2.cgi', 16), ('/cgi-bin/mailgraph.cgi/mailgraph_0.png', 16)]


In [None]:
# RDD implementation
# (2) Request count for each HTTP response code, sorted by response code

# find the top 10 clients, map/reduce style using RDD transformations
access_log_rdd = (sc.textFile("apache.access.log")
                  .map(lambda line: ( line.split(" ")[6], 1 ))  # field 0 = client address
                  .map(lambda s: (s[0][2:5:], s[1])))
                  #.reduceByKey(lambda x, y: x + y)
                  #.sortBy(lambda t: -t[1])) 

print ("Total count of client hostnames:")
print(access_log_rdd.count())

print ("Top 10 client hostnames:")
print(access_log_rdd.take(10))
  

Total count of client hostnames:
1406
Top 10 client hostnames:
[('wik', 1), ('wik', 1), ('ail', 1), ('wik', 1), ('wik', 1), ('wik', 1), ('wik', 1), ('wik', 1), ('wik', 1), ('wik', 1)]


In [None]:
# RDD implementation
# (3) Request count for each calendar month and year, sorted chronologically
from datetime import date

def getYearMonth(line: str):
    str_time = line.split(" ")[3]
    month_str = str_time.split("/")[1]

    if (month_str == "Jan"):
      month = 1
    if (month_str == "Feb"):
      month = 2
    if (month_str == "Mar"):
      month = 3
    if (month_str == "May"):
      month = 5
    if (month_str == "Jun"):
      month = 6
    if (month_str == "Jul"):
      month = 7
    if (month_str == "Aug"):
      month = 8
    if (month_str == "Sep"):
      month = 9
    if (month_str == "Oct"):
      month = 10
    if (month_str == "Nov"):
      month = 11
    if (month_str == "Dec"):
      month = 12

    year = int(str_time.split("/")[2][0:4])

    day = date(year, month, 1)

    return day

access_log_rdd = (sc.textFile("apache.access.log")
                  .map(lambda line : (getYearMonth(line), 1))
                  .reduceByKey(lambda x, y: x + y)
                  .sortBy(lambda x: x[0])
                  .map(lambda data: (str(data[0].year) + '/' + str(data[0].month), data[1])))
                  
print(access_log_rdd.collect())


[('2004/3', 1406)]


In [None]:
# RDD implementation
# (4) Total bytes sent to the client with a specified hostname or IPv4 address (you may hard code an address)

access_log_rdd = (sc.textFile("apache.access.log")
                  .filter(lambda line : line.split(' ')[0] == '10.0.0.153')
                  .map(lambda line : (line.split(' ')[0], int(line.split(' ')[-1])))
                  .reduceByKey(lambda x, y: x + y))

print(access_log_rdd.collect())

[('10.0.0.153', 1200145)]


In [None]:
# RDD implementation
# (5) Based on a given URL (hard coded), compute a request count for each client (hostname or IPv4) who accessed that URL, sorted by request count, highest to lowest


def filter_url(line: str, target_url: str):
  url = line.split(' ')[6]
  temp_str = url
  
  if target_url == url:
    return True
  return False

access_log_rdd = (sc.textFile("apache.access.log")
                  .filter(lambda line: filter_url(line, '/'))
                  .map(lambda line: (line.split(' ')[0], 1))
                  .reduceByKey(lambda x, y: x + y)
                  .sortBy(lambda x: -1 * x[1]))

print(access_log_rdd.collect())


[('66-194-6-70.gen.twtelecom.net', 3), ('66-194-6-79.gen.twtelecom.net', 2), ('208-38-57-205.ip.cal.radiant.net', 2), ('pool-68-160-195-60.ny325.east.verizon.net', 1), ('219.95.17.51', 1), ('spot.nnacorp.com', 1), ('lhr003a.dhl.com', 1), ('ip68-228-43-49.tc.ph.cox.net', 1), ('proxy0.haifa.ac.il', 1), ('fw1.millardref.com', 1), ('l07v-1-17.d1.club-internet.fr', 1), ('c-24-20-163-223.client.comcast.net', 1), ('jacksonproject.cnc.bc.ca', 1), ('h24-70-69-74.ca.shawcable.net', 1), ('lj1212.inktomisearch.com', 1), ('67.131.107.5', 1), ('dsl-80-43-113-44.access.uk.tiscali.com', 1), ('212.92.37.62', 1), ('66.213.206.2', 1), ('market-mail.panduit.com', 1), ('a213-84-36-192.adsl.xs4all.nl', 1), ('ts04-ip92.hevanet.com', 1), ('207.195.59.160', 1), ('2-238.cnc.bc.ca', 1), ('3_343_lt_someone', 1), ('1-729.cnc.bc.ca', 1), ('66-194-6-71.gen.twtelecom.net', 1)]


# (B) DataFrame Implementations

Perform reporting tasks 1-5 using Spark's DataFrame API

[DataFrame API PySpark v.3.1.1](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame)

Please Note: Spark SQL **is not** permitted for these exercises. You must use the Spark DataFrame API. 

In [None]:
# DataFrame implementation
# (1) Most popular URL paths (top 15)


data_frame_query_1 = named_df.withColumn('path', split(named_df.path, ' ')[1])
data_frame_query_11 = data_frame_query_1.groupBy('path').count().sort('count', ascending=False).limit(15).show()


+--------------------+-----+
|                path|count|
+--------------------+-----+
|/twiki/bin/view/M...|   40|
|/twiki/pub/TWiki/...|   32|
|                   /|   31|
|        /favicon.ico|   28|
|         /robots.txt|   27|
|         /razor.html|   23|
|/twiki/bin/view/M...|   18|
|/twiki/bin/view/M...|   17|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
|/cgi-bin/mailgrap...|   16|
+--------------------+-----+



In [None]:
# DataFrame implementation
# (2) Request count for each HTTP response code, sorted by response code

data_frame_query_2 = named_df.groupBy('status').count().sort('status', ascending=False).show()



+------+-----+
|status|count|
+------+-----+
|   404|    5|
|   401|  123|
|   302|    6|
|   200| 1272|
+------+-----+



In [None]:
# DataFrame implementation
# (3) Request count for each calendar month and year, sorted chronologically


data_frame_query_3 = named_df.withColumn('month_year', pyspark.substring('timestamp', 5, 8)).groupBy('month_year').count().show()


+----------+-----+
|month_year|count|
+----------+-----+
|  Mar/2004| 1406|
+----------+-----+



In [None]:
# DataFrame implementation
# (4) Total bytes sent to the client with a specified hostname or IPv4 address (you may hard code an address)

given_hostname = '64.242.88.10'

data_frame_query_4 = named_df.filter(named_df.host == given_hostname).groupBy('host').sum('content_size').show()


+------------+-----------------+
|        host|sum(content_size)|
+------------+-----------------+
|64.242.88.10|          5745035|
+------------+-----------------+



In [None]:
# DataFrame implementation
# (5) Based on a given URL (hard coded), compute a request count for each client (hostname or IPv4) who accessed that URL, sorted by request count, highest to lowest

give_url = '/twiki/bin/view'
data_frame_query_5 = named_df.withColumn('path', split(named_df.path, ' ')[1])
data_frame_query_55 = data_frame_query_5.filter(named_df.path.contains(give_url)).groupBy('host').count().sort('count', ascending=False).show()



+--------------------+-----+
|                host|count|
+--------------------+-----+
|        64.242.88.10|  139|
|cr020r01-3.sac.ov...|   25|
|pc3-registry-stoc...|   13|
|ts05-ip44.hevanet...|   13|
|       128.227.88.79|   11|
| prxint-sxb3.e-i.net|   11|
|        212.92.37.62|   11|
|      207.195.59.160|   11|
|mail.geovariances.fr|   10|
|      ogw.netinfo.bg|   10|
|market-mail.pandu...|    9|
|crawl24-public.al...|    8|
|p213.54.168.132.t...|    8|
|      195.246.13.119|    8|
|mth-fgw.ballarat....|    6|
|ip68-228-43-49.tc...|    6|
|c-24-20-163-223.c...|    5|
|user-0c8hdkf.cabl...|    4|
|208-38-57-205.ip....|    4|
|          10.0.0.153|    4|
+--------------------+-----+
only showing top 20 rows

