## Spark Data Cleaning Demo

There are a wide variety of ways to manipulate and change data in a Spark-y way. This notebook shows some of them in the context of data cleaning.

---

### Importing Modules & Starting Spark Session

---

In [2]:
from pyspark.sql import SparkSession
import pandas as pd
import matplotlib

In [3]:
sparkSesh = SparkSession \
    .builder \
    .getOrCreate()

22/09/18 13:00:57 WARN Utils: Your hostname, rambino-AERO-15-XD resolves to a loopback address: 127.0.1.1; using 172.20.10.14 instead (on interface wlp48s0)
22/09/18 13:00:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/18 13:00:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


---

### Loading Dataset

---

In [4]:
#Data Source: http://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
logs = sparkSesh.read.text("NASA_access_log_Jul95.gz")

---

### Inspecting Data

---

Overview: Spark has some nifty tools for data inspection, but sometimes it can be easier to convert the Spark df to a Pandas df to take advantage of Pandas increased offering of data inspection and analysis tools

In [6]:
logs.printSchema()
logs.count()

root
 |-- value: string (nullable = true)



                                                                                

1891715

In [9]:
logs.show(5, truncate = False)

+-----------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------+
|199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245                                 |
|unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                      |
|199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085   |
|burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0               |
|199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179|
+-----------------------

In [70]:
pd.set_option('max_colwidth',250)
#Note: It is almost ALWAYS best to filter data before additional processing. Here we do that by limiting in the Spark df
#before passing it to Pandas, rather than passing all the data to Pandas and then limiting there.
logs.limit(5).toPandas()
#logs.toPandas().head(5) #Inefficient version

Unnamed: 0,value
0,"199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] ""GET /history/apollo/ HTTP/1.0"" 200 6245"
1,"unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] ""GET /shuttle/countdown/ HTTP/1.0"" 200 3985"
2,"199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] ""GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0"" 200 4085"
3,"burger.letters.com - - [01/Jul/1995:00:00:11 -0400] ""GET /shuttle/countdown/liftoff.html HTTP/1.0"" 304 0"
4,"199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] ""GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0"" 200 4179"


From looking at at these data, it appears that we have multiple values simply concatenated as long strings.
In order to analyse these data more efficiently, we will need to separate these data into separate columns

---

### Trial 1: using 'Split'

---

In [18]:
from pyspark.sql.functions import split, col

#Split allows us to split strings by a common delimiter. Here's what it looks like for a few rows if we use space:
logs \
    .limit(5) \
    .select(
        split(col('value')," ").alias("splitVals")
    ).show(truncate=False)

#Looks like some values were separated correctly, but most were not (because spaces are not consistently used as delimiters)


+----------------------------------------------------------------------------------------------------------------------------------+
|splitVals                                                                                                                         |
+----------------------------------------------------------------------------------------------------------------------------------+
|[199.72.81.55, -, -, [01/Jul/1995:00:00:01, -0400], "GET, /history/apollo/, HTTP/1.0", 200, 6245]                                 |
|[unicomp6.unicomp.net, -, -, [01/Jul/1995:00:00:06, -0400], "GET, /shuttle/countdown/, HTTP/1.0", 200, 3985]                      |
|[199.120.110.21, -, -, [01/Jul/1995:00:00:09, -0400], "GET, /shuttle/missions/sts-73/mission-sts-73.html, HTTP/1.0", 200, 4085]   |
|[burger.letters.com, -, -, [01/Jul/1995:00:00:11, -0400], "GET, /shuttle/countdown/liftoff.html, HTTP/1.0", 304, 0]               |
|[199.120.110.21, -, -, [01/Jul/1995:00:00:11, -0400], "GET, /shuttle

---

### Trial 2: Parse using custom UDF

---

In [65]:
#Let's build a UDF that can do custom parsing
from pyspark.sql.functions import udf
from pyspark.sql.types import MapType,StringType
import re

#Note: in the UDFs below, we define 'MapType' as the return type, which allows us to return the data as a JSON.
#However, we cannot mix types in these returned map types, so despite having 1 integer type, we are returning all values
#as strings.

#Note: We can register a UDF to spark using '@' syntax or the function 'register'
#Option 1 (define function, convert to UDF, then register):
def testFunc(logStr: str):
    return logStr

sparkSesh.udf.register(
    "logParser",
    udf(
        testFunc,
        MapType(StringType(),StringType())
    )
)

#Option 2 (add decorator '@udf' and return type, then write function):
@udf(returnType=MapType(StringType(),StringType()))
def parseLog(logStr: str):
    regex = r"^(?P<client>\S+?) \- \- \[(?P<datetime>[^\]]+)\] +\"(?P<request>.*) +(?P<endpoint>\S+) +(?P<protocol>\S+)\" (?P<status>[0-9]{3}) (?P<size>[0-9]+|\-)$"
    result = re.search(regex,logStr)

    if result is None:
        return (logStr,0)

    size = result.group('size')
    if size == '-':
        size = 0

    if result is not None:
        return {
            'client':result.group('client'),
            'datetime':result.group('datetime'),
            'request':result.group('request'),
            'endpoint':result.group('endpoint'),
            'protocol':result.group('protocol'),
            'status':result.group('status'),
            'size':size
        }

22/09/18 15:19:40 WARN SimpleFunctionRegistry: The function logparser replaced a previously registered function.


In [71]:
dfParsed = logs.withColumn('parsed',parseLog('value'))
dfParsed.select('parsed').limit(10).toPandas()

Unnamed: 0,parsed
0,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/history/apollo/', 'datetime': '01/Jul/1995:00:00:01 -0400', 'size': '6245', 'client': '199.72.81.55', 'status': '200'}"
1,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/countdown/', 'datetime': '01/Jul/1995:00:00:06 -0400', 'size': '3985', 'client': 'unicomp6.unicomp.net', 'status': '200'}"
2,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/missions/sts-73/mission-sts-73.html', 'datetime': '01/Jul/1995:00:00:09 -0400', 'size': '4085', 'client': '199.120.110.21', 'status': '200'}"
3,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/countdown/liftoff.html', 'datetime': '01/Jul/1995:00:00:11 -0400', 'size': '0', 'client': 'burger.letters.com', 'status': '304'}"
4,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'datetime': '01/Jul/1995:00:00:11 -0400', 'size': '4179', 'client': '199.120.110.21', 'status': '200'}"
5,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/images/NASA-logosmall.gif', 'datetime': '01/Jul/1995:00:00:12 -0400', 'size': '0', 'client': 'burger.letters.com', 'status': '304'}"
6,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/countdown/video/livevideo.gif', 'datetime': '01/Jul/1995:00:00:12 -0400', 'size': '0', 'client': 'burger.letters.com', 'status': '200'}"
7,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/countdown/countdown.html', 'datetime': '01/Jul/1995:00:00:12 -0400', 'size': '3985', 'client': '205.212.115.106', 'status': '200'}"
8,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/shuttle/countdown/', 'datetime': '01/Jul/1995:00:00:13 -0400', 'size': '3985', 'client': 'd104.aa.net', 'status': '200'}"
9,"{'request': 'GET', 'protocol': 'HTTP/1.0', 'endpoint': '/', 'datetime': '01/Jul/1995:00:00:13 -0400', 'size': '7074', 'client': '129.94.144.152', 'status': '200'}"


---

### Break out map into separate columns

---

In [73]:
#We can use 'selectExpr' as a way to include dynamic SQL, ex:
dfParsed.selectExpr('parsed["client"] as client').limit(5).show()

+--------------------+
|              client|
+--------------------+
|        199.72.81.55|
|unicomp6.unicomp.net|
|      199.120.110.21|
|  burger.letters.com|
|      199.120.110.21|
+--------------------+



In [111]:
#If we get a little fancy with string interpolation and list comprehensions, we can create these select expressions
#dynamically and make a generic expression:

from pyspark.sql.functions import map_keys,col
import numpy as np

#Somewhat verbosely, getting keys of nested Map structure:
values = dfParsed \
    .limit(1) \
    .select(
        map_keys(col('parsed'))
    ) \
    .toPandas() \
    .values \
    .flatten()[0]

expressions = [f"parsed['{key}'] as {key}" for key in values]

#Note: * operator is basically a spread operator, like in JS:
dfClean = dfParsed.selectExpr(*expressions)

Note: The code above- particularly dynamically making 'selectExpr' expressions - is a bit hacky.
I would prefer to find a way to parse this log data that would be simpler and less prone to breaking.

**Current Version:**
1. Read in data
2. Break log string into pieces using regex
3. Build JSON from log string pieces
4. Store JSON as new column
5. Write Spark 'selectExpr' code to convert JSON to new columns
6. Convert incorrect value types from JSON (str -> int)

**Possibly Better Version:**
I wonder if we could do this better by pulling out individual values from the log string incrementally. I don't see the need to get all values from the log string at once if it causes us to STILL have to create new columns from the resulting JSON AND to convert incorrect types.

1. Read in data
2. Write generic function to apply regex and return matching result
3. Write multiple regex to pull out only individual values from log string
4. For each individual column needed, create column by running generic function with appropriate regex.

This version would require that regex be assessed for every column (a disadvantage), but it would also allow the user to ONLY pull in needed columns instead of importing everything by default (an advantage).
I like version #2 in terms of maintainability, but **would require testing to assess performance**

In [114]:
dfClean.limit(5).show()

+-------+--------+--------------------+--------------------+----+--------------------+------+
|request|protocol|            endpoint|            datetime|size|              client|status|
+-------+--------+--------------------+--------------------+----+--------------------+------+
|    GET|HTTP/1.0|    /history/apollo/|01/Jul/1995:00:00...|6245|        199.72.81.55|   200|
|    GET|HTTP/1.0| /shuttle/countdown/|01/Jul/1995:00:00...|3985|unicomp6.unicomp.net|   200|
|    GET|HTTP/1.0|/shuttle/missions...|01/Jul/1995:00:00...|4085|      199.120.110.21|   200|
|    GET|HTTP/1.0|/shuttle/countdow...|01/Jul/1995:00:00...|   0|  burger.letters.com|   304|
|    GET|HTTP/1.0|/shuttle/missions...|01/Jul/1995:00:00...|4179|      199.120.110.21|   200|
+-------+--------+--------------------+--------------------+----+--------------------+------+



In [116]:
#Now we need to solve the issue we had in the UDF: where all data was returned as strings - since 'size' should be an INT value:
from pyspark.sql.functions import expr

dfClean = dfClean.withColumn('size_int',expr('cast(size as int)'))
dfClean.limit(5).show()

+-------+--------+--------------------+--------------------+----+--------------------+------+--------+
|request|protocol|            endpoint|            datetime|size|              client|status|size_int|
+-------+--------+--------------------+--------------------+----+--------------------+------+--------+
|    GET|HTTP/1.0|    /history/apollo/|01/Jul/1995:00:00...|6245|        199.72.81.55|   200|    6245|
|    GET|HTTP/1.0| /shuttle/countdown/|01/Jul/1995:00:00...|3985|unicomp6.unicomp.net|   200|    3985|
|    GET|HTTP/1.0|/shuttle/missions...|01/Jul/1995:00:00...|4085|      199.120.110.21|   200|    4085|
|    GET|HTTP/1.0|/shuttle/countdow...|01/Jul/1995:00:00...|   0|  burger.letters.com|   304|       0|
|    GET|HTTP/1.0|/shuttle/missions...|01/Jul/1995:00:00...|4179|      199.120.110.21|   200|    4179|
+-------+--------+--------------------+--------------------+----+--------------------+------+--------+



---

### Analyze

---

In [117]:
dfClean.createOrReplaceTempView("dfClean_sql")

In [122]:
#Most common clients
sparkSesh.sql('''
    SELECT client, COUNT(client) as count
    FROM dfClean_sql
    GROUP BY client
    ORDER BY count DESC
    LIMIT 10
''').show()

[Stage 117:>                                                        (0 + 1) / 1]

+--------------------+-----+
|              client|count|
+--------------------+-----+
|piweba3y.prodigy.com|17572|
|piweba4y.prodigy.com|11591|
|piweba1y.prodigy.com| 9868|
|  alyssa.prodigy.com| 7852|
| siltb10.orl.mmc.com| 7573|
|piweba2y.prodigy.com| 5922|
|  edams.ksc.nasa.gov| 5434|
|        163.206.89.4| 4906|
|         news.ti.com| 4863|
|disarray.demon.co.uk| 4353|
+--------------------+-----+



                                                                                

In [123]:
#Biggest files:
sparkSesh.sql('''
    SELECT endpoint, size_int
    FROM dfClean_sql
    ORDER BY size_int DESC
    LIMIT 10
''').show()

[Stage 121:>                                                        (0 + 1) / 1]

+--------------------+--------+
|            endpoint|size_int|
+--------------------+--------+
|/shuttle/countdow...| 6823936|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 3155499|
|/statistics/1995/...| 2973350|
|/statistics/1995/...| 2973350|
+--------------------+--------+



                                                                                

In [157]:
# Get number of requests per day

from pyspark.sql.functions import to_timestamp, lit

#first, need to clean 'datetime' column:
dfClean = dfClean.withColumn('timestamp',to_timestamp(col('datetime'),"dd/MMM/yyyy:HH:mm:ss Z"))
dfClean.createOrReplaceTempView("dfClean_sql")

In [159]:
sparkSesh.sql('''
    SELECT EXTRACT(day from timestamp) AS day,
    COUNT(client) AS count
    FROM dfClean_sql
    GROUP BY day
    ORDER BY day ASC
''').show()

[Stage 184:>                                                        (0 + 1) / 1]

+----+------+
| day| count|
+----+------+
|null|     0|
|   1| 45999|
|   2| 58994|
|   3| 87604|
|   4| 74304|
|   5| 91426|
|   6| 97183|
|   7| 95597|
|   8| 43702|
|   9| 34741|
|  10| 66720|
|  11| 78467|
|  12| 88171|
|  13|138096|
|  14| 86034|
|  15| 49805|
|  16| 44324|
|  17| 74377|
|  18| 66324|
|  19| 72283|
+----+------+
only showing top 20 rows



                                                                                