
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Credit to DataBricks

In [1]:
import re
import datetime
import pyspark

from pyspark.sql import Row
from pyspark import SparkContext
import sys
import os
from operator import add

sc = pyspark.SparkContext('local[*]')



In [2]:
# Quick test of the regular expression library
# https://www.regextester.com/95830
# Look behind 'abc'
m = re.search('(?<=abc)def', 'abcdef')
m.group(0)

'def'

### **Part 1: Apache Web Server Log file format**
#### The log files that we use for this assignment are in the [Apache Common Log Format (CLF)](http://httpd.apache.org/docs/1.3/logs.html#common). The log file entries produced in CLF will look something like this:
`127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839`
 
#### Each part of this log entry is described below.
* `127.0.0.1`
#### This is the IP address (or host name, if available) of the client (remote host) which made the request to the server.
 
* `-`
#### The "hyphen" in the output indicates that the requested piece of information (user identity from remote machine) is not available.
 
* `-`
#### The "hyphen" in the output indicates that the requested piece of information (user identity from local logon) is not available.
 
* `[01/Aug/1995:00:00:01 -0400]`
#### The time that the server finished processing the request. The format is:
`[day/month/year:hour:minute:second timezone]`
  * ####day = 2 digits
  * ####month = 3 letters
  * ####year = 4 digits
  * ####hour = 2 digits
  * ####minute = 2 digits
  * ####second = 2 digits
  * ####zone = (\+ | \-) 4 digits
 
* `"GET /images/launch-logo.gif HTTP/1.0"`
#### This is the first line of the request string from the client. It consists of a three components: the request method (e.g., `GET`, `POST`, etc.), the endpoint (a [Uniform Resource Identifier](http://en.wikipedia.org/wiki/Uniform_resource_identifier)), and the client protocol version.
 
* `200`
#### This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification ([RFC 2616](https://www.ietf.org/rfc/rfc2616.txt) section 10).
 
* `1839`
#### The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be "-" (or sometimes 0).
 
#### Note that log files contain information supplied directly by the client, without escaping. Therefore, it is possible for malicious clients to insert control-characters in the log files, *so care must be taken in dealing with raw logs.*
 


In [3]:
month_map = {'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
    'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12}


def parse_apache_time(s):
    """ Convert Apache time format into a Python datetime object
    Args:
        s (str): date and time in Apache time format
    Returns:
        datetime: datetime object (ignore timezone for now)
    """
    return datetime.datetime(int(s[7:11]),
                             month_map[s[3:6]],
                             int(s[0:2]),  # Day
                             int(s[12:14]),
                             int(s[15:17]),
                             int(s[18:20]))

### Parsing Each Log Line
#### Using the CLF as defined above, we create a regular expression pattern to extract the nine fields of the log line using the Python regular expression search function. The function returns a pair consisting of a Row object and 1. If the log line fails to match the regular expression, the function returns a pair consisting of the log line string and 0. A '-' value in the content size field is cleaned up by substituting it with 0. The function converts the log line's date string into a Python datetime object using the given parse_apache_time function.

In [7]:
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'

In [4]:
def parseApacheLogLine(logline):
    """ Parse a line in the Apache Common Log format
    Args:
        logline (str): a line of text in the Apache Common Log format
    Returns:
        tuple: either a dictionary containing the parts of the Apache Access Log and 1,
               or the original invalid log line and 0
    """
    match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)
    if match is None:
        return (logline, 0)
    size_field = match.group(9)
    if size_field == '-':
        size = 0
    else:
        size = match.group(9)
    return (Row(
        host          = match.group(1),
        client_identd = match.group(2),
        user_id       = match.group(3),
        date_time     = parse_apache_time(match.group(4)),
        method        = match.group(5),
        endpoint      = match.group(6),
        protocol      = match.group(7),
        response_code = int(match.group(8)),
        content_size  = int(size)
    ), 1)

### Configuration and Initial RDD Creation
#### We are ready to specify the input log file and create an RDD containing the parsed log file data. The log file has already been downloaded for you.

#### To create the primary RDD that we'll use in the rest of this assignment, we first load the text file using sc.textfile(logFile) to convert each line of the file into an element in an RDD.

#### Next, we use map(parseApacheLogLine) to apply the parse function to each element (that is, a line from the log file) in the RDD and turn each line into a pair Row object.

#### Finally, we cache the RDD in memory since we'll use it throughout this notebook.




In [5]:
def parseLogs( log_file ):
    """ Read and parse log file """
    parsed_logs = (sc
                   .textFile( log_file )
                   .map(parseApacheLogLine)
                   .cache())

    access_logs = (parsed_logs
                   .filter(lambda s: s[1] == 1)
                   .map(lambda s: s[0])
                   .cache())

    failed_logs = (parsed_logs
                   .filter(lambda s: s[1] == 0)
                   .map(lambda s: s[0]))
    failed_logs_count = failed_logs.count()
    if failed_logs_count > 0:
        print( 'Number of invalid logline: %d' % failed_logs.count())
        for line in failed_logs.take(20):
            print( 'Invalid logline: %s' % line)

    print('Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (parsed_logs.count(), access_logs.count(), failed_logs.count()))
    return parsed_logs, access_logs, failed_logs


In [8]:
logFile = 'data/usask_access_sanity_check_log'
parsed_logs, access_logs, failed_logs = parseLogs(logFile)
#failed_logs.take(5)
#access_logs.take(10)
print(access_logs.count())

Read 16 lines, successfully parsed 16 lines, failed to parse 0 lines
16


In [9]:

# Count URLs in erro (not code 200)

not200 = access_logs.map(lambda log: log).filter(lambda log: log.response_code != 200)
print(not200.count())
endpointCountPairTuple = not200.map(lambda log: (log.endpoint, 1))

endpointSum = endpointCountPairTuple.reduceByKey(add)
#print(endpointSum.collect())
topTenErrURLs = endpointSum.takeOrdered(10)
print('Top Ten failed URLs: %s' % topTenErrURLs)

9
Top Ten failed URLs: [('/images/logo.gif', 9)]


In [27]:
content_sizes = access_logs.map(lambda log: log.content_size).cache()

print(content_sizes.count())
avg_content_size = content_sizes.reduce(lambda a, b : a + b) / content_sizes.count()
print('Average Content size : %d ' % avg_content_size )
#print(type(content_sizes.min()) )
#print('Minimum size is : %d' % content_sizes.min())
#print('Minimum size is : %d' % content_sizes.max())
print(content_sizes.stats())


99988
Average Content size : 7396 
(count: 99988, mean: 7396.535384246124, stdev: 275491.2270841608, max: 30193824.0, min: 0.0)


In [10]:
# Number of unique Hosts
hosts = access_logs.map(lambda log: log.host)
print(hosts.take(5))
uniqueHosts = hosts.distinct()
uniqueHostCount = uniqueHosts.count()
print( 'Unique hosts: %d' % uniqueHostCount)

['skul2.usask.ca', 'bell.usask.ca', '142.99.48.34', 'villi.usask.ca', 'chemeng03']
Unique hosts: 10


In [11]:
dayTimeStatsMonth = access_logs.map(lambda log: float(log.date_time.month))
dayTimeStatsDay = access_logs.map(lambda log: float(log.date_time.day))
dayTimeStatsHour = access_logs.map(lambda log: float(log.date_time.hour))
dayTimeStatsSecond = access_logs.map(lambda log: float(log.date_time.second))
print("Month Stats")
print(dayTimeStatsMonth.stats())
print("Day Stats")
print(dayTimeStatsDay.stats())
print("Hour Stats")
print(dayTimeStatsHour.stats())
print("Second Stats")
print(dayTimeStatsSecond.stats())

Month Stats
(count: 16, mean: 6.0, stdev: 0.0, max: 6.0, min: 6.0)
Day Stats
(count: 16, mean: 15.0, stdev: 0.0, max: 15.0, min: 15.0)
Hour Stats
(count: 16, mean: 13.8125, stdev: 0.3903123748999, max: 14.0, min: 13.0)
Second Stats
(count: 16, mean: 28.0, stdev: 16.725728683677733, max: 50.0, min: 3.0)


In [15]:
# TODO: Replace <FILL IN> with appropriate code

dayToHostPairTuple = access_logs.map(lambda log: ((log.date_time.day, log.host), 1))

#
#  shape : ((k1, k2), 1)

dayGroupedHosts = dayToHostPairTuple.reduceByKey(lambda a, b: a + b)
#  shape : ((k1, k2), total)

dayHostCount = dayGroupedHosts.map(lambda x: (x[0][0], 1))
#  shape : (k, v)

dailyHosts = dayHostCount.reduceByKey(lambda a, b: a + b).sortByKey().cache()
             
dailyHostsList = dailyHosts.takeOrdered(30)
print ('Unique hosts per day: %s' % dailyHostsList)




Unique hosts per day: [(15, 10)]


In [20]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
access_logs.toDF().show()

+-------------+------------+-------------------+--------------------+--------------------+------+--------+-------------+-------+
|client_identd|content_size|          date_time|            endpoint|                host|method|protocol|response_code|user_id|
+-------------+------------+-------------------+--------------------+--------------------+------+--------+-------------+-------+
|            -|        2360|1995-06-15 13:56:59|/~macphed/finite/...| geif15.insa-lyon.fr|   GET|        |          200|      -|
|            -|        2190|1995-06-15 13:57:08|/~macphed/finite/...| geif15.insa-lyon.fr|   GET|        |          200|      -|
|            -|        2146|1995-06-15 13:57:09|/~lowey/unprofess...|anonymous.chevron...|   GET|        |          200|      -|
|            -|        4210|1995-06-15 13:57:12|/~scottp/freewww....|  ds7001-1.gnofn.org|   GET|        |          200|      -|
|            -|       38469|1995-06-15 13:57:17|/~lowey/kev_dino.gif|anonymous.chevron...|   GET|