# HW 1: Web log data wrangling

Please also refer to the HW1 [README](https://github.com/berkeley-cs186/course/tree/master/hw1) for the full assignment details.

--------------------------------------------

## Introduction

### Jupyter Notebooks w/ iPython

Jupyter Notebook is a web-based interactive computing system, which allow you to mix code and rich-text in one document. A notebook consists of a sequence of cells, which can be run using the "Play" button in the toolbar or by hitting Shift-Enter on the keyboard.

In HW1, you will primarily use code cells with iPython code. You can find a tour and pointers to more documentation in the `Help` menu above.


### The dataset

Let's take a look at the data. These web logs were produced by an Apache web server. Each line represents a request to the server that originally hosted an early viral video from 2002.

In [1]:
import os
DATA_DIR = os.environ['MASTERDIR'] + '/sp16/hw1/'

In [2]:
with open(DATA_DIR + "web_log_small.log") as log_file:
    sample_line = log_file.readline()

print sample_line

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"



This format is called "Combined Log Format", and you can find a description of each of the fields [here](https://httpd.apache.org/docs/1.3/logs.html#common).

Here's another way to view the first line of the dataset. We can run a shell command using [`! operator`](https://ipython.org/ipython-doc/3/interactive/reference.html#system-shell-access) (a feature of iPython). 

In [3]:
!head -1 {DATA_DIR}web_log_small.log

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"


-----------

## Your Assignment

Fill in the `process_logs` function below to complete the specification in the README. You can add any helper functions you need. You may use any of Python 2's standard libraries available on the instructional machines. You cannot use (and shouldn't need) any external libraries.

Remember, you need to ensure that your code will scale to datasets that are bigger than memory -- no matter how large or skewed the dataset or how much memory is on your test machine.  Avoid keeping data structures of unbounded size in memory, since it **won't** scale, e.g.: 

- having a list of every line in the dataset
- having a dictionary with an key for every IP address

Finally, to ensure proper grading, please make sure all of your log processing code (including `import` statements) is between the **BEGIN/END STUDENT CODE** cells. Do not modify or remove either of these cells.

### * BEGIN STUDENT CODE *

In [4]:
import apachetime
import time

def apache_ts_to_unixtime(ts):
    """
    @param ts - a Apache timestamp string, e.g. '[02/Jan/2003:02:06:41 -0700]'
    @returns int - a Unix timestamp in seconds
    """
    dt = apachetime.apachetime(ts)
    unixtime = time.mktime(dt.timetuple())
    return int(unixtime)

In [5]:
import csv

def process_logs(dataset_iter):
    """
    Processes the input stream, and outputs the CSV files described in the README.    
    This is the main entry point for your assignment.
    
    @param dataset_iter - an iterator of Apache log lines.
    """
    # FIX ME
    with open("hits.csv", "w+") as hits_file:        
        writer = csv.writer(hits_file, lineterminator = '\n')
        writer.writerow(["ip", "timestamp"])
        for i, line in enumerate(dataset_iter):            
            logSplitList = line.split(" ")
            ipAddress = logSplitList[0]
            unixTimeStamp = apache_ts_to_unixtime(logSplitList[3] + " " + logSplitList[4])
            writer.writerow([ipAddress, str(unixTimeStamp)])
    
    !sort -k1,1 -k2,2n -t',' hits.csv > intermediate.csv
    
    with open("intermediate.csv", "r") as intermediate_file:
        with open ("sessions.csv", "w+") as sessions_file:
            reader = csv.reader(intermediate_file)
            writer = csv.writer(sessions_file, lineterminator = '\n')
            writer.writerow(["ip", "session_length", "num_hits"])
            lastIp = ""
            lastTimeStamp = 0
            first_hit = 0
            num_hits = 1
            checkWriteLastLine = False
            for row in reader:
                currentIp = row[0]
                if currentIp == "ip":
                    continue
                currentTimeStamp = int(row[1])
                
                #same machine
                if currentIp == lastIp:
                    #same session (within 30 min of last hit)
                    if currentTimeStamp - lastTimeStamp <= 1800:
                        num_hits += 1
                        checkWriteLastLine = False
                    else:
                        #same machine but start of new session
                        writer.writerow([currentIp, str(lastTimeStamp - first_hit), str(num_hits)])
                        checkWriteLastLine = True
                        num_hits = 1
                        first_hit = currentTimeStamp
                #new machine
                else:
                    if lastIp == "" and lastTimeStamp == 0:
                        first_hit = currentTimeStamp
                        num_hits = 1
                    else:    
                        writer.writerow([lastIp, str(lastTimeStamp - first_hit), str(num_hits)])
                        checkWriteLastLine = True
                        num_hits = 1
                        first_hit = currentTimeStamp


                lastIp = currentIp
                lastTimeStamp = currentTimeStamp
            if checkWriteLastLine == False:
                writer.writerow([lastIp, str(lastTimeStamp - first_hit), str(num_hits)])
    
    
    #sort sessions.csv based on session_length
    !sort -k2,2n -t',' sessions.csv > intermediate2.csv
    
    with open("intermediate2.csv", "r") as intermediate_file:
        with open("session_length_plot.csv", "wb") as session_length_plot_file:
            reader = csv.reader(intermediate_file)
            writer = csv.writer(session_length_plot_file, lineterminator = '\n')
            writer.writerow(["left", "right", "count"])
            counter = 0
            lowerBound = 0
            upperBound = 2
            for row in reader:
                if row[1] == "session_length":
                    continue
                session_length = int(row[1])
                if session_length >= lowerBound and session_length < upperBound:
                    #correct range
                    counter += 1
                else:
                    writer.writerow([str(lowerBound), str(upperBound), str(counter)])
                    counter = 1
                    while upperBound <= session_length:
                        lowerBound = upperBound
                        upperBound = lowerBound * 2
            writer.writerow([str(lowerBound), str(upperBound), str(counter)])

### * END STUDENT CODE *

------------------------


In [6]:
def process_logs_small():
    """
    Runs the process_logs function with the small dataset (186 MB).
    """        
    with open(DATA_DIR + "web_log_small.log") as log_file:
        process_logs(log_file)

In [7]:
%time process_logs_small()

left,right,count
0,2,76233
2,4,6496
4,8,5246
8,16,9087
16,32,13372
32,64,14651
64,128,16044
128,256,23592
256,512,15117
512,1024,11296
1024,2048,8913
2048,4096,2689
4096,8192,576
8192,16384,193
16384,32768,117
32768,65536,67
65536,131072,22
131072,262144,2
262144,524288,2
524288,1048576,1
1048576,2097152,1
CPU times: user 16.4 s, sys: 577 ms, total: 17 s
Wall time: 18.2 s


In [8]:
import zipfile

def process_logs_large():
    """
    Runs the process_logs function on the full dataset.  The code below 
    performs a streaming unzip of the compressed dataset which is (158MB). 
    This saves the 1.6GB of disk space needed to unzip this file onto disk.
    """
    with zipfile.ZipFile(DATA_DIR + "web_log_large.zip") as z:
        fname = z.filelist[0].filename
        f = z.open(fname)
        process_logs(f)
        f.close()

In [9]:
%time process_logs_large()

left,right,count
0,2,1194447
2,4,39348
4,8,51821
8,16,105254
16,32,139045
32,64,167412
64,128,190613
128,256,208332
256,512,155371
512,1024,116568
1024,2048,104063
2048,4096,33026
4096,8192,8630
8192,16384,2135
16384,32768,927
32768,65536,535
65536,131072,252
131072,262144,84
262144,524288,36
524288,1048576,14
1048576,2097152,3
2097152,4194304,3
4194304,8388608,1
CPU times: user 2min 31s, sys: 3.72 s, total: 2min 35s
Wall time: 2min 46s


---------------

# Testing

As mentioned in the README, we provide reference output only for the small dataset. `diff_outputs()` produces a `.diff` files if there's a difference between your output and the referrence output.

If you're unfamiliar with the format of `diff`'s output, you can read about it [here](https://en.wikipedia.org/wiki/Diff_utility#Usage).

There are other diff utilities which produce colored/side-by-side output, making it easier to see differences. If you're interested, try:

```
$ vimdiff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
OR
$ git diff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
```

In [10]:
import os

ref_output_dir = DATA_DIR + "ref_output_small/"

def _diff_helper(f, unordered=False):
    """
    @param f (str) - filename to diff with reference output
    @param unordered (bool) - whether the ordering of the lines matters
    """
    if not os.path.isfile(f):
        print "FAIL - {} does not exist.".format(f)
        return
    
    if unordered:
        tmp1 = !mktemp
        tmp1 = tmp1[0]
        !sort {f} > {tmp1}
        !sort {ref_output_dir + f} | diff {tmp1} - > {f}.diff
    else:
        !diff {f} {ref_output_dir + f} > {f}.diff
    
    success = _exit_code == 0
    if success:
        !rm {f}.diff
        print "PASS - {} matched reference output.".format(f)
    else:
        print "FAIL - {} did not match reference output. See {}.diff.".format(f, f)
        

def diff_against_reference():
    """
    Compares the output files in the current directory with the reference output.
    If there is a difference, writes a ".diff" file, e.g. hits.csv.diff.
    """ 
    _diff_helper("hits.csv")
    _diff_helper("sessions.csv", unordered=True)
    _diff_helper("session_length_plot.csv")

In [11]:
process_logs_small()
diff_against_reference()

left,right,count
0,2,76233
2,4,6496
4,8,5246
8,16,9087
16,32,13372
32,64,14651
64,128,16044
128,256,23592
256,512,15117
512,1024,11296
1024,2048,8913
2048,4096,2689
4096,8192,576
8192,16384,193
16384,32768,117
32768,65536,67
65536,131072,22
131072,262144,2
262144,524288,2
524288,1048576,1
1048576,2097152,1
PASS - hits.csv matched reference output.
PASS - sessions.csv matched reference output.
PASS - session_length_plot.csv matched reference output.



### Testing Memory Usage

For additional testing, we've included a script which:
 - (1) makes sure all of your log processing code is between the BEGIN/END STUDENT CODE CELLS above, so it will work with our autograder
 - (2) runs your code with a memory cap of 1MB. If you see a `MemoryError`, it's a sign your code is not doing appropriate streaming and/or divide-and-conquer!
 
Make sure to save your notebook (`File > Save and Checkpoint`) before running the next cell.

In [12]:
!bash test_memory_usage.sh

[NbConvertApp] Converting notebook hw1.ipynb to python
Running process_logs_large()
left,right,count
0,2,1194447
2,4,39348
4,8,51821
8,16,105254
16,32,139045
32,64,167412
64,128,190613
128,256,208332
256,512,155371
512,1024,116568
1024,2048,104063
2048,4096,33026
4096,8192,8630
8192,16384,2135
16384,32768,927
32768,65536,535
65536,131072,252
131072,262144,84
262144,524288,36
524288,1048576,14
1048576,2097152,3
2097152,4194304,3
4194304,8388608,1
Memory Test Done.
