# Assignment Dask

_**Connect to a Local Dask Cluster on this machine and run analytics**_

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio Notebook Instances with the `ml.t3.2xlarge` instance (8 vCPU + 32GiB).

---

This notebook creates a local Dask cluster and then it reads the `s3://bigdatateaching/quazyilx/quazyilx2.txt` and `s3://bigdatateaching/forensicswiki/2012_logs.txt` datasets and runs analytics tasks on these datasets.

**It is important to stick to the exact same version of the dependencies as installed in this notebook. Any changes to anything in this notebook or cloud formation scripts would most likely cause things to break and lead to a dependency hell.**

---

---

## Contents
1. [Tasks do be done in this assignment](#Tasks-do-be-done-in-this-assignment)
1. [Prepare the environment](#Prepare-the-environment)
1. [Connect to the Dask Cluster](#Connect-to-the-Dask-Cluster)
1. [[TASK 1] The quazyilx scientific instrument (5 points)](#[TASK-1]-The-quazyilx-scientific-instrument-(5-points))
1. [[TASK 2] Log file analysis (5 points)](#[TASK-2]-Log-file-analysis-(5-points))

_**During the course of execution of this notebook you might see errors such as `distributed.nanny - WARNING - Worker process still alive after 3.9999988555908206 seconds, killing` these are OK, especially if the thing that you asked Dask to do did indeed complete, see more here [Why did my worker die?](https://distributed.dask.org/en/stable/killed.html).**_

---

## Tasks do be done in this assignment


There are 2 tasks to be done in this lab. Look for `TASK 1` and `TASK 2` in this notebook for instructions for each of the tasks. You would need to write code for each of the task and save the output requested in a file as per the instructions provided for each task.

---

## Prepare the environment

Install the exact version of Python packages that work with the Dask cluster (based on the container used by the Dask cluster, see cloud formation templates).

In [154]:
!pip install dask[complete]==2022.2.0 s3fs==2022.7.1 pyarrow==9.0.0 dask-glm==0.2.0 cytoolz==0.12.0 dask-ml==2022.5.27

Collecting dask-glm==0.2.0
  Using cached dask_glm-0.2.0-py2.py3-none-any.whl (12 kB)
Collecting cytoolz==0.12.0
  Using cached cytoolz-0.12.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting dask-ml==2022.5.27
  Using cached dask_ml-2022.5.27-py3-none-any.whl (148 kB)
Collecting bokeh>=2.1.1
  Using cached bokeh-2.4.3-py3-none-any.whl (18.5 MB)
Collecting scikit-learn>=0.18
  Using cached scikit_learn-1.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.8 MB)
Collecting botocore<1.24.22,>=1.24.21
  Using cached botocore-1.24.21-py3-none-any.whl (8.6 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, cytoolz, scikit-learn, botocore, bokeh, dask-glm, dask-ml
  Attempting uninstall: cytoolz
    Found existing installation: cytoolz 0.10.1
    Uninstalling cytoolz-0.10.1:
      Successfully uninstalled cytoolz-0.10.1
  Attempting uninstall: scikit-learn
   

Install `htop` so that we can see the CPU and memory utilization because we would not be able to connect to the web portal of the local cluster (although there are ways of doing this but we would not be doing that in this class, see [ngrok](https://ngrok.com/))

In [155]:
!apt-get update
!apt-get install -y htop

Get:1 http://security.debian.org/debian-security buster/updates InRelease [34.8 kB]
Get:2 http://deb.debian.org/debian buster InRelease [122 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [56.6 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [365 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7909 kB]
Get:6 http://deb.debian.org/debian buster-updates/main amd64 Packages [8788 B]
Fetched 8496 kB in 1s (5821 kB/s)                         
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  lsof strace
The following NEW packages will be installed:
  htop
0 upgraded, 1 newly installed, 0 to remove and 63 not upgraded.
Need to get 92.8 kB of archives.
After this operation, 230 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 htop amd64 2.2.0-1+b1 [92.8 kB]
Fetched 92.8 kB in 

In [4]:
import os
import dask
import s3fs
import time
import distributed
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dateutil.parser import parse
from dask.distributed import Client
from dask.distributed import LocalCluster
from dask.distributed import performance_report

---

## Connect to the Dask Cluster

With the right packages installed, now create and connect to the Dask cluster. Any Dask operations we do after that are automatically done at the _local_ cluster.

As created, the Dask cluster has one _scheduler task_ and as many _worker tasks_ as there are vCPUs on this instance.

In [156]:
# enable this client for fargate distributed cluster testing
cluster = LocalCluster()
client = Client(cluster)

In [157]:
client.cluster

Tab(children=(HTML(value='<div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-outpu…

Link to this cluster, we will not be able to access this, but if you run this notebook on your laptop you will be able to access it.

In [158]:
cluster.dashboard_link

'http://127.0.0.1:8787/status'

Use the `get_logs` function to access the logs of this cluster. Each task also has its web page.

In [159]:
cluster.get_logs()

---

## [TASK 1] The quazyilx scientific instrument (5 points)

For this problem, you will be working with data from the quazyilx instrument. The files you will use contain hypothetic measurements of a scientific instrument called a quazyilx that has been specially created for this class. Every few seconds the quazyilx makes four measurements: fnard, fnok, cark and gnuck. The output looks like this:

`YYYY-MM-DDTHH:MM:SSZ fnard:10 fnok:4 cark:2 gnuck:9`
(This time format is called ISO-8601 and it has the advantage that it is both unambiguous and that it sorts properly. The Z stands for Greenwich Mean Time or GMT, and is sometimes called Zulu Time because the NATO Phonetic Alphabet word for Z is Zulu.)

When one of the measurements is not present, the result is displayed as negative 1 (e.g. -1).

The quazyilx has been malfunctioning, and occasionally generates output with a -1 for all four measurements, like this:

`2015-12-10T08:40:10Z fnard:-1 fnok:-1 cark:-1 gnuck:-1`

Your job is to find all of the times where the four instruments malfunctioned together using grep with Hadoop Streaming.

You will run a Dask job using the 18GB file `s3://bigdatateaching/quazyilx/quazyilx2.txt` as input. **First, copy the 18GB file from the bigdatateaching S3 bucket into your own S3 bucket**

**<u>Submission Requirements</u>**

1. A file called `dask-report-task1.html` which contains the dask performance report generated via the [performance_report function](https://distributed.dask.org/en/stable/diagnosing-performance.html). You should include the final Dask operation that you do in this performance report.

1. A file called `task1.csv` that contains all the lines containing `fnard:-1,fnok:-1,cark:-1,gnuck:-1`.

In [9]:
!aws s3 cp s3://bigdatateaching/quazyilx/quazyilx2.txt s3://anly502-fall-2022-yl1353/quazyilx/quazyilx2.txt

copy: s3://bigdatateaching/quazyilx/quazyilx2.txt to s3://anly502-fall-2022-yl1353/quazyilx/quazyilx2.txt


**<u>Hint 1</u>**: you can use the `sep` argument in read_csv to provide a delimiter that works for this file (as we can see "," is not a delimiter for this file). Pass the `dtype='object'`. This Stack Overflow link is helpful https://stackoverflow.com/questions/34266263/reading-csv-with-separator-in-python-dask. Also, notice there is no header in this file.

**<u>Hint 2</u>**: you would need to AND multiple conditions while filtering your dataframe, if you need help with that a google search will be useful.

In [36]:
import dask.dataframe as dd

Header = ['Time', 'fnard', 'fnok', 'cark', 'gnuck']
df = dd.read_csv('s3://anly502-fall-2022-yl1353/quazyilx/quazyilx2.txt', names = Header, 
                 sep= r' ', dtype='object')
df.head()

Unnamed: 0,Time,fnard,fnok,cark,gnuck
2000-01-01,00:00:10,fnard:17,fnok:18,cark:0,gnuck:32
2000-01-01,00:00:17,fnard:14,fnok:6,cark:-1,gnuck:11
2000-01-01,00:00:27,fnard:12,fnok:11,cark:18,gnuck:30
2000-01-01,00:00:36,fnard:9,fnok:10,cark:-1,gnuck:34
2000-01-01,00:00:40,fnard:1,fnok:14,cark:4,gnuck:45


In [37]:
df_n1 = df[df['fnard'] == "fnard:-1"]
df_n2 = df_n1[df_n1['fnok'] == "fnok:-1"]
df_n3 = df_n2[df_n2['cark'] == "cark:-1"]
df_n4 = df_n3[df_n3['gnuck'] == "gnuck:-1"]

df_n4.head()

Unnamed: 0,Time,fnard,fnok,cark,gnuck
2000-01-16,09:56:16,fnard:-1,fnok:-1,cark:-1,gnuck:-1
2000-02-29,11:21:35,fnard:-1,fnok:-1,cark:-1,gnuck:-1
2000-03-01,04:32:38,fnard:-1,fnok:-1,cark:-1,gnuck:-1


In [38]:
df_n4

Unnamed: 0_level_0,Time,fnard,fnok,cark,gnuck
npartitions=303,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,object,object,object,object,object
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [41]:
df_n4.to_csv('task1/*.csv')

In [46]:
import glob

task1_csv = glob.glob('task1/*.{}'.format('csv'))
task1_csv[0]

'task1/204.csv'

In [47]:
df_append = pd.DataFrame()

#append all files together
for file in task1_csv:
    df_temp = pd.read_csv(file)
    df_append = df_append.append(df_temp, ignore_index=True)
df_append.head()

Unnamed: 0.1,Unnamed: 0,Time,fnard,fnok,cark,gnuck
0,2043-06-05,23:57:57,fnard:-1,fnok:-1,cark:-1,gnuck:-1
1,2043-06-08,12:49:59,fnard:-1,fnok:-1,cark:-1,gnuck:-1
2,2043-07-22,17:12:59,fnard:-1,fnok:-1,cark:-1,gnuck:-1
3,2008-12-13,06:58:03,fnard:-1,fnok:-1,cark:-1,gnuck:-1
4,2009-01-06,15:46:55,fnard:-1,fnok:-1,cark:-1,gnuck:-1


In [52]:
df_append.to_csv('task1.csv', index = None)

In [53]:
!pip install dask[complete]==2022.2.0 s3fs==2022.7.1 pyarrow==9.0.0 dask-glm==0.2.0 cytoolz==0.12.0 dask-ml==2022.5.27

[0m

In [55]:
from dask.distributed import performance_report

with performance_report(filename="dask-report-task1.html"):
    df_n4.compute()

---

## [TASK 2] Log file analysis (5 points)

The file `s3://bigdatateaching/forensicswiki/2012_logs.txt` is a year's worth of Apache logs for the forensicswiki website. Each line of the log file correspondents to a single HTTP GET command sent to the web server. The log file is in the Combined Log Format.

Start off by copying the file from bigdatateaching into your own S3 bucket! Use the lab materials to find the command to do this.

Your goal in this problem is to report the number of hits for each month. Your final job output should look like this:
```
2010-01,xxxxxx
2010-02,yyyyyy
...
...
```
Where xxxxxx and yyyyyy are replaced by the actual number of hits in each month. **First, copy the `s3://bigdatateaching/forensicswiki/2012_logs.txt` file from the bigdatateaching S3 bucket into your own S3 bucket**

**<u>Submission Requirements</u>**

1. A file called `dask-report-task2.html` which contains the dask performance report generated via the [performance_report function](https://distributed.dask.org/en/stable/diagnosing-performance.html).  You should include the final Dask operation that you do in this performance report.

1. A file called `task2.csv` that contains the output in the format:
```
2010-01,xxxxxx
2010-02,yyyyyy
...
...
```

In [2]:
!aws s3 cp s3://bigdatateaching/forensicswiki/2012_logs.txt s3://anly502-fall-2022-yl1353/forensicswiki/2012_logs.txt

copy: s3://bigdatateaching/forensicswiki/2012_logs.txt to s3://anly502-fall-2022-yl1353/forensicswiki/2012_logs.txt


**<u>Hint 1</u>**: you can use the `sep` argument in read_csv to provide a delimiter that works for this file (as we can see "," is not a delimiter for this file). Pass the `dtype='object'`. This Stack Overflow link is helpful https://stackoverflow.com/questions/34266263/reading-csv-with-separator-in-python-dask. Also, notice there is no header in this file.

**<u>Hint 2</u>**: This file has formatting errors, which is to say not every line has the same number of columns, you would need to handle that. See the documentation for the Pandas `read_csv` function, check https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. But, you will need to do something else as well to deal with these errors, `read_csv` supports two different engines, try both to see which works, again, **read the docs**.

**<u>Hint 3</u>**: The `Pandas apply` function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html, the `datetime` module https://docs.python.org/3/library/datetime.html and the `parser` module https://dateutil.readthedocs.io/en/stable/parser.html are your friends, get familiar with them.

In [81]:
df_task2 = dd.read_csv('s3://anly502-fall-2022-yl1353/forensicswiki/2012_logs.txt', sep='-0800', names = ['value'], 
                        engine='python', dtype='object')

df_task2.head()

Unnamed: 0,value
0,"77.21.0.59 - - [01/Jan/2012:00:35:03 -0800] ""G..."
1,"77.21.0.59 - - [01/Jan/2012:00:35:04 -0800] ""G..."
2,"77.21.0.59 - - [01/Jan/2012:00:35:04 -0800] ""G..."
3,"77.21.0.59 - - [01/Jan/2012:00:35:05 -0800] ""G..."
4,"77.21.0.59 - - [01/Jan/2012:00:35:05 -0800] ""G..."


In [6]:
import re
import sys
import time
import datetime
from collections import Counter
from io import StringIO

In [133]:
df_2 = df_task2['value'].str.findall(r'\[(.*?)\:').compute()
df_2.head()

0    [01/Jan/2012]
1    [01/Jan/2012]
2    [01/Jan/2012]
3    [01/Jan/2012]
4    [01/Jan/2012]
Name: value, dtype: object

In [135]:
df_2 = df_2.str[0]
df_2.head()

0    01/Jan/2012
1    01/Jan/2012
2    01/Jan/2012
3    01/Jan/2012
4    01/Jan/2012
Name: value, dtype: object

In [95]:
df_date = df_2.to_frame()
df_date.head()

Unnamed: 0,value
0,01/Jan/2012
1,01/Jan/2012
2,01/Jan/2012
3,01/Jan/2012
4,01/Jan/2012


In [139]:
clean_list = []
for i in df_2:
    date = str(i)[3:11]
    obj = datetime.datetime.strptime(date, '%b/%Y')
    value = datetime.datetime.strftime(obj, '%Y-%m')
    clean_list.append(value)

In [142]:
df_date = pd.DataFrame(clean_list, columns = ['time'])
df_date.head()

Unnamed: 0,time
0,2012-01
1,2012-01
2,2012-01
3,2012-01
4,2012-01


In [145]:
df_date.time.value_counts()

2012-01    1544100
2012-10    1498895
2012-08    1450426
2012-11    1397343
2012-12    1396198
2012-02    1325030
2012-06    1300250
2012-07    1287187
2012-09    1284945
2012-03    1274061
2012-05    1173380
2012-04    1016456
2013-01       1283
Name: time, dtype: int64

In [150]:
results = df_date.time.value_counts()
results.to_csv('task2.csv', header = ['count'])

In [151]:
final_dask = dd.from_pandas(df_date, npartitions = 13)

In [160]:
with performance_report(filename="dask-report-task2.html"):
    final_dask.compute()