Preparing our Network Data For Streaming
--------------------------------------

Recall from our introduction, that we are going to use netflow data from Los Alamos National Laboratory. It's a huge dataset and comes to us in CSV format. To better simulate true streaming, we're going to convert it to JSON; to allow us to run it locally, we'll only use a small subset of the data.

If we wanted to replicate this exercise on a larger dataset, we could set up a larger Spark cluster on a cloud provider, or in a datacenter. We could also add a Kafka / Zookeeper layer to our environment as well. Both of those steps would take us closer to a more realistic production environment.

For this part of the exercise, we'll:
    1. Get a network file from LANL
    2. Open the first 500,000 rows - which is only about 2.5 minutes of data - in Pandas
    3. Write the partial dataset as a single CSV.
    4. Split the CSV into multiple 10,000 row CSVs (50 total).
    5. Convert the 50 CSV's into streaming formatted JSON.
    
You're free to adjust the total number of rows, as well as the size of the JSONs, but we find that this provides a good balance of data volume and timely processing.

You have two choices to work with the data here - since the data directory should be mapped as a volume into your Docker container, you can either access the container itself using `docker exec pyspark1 /bin/bash` (if your container is named `pyspark1`. Alternatively, you can use the mapped volume locally on your system.

Either way, `cd` into the directory where you want this data to live so that all our data is centrally located.

Then, get the data. This example grabs day 3. Use [this link to download the dataset](https://s3-us-gov-west-1.amazonaws.com/unified-host-network-dataset/2017/netflow/netflow_day-03.bz2), and then move it into your `big-data-student-resources/datasets/lanl` directory.

Once that finishes downloading, you can come back to this notebook and run the Python steps here.

We import Pandas, define a headers list, and then read the CSV.

Pandas is great for this step because it natively decompresses the bz2 format, can apply headers when there aren't any, and can easily limit the number of rows we import.

In [1]:
import pandas as pd

headers = ['time', 'duration', 'srcdevice', 'dstdevice', 'protocol',
           'srcport', 'dstport', 'srcpackets', 'dstpackets', 'srcbytes', 'dstbytes']

dfDay03 = pd.read_csv('./datasets/lanl/netflow_day-03.bz2', header=None, names=headers, nrows=500000)

We then check the basics of our imported dataframe for length and to make sure the data looks right.

In [2]:
len(dfDay03)

500000

In [3]:
dfDay03.tail()

Unnamed: 0,time,duration,srcdevice,dstdevice,protocol,srcport,dstport,srcpackets,dstpackets,srcbytes,dstbytes
499995,172957,0,Comp208915,Comp275646,17,Port27607,53,1,0,73,0
499996,172957,0,Comp208915,Comp275646,17,Port55333,53,1,0,60,0
499997,172957,0,Comp208915,Comp275646,17,Port74721,53,1,0,65,0
499998,172957,0,Comp208915,Comp275646,17,Port80500,53,1,0,63,0
499999,172957,0,Comp208915,Comp275646,17,Port81522,53,1,0,63,0


Since all we need from Pandas here is a smaller CSV, go ahead and write it now.

In [4]:
dfDay03.to_csv('./datasets/lanl/netflow_day-03_partial.csv', header=True, index=False)

Now we'll go back to the command line and process the csv.

Note that these commands were developed using `zsh` on macOS. If you are using `bash` or another shell, the syntax will likely be slightly different.

First, let's split the large CSV into the smaller files. From the terminal, type:
`split -l 10000 netflow_day-03_partial.csv segment_`

You'll now have 50 new files names `segment_aa` through `segment_bx`. It's time to convert them to JSON.

We'll do this using the `csvkit` utility. If you haven't installed it, you can do so via a simple `pip install csvkit`. Within `csvkit`, there is a command called `csvjson` that converts CSV files into JSON.

First, make a single headers file that will be used on each of the smaller splits:
`head -1 netflow_day-03_partial.csv > headers.csv`

Now, create a shell script in your preferred text editor with the following content (again, written in `zsh`):

```
#!/bin/zsh
set -e

fnumber=1

files=(segment_*(N))

for input in $files; do
    printf -v j "%02g" $fnumber
    echo "working on time_$j.json"
    cat headers.csv $input > time_$j.csv
    csvjson --stream time_$j.csv > time_$j.json
    rm time_$j.csv $input
    fnumber=$((fnumber + 1))
done
```

Save the file, and from the command line make sure you `chmod` it to be executable (`chmod a+x <your_script>.sh`).

All this script does is create a list of all the `segment_` files, then iterate over each one to create a JSON file from the CSV. When this is complete, your 50 `segment_` files will have been replaced by `time_xx.json` files, where `xx` is a value between `01` and `50`.

At this point, you should move the original CSV, headers.csv, the downloaded bz2, and the shell script file out of the directory so that there are only JSON files remaining. This is important for Spark to handle streaming from a directory.

Congratulations! You've successfully prepared our dataset for streaming. Time to dive into the fun part!