# Project 2: Tracking User Activity (Part I)

This jupyter notebook contains the step-by-step procedures adopted to build the pipeline and launch spark session.

The document is organized in three chapters:
- [JSON file data structure](#json_file_data_structure)
- [Publish and consume messages with Kafka](#publish_and_consume_messages_with_kafka)
- [Launch Spark](#launch_spark)

<a id='json_file_data_structure'></a>
## 1. JSON file data structure

### 1.1 Get the data using curl

In [34]:
!curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9096k  100 9096k    0     0  21.2M      0 --:--:-- --:--:-- --:--:-- 71.1M


### 1.2 Explore JSON file to understand data structure

The data is a json file with many nested levels. Each entry represents an assessment. On the first level, we have general data about the assessment, such as: the exam_id, the exam_name, the user_id, if it was a certication or not, start date/time, among other information. On the 'sequences' key, we find the 'questions' key and the 'counts' key. The 'counts' key takes us to a summary of metrics related to the assessment, such as: the number of correct, incorrect, incomplete, or unanswered questions. The 'questions' key takes us to each of the questions of the exam, and inside it there is the 'options' key that takes us one level down, to each of the alternatives available for each question.  

In [35]:
!cat assessment-attempts-20180128-121051-nested.json | jq '.[0]'

[1;39m{
  [0m[34;1m"keen_timestamp"[0m[1;39m: [0m[0;32m"1516717442.735266"[0m[1;39m,
  [0m[34;1m"max_attempts"[0m[1;39m: [0m[0;32m"1.0"[0m[1;39m,
  [0m[34;1m"started_at"[0m[1;39m: [0m[0;32m"2018-01-23T14:23:19.082Z"[0m[1;39m,
  [0m[34;1m"base_exam_id"[0m[1;39m: [0m[0;32m"37f0a30a-7464-11e6-aa92-a8667f27e5dc"[0m[1;39m,
  [0m[34;1m"user_exam_id"[0m[1;39m: [0m[0;32m"6d4089e4-bde5-4a22-b65f-18bce9ab79c8"[0m[1;39m,
  [0m[34;1m"sequences"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"questions"[0m[1;39m: [0m[1;39m[
      [1;39m{
        [0m[34;1m"user_incomplete"[0m[1;39m: [0m[0;39mtrue[0m[1;39m,
        [0m[34;1m"user_correct"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
        [0m[34;1m"options"[0m[1;39m: [0m[1;39m[
          [1;39m{
            [0m[34;1m"checked"[0m[1;39m: [0m[0;39mtrue[0m[1;39m,
            [0m[34;1m"at"[0m[1;39m: [0m[0;32m"2018-01-23T14:23:24.670Z"[0m[1;39m,
            [0m[34;1m"id"[0m[1;39m: 

<a id='publish_and_consume_messages_with_kafka'></a>
## 2. Publish and consume messages with kafka

### 2.1 Spin up the cluster using docker-compose

We will use a cluster with five components to build our pipeline:
- kafka (for data ingestion)
- zookeeper (broker)
- spark (for data transformation)
- cloudera/HDFS (for loading data in the hard disk)
- MIDS (linux bash)

In [41]:
!docker-compose up -d

Creating network "project-2-lbrossi_default" with the default driver
Creating project-2-lbrossi_mids_1 ... 
Creating project-2-lbrossi_cloudera_1 ... 
Creating project-2-lbrossi_zookeeper_1 ... 
[2BCreating project-2-lbrossi_spark_1     ... mdone[0m[2A[2K
[2BCreating project-2-lbrossi_kafka_1     ... mdone[0m
[1Bting project-2-lbrossi_kafka_1     ... [32mdone[0m[1A[2K

### 2.2 Check every container is up

In [42]:
!docker-compose ps

         Name                   Command           State           Ports         
--------------------------------------------------------------------------------
project-2-lbrossi_clou   cdh_startup_script.sh    Up      11000/tcp, 11443/tcp, 
dera_1                                                    19888/tcp, 50070/tcp, 
                                                          8020/tcp, 8088/tcp,   
                                                          8888/tcp, 9090/tcp    
project-2-lbrossi_kafk   /etc/confluent/docker/   Up      29092/tcp, 9092/tcp   
a_1                      run                                                    
project-2-lbrossi_mids   /bin/bash                Up      8888/tcp              
_1                                                                              
project-2-lbrossi_spar   docker-entrypoint.sh     Up      0.0.0.0:8888->8888/tcp
k_1                      bash                                                   
project-2-lbrossi_zook   /et

### 2.3 Create the *assessments* topic in kafka

The topic assessments will be our pipeline for publishing messages from the JSON file into kafka.

In [None]:
!docker-compose exec kafka \
  kafka-topics \
    --create \
    --topic assessments \
    --partitions 1 \
    --replication-factor 1 \
    --if-not-exists \
    --zookeeper zookeeper:32181

### 2.4 Check the topic has been properly created

In [26]:
!docker-compose exec kafka \
  kafka-topics \
    --describe \
    --topic assessments \
    --zookeeper zookeeper:32181

Topic: assessments	PartitionCount: 1	ReplicationFactor: 1	Configs: 
	Topic: assessments	Partition: 0	Leader: 1	Replicas: 1	Isr: 1


### 2.5 Publish the messages into kafka

In [27]:
!docker-compose exec mids \
  bash -c "cat ~/project-2-lbrossi/assessment-attempts-20180128-121051-nested.json \
    | jq '.[]' -c \
    | kafkacat -P -b kafka:29092 -t assessments && echo 'Messages published'"

Messages published


<a id='launch_spark'></a>
## 3. Launch Spark

### 3.1 Launch spark session in jupyter notebook

In [29]:
!docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root --notebook-dir=/w205/' pyspark

[32m[I 20:42:58.740 NotebookApp](B[m Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[32m[I 20:42:58.882 NotebookApp](B[m Serving notebooks from local directory: /w205
[32m[I 20:42:58.883 NotebookApp](B[m 0 active kernels 
[32m[I 20:42:58.884 NotebookApp](B[m The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=65c2f9aaf5407f97012a1ffcd9fb387951eb44ac3f7e6ab7
[32m[I 20:42:58.884 NotebookApp](B[m Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 20:42:58.885 NotebookApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://0.0.0.0:8888/?token=65c2f9aaf5407f97012a1ffcd9fb387951eb44ac3f7e6ab7
^C
[32m[I 20:45:00.506 NotebookApp](B[m interrupted
Serving notebooks from local directory: /w205
0 active kernels 
The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=65c2f9aaf5407f97012a1ffcd9fb

### 3.2 Get the external IP to access jupyter from the browser

In [30]:
from requests import get
ip = get('https://api.ipify.org').text
print ('IP is:', ip)

IP is: 34.83.60.96
