# Jeffry Zheng Project 2: Tracking User Activity

Initial set up included cloning down the project 2 repo before creating and moving to the assignment branch.
    
    /w205/project-2-jeffry-zheng

Pulling assessment data:
    
    curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp

# Spinning up the pipeline

Bring up and edit YAML file:
    
    cp ~/w205/course-content/08-Querying-Data/docker-compose.yml ~/w205/project-2-jeffry-zheng/

    fix 8888 for spark
        expose:
    -	“8888”
    ports:
    -	“8888:8888”
YAML file was edited to connect Jupyter Notebook to pyspark. 
YAML file services include zookeeper, kafka, cloudera, spark, and mids.

Spinning up the cluster:
    
    docker-compose up -d
    docker-compose ps
    docker ps -a
    docker-compose logs -f kafka

Creating and checking topic "assessments":

    docker-compose exec kafka kafka-topics --create --topic assessments --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181
    docker-compose exec kafka kafka-topics --describe --topic assessments --zookeeper zookeeper:32181

Publish and consume messages with Kafka:

    docker-compose exec mids bash -c "cat /w205/project-2-jeffry-zheng/assessment-attempts-20180128-121051-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessments"
    docker-compose exec mids bash -c "kafkacat -C -b kafka:29092 -t assessments -o beginning -e"

Linking the Spark container to the /w205 mount point:

    docker-compose exec spark bash
    ln -s /w205 w205
    exit

Starting a Jupyter Notebook for a pyspark kernel:

    docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark

# Transforming messages with Spark

In [18]:
import json
import pprint
from pyspark.sql import Row


### Pretty Print First Entry

In [15]:
p = pprint.PrettyPrinter(indent=1)


In [48]:
f = open("assessment-attempts-20180128-121051-nested.json","r")
s = f.read()
json_data = json.loads(s)
f.close()
len(json_data)

3280

In [49]:
p.pprint(json_data[0])

{'base_exam_id': '37f0a30a-7464-11e6-aa92-a8667f27e5dc',
 'certification': 'false',
 'exam_name': 'Normal Forms and All That Jazz Master Class',
 'keen_created_at': '1516717442.735266',
 'keen_id': '5a6745820eb8ab00016be1f1',
 'keen_timestamp': '1516717442.735266',
 'max_attempts': '1.0',
 'sequences': {'attempt': 1,
               'counts': {'all_correct': False,
                          'correct': 2,
                          'incomplete': 1,
                          'incorrect': 1,
                          'submitted': 4,
                          'total': 4,
                          'unanswered': 0},
               'id': '5b28a462-7a3b-42e0-b508-09f3906d1703',
               'questions': [{'id': '7a2ed6d3-f492-49b3-b8aa-d080a8aad986',
                              'options': [{'at': '2018-01-23T14:23:24.670Z',
                                           'checked': True,
                                           'correct': True,
                                           'id': '

## Unrolling Nested json

In [73]:
#creating data frame by subscribing to the kafka topic
raw_assessments = spark.read.format("kafka").option("kafka.bootstrap.servers", "kafka:29092").option("subscribe","assessments").option("startingOffsets", "earliest").option("endingOffsets", "latest").load() 
raw_assessments.cache()

DataFrame[key: binary, value: binary, topic: string, partition: int, offset: bigint, timestamp: timestamp, timestampType: int]

In [83]:
#passing json data as a string to data frame
assessments = raw_assessments.select(raw_assessments.value.cast('string'))

In [179]:
#writing to hdfs
assessments.write.parquet("/tmp/assessments")

In [84]:
extracted_assessments = assessments.rdd.map(lambda x: Row(**json.loads(x.value))).toDF()

In [167]:
#create a temporary table
extracted_assessments.registerTempTable('assessments')
extracted_assessments.printSchema()

root
 |-- base_exam_id: string (nullable = true)
 |-- certification: string (nullable = true)
 |-- exam_name: string (nullable = true)
 |-- keen_created_at: string (nullable = true)
 |-- keen_id: string (nullable = true)
 |-- keen_timestamp: string (nullable = true)
 |-- max_attempts: string (nullable = true)
 |-- sequences: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: map (containsNull = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: boolean (valueContainsNull = true)
 |-- started_at: string (nullable = true)
 |-- user_exam_id: string (nullable = true)



In [188]:
spark.sql("select keen_id, certification, max_attempts, sequences.att from assessments limit 10").show()


+--------------------+-------------+------------+----+
|             keen_id|certification|max_attempts| att|
+--------------------+-------------+------------+----+
|5a6745820eb8ab000...|        false|         1.0|null|
|5a674541ab6b0a000...|        false|         1.0|null|
|5a67999d3ed3e3000...|        false|         1.0|null|
|5a6799694fc7c7000...|        false|         1.0|null|
|5a6791e824fccd000...|        false|         1.0|null|
|5a67a0b6852c2a000...|        false|         1.0|null|
|5a67b627cc80e6000...|        false|         1.0|null|
|5a67ac8cb0a5f4000...|        false|         1.0|null|
|5a67a9ba060087000...|        false|         1.0|null|
|5a67ac54411aed000...|        false|         1.0|null|
+--------------------+-------------+------------+----+



In [27]:
spark.sql("select keen_timestamp, sequences.questions[0].user_incomplete from assessments limit 10").show()


+------------------+-------------------------------------------------------+
|    keen_timestamp|sequences[questions] AS `questions`[0][user_incomplete]|
+------------------+-------------------------------------------------------+
| 1516717442.735266|                                                   true|
| 1516717377.639827|                                                  false|
| 1516738973.653394|                                                  false|
|1516738921.1137421|                                                  false|
| 1516737000.212122|                                                  false|
| 1516740790.309757|                                                  false|
|1516746279.3801291|                                                  false|
| 1516743820.305464|                                                  false|
|  1516743098.56811|                                                  false|
| 1516743764.813107|                                                  false|

## Business Question 1: What are the top 5 most popular exams taken?

Table of the top 5 exams taken.

In [233]:
spark.sql("select count(distinct(base_exam_id)) as num_exams, exam_name from assessments group by exam_name order by num_exams desc limit 5").show()



+---------+--------------------+
|num_exams|           exam_name|
+---------+--------------------+
|        2|Being a Better In...|
|        2|          Great Bash|
|        2|Introduction to P...|
|        2|Architectural Con...|
|        1|Learning Spring P...|
+---------+--------------------+



In [230]:
pop_exams = spark.sql("select count(distinct(base_exam_id)) as num_exams, exam_name from assessments group by exam_name order by num_exams desc limit 5")

Row(num_exams=2, exam_name='Introduction to Python')

In [206]:
pop_exams.write.parquet("/tmp/pop_exams")

### Nested multi-value as dictionary

In [249]:
def my_lambda_sequences(x):
    raw_dict = json.loads(x.value)
    my_dict = {"keen_id" : raw_dict["keen_id"], "sequences_id" : raw_dict["sequences"]["id"], "sequences_attempt" : raw_dict["sequences"]["attempt"]}
    return Row(**my_dict)


In [250]:
my_sequences = assessments.rdd.map(my_lambda_sequences).toDF()

In [251]:
my_sequences.registerTempTable('sequences')
my_sequences.printSchema()

root
 |-- keen_id: string (nullable = true)
 |-- sequences_attempt: long (nullable = true)
 |-- sequences_id: string (nullable = true)



In [256]:
spark.sql("select a.keen_id, a.keen_timestamp, s.sequences_id, s.sequences_attempt from assessments a join sequences s on a.keen_id = s.keen_id order by s.sequences_attempt limit 10").show()


+--------------------+------------------+--------------------+-----------------+
|             keen_id|    keen_timestamp|        sequences_id|sequences_attempt|
+--------------------+------------------+--------------------+-----------------+
|5a17a67efa1257000...|1511499390.3836269|8ac691f8-8c1a-403...|                1|
|5a26ee9cbf5ce1000...|1512500892.4166169|9bd87823-4508-4e0...|                1|
|5a29dcac74b662000...|1512692908.8423469|e7110aed-0d08-4cb...|                1|
|5a2fdab0eabeda000...|1513085616.2275269|cd800e92-afc3-447...|                1|
|5a30105020e9d4000...|1513099344.8624721|8ac691f8-8c1a-403...|                1|
|5a3a6fc3f0a100000...| 1513779139.354213|e7110aed-0d08-4cb...|                1|
|5a4e17fe08a892000...|1515067390.1336551|9abd5b51-6bd8-11e...|                1|
|5a4f3c69cc6444000...| 1515142249.858722|083844c5-772f-48d...|                1|
|5a51b21bd0480b000...| 1515303451.773272|e7110aed-0d08-4cb...|                1|
|5a575a85329e1a000...| 15156

In [265]:
sequence_info = spark.sql("select a.keen_id, a.keen_timestamp, s.sequences_id, s.sequences_attempt from assessments a join sequences s on a.keen_id = s.keen_id order by s.sequences_attempt limit 10")


In [266]:
sequence_info.write.parquet("/tmp/sequence_info")

## Business Question 2: Which sequences were attempted the most?

Table of top distinct sequences by the total number of attempts per sequence.

In [259]:
spark.sql("select sequences_id, sum(sequences_attempt) as num_attempts from sequences group by sequences_id order by num_attempts desc limit 10").show()


+--------------------+------------+
|        sequences_id|num_attempts|
+--------------------+------------+
|e7110aed-0d08-4cb...|         394|
|066b5326-e547-4da...|         162|
|cc7043a6-1511-4f5...|         158|
|8ac691f8-8c1a-403...|         154|
|9bd87823-4508-4e0...|         128|
|cec4308a-64dc-484...|         119|
|3585eaaa-512d-4ff...|         109|
|25ca21fe-4dbb-446...|          95|
|07177166-021f-449...|          85|
|cbc07487-e537-4b9...|          80|
+--------------------+------------+



We can see that the most attempts for a sequence was 394.

### Nested multi-valued as a list

In [260]:
def my_lambda_questions(x):
    raw_dict = json.loads(x.value)
    my_list = []
    my_count = 0
    for l in raw_dict["sequences"]["questions"]:
        my_count += 1
        my_dict = {"keen_id" : raw_dict["keen_id"], "my_count" : my_count, "id" : l["id"]}
        my_list.append(Row(**my_dict))
    return my_list

In [261]:
my_questions = assessments.rdd.flatMap(my_lambda_questions).toDF()

In [262]:
my_questions.registerTempTable('questions')
my_questions.printSchema()

root
 |-- id: string (nullable = true)
 |-- keen_id: string (nullable = true)
 |-- my_count: long (nullable = true)



In [54]:
spark.sql("select q.keen_id, a.keen_timestamp, q.id from assessments a join questions q on a.keen_id = q.keen_id limit 10").show()


+--------------------+------------------+--------------------+
|             keen_id|    keen_timestamp|                  id|
+--------------------+------------------+--------------------+
|5a17a67efa1257000...|1511499390.3836269|803fc93f-7eb2-412...|
|5a17a67efa1257000...|1511499390.3836269|f3cb88cc-5b79-41b...|
|5a17a67efa1257000...|1511499390.3836269|32fe7d8d-6d89-4db...|
|5a17a67efa1257000...|1511499390.3836269|5c34cf19-8cfd-4f5...|
|5a26ee9cbf5ce1000...|1512500892.4166169|0603e6f4-c3f9-4c2...|
|5a26ee9cbf5ce1000...|1512500892.4166169|26a06b88-2758-45b...|
|5a26ee9cbf5ce1000...|1512500892.4166169|25b6effe-79b0-4c4...|
|5a26ee9cbf5ce1000...|1512500892.4166169|6de03a9b-2a78-46b...|
|5a26ee9cbf5ce1000...|1512500892.4166169|aaf39991-fa83-470...|
|5a26ee9cbf5ce1000...|1512500892.4166169|aab2e817-73dc-4ff...|
+--------------------+------------------+--------------------+



## Business Question 3: Which questions came up the most on all the exams?

Table of question id and the number of times it appeared throughout the exams.

In [264]:
spark.sql("select id, my_count from questions order by my_count desc limit 10").show()


+--------------------+--------+
|                  id|my_count|
+--------------------+--------+
|8d5372d4-a63b-40c...|      20|
|7d527603-ed07-4d1...|      19|
|bdc4d043-b161-426...|      18|
|2494e12b-070e-458...|      17|
|a994e9ca-208a-40a...|      16|
|ebc3d26e-ed70-4cd...|      15|
|ddac0ade-2320-48d...|      14|
|0bc8696a-b9b0-43b...|      13|
|c27494a4-fe24-4a1...|      12|
|be969a50-0474-409...|      11|
+--------------------+--------+



We can see that the most frequent question appeared 20 times.

In [267]:
question_info = spark.sql("select id, my_count from questions order by my_count desc limit 10")


In [268]:
question_info.write.parquet("/tmp/question_info")

### Handling "holes" in json data

In [56]:
def my_lambda_correct_total(x):
    raw_dict = json.loads(x.value)
    my_list = []
    if "sequences" in raw_dict:  
        if "counts" in raw_dict["sequences"]:     
            if "correct" in raw_dict["sequences"]["counts"] and "total" in raw_dict["sequences"]["counts"]:         
                my_dict = {"correct": raw_dict["sequences"]["counts"]["correct"], 
                           "total": raw_dict["sequences"]["counts"]["total"]}
                my_list.append(Row(**my_dict))
    return my_list


In [57]:
my_correct_total = assessments.rdd.flatMap(my_lambda_correct_total).toDF()


In [58]:
my_correct_total.registerTempTable('ct')


In [59]:
spark.sql("select * from ct limit 10").show()


+-------+-----+
|correct|total|
+-------+-----+
|      2|    4|
|      1|    4|
|      3|    4|
|      2|    4|
|      3|    4|
|      5|    5|
|      1|    1|
|      5|    5|
|      4|    4|
|      0|    5|
+-------+-----+



In [60]:
spark.sql("select correct / total as score from ct limit 10").show()


+-----+
|score|
+-----+
|  0.5|
| 0.25|
| 0.75|
|  0.5|
| 0.75|
|  1.0|
|  1.0|
|  1.0|
|  1.0|
|  0.0|
+-----+



In [61]:
spark.sql("select avg(correct / total)*100 as avg_score from ct limit 10").show()


+-----------------+
|        avg_score|
+-----------------+
|62.65699745547047|
+-----------------+



In [62]:
spark.sql("select stddev(correct / total) as standard_deviation from ct limit 10").show()

+-------------------+
| standard_deviation|
+-------------------+
|0.31086692286170553|
+-------------------+



# Checking results in hadoop

All directories:

    docker-compose exec cloudera hadoop fs -ls /tmp/
Individual directories:

    docker-compose exec cloudera hadoop fs -ls /tmp/assessments
    docker-compose exec cloudera hadoop fs -ls /tmp/pop_exams
    docker-compose exec cloudera hadoop fs -ls /tmp/sequence_info
    docker-compose exec cloudera hadoop fs -ls /tmp/question_info

# Shutting down the cluster

    docker-compose down
    docker-compose ps
    docker ps -a