# Understanding User Behavior in a Video Game
By: Napoleon Paxton

## Overview
The goal of this document is to provide details on how to track events on a video game. This document provides details of the generated data, data pipeline, and example analytics of the data. The hope is that this document can be used as a template for engineers that would like to add events or conduct analytics for Delta Quest. This document can also be used as a guide for engineers that would like to observe the current pipeline and create updates based off of it. For engineers that are only interested in the commands used to create the pipeline and example analysis in this document without verbose context, see Appendix 1. This document requires the use of multiple command terminals to operate. For ease of use, Appendix 1 is separated by the command terminal used to enter the commands.

## Contents
1. Connecting Jupyter Notebook with PySpark
2. Create Kafka Topic and Flask Web Server
3. Watch Events with Kafkacat
4. Generate Events with Apache Bench
5. Filter Schema
6. View Purchaces in Hadoop
7. Read Pyspark for queries in code cell
8. Query with Hive
9. Query with Presto
10. Streaming (filter swords)
11. Streaming (write swords)
12. Streaming (continuously feed the stream)
13. Check data populated in Hadoop
14. Appendix 1: List of Commands by Terminal Shells Opened
15. Appendix 2: Event generating file
16. Appendix 3: Redis storage

## 1. Connecting Jupyter Notebook with PySpark
For my environment I am using a Unix shell on Google Cloud Platform (GCP). The first thing I do is to spin up all the containers I need which are contained in the docker-compose.yml file using the following command: <b>docker-compose up -d</b>. Next, within the same bash shell I create a symbolic link so I can connect a jupyter notebook with pyspark for analysis. First I enter the spark bash shell like this: <b>docker-compose exec spark bash</b>, and then using this command: <b>ln -s /w205 w205</b>, I generate the symbolic link. Once this link is created you can then exit the container simply by typing <b>exit</b>. Finally, within this same shell, I use this command to start a Jupyter Notebook for a pyspark kernel: <b>docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark</b>. Once the url is generated, copy the link into a text editor and then replace the address octets <b>(--ip 0.0.0.0)</b> with the correct values representing the location of your computing environment. 
## 2. Create Kafka Topic and Flask Web Server
Open up another shell in your computing environment and run this command to create a kafka topic: <b>docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181</b>. Here we called the topic events, so you know you succeeded when you see the message: <b> Created topic events </b>. Next we spin up our web application game using Flask: <b> docker-compose exec mids env FLASK_APP=/w205/full-stack2/game_api.py flask run --host 0.0.0.0 </b>. You know you have succeeded when you see the messages: <b> * Serving Flask app "game_api" and * Running on http://0.0.0.0:5000/ </b>

## 3. Watch Events with Kafkacat
Next we want to open another shell to start kafkacat so we can watch the events on kafka: <b> docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning </b>. 

## 4. Generate Events with Apache Bench
Next we generate events using Apache bench. 

    * docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/
    * docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword
    * docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_staff
    * docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/
    * docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/purchase_a_sword
    * docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/purchase_a_staff
After running the command you will see GET requests on the Flask web application and you will see the events populated in kafkacat. 

## 5. Filter Schema
Next we can use the following code to filter out data and write to hdfs: <b> docker-compose exec spark spark-submit /w205/full-stack/filtered_writes.py</b>. 

## 6. View Purchases in Hadoop
We can then see these purchases in hdfs using the following commands: <b> (1) docker-compose exec cloudera hadoop fs -ls /tmp/ </b> <b> (2) docker-compose exec cloudera hadoop fs -ls /tmp/purchases/</b>. 


## 7. Read Pyspark for queries in code cell

In [1]:
## Add purchases which are currently in hadoop to a variable and show the purchases
df = spark.read.parquet('/tmp/purchases')
df.show()

NameError: name 'spark' is not defined

In [3]:
## Add purchases which are currently in hadoop to a variable
df = spark.read.parquet('/tmp/purchases')

## Show those purchases
df.show()

## Create a temporary table named purchases based on the data in the variable
df.registerTempTable('Purchases')

## Select everything from purchases from the host user1.comcast.com and show them
df_by_example2 = spark.sql("select * from purchases where host='user1.comcast.com'")
df_by_example2.show()

## Create a Pandas dataframe based on df_by_example2 variable and show details of the dataframe
newdf = df_by_example2.toPandas()
newdf.describe()

## Store a sql query in a varaible named query and then run a spark sql query command using that variable
query = "create external table purchase_events stored as parquet location '/tmp/purchase_events' as select * from purchases"
spark.sql(query)

+------+-----------------+---------------+--------------+--------------------+
|Accept|             Host|     User-Agent|    event_type|           timestamp|
+------+-----------------+---------------+--------------+--------------------+
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_sword|2021-08-06 03:11:...|
|   */*|user1.comcast.com|ApacheBench/2.3|purchase_s

DataFrame[]

## 8. Query with Hive
Write a hive table to hdfs using this command: <b> docker-compose exec spark spark-submit /w205/full-stack2/write_hive_table.py </b>
Now we can check to see if it is listed in hdfs using these commands: 
<b> (1) docker-compose exec cloudera hadoop fs -ls /tmp/
(2) docker-compose exec cloudera hadoop fs -ls /tmp/purchases/ </b>

## 9. Query with Presto
Now we can use Presto to query hdfs. Once we enter this command: <b> docker-compose exec presto presto --server presto:8080 --catalog hive --schema default </b>, we are now in Presto and can make queries. Some example commands are <b> show tables, describe purchases, and select * from purchases </b>

## 10. Streaming (filter swords)
We can also stream events. The following file filters schemas in a streaming format: <b> docker-compose exec spark spark-submit /w205/full-stack2/filter_swords_stream.py </b> 

## 11. Streaming (write swords)
We can also write a stream to parquet format using the following file: <b> docker-compose exec spark spark-submit /w205/full-stack2/write_swords_stream.py </b> 

## 12. Streaming (continuously feed the stream)
Next, here is an example of how we can continuously feed the stream: <b> while true; do docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword; sleep 5; done </b>. 

## 13. Check Data Populated in Hadoop
To see what is populated in hadoop, run this command: <b> docker-compose exec cloudera hadoop fs -ls /tmp/sword_purchases </b>. Since we are now streaming, wait a few minutes and run this command again so you can see how it has changed over time. Once this is done, make sure to use the command <b> docker-compose down </b> to shut down your containers.




## 14. Appendix 1: Commands based on the Terminal Shells Opened
### Commands in Shell 1
1. Start up containers
    * docker-compose up -d
2. Enter Spark bash to create symbolic link for Juypter Notebook
    * docker-compose exec spark bash
3. Create symbolic link
    * ln -s /w205 w205
4. Leave Spark bash
    * exit
5. Open up Spark based Jupyter Notebook
    * docker-compose exec spark env PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip 0.0.0.0 --allow-root' pyspark
    
### Commands in Shell 2    
1. Create Kafka topic
    * docker-compose exec kafka kafka-topics --create --topic events --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181 
2. Start Flask Web Application
    * docker-compose exec mids env FLASK_APP=/w205/full-stack2/game_api.py flask run --host 0.0.0.0

### Commands in Shell 3
1. Start Kafkacat for monitoring
    * docker-compose exec mids kafkacat -C -b kafka:29092 -t events -o beginning

### Commands in Shell 4
1. Generate Events with Apache Bench 
    * docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/
    * docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword
    * docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/
    * docker-compose exec mids ab -n 10 -H "Host: user2.att.com" http://localhost:5000/purchase_a_sword
2. Filter Schema
    * spark spark-submit /w205/full-stack2/filtered_writes.py
3. View purchaces in Hadoop
    * docker-compose exec cloudera hadoop fs -ls /tmp/
    * docker-compose exec cloudera hadoop fs -ls /tmp/purchases/
4. Query with Hive
    * docker-compose exec spark spark-submit /w205/full-stack2/write_hive_table.py
5. Query with Presto
    * docker-compose exec cloudera hadoop fs -ls /tmp/
    * docker-compose exec cloudera hadoop fs -ls /tmp/purchases/

### Commands in Shell 5
1. Filter Stream
    * docker-compose exec spark spark-submit /w205/full-stack2/filter_swords_stream.py

### Commands in Shell 6
1. Write HDFS Files in Streaming Mode
    * docker-compose exec spark spark-submit /w205/full-stack2/write_swords_stream.py

### Commands in Shell 7
1. Continuously Feed Stream
    * while true; do docker-compose exec mids ab -n 10 -H "Host: user1.comcast.com" http://localhost:5000/purchase_a_sword; sleep 5; done

### Commands in Shell 8
1. Check hadoop
    * docker-compose exec cloudera hadoop fs -ls /tmp/sword_purchases
2. Shutdown Containers
    * docker-compose down


## 15. Appendix 2: Event Generating File
Included in this repo is an event generating shell script file which when run will push events to Kafka. To run the file use the following command: <b> ./data_generation.txt </b>

## 16. Appendix 3: Redis Storage
To add Redis for storage we create a new image. Using redis will allow you to do things like track the state of users. We use the following commands to start:
    * FROM midsw205/base
    * MAINTAINER Your Name <youremail>
    * RUN pip install redis
Or we can build a redis image:
    * docker build -t midswredis
Next we modify the docker-compose.yml file. We add the redis entry and then we change mids to use the redis image:
    * redis:
    *     image: redis:latest
    *     expose:
    *        - "6379"
    *     ports:
    *        - "6379:6379"
    * mids:
    *     image: midswredis
    *     stdin_open: true
    *     tty: true
    *     volumes:
    *       - /Your/Path/w205:/w205
    *     expose:
    *       - "5000"
    *     ports:
    *       - "5000:5000"
    *     extra_hosts:
    *       - "moby:127.0.0.1"
    *
    
To use, we enter the command: <b> docker-compose exec mids bash </b>. This logs us into the new redis image. Next we enter Python by typing <b> ipython </b> and we enter the following command to use redis <b> import redis </b>. We then connect to the local service using <b> r = redis.Redis(host='redis', port='6379') </b>. Some example things you can do is check the keys using <b> r.keys(). If you don't have something running the keys will be empty. Next you can set keys using the command <b> r.mset({"Croatia": "Zagreb", "Bahams": "Nassau"}) </b>. When you check the keys again using <b> r.keys() </b> you will see Zagreb, Bahams, and Nassau entered.
