# CWL data analysis using DataFrames

In this homework you will ETL the Call of Duty World League Championship data.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F  # will be used a LOT
from pyspark import Row  # Row will be used in some of the assertions

ss = SparkSession.builder.\
     master('spark://spark-master:7077').\
     appName('cwlanalysis').getOrCreate()

Let's validate that you successfully uploaded all of the CWL data to HDFS:

In [2]:
from hdfs import InsecureClient

client = InsecureClient('http://namenode:50070', user='vagrant')
cwldirs = client.list('/Users/vagrant')

In [3]:
assert 'structured-2018-08-19-champs' in cwldirs

In [4]:
# let's cleanup any junk parquet files that you might have already in HDFS
client.delete('/Users/vagrant/matches_df.parquet', recursive=True)
client.delete('/Users/vagrant/teammatches_df.parquet', recursive=True)
client.delete('/Users/vagrant/modes_df.parquet', recursive=True)
client.delete('/Users/vagrant/playermatches_df.parquet', recursive=True)
client.delete('/Users/vagrant/matchevents_df.parquet', recursive=True)

False

Let's read in the "champs" dataset (each json file = 1 match played = 1 row in the DataFrame):

In [None]:
df = ss.read.json('hdfs://namenode/Users/vagrant/structured-2018-08-19-champs/*.json')

In [None]:
assert df.count() == 296

Sort your DataFrame by the `id` column (in ascending order).

In [None]:
# YOUR CODE HERE
df = df.sort("id")

There are a few ways to check out what the DataFrame looks like.  The first is probably to just list out the columns:

In [None]:
df.columns

Sadly, this doesn't reveal much about *nested* structure.  It is probably better to list out the schema.  There are a few ways to do this.  The first is to use Python lingo:

In [None]:
df.dtypes

The second is to list out the schema in Scala lingo:

In [None]:
df.schema

Oh my, THAT^^ is ugly.  Fortunately, there is a "pretty print" version of this that we will be much more useful:

In [None]:
df.printSchema()

### `matches_df`

The columns `events`, `players`, `teams`, and `hp_hill_names` are arrays (lists). We will want to "explode" each of them into their own tables.  Later we can analyze using table joins (just like in SQL).

Let's first create a table called `matches_df` that omits these arrays.  Use the DataFrames `.drop()` function to drop these 4 columns:

In [None]:
# YOUR CODE HERE
matches_df = df.drop('events','players','teams','hp_hill_names')

In [None]:
assert matches_df.take(2) == \
[Row(duration_ms=522000, end_time_s=1534359399, hp_hill_rotations=9, id='0066bbc8-4e5f-5641-9224-c743c1b003dc', map='London Docks', mode='Hardpoint', platform='ps4', rounds=1, series_id='champs-pool-F-1', start_time_s=1534358877, title='ww2'),
 Row(duration_ms=569000, end_time_s=1534364215, hp_hill_rotations=None, id='006a2f3e-b942-564e-9515-3a2fbff1a817', map='USS Texas', mode='Search & Destroy', platform='ps4', rounds=8, series_id='champs-pool-H-1', start_time_s=1534363646, title='ww2')]

We have seen how to visualize a couple of rows using `.take(5)`/`.head(5)`, `.show(5)`, and, if the DataFrame is small enough, `.collect()`.

A better viewing experience is to use `.limit(5)` (which builds a DataFrame of only 5 elements in Spark) and then `.toPandas()` to convert it to a Pandas DataFrame for viewing on the driver:

In [None]:
matches_df.limit(5).toPandas().head()

We are running DANGEROUSLY low on memory right now.  Let's write this DataFrame out to HDFS and delete it.  We'll read it back in later when we need it.

In [None]:
matches_df.write.parquet('hdfs://namenode/Users/vagrant/matches_df.parquet')
del matches_df
ss.catalog.clearCache()

### `modes_df`

Let's see what game modes were being played in CWL in 2018.  Recall that Call of Duty is really a collection of many games, each inspired by games that kids play on the playground (e.g. King of the Hill, Capture the Flag, etc).

Create a DataFrame named `modes_df` where each row is a distinct game mode.  Make sure they are sorted alphabetically.  Use Spark to do the sorting (hint:  how would you do this in SQL?):

In [None]:
# YOUR CODE HERE
modes_df = df.select('mode').distinct().sort('mode')

In [None]:
assert modes_df.collect() == \
[Row(mode='Capture The Flag'),
 Row(mode='Hardpoint'),
 Row(mode='Search & Destroy')]

We are running DANGEROUSLY low on memory right now.  Let's write this DataFrame out to HDFS and delete it.  We'll read it back in later when we need it.

In [None]:
modes_df.write.parquet('hdfs://namenode/Users/vagrant/modes_df.parquet')
del modes_df
ss.catalog.clearCache()

### `teammatches_df`

In the original DataFrame (`df` above) the `teams` column was really an array containing the statistics for the two teams that played against each other in that match.

We want to use the `explode` function to expand elements in the array to individual rows in a new table.  Your new DataFrame should be named `teammatches_df` and contain three columns:

- `id` (so that you can join back to other tables)
- `mode` (it is useful to store this redundantly in this table so that we can cut down on expensive joins later)
- `team` (contains the struct for a single team).

`team` will still be nested (i.e. it will contain fields like `team.name` and `team.is_victor`).

Make sure that your new DataFrame is sorted in ascending order by match `id`, and then by team name.

Hints:  we did `import pyspark.sql.functions as F` above for a reason.  Also, you can `.alias` a column.  This was shown in the lecture.

In [None]:
# YOUR CODE HERE
teammatches_df = df.select('id','mode',F.explode('teams').alias('team'))

In [None]:
assert teammatches_df.take(2) == \
[Row(id='0066bbc8-4e5f-5641-9224-c743c1b003dc', mode='Hardpoint', team=Row(is_victor=False, name='TEAM PRISMATIC', round_scores=[14, 39, 2, 7, 2, 0, 0, 33, 0], score=97, side='home')),
 Row(id='0066bbc8-4e5f-5641-9224-c743c1b003dc', mode='Hardpoint', team=Row(is_victor=True, name='UNILAD', round_scores=[20, 5, 51, 26, 29, 37, 44, 9, 29], score=250, side='away'))]

Let's print the schema so that we can refer to it later:

In [None]:
teammatches_df.printSchema()

We are running DANGEROUSLY low on memory right now.  Let's write this DataFrame out to HDFS and delete it.  We'll read it back in later when we need it.

In [None]:
teammatches_df.write.parquet('hdfs://namenode/Users/vagrant/teammatches_df.parquet')
del teammatches_df
ss.catalog.clearCache()

### `playermatches_df`

We want to similarly explode the `players` column in the original DataFrame (`df`) into a new DataFrame that we'll call `playermatches_df`.

Each row will contain the statistics for a single player in a single match.

As we did for `teammatches_df`, let's have 3 columns:

- `id`
- `mode`
- `player`

where `player` is the exploded column.

Make sure that new DataFrame is sorted by `id` (first) and then by player's name.

In [None]:
# YOUR CODE HERE
playermatches_df = df.select('id','mode',F.explode('players').alias('player')).orderBy('id','player.name')


In [None]:
grab2rows = playermatches_df.take(2)
assert grab2rows[0].id == '0066bbc8-4e5f-5641-9224-c743c1b003dc'
assert grab2rows[1].id == '0066bbc8-4e5f-5641-9224-c743c1b003dc'
assert grab2rows[0]['player']['name'] == 'ALEX'
assert grab2rows[1]['player']['name'] == 'MALLS'

In [None]:
playermatches_df.printSchema()

We are running DANGEROUSLY low on memory right now.  Let's write this DataFrame out to HDFS and delete it.  We'll read it back in later when we need it.

In [None]:
playermatches_df.write.parquet('hdfs://namenode/Users/vagrant/playermatches_df.parquet')
del playermatches_df
ss.catalog.clearCache()

### `matchevents_df`

We want to similarly explode the `events` column in the original DataFrame (`df`) into a new DataFrame that we'll call `matchevents_df`.

Each row will contain the statistics for a single event in a single match.

As we did for `teammatches_df`, let's have 3 columns:

- `id`
- `mode`
- `event`

where `event` is the exploded column.

Make sure that new DataFrame is sorted by `id` (first) and then by the time from the start of the match (hint:  `time_ms` measures this).  Note that this does not specify a unique ordering (since several events in a match might occur at the exact same time):

In [None]:
# YOUR CODE HERE
matchevents_df = df.select('id','mode',F.explode('events').alias('event')).orderBy('id','event.time_ms')


In [None]:
grab2rows = matchevents_df.take(2)
assert grab2rows[0].id == '0066bbc8-4e5f-5641-9224-c743c1b003dc'
assert grab2rows[1].id == '0066bbc8-4e5f-5641-9224-c743c1b003dc'

In [None]:
matchevents_df.printSchema()

We are running DANGEROUSLY low on memory right now.  Let's write this DataFrame out to HDFS and delete it.  We'll read it back in later when we need it.

In [None]:
matchevents_df.write.parquet('hdfs://namenode/Users/vagrant/matchevents_df.parquet')
del matchevents_df

In [None]:
# Let's clean up as much as possible
del df
ss.catalog.clearCache()