# Class 8 Notebook 2: Working with per-line JSON files

Class 8 (7 Dec 2016) of [BS1804-1617 Fundamentals of Database Technologies](https://imperialbusiness.school/category/bs1804-1617/) by [Piotr Migdal](http://p.migdal.pl/)

* [How do I copy a remote dataset from the internet to DBFS in my Spark cluster?](https://forums.databricks.com/questions/771/how-do-i-distribute-a-local-file-on-my-drivermaste.html)

We will work on the same city data, exported to per-line JSONs from PostgreSQL with:
```COPY (SELECT row_to_json(t) FROM city as t) to '/Users/pmigdal/city.json';```

In [2]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/country.json

In [3]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/city.json

In [4]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/countrylanguage.json

In [5]:
# files (a Databricks command)
dbutils.fs.ls("file:/tmp/")

In [6]:
# we load it as a text file
rdd = sc.textFile("file:/tmp/country.json")

In [7]:
# it is just a string with JSON content
rdd.first()

In [8]:
# a Python library for parsing json
import json

In [9]:
# number of lines
rdd.count()

In [10]:
# parsing JSON takes time
# we want to load it only once
rdd_jsons = rdd.map(json.loads).cache()

In [11]:
# now, it is a Python dictionary
rdd_jsons.first()

In [12]:
# taking a column; look at the missing values
rdd_jsons.map(lambda x: x["indepyear"]).take(10)

In [13]:
# most popular independence years
rdd_jsons.map(lambda x: (x["indepyear"], 1)) \
 .reduceByKey(lambda x, y: x + y) \
 .top(5, lambda (k, v): v)

In [14]:
rdd_jsons \
  .filter(lambda x: x["continent"] == "Asia") \
  .filter(lambda x: x["indepyear"] is not None) \
  .sortBy(lambda x: x["indepyear"]) \
  .take(2)

## Exercises

* Get the 5 most populous countries.
* Get the number of countries per continent.
* Get the total population by continent.

In [16]:
rdd_jsons \
  .sortBy(lambda x: x["population"], ascending=False) \
  .take(5)

In [17]:
rdd_jsons \
  .map(lambda x: (x["continent"], 1)) \
  .reduceByKey(lambda x, y: x + y) \
  .collect()

In [18]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(lambda x, y: x + y) \
  .collect()

In [19]:
from operator import add

In [20]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(add) \
  .collect()

In [21]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(max) \
  .collect()