# Class 8 Notebook 2: Working with per-line JSON files

Class 8 (7 Dec 2016) of [BS1804-1617 Fundamentals of Database Technologies](https://imperialbusiness.school/category/bs1804-1617/) by [Piotr Migdal](http://p.migdal.pl/)

* [How do I copy a remote dataset from the internet to DBFS in my Spark cluster?](https://forums.databricks.com/questions/771/how-do-i-distribute-a-local-file-on-my-drivermaste.html)

We will work on the same city data, exported to per-line JSONs from PostgreSQL with:
```COPY (SELECT row_to_json(t) FROM city as t) to '/Users/pmigdal/city.json';```

In [1]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/country.json

--2016-12-11 20:37:27--  https://s3.amazonaws.com/pmigdal/country.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.81.11
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.81.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75715 (74K) [application/octet-stream]
Saving to: ‘/tmp/country.json’


2016-12-11 20:37:33 (187 KB/s) - ‘/tmp/country.json’ saved [75715/75715]



In [2]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/city.json

--2016-12-11 20:37:38--  https://s3.amazonaws.com/pmigdal/city.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.64.179
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.64.179|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 381063 (372K) [application/octet-stream]
Saving to: ‘/tmp/city.json’


2016-12-11 20:37:45 (170 KB/s) - ‘/tmp/city.json’ saved [381063/381063]



In [3]:
!wget -P /tmp https://s3.amazonaws.com/pmigdal/countrylanguage.json

--2016-12-11 20:37:48--  https://s3.amazonaws.com/pmigdal/countrylanguage.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.40.2
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.40.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 77725 (76K) [application/octet-stream]
Saving to: ‘/tmp/countrylanguage.json’


2016-12-11 20:37:54 (213 KB/s) - ‘/tmp/countrylanguage.json’ saved [77725/77725]



In [4]:
import pyspark
sc = pyspark.SparkContext('local[*]')
# files (a Databricks command)
dbutils.fs.ls("file:/tmp/")

NameError: name 'dbutils' is not defined

In [5]:
# we load it as a text file
rdd = sc.textFile("file:/tmp/country.json")

In [6]:
# it is just a string with JSON content
rdd.first()

'{"code":"AFG","name":"Afghanistan","continent":"Asia","region":"Southern and Central Asia","surfacearea":652090,"indepyear":1919,"population":22720000,"lifeexpectancy":45.9,"gnp":5976.00,"gnpold":null,"localname":"Afganistan/Afqanestan","governmentform":"Islamic Emirate","headofstate":"Mohammad Omar","capital":1,"code2":"AF"}'

In [7]:
# a Python library for parsing json
import json

In [8]:
# number of lines
rdd.count()

239

In [9]:
# parsing JSON takes time
# we want to load it only once
rdd_jsons = rdd.map(json.loads).cache()

In [10]:
# now, it is a Python dictionary
rdd_jsons.first()

{'capital': 1,
 'code': 'AFG',
 'code2': 'AF',
 'continent': 'Asia',
 'gnp': 5976.0,
 'gnpold': None,
 'governmentform': 'Islamic Emirate',
 'headofstate': 'Mohammad Omar',
 'indepyear': 1919,
 'lifeexpectancy': 45.9,
 'localname': 'Afganistan/Afqanestan',
 'name': 'Afghanistan',
 'population': 22720000,
 'region': 'Southern and Central Asia',
 'surfacearea': 652090}

In [11]:
# taking a column; look at the missing values
rdd_jsons.map(lambda x: x["indepyear"]).take(10)

[1919, 1581, None, 1912, 1962, None, 1278, 1975, None, 1981]

In [12]:
# most popular independence years
rdd_jsons.map(lambda x: (x["indepyear"], 1)) \
 .reduceByKey(lambda x, y: x + y) \
 .top(5, lambda kv:kv[1])

[(None, 47), (1960, 18), (1991, 18), (1962, 7), (1975, 7)]

In [13]:
rdd_jsons \
  .filter(lambda x: x["continent"] == "Asia") \
  .filter(lambda x: x["indepyear"] is not None) \
  .sortBy(lambda x: x["indepyear"]) \
  .take(2)

[{'capital': 1891,
  'code': 'CHN',
  'code2': 'CN',
  'continent': 'Asia',
  'gnp': 982268.0,
  'gnpold': 917719.0,
  'governmentform': "People'sRepublic",
  'headofstate': 'Jiang Zemin',
  'indepyear': -1523,
  'lifeexpectancy': 71.4,
  'localname': 'Zhongquo',
  'name': 'China',
  'population': 1277558000,
  'region': 'Eastern Asia',
  'surfacearea': 9572900.0},
 {'capital': 1532,
  'code': 'JPN',
  'code2': 'JP',
  'continent': 'Asia',
  'gnp': 3787042.0,
  'gnpold': 4192638.0,
  'governmentform': 'Constitutional Monarchy',
  'headofstate': 'Akihito',
  'indepyear': -660,
  'lifeexpectancy': 80.7,
  'localname': 'Nihon/Nippon',
  'name': 'Japan',
  'population': 126714000,
  'region': 'Eastern Asia',
  'surfacearea': 377829}]

## Exercises

* Get the 5 most populous countries.
* Get the number of countries per continent.
* Get the total population by continent.

In [14]:
rdd_jsons \
  .sortBy(lambda x: x["population"], ascending=False) \
  .take(5)

[{'capital': 1891,
  'code': 'CHN',
  'code2': 'CN',
  'continent': 'Asia',
  'gnp': 982268.0,
  'gnpold': 917719.0,
  'governmentform': "People'sRepublic",
  'headofstate': 'Jiang Zemin',
  'indepyear': -1523,
  'lifeexpectancy': 71.4,
  'localname': 'Zhongquo',
  'name': 'China',
  'population': 1277558000,
  'region': 'Eastern Asia',
  'surfacearea': 9572900.0},
 {'capital': 1109,
  'code': 'IND',
  'code2': 'IN',
  'continent': 'Asia',
  'gnp': 447114.0,
  'gnpold': 430572.0,
  'governmentform': 'Federal Republic',
  'headofstate': 'Kocheril Raman Narayanan',
  'indepyear': 1947,
  'lifeexpectancy': 62.5,
  'localname': 'Bharat/India',
  'name': 'India',
  'population': 1013662000,
  'region': 'Southern and Central Asia',
  'surfacearea': 3287260.0},
 {'capital': 3813,
  'code': 'USA',
  'code2': 'US',
  'continent': 'North America',
  'gnp': 8510700.0,
  'gnpold': 8110900.0,
  'governmentform': 'Federal Republic',
  'headofstate': 'George W. Bush',
  'indepyear': 1776,
  'lifeexpe

In [15]:
rdd_jsons \
  .map(lambda x: (x["continent"], 1)) \
  .reduceByKey(lambda x, y: x + y) \
  .collect()

[('Asia', 51),
 ('Antarctica', 5),
 ('South America', 14),
 ('Europe', 46),
 ('Africa', 58),
 ('North America', 37),
 ('Oceania', 28)]

In [16]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(lambda x, y: x + y) \
  .collect()

[('Asia', 3705025700),
 ('Antarctica', 0),
 ('South America', 345780000),
 ('Europe', 730074600),
 ('Africa', 784475000),
 ('North America', 482993000),
 ('Oceania', 30401150)]

In [17]:
from operator import add

In [18]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(add) \
  .collect()

[('Asia', 3705025700),
 ('Antarctica', 0),
 ('South America', 345780000),
 ('Europe', 730074600),
 ('Africa', 784475000),
 ('North America', 482993000),
 ('Oceania', 30401150)]

In [19]:
rdd_jsons \
  .map(lambda x: (x["continent"], x["population"])) \
  .reduceByKey(max) \
  .collect()

[('Asia', 1277558000),
 ('Antarctica', 0),
 ('South America', 170115000),
 ('Europe', 146934000),
 ('Africa', 111506000),
 ('North America', 278357000),
 ('Oceania', 18886000)]