# End to End PySpark Clustering: Part I Using Colab for PySpark and Collecting Data

A notebook for collecting the data that will drive a planetary clustering project. The data comes from: https://api.le-systeme-solaire.net/en/ 

The notebook will focus on using the requests package to get data from the API and using parameters as well as storing the data in a DataFrame.

In [1]:
import requests

check = requests.get('https://api.le-systeme-solaire.net/rest')
check  # check API is available

<Response [200]>

The full request is much longer than the above, the data need is much more specific than the whole dataset available through the API. Using the API documentation the parameters you can add include a data to exclude option. So with the request we will add a parameters option.

In [2]:
url = 'https://api.le-systeme-solaire.net/rest/bodies'
params = {'exclude' :'mass,vol,moons,discoveredBy,discoveryDate,alternativeName,axialTilt,avgTemp,mainAnomaly,argPeriapsis,longAscNode,rel,aroundPlanet,sideralOrbit,sideralRotation,dimension,flattening,polarRadius'}

all_data = requests.get(url, params).json()
all_data.get('bodies')

[{'aphelion': 405500,
  'bodyType': 'Moon',
  'density': 3.344,
  'eccentricity': 0.0549,
  'englishName': 'Moon',
  'equaRadius': 1738.1,
  'escape': 2380.0,
  'gravity': 1.62,
  'id': 'lune',
  'inclination': 5.145,
  'isPlanet': False,
  'meanRadius': 33.0,
  'name': 'La Lune',
  'perihelion': 363300,
  'semimajorAxis': 384400},
 {'aphelion': 9518,
  'bodyType': 'Moon',
  'density': 1.9,
  'eccentricity': 0.0151,
  'englishName': 'Phobos',
  'equaRadius': 13.0,
  'escape': 11.39,
  'gravity': 0.0057,
  'id': 'phobos',
  'inclination': 1.075,
  'isPlanet': False,
  'meanRadius': 33.0,
  'name': 'Phobos',
  'perihelion': 9234,
  'semimajorAxis': 9376},
 {'aphelion': 23471,
  'bodyType': 'Moon',
  'density': 1.75,
  'eccentricity': 0.0002,
  'englishName': 'Deimos',
  'equaRadius': 7.8,
  'escape': 5.556,
  'gravity': 0.003,
  'id': 'deimos',
  'inclination': 1.075,
  'isPlanet': False,
  'meanRadius': 33.0,
  'name': 'Deïmos',
  'perihelion': 23456,
  'semimajorAxis': 23458},
 {'aphel

## Setting up a Spark environment in Google Colab

There are parts of this code that will need to be updated over time as Spark and Hadoop versions change the links to their downloads must also change (i.e. 3.2.0 may become 3.2.1).

In [3]:
# download JDK and Spark/Hadoop
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

--2022-01-09 10:45:36--  https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.95.219, 2a01:4f8:10a:201a::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300965906 (287M) [application/x-gzip]
Saving to: ‘spark-3.2.0-bin-hadoop3.2.tgz’


2022-01-09 10:45:48 (24.1 MB/s) - ‘spark-3.2.0-bin-hadoop3.2.tgz’ saved [300965906/300965906]



In [4]:
# unpack Spark and Hadoop
!tar xf spark-3.2.0-bin-hadoop3.2.tgz

Set the home env variables to run spark 

In [5]:
# set home paths 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

Use findspark package to manage and access Spark context.

In [6]:
# install import and initialise findspark
!pip install -q findspark
import findspark
findspark.init()

In [7]:
# build spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [9]:
# get data from all data and dump into JSON file
import json
data = all_data.get('bodies')
data_json = json.dumps(data)

with open('data.json', 'w') as f:
    json.dump(data, f)

In [15]:
# set up schema 
from pyspark.sql.types import StructType, StructField, BooleanType, FloatType, StringType

schema = StructType([
  StructField("id", StringType(), False),
  StructField("englishName", StringType(), True),
  StructField("isPlanet", BooleanType(), True),
  StructField("density", StringType(), True),
  StructField("gravity", FloatType()),
  StructField("escape", FloatType())
])


In [21]:
# read in data usinf schema and show first five rows 
df = spark.read.json('data.json', schema)
df.show(5)

+----------+-----------+--------+-------+-------+------+
|        id|englishName|isPlanet|density|gravity|escape|
+----------+-----------+--------+-------+-------+------+
|      lune|       Moon|   false|  3.344|   1.62|2380.0|
|    phobos|     Phobos|   false|    1.9| 0.0057| 11.39|
|    deimos|     Deimos|   false|   1.75|  0.003| 5.556|
|        io|         Io|   false|   3.53|   1.79|   0.0|
|    europe|     Europa|   false|   3.01|   1.31|   0.0|
|  ganymede|   Ganymede|   false|   1.94|  1.428|   0.0|
|  callisto|   Callisto|   false|   1.83|  1.235|   0.0|
|  amalthee|   Amalthea|   false|    3.1|   0.02|   0.0|
|   himalia|    Himalia|   false|    1.0|  0.062|   0.0|
|     elara|      Elara|   false|    1.0|  0.031|   0.0|
|  pasiphae|   Pasiphae|   false|    1.0|  0.022|   0.0|
|    sinope|     Sinope|   false|    1.0|  0.014|   0.0|
|  lysithea|   Lysithea|   false|    1.0|  0.013|   0.0|
|     carme|      Carme|   false|    1.0|  0.017|   0.0|
|    ananke|     Ananke|   fals