# What is SparkContext?
Spark comes with interactive python shell in which PySpark is already installed in it. PySpark automatically creates a SparkContext for you in the PySpark Shell. SparkContext is an entry point into the world of Spark. An entry point is a way of connecting to Spark cluster. We can use SparkContext using sc variable. In the following examples, we retrieve SparkContext version and Python version of SparkContext.

In [1]:
# to retrieve SparkContext version
sc.version

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200303033042-0000
KERNEL_ID = 5b7c44b3-a29d-41a9-9efc-50db7cc29a76


'2.3.3'

In [2]:
# to retriece Python version of SparkContext
sc.pythonVer

'3.6'

In [5]:

import ibmos2spark
@hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# IBM WATSON PYTHON+SPARK ENVIRONMENT
# xxxx - PUT YOUR CREDENTIAL
credentials = {
    'endpoint': 'xxxx',
    'service_id': 'xxxx',
    'iam_service_endpoint': 'xxxx',
    'api_key': 'xxxx'
}

configuration_name = 'os_0a7faf2d576d4d0e985305b42aae4ce7_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data_1 = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('people.csv', 'sparktry-donotdelete-pr-adbnjs9sf1yuav'))
df_data_1.take(5)

[Row(_c0='0', person_id='100', name='Penelope Lewis', sex='female', date of birth='1990-08-31'),
 Row(_c0='1', person_id='101', name='David Anthony', sex='male', date of birth='1971-10-14'),
 Row(_c0='2', person_id='102', name='Ida Shipp', sex='female', date of birth='1962-05-24'),
 Row(_c0='3', person_id='103', name='Joanna Moore', sex='female', date of birth='2017-03-10'),
 Row(_c0='4', person_id='104', name='Lisandra Ortiz', sex='female', date of birth='2020-08-05')]

# Problem 1:->

### Create a Spark program to read the airport data from airports.text and  find all the airports which are located in United States and output the airport's name and the city's name.

In [6]:

@hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'IAM_SERVICE_ID': 'XXXX',
    'IBM_API_KEY_ID': 'XXXX',
    'ENDPOINT': 'XXXX',
    'IBM_AUTH_ENDPOINT': 'XXXX',
    'BUCKET': 'XXXX',
    'FILE': 'airports.text'
}
aiport_data = cos.url('airports.text', 'sparktry-donotdelete-pr-adbnjs9sf1yuav')

In [7]:
# load the aiport_data dataset into a rdd named clusterRDD
clusterRDD = sc.textFile(aiport_data)

In [8]:
# Data of 5 rows
clusterRDD.take(5)

['1,"Goroka","Goroka","Papua New Guinea","GKA","AYGA",-6.081689,145.391881,5282,10,"U","Pacific/Port_Moresby"',
 '2,"Madang","Madang","Papua New Guinea","MAG","AYMD",-5.207083,145.7887,20,10,"U","Pacific/Port_Moresby"',
 '3,"Mount Hagen","Mount Hagen","Papua New Guinea","HGU","AYMH",-5.826789,144.295861,5388,10,"U","Pacific/Port_Moresby"',
 '4,"Nadzab","Nadzab","Papua New Guinea","LAE","AYNZ",-6.569828,146.726242,239,10,"U","Pacific/Port_Moresby"',
 '5,"Port Moresby Jacksons Intl","Port Moresby","Papua New Guinea","POM","AYPY",-9.443383,147.22005,146,10,"U","Pacific/Port_Moresby"']

In [10]:
# For removing the commas and spaces
import re
class Utils():
    COMMA_DELIMITER = re.compile(''',(?=(?:[^"]*"[^"]*")*[^"]*$)''')

In [11]:
# Splitting of the result
def splitComma(line: str):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[2])

In [12]:
# Applying the tranformation
airportsInUSA = clusterRDD.filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"")

In [13]:
# After applying map to join aiport name and city
airportsNameAndCityNames = airportsInUSA.map(splitComma)

In [15]:
airportsNameAndCityNames.take(10)

['"Putnam County Airport", "Greencastle"',
 '"Dowagiac Municipal Airport", "Dowagiac"',
 '"Cambridge Municipal Airport", "Cambridge"',
 '"Door County Cherryland Airport", "Sturgeon Bay"',
 '"Shoestring Aviation Airfield", "Stewartstown"',
 '"Eastern Oregon Regional Airport", "Pendleton"',
 '"Tyonek Airport", "Tyonek"',
 '"Riverton Regional", "Riverton WY"',
 '"Montrose Regional Airport", "Montrose CO"',
 '"Clow International Airport", "Bolingbrook"']

# Problem - 2 :->
### Create a Spark program to read the airport data from in/airports.text,  find all the airports whose latitude are bigger than 40.

In [26]:
airports_latitude = clusterRDD.filter(lambda line: float(Utils.COMMA_DELIMITER.split(line)[6]) > 40)

In [27]:
# Splitting of the result
def splitCommaLat(line: str):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[6])

In [28]:
airportsNames = airports_latitude.map(splitCommaLat)

In [30]:
airportsNames.take(20)

['"Narsarsuaq", 61.160517',
 '"Nuuk", 64.190922',
 '"Sondre Stromfjord", 67.016969',
 '"Thule Air Base", 76.531203',
 '"Akureyri", 65.659994',
 '"Egilsstadir", 65.283333',
 '"Hornafjordur", 64.295556',
 '"Husavik", 65.952328',
 '"Isafjordur", 66.058056',
 '"Keflavik International Airport", 63.985',
 '"Patreksfjordur", 65.555833',
 '"Reykjavik", 64.13',
 '"Siglufjordur", 66.133333',
 '"Vestmannaeyjar", 63.424303',
 '"Sault Ste Marie", 46.485001',
 '"Winnipeg St Andrews", 50.056389',
 '"Shearwater", 44.639721',
 '"St Anthony", 51.391944',
 '"Tofino", 49.082222',
 '"Kugaaruk", 68.534444']

In [31]:
# Number of outputs
airportsNames.count()

3309