<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Connecting to spark cluster using Skills Network labs


Estimated time needed: **10** minutes


<p style='color: red'>The purpose of this lab is to show you how to connect to a Spark Cluster on Skill Networks Labs.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li><a href="#Datasets">Datasets</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-Required-Libraries">Installing Required Libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing Required Libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Examples">Examples</a>
        <ol>
            <li><a href="#Task-1---Create-a-spark-session">Task 1 - Create a spark session</a></li>
    <li><a href="#Task-2---Download-the-data-file">Task-2 - Download the data file</a></li>
<li><a href="#Task-3---Load-the-data-in-a-csv-file-into-a-dataframe">Task 3 - Load the data in a csv file into a dataframe</a></li>
    <li><a href="#Task-4---Explore-the-data-set">Task 4 - Explore the data set</a></li>
        <li><a href="#Task-5---Stop-the-spark-session">Task 5 - Stop the spark session</a></li>    
        </ol>
    </li>



<li><a href="#Exercises">Exercises</a></li>
<ol>
    <li><a href="#Exercise-1---Create-a-Spark-Session">Exercise 1 - Create a Spark Session</a></li>
    <li><a href="#Exercise-2---Load-the-dataset-into-a-dataframe">Exercise 2 - Load the dataset into a dataframe</a></li>
    <li><a href="#Exercise-3---Explore-the-data">Exercise 3 - Explore the data</a></li>
    <li><a href="#Exercise-4---Print-the-top-5-rows-of-the-dataframe">Exercise 4 - Print the top 5 rows of the dataframe</a></li>
    <li><a href="#Exercise-5---Stop-the-spark-session">Exercise 5 - Stop the spark session</a></li>
    </ol>
<li><a href="#How-to-use-this-lab-notebook-offline">How to use this lab notebook offline.
</a></li>

</ol>















## Objectives

After completing this lab you will be able to:

 - Use PySpark to connect to a spark cluster.
 - Create a spark session.
 - Read a csv file into a data frame using the spark session.
 - Stop the spark session
 - Learn how to use this lab notebook offline.


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset.  Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg 
 - Modified version of diamonds dataset.  Original dataset available at https://www.openml.org/search?type=data&sort=runs&id=42225&status=active
 


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to
 connect to this cluster.
 
If you download this notebook and run on your laptop, you will NOT be able to connect the spark cluster running on the SN labs.


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [2]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

# import SparkSession
from pyspark.sql import SparkSession

# Examples


## Task 1 - Create a spark session


In [3]:
#Create SparkSession
#Here 'Getting Started with Spark' is the application name
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Getting Started with Spark").getOrCreate()

24/01/26 01:42:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Task 2 - Download the data file


In [4]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv


--2024-01-26 01:43:10--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/mpg.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13891 (14K) [text/csv]
Saving to: ‘mpg.csv’


2024-01-26 01:43:10 (48.0 MB/s) - ‘mpg.csv’ saved [13891/13891]



## Task 3 - Load the data in a csv file into a dataframe


In [5]:
# using the spark.read.csv function we load the data into a dataframe.
# the header = True mentions that there is a header row in out csv file
# the inferSchema = True, tells spark to automatically find out the data types of the columns.

# Load mpg dataset
mpg_data = spark.read.csv("mpg.csv", header=True, inferSchema=True)

                                                                                

## Task 4 - Explore the data set


Let's print the schema of the dataset


In [6]:
mpg_data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



Let's look at some sample rows from the dataset we loaded:


In [7]:
# show top 5 rows from the dataset
mpg_data.head(5)

[Row(MPG=15.0, Cylinders=8, Engine Disp=390.0, Horsepower=190, Weight=3850, Accelerate=8.5, Year=70, Origin='American'),
 Row(MPG=21.0, Cylinders=6, Engine Disp=199.0, Horsepower=90, Weight=2648, Accelerate=15.0, Year=70, Origin='American'),
 Row(MPG=18.0, Cylinders=6, Engine Disp=199.0, Horsepower=97, Weight=2774, Accelerate=15.5, Year=70, Origin='American'),
 Row(MPG=16.0, Cylinders=8, Engine Disp=304.0, Horsepower=150, Weight=3433, Accelerate=12.0, Year=70, Origin='American'),
 Row(MPG=14.0, Cylinders=8, Engine Disp=455.0, Horsepower=225, Weight=3086, Accelerate=10.0, Year=70, Origin='American')]

## Task 5 - Stop the spark session


First we identify the target. Target is the value that our machine learning model needs to predict


In [8]:
spark.stop()

# Exercises


### Exercise 1 - Create a Spark Session


Create a spark session with appname "Diamond data analysis"


In [10]:
spark = SparkSession \
    .builder \
    .appName("Diamond Data Analysis") \
    .getOrCreate()

### Exercise 2 - Load the dataset into a dataframe


Download the data set from "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv"


In [11]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv

--2024-01-26 01:54:35--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3192561 (3.0M) [text/csv]
Saving to: ‘diamonds.csv’


2024-01-26 01:54:36 (52.8 MB/s) - ‘diamonds.csv’ saved [3192561/3192561]



Load diamond dataset into a dataframe named diamond_data


In [12]:
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)

                                                                                

### Exercise 3 - Explore the data


Print the schema of the dataframe


In [13]:
diamond_data.printSchema()

root
 |-- s: integer (nullable = true)
 |-- carat: double (nullable = true)
 |-- cut: string (nullable = true)
 |-- color: string (nullable = true)
 |-- clarity: string (nullable = true)
 |-- depth: double (nullable = true)
 |-- table: double (nullable = true)
 |-- price: integer (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



### Exercise 4 - Print the top 5 rows of the dataframe


In [14]:
diamond_data.head(5)

[Row(s=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326, x=3.95, y=3.98, z=2.43),
 Row(s=2, carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0, price=326, x=3.89, y=3.84, z=2.31),
 Row(s=3, carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0, price=327, x=4.05, y=4.07, z=2.31),
 Row(s=4, carat=0.29, cut='Premium', color='I', clarity='VS2', depth=62.4, table=58.0, price=334, x=4.2, y=4.23, z=2.63),
 Row(s=5, carat=0.31, cut='Good', color='J', clarity='SI2', depth=63.3, table=58.0, price=335, x=4.34, y=4.35, z=2.75)]

### Exercise 5 - Stop the spark session


In [15]:
spark.stop()

Congratulations you have completed this lab.<br>


### How to use this lab notebook offline


 - All the lab jupyter notebooks are designed to work with in the environment of the SN Labs.
 - If you download this notebook and run on your local machine, it will NOT run.
 - This is primarily because SN labs runs a Spark cluster instance, and all the Spark based labs connect to it.
 - If you download this notebook and run, you cannot connect the Spark cluster instance that runs on the SN labs.
 - However, if you wish to connect to your own instance of Spark cluster you can connect to it by making the changes mentioned below.
 
 Replace the code in Task 1 - "Create a spark session" with the one given below. Make sure you use your spark-master and port details.
 
 spark = SparkSession.builder.appName("YourAppName").master("spark://spark-master:port").getOrCreate()


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-07-11|1.1|Ramesh Sannareddy|Add the section 'How to use this lab notebook offline'|
|2023-04-25|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
