<a href="https://colab.research.google.com/github/mansibora20/PySpark/blob/main/02_RDD_(Resilient_Distributed_Dataset).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
!ls

sample_data		   spark-3.1.1-bin-hadoop3.2.tgz
spark-3.1.1-bin-hadoop3.2  spark-3.1.1-bin-hadoop3.2.tgz.1


In [19]:
!pwd

/content


In [20]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Connecting to security.ubuntu.com (185.125.190.82)] [Waiting for headers] [Waiting for headers]                                                                                                     Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of config

In [21]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [22]:
!ls

sample_data		   spark-3.1.1-bin-hadoop3.2.tgz    spark-3.1.1-bin-hadoop3.2.tgz.2
spark-3.1.1-bin-hadoop3.2  spark-3.1.1-bin-hadoop3.2.tgz.1


**SparkContext**
* SparkContext is used for low-level operations like RDDs, accumulators, and broadcast variables.
* Can only be created once in an application
* SparkContext is the legacy entry point for low-level operations.
* It's used to access the underlying Spark environment and perform operations on it.
* It's necessary when you need more control over the underlying Spark execution.





In [23]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
#spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
sc

**SparkSession**
* SparkSession is used for high-level operations like DataFrame and SQL.
* Can be created multiple times in an application.
* SparkSession is the entry point to Spark SQL.
* It offers a unified interface for interacting with Spark APIs.
* It supports built-in integration with various Spark modules.
* It's ideal for most data processing tasks because of its simplicity.





In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark

**Resilient Distributed Dataset (RDD**)
* It is the basic abstraction in Spark.
* RDD represents an immutable, partitioned collection of elements that can be operated on in parallel.
*  The data within RDDs is segmented into logical partitions, allowing for distributed computation across multiple nodes within the cluster.


**1. Using sparkContext.parallelize() :**
By using parallelize() function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This method of creating an RDD is used when you already have data in memory that is either loaded from a file or from a database. and all data must be present in the driver program prior to creating RDD.

NOTE : For production applications, RDDs are mostly created by by using external storage systems like HDFS, S3, HBase e.t.c.


In [3]:
# Create RDD from parallelize
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = spark.sparkContext.parallelize(data)

In [8]:
#Displaying contents of RDD
rdd.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [9]:
df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),
(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])

In [10]:
#Displaying the contents of RRD dataframe
df.show()

+----+----+----+-----+
|col1|col2|col3| col4|
+----+----+----+-----+
|   1|   2|   3|a b c|
|   4|   5|   6|d e f|
|   7|   8|   9|g h i|
+----+----+----+-----+



**2. By using createDataFrame( ) function**


In [13]:
Employee = spark.createDataFrame([
                ('1', 'Joe', '70000', '1'),
                ('2', 'Henry', '80000', '2'),
                ('3', 'Sam', '60000', '2'),
                ('4', 'Max', '90000', '1')],
                ['Id', 'Name', 'Sallary','DepartmentId']
)

In [14]:
Employee.show()

+---+-----+-------+------------+
| Id| Name|Sallary|DepartmentId|
+---+-----+-------+------------+
|  1|  Joe|  70000|           1|
|  2|Henry|  80000|           2|
|  3|  Sam|  60000|           2|
|  4|  Max|  90000|           1|
+---+-----+-------+------------+



**3. By using read and load functions**

In [18]:
df_csv = spark.read.format('com.databricks.spark.csv').\
                          options(header='true', \
                          inferschema='true').\
                          load("/content/Top_spotify_songs.csv",
                          header=True)

In [19]:
df_csv.show(5)

+--------------------+------------------+--------------------+----------+--------------+---------------+-------+-------------+----------+-----------+-----------+--------------------+------------------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+
|          spotify_id|              name|             artists|daily_rank|daily_movement|weekly_movement|country|snapshot_date|popularity|is_explicit|duration_ms|          album_name|album_release_date|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|time_signature|
+--------------------+------------------+--------------------+----------+--------------+---------------+-------+-------------+----------+-----------+-----------+--------------------+------------------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+--------------+
|2plbrEY59IikOBgBG...|  Die With A

In [20]:
df_csv.printSchema()

root
 |-- spotify_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- daily_rank: string (nullable = true)
 |-- daily_movement: string (nullable = true)
 |-- weekly_movement: string (nullable = true)
 |-- country: string (nullable = true)
 |-- snapshot_date: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- is_explicit: string (nullable = true)
 |-- duration_ms: string (nullable = true)
 |-- album_name: string (nullable = true)
 |-- album_release_date: string (nullable = true)
 |-- danceability: string (nullable = true)
 |-- energy: string (nullable = true)
 |-- key: string (nullable = true)
 |-- loudness: string (nullable = true)
 |-- mode: string (nullable = true)
 |-- speechiness: string (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = t