# What is PySpark?

#### PySpark is Python API for Apache Spark, which is open-source, distributed computing framework designed for big data processing and analytics. 

#### This leverages the power of spark for handling large datasets and performing complex operations in a distributed environment.  

### Setting up Spark environment

The starting point for any Spark application is creating a SparkSession. This session is the single entry point for reading data, creating DataFrames, and working with Spark’s functionality.

In [6]:
# pip install pyspark

In [10]:
import warnings
from pyspark.sql import SparkSession

In [11]:
warnings.filterwarnings("ignore")

In [12]:
spark = SparkSession.builder.appName("SparkBasics").getOrCreate()
spark

SparkSession: It encapsulates your connection to a Spark cluster.

builder: This is a factory for constructing your SparkSession.

appName: This gives your application a name that will show up in Spark’s UI.

getOrCreate(): It creates a new session if none exists or returns an existing one.

### Loading data into Dataframe

spark.read: Initiates the DataFrameReader.

option("header", True): Tells Spark that the first line contains column names.

csv("path/to/data.csv"): Loads the CSV file; you can adjust parameters for other data sources.

show(5): Displays the first five rows.

printSchema(): Outputs the DataFrame schema to help validate the data types.

In [14]:
sample_df = spark.read.option("header",True).csv("../sample_data/spotify.csv")

In [17]:
sample_df.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- duration_ms: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- acousticness: string (nullable = true)
 |-- danceability: string (nullable = true)
 |-- energy: string (nullable = true)
 |-- instrumentalness: string (nullable = true)
 |-- liveness: string (nullable = true)
 |-- loudness: string (nullable = true)
 |-- speechiness: string (nullable = true)
 |-- tempo: string (nullable = true)
 |-- valence: string (nullable = true)
 |-- mode: string (nullable = true)
 |-- key: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- explicit: string (nullable = true)



In [18]:
sample_df.show(2)

+--------------------+--------------------+--------------------+-----------+------------+----+------------+------------+------+----------------+--------+--------+-----------+-------+-------+----+---+----------+--------+
|                  id|                name|             artists|duration_ms|release_date|year|acousticness|danceability|energy|instrumentalness|liveness|loudness|speechiness|  tempo|valence|mode|key|popularity|explicit|
+--------------------+--------------------+--------------------+-----------+------------+----+------------+------------+------+----------------+--------+--------+-----------+-------+-------+----+---+----------+--------+
|6KbQ3uYMLKb5jDxLF...|Singende Bataillo...| ['Carl Woitschach']|     158648|        1928|1928|       0.995|       0.708| 0.195|           0.563|   0.151| -12.428|     0.0506|118.469|  0.779|   1| 10|         0|       0|
|6KuQTIu1KoTTkLXKr...|Fantasiestücke, O...|['Robert Schumann...|     282133|        1928|1928|       0.994|       0.379|