# Learning Apache Spark with PySpark
*An introduction to basic concepts for working with Apache Spark using Python.*

*This notebook implements these concepts in code, serving both as a learning resource and as a collection of reusable code snippets for future projects.*


## What is Apache Spark?

Apache Spark is a tool designed to process and analyze large amounts of data efficiently.
Instead of working on data row by row on a single machine, Spark is built to split work into smaller pieces and process them in parallel.

Although Spark is often used on clusters with many machines, it can also run locally on a single computer.
In this notebook, Spark is used in **local mode**, which makes it easier to experiment and learn, while still using the same APIs that would later scale to a cluster.

At a high level, Spark focuses on:
- processing data in parallel,
- keeping data in memory when possible for better performance,
- delaying execution until a result is actually needed.

These ideas make Spark especially useful in Big Data and Cloud Computing contexts, where datasets are too large to be handled efficiently by traditional single-machine tools.


## SparkSession: Entry point to Spark
A SparkSession is the main entry point to Apache Spark when using it from Python (PySpark).

In practical terms, the SparkSession:
connects your Python code to the Spark engine (running on the JVM), manages configuration and resources, allows you to create DataFrames, read data, and run computations.

Without a SparkSession, Spark has no context in which to execute your code.

You need to import the SparkSession from PySpark in order to start writing any Spark application, then initialize a session:

In [None]:
# Import the SparkSession
from pyspark.sql import SparkSession

# Start the SparkSession,
spark = (SparkSession
         .builder
         .appName("Test-1")
         .getOrCreate())

## DataFrames
A **Spark DataFrame** is Spark’s main way of working with structured data.  
It looks and feels similar to a Pandas DataFrame, but it is designed to work with much larger datasets.

Behind the scenes, a Spark DataFrame:
- is split into multiple partitions,
- is evaluated lazily (nothing runs until a result is needed),
- is optimized automatically by Spark before execution.

Because of this, DataFrames are what you’ll use most of the time when working with Spark in PySpark and Spark SQL.

### Defining Data

To create a DataFrame, two components are required:

Data – the actual rows
Schema – the structure of the DataFrame (column names and types)

This components can be declared in the script, or read from different data files.

**Declaring the data manually:**

In [None]:
# Declaring the different entities (rows) of the DataFrame
data = [[1, "Nombre 1", "Apellido 1"],
        [2, "Nombre 2", "Apellido 2"],
        [3, "Nombre 3", "Apellido 3"]]

**Creating the Schema:**
It can be done in different ways, here are two simple ones.

In [None]:
# Create the schema for the DataFrame, using SQL DDL
schema = "`ID` INT, `First` STRING, `Second` STRING"

# Create the "schema" by naming the columns in another python list
columns = ["ID", "Fname", "Lname"]

### Creating Spark DataFrames

Spark DataFrames are created using the active SparkSession.
The same data can produce different DataFrames depending on how the schema is provided.

In [None]:
# From the schema
my_first_df = spark.createDataFrame(data, schema)

# From the columns list
my_second_df = spark.createDataFrame(data, columns)

### Displaying the DataFrame

In [None]:
# Printing the DataFrames
my_first_df.show()
my_second_df.show()

# Method to print the DataFrames schema structures
print(my_first_df.printSchema())
print(my_second_df.printSchema())