<a href="https://colab.research.google.com/github/msfasha/307401-Big-Data/blob/main/lecture_notes/section_4_introduction_to_apache_spark/Introduction%20to%20Apache%20Spark%20Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Apache Spark using Python PySpark**
## 1. What is Apache Spark?
Apache Spark is an open-source, distributed computing system that processes large datasets quickly across clusters of computers. 

It’s widely used in data analytics, big data processing, and machine learning due to its speed, ease of use, and versatility.

It can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and many more.

Key Features of Apache Spark:
- **Speed:** Spark can process data up to 100 times faster than traditional data-processing frameworks.
- **Distributed Computing:** It divides large datasets across multiple machines, enabling parallel processing.
- **Ease of Use:** Spark provides APIs in Python, Scala, Java, and R, making it accessible to various developers.
- **Unified Data Processing Engine:** Spark has multiple components for different tasks, including Spark SQL, MLlib for machine learning, and Spark Streaming for real-time data.
  
## Spark Architecture and Components:
1. **Spark Core**
- RDDs (Resilient Distributed Datasets): The fundamental data structure in Spark, RDDs are immutable, fault-tolerant collections of objects distributed across nodes. They support operations like map, filter, and reduce.
- Transformations and Actions: Transformations (e.g., map, filter) create a new RDD, while actions (e.g., count, collect) execute computations and return results.
2. **Spark SQL**
- Spark SQL allows querying structured data using SQL or the DataFrame API. It’s optimized for performance and commonly used for analyzing large datasets.
3. **MLlib** (Machine Learning Library)
- MLlib is Spark’s library for scalable machine learning algorithms, including regression, classification, clustering, and collaborative filtering.
4. **Spark Streaming**
- Enables real-time data processing for applications needing continuous, live data feeds.

5. **GraphX** 
- A graph processing library in Spark for graph-parallel computations, useful for tasks like social network analysis.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/msfasha/307401-Big-Data/main/lecture_notes/images/spark_eco_system.png" alt="Spark Eco System" width="600"/>
</div>

## How Spark Works: Understanding Distributed Processing
- Spark uses a driver program (SparkContext object) to initiate the computation and worker nodes to perform parallel processing.
- Cluster Managers (like Hadoop YARN, Apache Mesos, or Spark’s built-in Standalone Cluster Manager) coordinate the nodes.

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/msfasha/307401-Big-Data/main/lecture_notes/images/cluster-overview.png" alt="Spark Cluster" width="600"/>    
</div>

---

## 2. Getting Started with PySpark
1. Setting Up PySpark
You can run PySpark on a local machine or on a distributed cluster. For classroom purposes, we’ll focus on setting up PySpark locally with a Jupyter Notebook environment.
### Installation Steps:
1.	Install Java (required for Spark):<br>
default-jre
2.	Install PySpark:<br>
pip install pyspark
2. Running Your First PySpark Program<br>
You can check if PySpark is installed correctly by launching a Jupyter Notebook and running this code to initialize Spark:


### Creating a spark session
SparkSession is the entry point to programming with Spark. It allows you to interact with Spark, load and process data, and manage resources in a cluster.

In [29]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Introduction to Spark").getOrCreate()

# Display Spark version
print("Spark version:", spark.version)

Spark version: 3.5.3


- SparkSession.builder: This starts the construction of a Spark session.
- .appName("Introduction to Spark"): This sets the name of your Spark application. It is used for identification in Spark's UI and logs.
- .getOrCreate(): This either retrieves the existing Spark session (if one already exists) or creates a new one if none exists.
This creates a Spark session object named spark, which you will use to interact with Spark.

This initializes Spark and shows the version number, confirming that your environment is ready.


---

### Resilient Distributed Datasets (RDDs)
#### What is an RDD?
RDDs, or Resilient Distributed Datasets, are the foundational data structure in Spark. They represent an immutable distributed collection of objects that can be processed in parallel across the nodes in a Spark cluster.
Key properties of RDDs:
- Resilient: RDDs automatically recover from node failures.
- Distributed: Data is spread across multiple nodes.
- Immutable: Once created, an RDD cannot be changed.
#### Creating RDDs
There are two main ways to create RDDs:
- Parallelizing a collection: Creating an RDD from an existing list or array.
- Reading from an external data source: Creating an RDD from a file or dataset (like a CSV file).

#### Examples:
1. Parallelize a Collection:<br>

In [30]:
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Collect the RDD data and print it
print(rdd.collect())

[1, 2, 3, 4, 5]


2.	Read from a Text File:

In [31]:
rdd = spark.sparkContext.textFile("path/to/file.txt")

---
## 3. Tansformations and Actions
In Apache Spark, **Transformations** and **Actions** are the two main types of operations used to process and analyze data. Understanding the difference between them is crucial for mastering Spark.

Transformations are functions executed on demand to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter, and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

### 3.1. Transformations
Transformations are **lazy operations** that define a set of instructions for manipulating data but do not execute them immediately. Instead, they create a new Resilient Distributed Dataset (RDD) or DataFrame, representing a logical plan for execution.

#### Key Characteristics
- **Lazy Evaluation**: Spark doesn’t execute transformations until an action triggers the computation.
- **Immutable Data**: Transformations create new RDDs or DataFrames rather than modifying existing ones.
- **Chaining**: Multiple transformations can be chained together to create complex workflows.

#### Common Transformations
| Transformation | Description                                  | Example |
|-----------------|----------------------------------------------|---------|
| `map()`         | Applies a function to each element in the dataset. | Transform each number to its square. |
| `filter()`      | Filters elements based on a condition.      | Keep only even numbers. |
| `flatMap()`     | Similar to `map()`, but can produce multiple output elements for each input element. | Split lines of text into words. |
| `distinct()`    | Removes duplicate elements.                 | Get unique values in a dataset. |
| `union()`       | Combines two datasets into one.             | Combine two RDDs. |
| `groupByKey()`  | Groups data by key (key-value RDD).         | Group all values by their keys. |
| `join()`        | Performs a join operation on two datasets.  | Join two RDDs/DataFrames. |

1. map(): Applies a function to each element in the RDD and returns a new RDD.

In [32]:
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Square each number
squared_rdd = rdd.map(lambda x: x**2)

print(squared_rdd.collect())  # Output: [1, 4, 9, 16, 25]

[1, 4, 9, 16, 25]


**Parallelization**:
   The `spark.sparkContext.parallelize(data)` function distributes the data across multiple cores or nodes in the cluster. Spark divides the dataset into partitions, and each partition can be processed independently.

**Transformations and Actions**:
   - The `filter` and `map` operations are **transformations**. They are lazy and only define the computation to be performed.
   - The `collect` operation is an **action**. It triggers execution (map in this case) and aggregates the results back to the driver (local machine) and returns them as a Python list.

**Scaling**:
   - If you run this code in a Spark cluster with multiple cores or nodes, Spark will distribute the transformations (`filter` and `map`) across all available resources.
   - Each core or executor will work on a subset of the data, speeding up the computation.

2. filter(): Returns a new RDD containing only the elements that satisfy a given condition.

In [33]:
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print(even_rdd.collect())

[2, 4]


3. flatMap(): Similar to map(), but flattens the results.
map() transformation is applied to each row in a dataset to return a new dataset. flatMap() transformation is also used for each dataset row, but a new flattened dataset is returned. In the case of flatMap, if a record is nested (e.g., a column that is in itself made up of a list or array), the data within that record gets extracted and is returned as a new row of the returned dataset.

Both map() and flatMap() transformations are narrow, meaning they do not result in the shuffling of data in Spark.
- flatMap() is a one-to-many transformation function that returns more rows than the current DataFrame. Map() returns the same number of records as in the input DataFrame.
- flatMap() can give a result that contains redundant data in some columns.
- flatMap() can flatten a column that contains arrays or lists. It can be used to flatten any other nested collection too.

In [34]:
lines = spark.sparkContext.parallelize(["hello world", "how are you"])
words = lines.flatMap(lambda line: line.split(" "))
print(words.collect())

['hello', 'world', 'how', 'are', 'you']


4. distinct(): Removes duplicate elements.

In [35]:
rdd_with_duplicates = spark.sparkContext.parallelize([1, 2, 2, 3, 4])
distinct_rdd = rdd_with_duplicates.distinct()
print(distinct_rdd.collect())

[1, 2, 3, 4]


5. union(): Combines two RDDs into one.

In [36]:
rdd1 = spark.sparkContext.parallelize([1, 2, 3])
rdd2 = spark.sparkContext.parallelize([4, 5, 6])
combined_rdd = rdd1.union(rdd2)
print(combined_rdd.collect())

[1, 2, 3, 4, 5, 6]


### 3.2. Actions
Actions are **eager operations** that trigger the execution of transformations. They perform computations and return a result to the driver program or write the output to an external storage.

#### Key Characteristics
- **Trigger Execution**: Actions force Spark to evaluate the transformations and perform the computation.
- **Return Results**: They either return a value to the driver or save the result to a file system.
- **Irreversible**: Actions mark the end of a computation chain.

#### **Common Actions**
| Action         | Description                                    | Example |
|----------------|------------------------------------------------|---------|
| `collect()`    | Returns all elements of the dataset as a list. | Collect results from an RDD. |
| `count()`      | Counts the number of elements in the dataset.  | Find the total number of rows. |
| `first()`      | Returns the first element of the dataset.      | Get the first line in a file. |
| `take(n)`      | Returns the first `n` elements of the dataset. | Get the first 5 rows. |
| `reduce()`     | Aggregates data using a specified function.    | Find the sum of all numbers. |
| `saveAsTextFile()` | Saves the dataset to a text file.          | Save results to HDFS or local storage. |
| `show()`       | Displays the first few rows of a DataFrame.    | Show data in tabular format. |

Some common actions include:
1.	collect(): Returns all elements of the RDD as a list (use sparingly with large datasets).

In [37]:
print("Collected elements:", rdd.collect())

Collected elements: [1, 2, 3, 4, 5]


2.	count(): Counts the number of elements in the RDD.

In [38]:
print("Count of elements:", rdd.count())

Count of elements: 5


3.	first(): Returns the first element in the RDD.

In [39]:
print("First element:", rdd.first())

First element: 1


4.	take(n): Returns the first n elements.

In [40]:
print("First three elements:", rdd.take(3))

First three elements: [1, 2, 3]


5.	reduce(): Aggregates the elements of the RDD using a specified function.

In [41]:
# Sum all elements
sum_of_elements = rdd.reduce(lambda x, y: x + y)
print("Sum of elements:", sum_of_elements)

Sum of elements: 15


6.	countByValue(): Returns a dictionary of each unique value and its count.

In [42]:
value_counts = rdd.countByValue()
print("Value counts:", value_counts)

Value counts: defaultdict(<class 'int'>, {1: 1, 2: 1, 3: 1, 4: 1, 5: 1})


## Transformations vs. Actions
| Feature              | Transformations                        | Actions                              |
|----------------------|----------------------------------------|-------------------------------------|
| **Execution**         | Lazy: Build a logical execution plan. | Eager: Trigger computation.         |
| **Output**            | Produces a new RDD/DataFrame.         | Returns a value or writes to storage. |
| **Examples**          | `map()`, `filter()`, `flatMap()`      | `collect()`, `count()`, `show()`    |

### Why Lazy Evaluation Matters
Lazy evaluation allows Spark to optimize the execution plan:
1. **Minimizing Data Movement**: Spark analyzes the entire computation chain to reduce shuffling.
2. **Combining Operations**: Spark can merge multiple transformations into a single stage.


### Key Takeaway
- Use **transformations** to define how data should be manipulated.
- Use **actions** to trigger execution and extract results.
- Understanding their roles helps you write efficient and optimized Spark applications.

## Practical Example: Word Count with RDDs
Problem: Count the occurrences of each word in a text file.

In [43]:
# Read the text file
text_rdd = spark.sparkContext.textFile("datasets\social_media_comments\sentimentdataset.txt")

# Split each line into words and flatten
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Map each word to a (word, 1) pair
word_pairs = words_rdd.map(lambda word: (word, 1))

# Reduce by key (word) to count occurrences
word_counts = word_pairs.reduceByKey(lambda x, y: x + y)

# Collect and display results
for word, count in word_counts.collect():
    print(f"{word}: {count}")

  text_rdd = spark.sparkContext.textFile("datasets\social_media_comments\sentimentdataset.txt")
  text_rdd = spark.sparkContext.textFile("datasets\social_media_comments\sentimentdataset.txt")


Py4JJavaError: An error occurred while calling o426.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/media/me/Disk1-Repo 1/my_code/my_courses/307401-Big-Data/Apache Spark/datasets\social_media_comments\sentimentdataset.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:290)
	at org.apache.spark.api.java.JavaRDDLike.partitions(JavaRDDLike.scala:61)
	at org.apache.spark.api.java.JavaRDDLike.partitions$(JavaRDDLike.scala:61)
	at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.io.IOException: Input path does not exist: file:/media/me/Disk1-Repo 1/my_code/my_courses/307401-Big-Data/Apache Spark/datasets\social_media_comments\sentimentdataset.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
	... 25 more


### Code Explaination
```python
# Read the text file
text_rdd = spark.sparkContext.textFile("datasets/social_media_comments/sentimentdataset.txt")
```
Reads the file and divides it into chunks that are distributed across worker nodes. Each worker is responsible for processing its assigned lines, allowing for parallel reading.

```python
# Split each line into words and flatten
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
```
Splits each line into words within each worker. Since the data was already distributed in the previous step, flatMap simply applies the splitting operation on each worker's assigned lines independently.

```python
# Map each word to a (word, 1) pair
word_pairs = words_rdd.map(lambda word: (word, 1))
```
Converts each word to a `(word, 1)` pair on each worker. The `map` function sends this transformation to each worker, where it operates on its data independently.

```python
# Reduce by key (word) to count occurrences
word_counts = word_pairs.reduceByKey(lambda x, y: x + y)
```
Aggregates the counts by word. Initially, each worker performs a local aggregation for the words it holds. Then, a shuffle occurs, redistributing data so all occurrences of the same word go to the same worker for final aggregation.

```python
# Collect and display results
for word, count in word_counts.collect():
    print(f"{word}: {count}")
```
Collects the final counts from all workers to the driver program. Each worker sends its results to the driver, where the data is combined and displayed.

#### sorting the counts in descending order

In [None]:
# Read the text file
text_rdd = spark.sparkContext.textFile("datasets/social_media_comments/sentimentdataset.txt")

# Split each line into words and flatten
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Map each word to a (word, 1) pair
word_pairs = words_rdd.map(lambda word: (word, 1))

# Reduce by key (word) to count occurrences
word_counts = word_pairs.reduceByKey(lambda x, y: x + y)

# Sort by count in descending order and take the top 10
top_10_words = word_counts.sortBy(lambda x: x[1], ascending=False).take(10)

# Display the top 10 results
for word, count in top_10_words:
    print(f"{word}: {count}")


: 2321
the: 808
of: 623
a: 621
in: 259
to: 133
and: 111
with: 107
for: 99
on: 91


This word count example demonstrates several core RDD concepts, including transformations (flatMap, map, reduceByKey) and actions (collect).


---

## 4. Introducing Spark SQL Tutorial: Using `apartment_prices.csv`


Spark SQL is a powerful module in Apache Spark for processing structured data. It enables SQL-like querying of data and integrates seamlessly with Spark’s core APIs.

Key capabilities:
- Query structured data using SQL.
- Work with various data formats (CSV, JSON, Parquet).
- Combine SQL with Spark’s DataFrame API for powerful analytics.



### Setting Up Spark SQL

#### Create a SparkSession
The `SparkSession` is the entry point for working with Spark SQL.

In [None]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("Apartment Prices Analysis").getOrCreate()

### Loading the Dataset
We'll load the provided `apartment_prices.csv` into a Spark DataFrame for analysis.

#### Load the Dataset

In [None]:
# Load CSV file into a DataFrame
df = spark.read.csv("datasets/apartment_prices.csv", header=True, inferSchema=True)

# Show the schema and a few rows of the dataset
df.printSchema()
df.show(5)

root
 |-- Square_Area: integer (nullable = true)
 |-- Num_Rooms: integer (nullable = true)
 |-- Age_of_Building: integer (nullable = true)
 |-- Floor_Level: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Price: double (nullable = true)

+-----------+---------+---------------+-----------+-----+-------+
|Square_Area|Num_Rooms|Age_of_Building|Floor_Level| City|  Price|
+-----------+---------+---------------+-----------+-----+-------+
|        162|        1|             15|         12|Amman|74900.0|
|        152|        5|              8|          8|Aqaba|79720.0|
|         74|        3|              2|          8|Irbid|43200.0|
|        166|        1|              3|         18|Irbid|69800.0|
|        131|        3|             14|         15|Aqaba|63160.0|
+-----------+---------+---------------+-----------+-----+-------+
only showing top 5 rows



### Registering the DataFrame as a SQL Table
To query the dataset using SQL, we register the DataFrame as a temporary table.

In [None]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("apartments")

### SQL Operations on the Dataset

#### a. Basic SELECT Query**
Retrieve all apartments located in "Amman."

In [None]:
result = spark.sql("SELECT * FROM apartments WHERE City = 'Amman'")
result.show()

+-----------+---------+---------------+-----------+-----+--------+
|Square_Area|Num_Rooms|Age_of_Building|Floor_Level| City|   Price|
+-----------+---------+---------------+-----------+-----+--------+
|        162|        1|             15|         12|Amman| 74900.0|
|        134|        4|              4|          4|Amman| 80300.0|
|        163|        4|             18|         10|Amman| 85350.0|
|         97|        4|             19|         12|Amman| 56650.0|
|        117|        1|              4|         19|Amman| 72650.0|
|        108|        1|              1|          4|Amman| 56600.0|
|         74|        1|              5|          8|Amman| 41300.0|
|        110|        5|             11|         19|Amman| 82500.0|
|        110|        5|              5|         19|Amman| 88500.0|
|         80|        5|              6|          9|Amman| 64000.0|
|        132|        2|              9|          3|Amman| 63400.0|
|        191|        5|             12|         19|Amman|11795

#### b. Aggregations
Calculate the average price of apartments grouped by the number of bedrooms.


In [None]:
result = spark.sql("SELECT Num_Rooms, AVG(Price) AS avg_price FROM apartments GROUP BY Num_Rooms")
result.show()

+---------+------------------+
|Num_Rooms|         avg_price|
+---------+------------------+
|        1| 56153.36363636364|
|        3|61384.903846153844|
|        5| 76516.07843137255|
|        4|  66747.1264367816|
|        2| 57040.51546391752|
+---------+------------------+



#### c. Sorting Data
List the top 5 most expensive apartments.

In [None]:
result = spark.sql("SELECT * FROM apartments ORDER BY price DESC LIMIT 5")
result.show()

+-----------+---------+---------------+-----------+-----+--------+
|Square_Area|Num_Rooms|Age_of_Building|Floor_Level| City|   Price|
+-----------+---------+---------------+-----------+-----+--------+
|        199|        4|              2|         16|Amman|123550.0|
|        183|        5|              4|         19|Amman|122350.0|
|        191|        5|             12|         19|Amman|117950.0|
|        187|        4|              7|         19|Amman|116150.0|
|        160|        5|              1|         15|Amman|111000.0|
+-----------+---------+---------------+-----------+-----+--------+



#### d. Filtering and Conditions
Find apartments with more than 3 bedrooms and priced below 200,000.

In [None]:
result = spark.sql("""
    SELECT * 
    FROM apartments 
    WHERE Num_Rooms > 3 AND Price < 200000
""")
result.show()

+-----------+---------+---------------+-----------+-----+--------+
|Square_Area|Num_Rooms|Age_of_Building|Floor_Level| City|   Price|
+-----------+---------+---------------+-----------+-----+--------+
|        152|        5|              8|          8|Aqaba| 79720.0|
|         80|        4|             14|          7|Aqaba| 41800.0|
|        181|        4|             16|         16|Aqaba| 85160.0|
|        134|        4|              4|          4|Amman| 80300.0|
|        147|        5|              5|          6|Aqaba| 78920.0|
|        159|        4|             16|          9|Irbid| 60700.0|
|        163|        4|             18|         10|Amman| 85350.0|
|         61|        4|              7|         18|Aqaba| 52960.0|
|         97|        4|             19|         12|Amman| 56650.0|
|        189|        4|             16|         13|Aqaba| 85040.0|
|         80|        5|             19|         13|Aqaba| 47800.0|
|         81|        4|             18|         19|Aqaba| 5016

### Writing Query Results to a File
Save the filtered data (apartments in "Amman") to a new CSV file.
```python
result = spark.sql("SELECT * FROM apartments WHERE location = 'Amman'")
result.write.csv("/mnt/data/amman_apartments.csv", header=True)
```

### Using Built-in SQL Functions

#### a. String Manipulation
Convert all location names to uppercase.

In [None]:
result = spark.sql("SELECT UPPER(City) AS location_upper, Square_Area, Price FROM apartments")
result.show()

+--------------+-----------+-------+
|location_upper|Square_Area|  Price|
+--------------+-----------+-------+
|         AMMAN|        162|74900.0|
|         AQABA|        152|79720.0|
|         IRBID|         74|43200.0|
|         IRBID|        166|69800.0|
|         AQABA|        131|63160.0|
|         AQABA|         80|41800.0|
|         AQABA|        162|68320.0|
|         AQABA|        181|85160.0|
|         AMMAN|        134|80300.0|
|         AQABA|        147|78920.0|
|         IRBID|        176|51800.0|
|         IRBID|        159|60700.0|
|         AMMAN|        163|85350.0|
|         IRBID|        190|74000.0|
|         IRBID|        112|44600.0|
|         AQABA|         61|52960.0|
|         IRBID|        147|65100.0|
|         AMMAN|         97|56650.0|
|         AQABA|        189|85040.0|
|         AQABA|         80|47800.0|
+--------------+-----------+-------+
only showing top 20 rows



#### b. Numeric Functions
Calculate the price per square foot for each apartment.

In [None]:
result = spark.sql("SELECT City, Square_Area, Price, (Price / Square_Area) AS price_per_sqft FROM apartments")
result.show()

+-----+-----------+-------+------------------+
| City|Square_Area|  Price|    price_per_sqft|
+-----+-----------+-------+------------------+
|Amman|        162|74900.0|462.34567901234567|
|Aqaba|        152|79720.0| 524.4736842105264|
|Irbid|         74|43200.0| 583.7837837837837|
|Irbid|        166|69800.0|420.48192771084337|
|Aqaba|        131|63160.0| 482.1374045801527|
|Aqaba|         80|41800.0|             522.5|
|Aqaba|        162|68320.0| 421.7283950617284|
|Aqaba|        181|85160.0|470.49723756906076|
|Amman|        134|80300.0| 599.2537313432836|
|Aqaba|        147|78920.0| 536.8707482993198|
|Irbid|        176|51800.0| 294.3181818181818|
|Irbid|        159|60700.0|381.76100628930817|
|Amman|        163|85350.0| 523.6196319018405|
|Irbid|        190|74000.0| 389.4736842105263|
|Irbid|        112|44600.0| 398.2142857142857|
|Aqaba|         61|52960.0| 868.1967213114754|
|Irbid|        147|65100.0|442.85714285714283|
|Amman|         97|56650.0|  584.020618556701|
|Aqaba|      

#### c. Statistical Analysis
Find the minimum, maximum, and average apartment prices.

In [None]:
result = spark.sql("""
    SELECT 
        MIN(price) AS min_price, 
        MAX(price) AS max_price, 
        AVG(price) AS avg_price 
    FROM apartments
""")
result.show()

+---------+---------+---------+
|min_price|max_price|avg_price|
+---------+---------+---------+
|  15900.0| 123550.0| 63410.94|
+---------+---------+---------+



### End-to-End Example
1. Load the dataset.
2. Filter apartments with at least 2 bedrooms and priced below 150,000.
3. Group them by location and calculate the average price.
4. Save the results.

In [None]:
# Step 1: Filter data
filtered_data = spark.sql("""
    SELECT * 
    FROM apartments 
    WHERE Num_Rooms >= 2 AND price < 150000
""")

# Step 2: Group and aggregate
aggregated_data = spark.sql("""
    SELECT City, AVG(price) AS avg_price
    FROM apartments
    WHERE Num_Rooms >= 2 AND price < 150000
    GROUP BY City
""")

# Step 3: Save results to a file
# aggregated_data.write.csv("/mnt/data/filtered_apartments.csv", header=True)

# Machine Learning with Apache Spark: Predicting Apartment Prices

In this notebook, we'll explore the basics of machine learning in Apache Spark using the MLlib library. Specifically, we'll build a regression model to predict apartment prices based on features like square area, number of rooms, age of the building, and floor level. 

### Step 1: Setting Up the Spark Environment

First, we need to set up a `SparkSession`, which is the main entry point for using Spark's DataFrame and MLlib capabilities. The `SparkSession` allows us to create and manipulate DataFrames and to access Spark's machine learning library.

In [45]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("Apartment Price Prediction").getOrCreate()

24/11/15 21:02:14 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Step 2: Loading the Dataset

Next, we load the dataset containing apartment information and prices. Spark can read various file formats; here, we’re loading a CSV file with headers and inferring the data types for each column. Once loaded, we display the schema and some sample rows to understand the data structure.

In [47]:
# Load the dataset
data_path = "datasets/apartment_prices.csv"  # Adjust the path if needed
df = spark.read.csv(data_path, header=True, inferSchema=True)

# Show the schema and data
df.printSchema()
df.show()

root
 |-- Square_Area: integer (nullable = true)
 |-- Num_Rooms: integer (nullable = true)
 |-- Age_of_Building: integer (nullable = true)
 |-- Floor_Level: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Price: double (nullable = true)

+-----------+---------+---------------+-----------+-----+-------+
|Square_Area|Num_Rooms|Age_of_Building|Floor_Level| City|  Price|
+-----------+---------+---------------+-----------+-----+-------+
|        162|        1|             15|         12|Amman|74900.0|
|        152|        5|              8|          8|Aqaba|79720.0|
|         74|        3|              2|          8|Irbid|43200.0|
|        166|        1|              3|         18|Irbid|69800.0|
|        131|        3|             14|         15|Aqaba|63160.0|
|         80|        4|             14|          7|Aqaba|41800.0|
|        162|        2|             11|         11|Aqaba|68320.0|
|        181|        4|             16|         16|Aqaba|85160.0|
|        134|    

### Step 3: Data Preprocessing – Handling Categorical Data

In machine learning, we need to convert categorical data into numerical representations. Here, the `City` column is a categorical feature that we need to transform. We use `StringIndexer` to assign a numeric index to each unique city, and then we apply `OneHotEncoder` to convert these indices into a one-hot encoded vector. This helps the model process categorical data effectively.

In [48]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

# Convert the 'City' column to a numeric index
indexer = StringIndexer(inputCol="City", outputCol="CityIndex")
df = indexer.fit(df).transform(df)

# Convert the numeric index to one-hot encoding
encoder = OneHotEncoder(inputCol="CityIndex", outputCol="CityVec")
df = encoder.fit(df).transform(df)

### Step 4: Feature Engineering – Assembling Features

Spark’s MLlib expects the features for each data point to be in a single vector column. We use the `VectorAssembler` to combine `Square_Area`, `Num_Rooms`, `Age_of_Building`, `Floor_Level`, and the one-hot encoded `CityVec` column into a single `features` column. We also rename the `Price` column to `label`, as MLlib expects the target variable to be named `label`.

In [49]:
from pyspark.ml.feature import VectorAssembler

# Assemble features into a single vector
assembler = VectorAssembler(inputCols=["Square_Area", "Num_Rooms", "Age_of_Building", "Floor_Level", "CityVec"], outputCol="features")
df = assembler.transform(df)

# Select the final columns for modeling
df = df.select("features", df["Price"].alias("label"))
df.show()

+--------------------+-------+
|            features|  label|
+--------------------+-------+
|[162.0,1.0,15.0,1...|74900.0|
|[152.0,5.0,8.0,8....|79720.0|
|[74.0,3.0,2.0,8.0...|43200.0|
|[166.0,1.0,3.0,18...|69800.0|
|[131.0,3.0,14.0,1...|63160.0|
|[80.0,4.0,14.0,7....|41800.0|
|[162.0,2.0,11.0,1...|68320.0|
|[181.0,4.0,16.0,1...|85160.0|
|[134.0,4.0,4.0,4....|80300.0|
|[147.0,5.0,5.0,6....|78920.0|
|[176.0,2.0,14.0,3...|51800.0|
|[159.0,4.0,16.0,9...|60700.0|
|[163.0,4.0,18.0,1...|85350.0|
|[190.0,2.0,7.0,14...|74000.0|
|[112.0,2.0,10.0,1...|44600.0|
|[61.0,4.0,7.0,18....|52960.0|
|[147.0,2.0,1.0,12...|65100.0|
|[97.0,4.0,19.0,12...|56650.0|
|[189.0,4.0,16.0,1...|85040.0|
|[80.0,5.0,19.0,13...|47800.0|
+--------------------+-------+
only showing top 20 rows



### Step 5: Splitting the Dataset

To evaluate our model, we need to split the data into training and test sets. Typically, 80% of the data is used for training, and 20% is used for testing. This allows us to train the model on one portion of the data and then test its performance on unseen data.

In [50]:
# Split data into training and test sets
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

### Step 6: Building and Training a Linear Regression Model

Now, we initialize and train a **Linear Regression** model. Linear regression is a supervised learning algorithm commonly used for predicting numerical values. Here, it will help us predict apartment prices based on the features provided.

In [51]:
from pyspark.ml.regression import LinearRegression

# Initialize Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="label")

# Train the model on the training data
lr_model = lr.fit(train_data)

# Print model coefficients and intercept
print(f"Coefficients: {lr_model.coefficients}")
print(f"Intercept: {lr_model.intercept}")

24/11/15 21:02:37 WARN Instrumentation: [03494bb4] regParam is zero, which might cause numerical instability and overfitting.
24/11/15 21:02:37 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


Coefficients: [371.418044269336,4978.707796345987,-1014.4774362162049,1035.5111204683565,11925.43813852181,-8114.944798383919]
Intercept: -1623.4506090574075


The output will show the coefficients (weights) for each feature, indicating how each feature impacts the apartment price, as well as the intercept term.

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Make predictions on the test data
predictions = lr_model.transform(test_data)

# Show predictions
predictions.select("features", "label", "prediction").show()

# Evaluate model using RMSE
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Initialize RegressionEvaluator with R2 metric
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")

# Evaluate model using R2
r2 = evaluator.evaluate(predictions)
print(f"R-squared: {r2}")

+--------------------+-------+------------------+
|            features|  label|        prediction|
+--------------------+-------+------------------+
|[60.0,3.0,14.0,3....|22000.0|16386.659892134998|
|[61.0,3.0,11.0,2....|24300.0| 18765.99912458459|
|[61.0,5.0,15.0,18...|55450.0|61274.065836811176|
|[63.0,4.0,14.0,12...|46350.0| 51839.60484240993|
|[65.0,3.0,19.0,16...|35400.0|34747.952296873205|
|[67.0,2.0,14.0,8....|28120.0| 27300.37880640007|
|[68.0,1.0,13.0,14...|36600.0|41846.071351871564|
|[71.0,2.0,4.0,3.0...|30300.0| 25638.32494491376|
|[73.0,2.0,17.0,1....|15900.0|11121.932121705046|
|[74.0,1.0,4.0,3.0...|30640.0| 29888.81607975969|
|[74.0,1.0,5.0,8.0...|41300.0| 45977.33238440708|
|[74.0,2.0,13.0,10...|40300.0|44911.242931960136|
|[74.0,5.0,19.0,10...|49300.0|53760.501703700866|
|[76.0,2.0,11.0,4....|37200.0| 41469.96717012108|
|[78.0,2.0,9.0,2.0...|31080.0|30245.297751633643|
|[80.0,5.0,6.0,9.0...|64000.0|  68141.7055196592|
|[83.0,1.0,10.0,18...|50350.0| 54602.81880643364|


The predictions DataFrame shows the actual price (label) alongside the model’s predicted price (prediction). 

The RMSE (Root Mean Squared Error) provides a quantitative measure of the model’s accuracy on the test data, where lower values indicate better model performance.

The R-squared (R²) value represents the proportion of the variance in the target variable (e.g., price) that is explained by the model. An R² value closer to 1 indicates a better model fit, while a value closer to 0 indicates a poor fit.


### Summary

In this notebook, we demonstrated how to use Spark MLlib to build a regression model for predicting apartment prices. The workflow included data preprocessing, feature engineering, model training, and evaluation. This example highlights Spark's ability to handle machine learning tasks on large datasets in a distributed environment, making it an excellent tool for scalable data processing and analysis.

## Installing pyspark on local windows machine:
- install python 3.9
- add python to the path
you can create a virtual environment and add it to the path
- install pyspark, pip install pyspark
- install java 11
- add JAVA_HOME = java folder to environment variables
- Under System Variables, click New to add new variables:
Variable name: PYSPARK_PYTHON
Variable value: python
- Repeat to add another variable:
Variable name: PYSPARK_DRIVER_PYTHON
Variable value: python