<a href="https://colab.research.google.com/github/gj-goncalvescaldas/Map-Reduce-and-Spark/blob/main/Spark_Setup_and_Basic_Data_Processing_in_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Description:**

This Jupyter notebook provides a step-by-step guide to setting up Apache Spark in a Google Colab environment and demonstrates basic data processing tasks using PySpark. The notebook covers the installation of necessary dependencies, initialization of Spark, creation of DataFrames, text file processing, and CSV file handling. Each block of code includes concise explanations to help understand the purpose and functionality of the operations performed.


In [8]:
# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
#!wget -q https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

# unzip the spark file to the current folder
!tar xf spark-3.5.1-bin-hadoop3.tgz

# set your spark folder to your system path environment
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

# install findspark using pip
!pip install -q findspark

Explanation:

*  Install Java: Installs OpenJDK 8 without displaying output to the terminal.
*  Download Spark: Downloads Spark version 3.5.1 (commented out, can be uncommented if needed).
* Unzip Spark: Extracts the downloaded Spark file.
* Set Environment Variables: Sets the environment variables for Java and Spark to ensure they are properly configured.
* Install findspark: Installs the findspark package to help locate the Spark installation.

In [12]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)


+-----+
|hello|
+-----+
|world|
|world|
|world|
+-----+
only showing top 3 rows



Explanation:

*  Initialize findspark: Prepares the environment for using PySpark.
*  Create SparkSession: Starts a Spark session that uses all available local cores.
* Create DataFrame: Creates a DataFrame with 1000 rows, each containing the key-value pair {"hello": "world"}.
* Show DataFrame: Displays the first 3 rows of the DataFrame without truncating the values.



In [32]:
sc = spark.sparkContext
file = sc.textFile("/DATA/el_quijote.txt")
words = file.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
for x in wordCounts.take(10):
  print(x)

('DE', 17)
('LA', 13)
('Miguel', 3)
('', 1)
('PRIMERA', 1)
('CAPÍTULO', 1)
('1:', 1)
('condición', 23)
('y', 8042)
('del', 1113)


Explanation:

*  Get SparkContext: Retrieves the current SparkContext from the Spark session.
*  Read Text File: Reads the text file "el_quijote.txt" as an RDD.
* Split into Words: Splits each line of the text into individual words.
* Count Words: Maps each word to a (word, 1) pair and reduces by key to count the occurrences of each word.
* Show Word Counts: Prints the first 10 word counts.



In [27]:
ficheroVentas = spark.read.csv("/DATA/ventas_zapatos.csv", header=True, sep=";")
ficheroVentas.show()

+----------------+---------------+----------+----------+----------+-------+--------------------+-------------+----------------+------+
|           Fecha|     id_cliente|    nombre|apellido_1|apellido_2|id_prod|         nombre_prod|material_prod|  categoria_prod|precio|
+----------------+---------------+----------+----------+----------+-------+--------------------+-------------+----------------+------+
| 20/04/2019 4:19|995052178892353|   Fabiola|    Méndez|     Ramos| 562972|   Pelusa mercenario|       Gamuza|      Zapatillas|    75|
| 10/02/2019 3:32|528848914440944|   Basileo|    Alonso|   Esteban| 949966| Reforma capitalista|         Goma|Zapatos de tacón|    50|
|08/05/2019 19:11| 53146869174343|    Míriam|  Martínez|   Esteban| 432964|    Zoca coxofemoral|       Gamuza|Zapato de vestir|    70|
|22/04/2019 15:24| 95327509355920|    Teresa|   Garrido|    Castro| 842352|Número Primo cham...|         Goma|         Botines|    50|
|02/01/2019 10:01|560935930327708|   Natalia|    Vargas

Explanation:

*  Read CSV File: Reads the CSV file "ventas_zapatos.csv" into a DataFrame with headers and fields separated by a semicolon.
*  Show DataFrame: Displays the contents of the DataFrame.

In [31]:
importePorMaterial = ficheroVentas.rdd.map(lambda row: (row["material_prod"], int(row["precio"]))).reduceByKey(lambda a, b: a + b)
importePorMaterial.toDF().show()


+----------+--------+
|        _1|      _2|
+----------+--------+
|    Gamuza|21056140|
|     Cuero|13324110|
|      Tela| 9974005|
|      Goma| 9629235|
|Sintéticos| 2175200|
+----------+--------+



Explanation:

*  Map and Reduce: Maps each row to a (material, price) pair and reduces by key to sum the prices for each material.
*  Convert and Show: Converts the result back to a DataFrame and displays it.