# Spark Show Sales Analysis
##Summary



1.   **Initialize Spark Session:** A Spark session is created for local processing.
2.   **Read CSV Data:** The shoe sales data from a CSV file is loaded into a DataFrame.
1.   **Create Temporary View:** The DataFrame is registered as a temporary view for SQL queries.
2.   **Convert Timestamp:** The "fecha" column is converted to a timestamp format.
1.   **Create DataMart for Materials:** A DataFrame with unique material IDs and names is created and registered as a view.
2.   **Create DataMart for Categories:** A DataFrame with unique category IDs and names is created and registered as a view.
1.   **Join Data for Products:** A comprehensive DataFrame is created by joining the sales data with the materials and categories DataFrames, resulting in a structured view of products with their corresponding IDs and details.
2.   **Aggregate Sales Data by Year and Month:** Sales data is aggregated by year, month, and product ID, and stored in a new DataFrame.
1.   **Save DataMarts to CSV:** The aggregated sales data and DataFrames for products, materials, and categories are saved as CSV files to Google Drive.
2.  ** Reload and Verify Data:** The CSV files are read back into Spark DataFrames, registered as temporary views, and verified by displaying a few rows.
1.   **Further Aggregation and Analysis:** Additional SQL queries are executed to aggregate and analyze sales data by year, month, and material.
2.   **Create Clients and Sales by Client Views:** Distinct client information and aggregated sales by client are created and saved to CSV.
1.   **Reload and Verify Clients and Sales by Client Data:** The saved CSV files are read back into Spark DataFrames, registered as temporary views, and verified.
2.   **Aggregate Sales by Client:** SQL queries aggregate sales by client per year and month, and specifically for clients named "Gloria".
























In [15]:
# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
#!wget -q https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

# unzip the spark file to the current folder
!tar xf spark-3.5.1-bin-hadoop3.tgz

# set your spark folder to your system path environment
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

# install findspark using pip
!pip install -q findspark

# install pyspark
import findspark
findspark.init()
from pyspark.sql import SparkSession

 #Acess to google drive
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/My Drive/Colab Notebooks')
os.listdir()


gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['S07L02-Gustvao.ipynb',
 'Untitled0.ipynb',
 'S12L01.ipynb',
 'S12L02.ipynb',
 'Machine_Learning_Lab03_Action_Learning_Notebook',
 'S12L03.ipynb',
 'S12L04.ipynb',
 'Untitled1.ipynb',
 'Cópia de Spark Setup and Basic Data Processing in PySpark.ipynb',
 'Spark Setup and Basic Data Processing in PySpark.ipynb',
 'DATA',
 '2 - spark_shoe_sales_analysis.ipynb.ipynb',
 'spark-3.5.1-bin-hadoop3.tgz',
 'spark-3.5.1-bin-hadoop3']

In [16]:
spark = SparkSession.builder.appName("Análise de Ventas de Zapatos").master("local[*]").getOrCreate()

Explanation:

*   This cell initializes a Spark session named "Análise de Ventas de Zapatos".
*   master("local[*]") specifies that Spark should run locally using all available cores.



In [17]:
df = spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/ventas_zapatos.csv', header=True, inferSchema=True, sep=';')
df.printSchema()

root
 |-- Fecha: string (nullable = true)
 |-- id_cliente: string (nullable = true)
 |-- nombre: string (nullable = true)
 |-- apellido_1: string (nullable = true)
 |-- apellido_2: string (nullable = true)
 |-- id_prod: integer (nullable = true)
 |-- nombre_prod: string (nullable = true)
 |-- material_prod: string (nullable = true)
 |-- categoria_prod: string (nullable = true)
 |-- precio: integer (nullable = true)



Explanation:

*   A CSV file named ventas_zapatos.csv is read from Google Drive into a DataFrame df.
*   header=True specifies that the CSV file has a header row.
*   inferSchema=True infers the data types of the columns.
*   sep=';' specifies that the CSV uses semicolons as separators.
*   printSchema() displays the schema of the DataFrame.






In [18]:
df.createOrReplaceTempView("ventas")

Explanation:

*   This creates a temporary view called "ventas" from the DataFrame df, allowing SQL queries to be run against it.




In [19]:
spark.sql("SELECT * FROM VENTAS").show()

+----------------+---------------+----------+----------+----------+-------+--------------------+-------------+----------------+------+
|           Fecha|     id_cliente|    nombre|apellido_1|apellido_2|id_prod|         nombre_prod|material_prod|  categoria_prod|precio|
+----------------+---------------+----------+----------+----------+-------+--------------------+-------------+----------------+------+
| 20/04/2019 4:19|995052178892353|   Fabiola|    Méndez|     Ramos| 562972|   Pelusa mercenario|       Gamuza|      Zapatillas|    75|
| 10/02/2019 3:32|528848914440944|   Basileo|    Alonso|   Esteban| 949966| Reforma capitalista|         Goma|Zapatos de tacón|    50|
|08/05/2019 19:11| 53146869174343|    Míriam|  Martínez|   Esteban| 432964|    Zoca coxofemoral|       Gamuza|Zapato de vestir|    70|
|22/04/2019 15:24| 95327509355920|    Teresa|   Garrido|    Castro| 842352|Número Primo cham...|         Goma|         Botines|    50|
|02/01/2019 10:01|560935930327708|   Natalia|    Vargas

In [20]:
from pyspark.sql.functions import to_timestamp
df = df.withColumn("fecha", to_timestamp(df["fecha"], "dd/MM/yyyy H:mm"))
df.createOrReplaceTempView("ventas")
spark.sql("SELECT * FROM VENTAS LIMIT 10").show()
df.printSchema()

+-------------------+---------------+---------+----------+----------+-------+--------------------+-------------+----------------+------+
|              fecha|     id_cliente|   nombre|apellido_1|apellido_2|id_prod|         nombre_prod|material_prod|  categoria_prod|precio|
+-------------------+---------------+---------+----------+----------+-------+--------------------+-------------+----------------+------+
|2019-04-20 04:19:00|995052178892353|  Fabiola|    Méndez|     Ramos| 562972|   Pelusa mercenario|       Gamuza|      Zapatillas|    75|
|2019-02-10 03:32:00|528848914440944|  Basileo|    Alonso|   Esteban| 949966| Reforma capitalista|         Goma|Zapatos de tacón|    50|
|2019-05-08 19:11:00| 53146869174343|   Míriam|  Martínez|   Esteban| 432964|    Zoca coxofemoral|       Gamuza|Zapato de vestir|    70|
|2019-04-22 15:24:00| 95327509355920|   Teresa|   Garrido|    Castro| 842352|Número Primo cham...|         Goma|         Botines|    50|
|2019-01-02 10:01:00|560935930327708|  Na

Explanation:

*   The to_timestamp function is imported.
*   The "fecha" column is converted to a timestamp format.
*   The temporary view "ventas" is recreated with the updated DataFrame.
*   A SQL query selects the first 10 rows from the "ventas" view and displays them.
*   printSchema() shows the updated schema of the DataFrame.





In [22]:
#Crear un DataMart
materiales = spark.sql("SELECT DISTINCT material_prod FROM VENTAS")
materiales.createOrReplaceTempView("materiales")
spark.sql("SELECT row_number () over (order by material_prod) as id, material_prod as nombre from materiales").createOrReplaceTempView("materiales")
spark.sql("SELECT * FROM MATERIALES").show()

+---+----------+
| id|    nombre|
+---+----------+
|  1|     Cuero|
|  2|    Gamuza|
|  3|      Goma|
|  4|Sintéticos|
|  5|      Tela|
+---+----------+



Explanation:

*   A SQL query selects distinct material_prod values from the "ventas" view, creating a DataFrame materiales.
*   A temporary view "materiales" is created.
*   A new SQL query assigns a unique ID to each material_prod and renames it to nombre, updating the "materiales" view.
*   The contents of the "materiales" view are displayed.


In [23]:
categoria = spark.sql("SELECT DISTINCT categoria_prod FROM VENTAS")
categoria.createOrReplaceTempView("categoria")
spark.sql("SELECT row_number () over (order by categoria_prod) as id, categoria_prod as nombre from categoria").createOrReplaceTempView("categoria")
spark.sql("SELECT * FROM CATEGORIA").show()

+---+----------------+
| id|          nombre|
+---+----------------+
|  1|         Botines|
|  2|      Zapatillas|
|  3|Zapato de vestir|
|  4| Zapatos de agua|
|  5|Zapatos de tacón|
+---+----------------+



Explanation:

*   A SQL query selects distinct categoria_prod values from the "ventas" view, creating a DataFrame categoria.
*   A temporary view "categoria" is created from categoria.
*   A new SQL query assigns a unique ID to each categoria_prod and renames it to nombre, updating the "categoria" view.
*   The contents of the "categoria" view are displayed.

In [24]:
productos = spark.sql ("""
    select distinct id_prod as id,  nombre_prod as nombre, m.id as id_material, c.id as id_categoria, precio
    from ventas
    join materiales m on m.nombre = ventas.material_prod
    join categoria c on c.nombre = ventas.categoria_prod
    order by id
""")
productos.createOrReplaceTempView("productos")
productos.show()

+------+--------------------+-----------+------------+------+
|    id|              nombre|id_material|id_categoria|precio|
+------+--------------------+-----------+------------+------+
|  2809| Ablación viñamarino|          5|           1|    40|
|  4077|      Tribu agonista|          5|           4|    45|
|  6189|    Bóiler ciudadana|          2|           5|    65|
|  9171|Cuatrillón partur...|          2|           2|    75|
| 10902|  Xacena tetrasílabo|          3|           1|    45|
| 29075|  Repentista chicoco|          1|           5|    85|
| 34388|     Argucia diestro|          2|           2|    65|
| 36922|    Criancia íntegro|          5|           5|    40|
| 39377|    Complice acetoso|          5|           5|    45|
| 52175|  Carameleo leguleyo|          1|           2|    90|
| 58759|     Horca dilatable|          2|           4|    65|
| 70231|  Bizantino parlante|          5|           5|    35|
| 71840|Materia Inorgánic...|          2|           4|    65|
| 75896|

Explanation:

*   A SQL query is run to select distinct product details from the "ventas" view.
*   The query joins the "ventas" view with the "materiales" and "categoria" views to get the material and category IDs.
*   The resulting DataFrame productos includes columns for product ID (id), product name (nombre), material ID (id_material), category ID (id_categoria), and price (precio).
*   This DataFrame is then registered as a temporary view called "productos".

In [25]:
ventas_final = spark.sql("""
          SELECT date_part ('year', Fecha) as anyo, date_part('months', Fecha) as mes, id_prod, sum(precio) as importe
          from ventas
          group by anyo, mes, id_prod
          order by anyo, mes
          """)
ventas_final.createOrReplaceTempView("ventas_final")
ventas_final.show()

+----+---+-------+-------+
|anyo|mes|id_prod|importe|
+----+---+-------+-------+
|2018| 12| 953060|     30|
|2019|  1|  86091|  11180|
|2019|  1| 977219|   6840|
|2019|  1| 104454|   5460|
|2019|  1| 308114|  11025|
|2019|  1| 844797|  10650|
|2019|  1| 619589|  14915|
|2019|  1| 612297|  13410|
|2019|  1|   6189|  10465|
|2019|  1| 753623|   7875|
|2019|  1| 432964|  11200|
|2019|  1| 904052|   5560|
|2019|  1| 144628|   5635|
|2019|  1| 883449|  11700|
|2019|  1| 851349|   5460|
|2019|  1| 371293|   5775|
|2019|  1| 320893|   3650|
|2019|  1| 356608|  12750|
|2019|  1|  34388|   9620|
|2019|  1| 414511|   6880|
+----+---+-------+-------+
only showing top 20 rows



Explanation:

*   A SQL query is executed to aggregate sales data by year, month, and product ID.
*   The total precio is summed for each group, and the results are stored in ventas_final.
*   The ventas_final DataFrame is registered as a temporary view and displayed.

In [26]:
spark.sql("SELECT date_part ('year', Fecha) as anyo, date_part('months', Fecha) as mes from ventas order by fecha").show()

+----+---+
|anyo|mes|
+----+---+
|2018| 12|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
|2019|  1|
+----+---+
only showing top 20 rows



Explanation:

*   A SQL query selects the year and month from the fecha column in the ventas view and orders the results by fecha.

In [27]:
productos_final = spark.sql ("select * from productos")
materiales_final = spark.sql ("select * from materiales")
categoria_final = spark.sql ("select * from categoria")


Explanation:

*   SQL queries select all records from the productos, materiales, and categoria views, creating DataFrames productos_final, materiales_final, and categoria_final.


In [28]:
ventas_final.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_ventas.csv")
productos_final.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_productos.csv")
materiales_final.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_materiales.csv")
categoria_final.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_categoria.csv")

Explanation:

*  The ventas_final, productos_final, materiales_final, and categoria_final DataFrames are saved as CSV files to Google Drive.
*   repartition(1) ensures that the data is saved in a single file.
*   CSV files are written with headers and semicolons as separators, overwriting any existing files.

In [29]:
#ventas
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_ventas.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_ventas")
spark.sql("SELECT * FROM datamart_ventas order by id_prod  limit 5 ").show()
#productos
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_productos.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_productos")
spark.sql("SELECT * FROM datamart_productos order by id limit 5 ").show()
#materiales
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_materiales.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_materiales")
spark.sql("SELECT * FROM datamart_materiales limit 5").show()
#categoria
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_categoria.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_categoria")
spark.sql("SELECT * FROM datamart_categoria limit 5").show()

+----+---+-------+-------+
|anyo|mes|id_prod|importe|
+----+---+-------+-------+
|2019|  4|   2809|   6400|
|2019|  8|   2809|   6480|
|2019|  5|   2809|   5920|
|2019|  1|   2809|   5520|
|2019|  6|   2809|   6000|
+----+---+-------+-------+

+-----+--------------------+-----------+------------+------+
|   id|              nombre|id_material|id_categoria|precio|
+-----+--------------------+-----------+------------+------+
| 2809| Ablación viñamarino|          5|           1|    40|
| 4077|      Tribu agonista|          5|           4|    45|
| 6189|    Bóiler ciudadana|          2|           5|    65|
| 9171|Cuatrillón partur...|          2|           2|    75|
|10902|  Xacena tetrasílabo|          3|           1|    45|
+-----+--------------------+-----------+------------+------+

+---+----------+
| id|    nombre|
+---+----------+
|  1|     Cuero|
|  2|    Gamuza|
|  3|      Goma|
|  4|Sintéticos|
|  5|      Tela|
+---+----------+

+---+----------------+
| id|          nombre|
+---+-

Explanation:

*   CSV files for datamart_ventas, datamart_productos, datamart_materiales, and datamart_categoria are read back into Spark DataFrames and registered as temporary views.
*   SQL queries display the first 5 rows of each view to verify the data.

In [30]:
# prompt: Find the month with more sells

spark.sql("""
          SELECT anyo, mes, sum(importe) as importe
          from datamart_ventas
          group by anyo, mes
          order by importe desc
          """).show()


+----+---+-------+
|anyo|mes|importe|
+----+---+-------+
|2021|  1|1738965|
|2020|  7|1735585|
|2019| 10|1735560|
|2021|  8|1733365|
|2020|  8|1732230|
|2019|  7|1731160|
|2019|  5|1729250|
|2020|  1|1728950|
|2021|  7|1728310|
|2021|  5|1727315|
|2021|  3|1726540|
|2019|  1|1725765|
|2020|  5|1725690|
|2019| 12|1725650|
|2020| 10|1724070|
|2019|  8|1722155|
|2019|  3|1721030|
|2020|  3|1717350|
|2020| 12|1708645|
|2020|  6|1691180|
+----+---+-------+
only showing top 20 rows



Explanation:

*   A SQL query aggregates sales (importe) by year and month from the datamart_ventas view.
*   The results are ordered by importe in descending order and displayed.

In [31]:
spark.sql("""
          SELECT dv.anyo, dv.mes, dm.nombre as material, sum(dv.importe) as importe
          from datamart_ventas dv
          join datamart_productos dp on dp.id = dv.id_prod
          join datamart_materiales dm on dp.id_material = dm.id
          group by 1,2,3
          order by 1,2,3
          """).show()


+----+---+----------+-------+
|anyo|mes|  material|importe|
+----+---+----------+-------+
|2018| 12|Sintéticos|     30|
|2019|  1|     Cuero| 408565|
|2019|  1|    Gamuza| 643520|
|2019|  1|      Goma| 299225|
|2019|  1|Sintéticos|  65950|
|2019|  1|      Tela| 308505|
|2019|  2|     Cuero| 364520|
|2019|  2|    Gamuza| 584570|
|2019|  2|      Goma| 270560|
|2019|  2|Sintéticos|  57460|
|2019|  2|      Tela| 277155|
|2019|  3|     Cuero| 408970|
|2019|  3|    Gamuza| 644515|
|2019|  3|      Goma| 291860|
|2019|  3|Sintéticos|  68015|
|2019|  3|      Tela| 307670|
|2019|  4|     Cuero| 391600|
|2019|  4|    Gamuza| 628585|
|2019|  4|      Goma| 286035|
|2019|  4|Sintéticos|  63925|
+----+---+----------+-------+
only showing top 20 rows



Explanation:

*   A SQL query aggregates sales (importe) by year, month, and material from the datamart_ventas, datamart_productos, and datamart_materiales views.
*   The results are grouped and ordered by year, month, and material, and then displayed.

In [34]:
spark.sql("select * from ventas limit 10").show()

+-------------------+---------------+---------+----------+----------+-------+--------------------+-------------+----------------+------+
|              fecha|     id_cliente|   nombre|apellido_1|apellido_2|id_prod|         nombre_prod|material_prod|  categoria_prod|precio|
+-------------------+---------------+---------+----------+----------+-------+--------------------+-------------+----------------+------+
|2019-04-20 04:19:00|995052178892353|  Fabiola|    Méndez|     Ramos| 562972|   Pelusa mercenario|       Gamuza|      Zapatillas|    75|
|2019-02-10 03:32:00|528848914440944|  Basileo|    Alonso|   Esteban| 949966| Reforma capitalista|         Goma|Zapatos de tacón|    50|
|2019-05-08 19:11:00| 53146869174343|   Míriam|  Martínez|   Esteban| 432964|    Zoca coxofemoral|       Gamuza|Zapato de vestir|    70|
|2019-04-22 15:24:00| 95327509355920|   Teresa|   Garrido|    Castro| 842352|Número Primo cham...|         Goma|         Botines|    50|
|2019-01-02 10:01:00|560935930327708|  Na

In [37]:
clientes = spark.sql("""
          select distinct id_cliente, nombre, apellido_1, apellido_2
          from ventas
""")
clientes.createOrReplaceTempView("clientes")
clientes.show()

+---------------+---------+----------+----------+
|     id_cliente|   nombre|apellido_1|apellido_2|
+---------------+---------+----------+----------+
| 80428676143833|  Dolores|    Cortés| Hernández|
| 29193624811735|    Josep|     Calvo|    Crespo|
|124469852051349| Atanasio|     Calvo|    Cortés|
| 31741117385487| Fernando|      Díez|      Mora|
|604051495291342|  Leoncio|    Torres|      Díez|
|731814709223377|   Matías|     Durán|    Pastor|
| 40724899055494|     Marc|      Sáez|     Arias|
|163647970819336|Alejandro|   Velasco|    Prieto|
|922460409347428|     Tito|  Castillo|     Román|
|308528434681532|   Raquel|      Sáez|     Pérez|
|608704101743155|  Vicenta|   Pascual|     Núñez|
|421713320345231|  Soledad|   Carmona|      Soto|
|495319712613149|Adalberto|   Fuentes|    Santos|
|707278997709905|Magdalena|   Ramírez|   Velasco|
|571062507452369|    Pilar|    Ibáñez|      Sáez|
|273047001510411|     #N/D|   Herrero|      Cruz|
| 81360272849621| Abelardo|    Ibáñez|   Velasco|


Explanation:

*   A SQL query selects distinct id_cliente, nombre, apellido_1, and apellido_2 from the ventas view.
*   This DataFrame is registered as a temporary view named clientes
*   The results are grouped and ordered by year, month, and material, and then displayed.

In [41]:
ventas_cliente = spark.sql("""
          select date_part('year', fecha) as anyo, date_part('month', fecha) as mes, id_cliente, sum(precio) as importe
          from ventas
          where id_cliente <> '#N/D'
          group by 1,2,3
          order by 1,2,3
""")
ventas_cliente.createOrReplaceTempView("ventas_cliente")
ventas_cliente.show()

+----+---+---------------+-------+
|anyo|mes|     id_cliente|importe|
+----+---+---------------+-------+
|2018| 12|731814709223377|     30|
|2019|  1|100434974821163|   1125|
|2019|  1|100609367711644|    925|
|2019|  1|101222940698139|   1080|
|2019|  1|101336235770802|   1135|
|2019|  1| 10225677955338|    680|
|2019|  1|102639691840435|   1005|
|2019|  1|103340889464728|    500|
|2019|  1|103574152138567|    495|
|2019|  1|104296523397751|   1040|
|2019|  1|104419325532001|    915|
|2019|  1|104881179041071|    670|
|2019|  1|104969260310593|    470|
|2019|  1|  1052752142170|   1150|
|2019|  1|105982744315977|   1140|
|2019|  1|106022545467425|    920|
|2019|  1|106168163816287|    790|
|2019|  1|106227527630542|    990|
|2019|  1|106342931776064|    600|
|2019|  1|106417570310974|    765|
+----+---+---------------+-------+
only showing top 20 rows



Explanation:

*   A SQL query selects distinct id_cliente, nombre, apellido_1, and apellido_2 from the ventas view.
*   This DataFrame is registered as a temporary view named clientes
*   The results are grouped and ordered by year, month, and id_cliente, and then displayed.

In [43]:
clientes.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_clientes.csv")
ventas_cliente.repartition(1). write.format("com.databricks.spark.csv").option("header", "true").option("sep", ";").mode("overwrite").save("/content/drive/My Drive/Colab Notebooks/DATA/datamart_ventas_cliente.csv")

#clientes
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_clientes.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_clientes")
spark.sql("SELECT * FROM datamart_clientes limit 5 ").show()
#ventas_cliente
spark.read.csv('/content/drive/My Drive/Colab Notebooks/DATA/datamart_ventas_cliente.csv', header=True, inferSchema=True, sep=';').createOrReplaceTempView("datamart_ventas_cliente")
spark.sql("SELECT * FROM datamart_ventas_cliente limit 5").show()


+---------------+--------+----------+----------+
|     id_cliente|  nombre|apellido_1|apellido_2|
+---------------+--------+----------+----------+
| 80428676143833| Dolores|    Cortés| Hernández|
| 29193624811735|   Josep|     Calvo|    Crespo|
|124469852051349|Atanasio|     Calvo|    Cortés|
| 31741117385487|Fernando|      Díez|      Mora|
|604051495291342| Leoncio|    Torres|      Díez|
+---------------+--------+----------+----------+

+----+---+---------------+-------+
|anyo|mes|     id_cliente|importe|
+----+---+---------------+-------+
|2018| 12|731814709223377|     30|
|2019|  1|100434974821163|   1125|
|2019|  1|100609367711644|    925|
|2019|  1|101222940698139|   1080|
|2019|  1|101336235770802|   1135|
+----+---+---------------+-------+



Explanation:

*   The clientes and ventas_cliente DataFrames are saved as CSV files to Google Drive with headers and semicolons as separators, overwriting any existing files.
*   The saved CSV files datamart_clientes.csv and datamart_ventas_cliente.csv are read back into Spark DataFrames and registered as temporary views.
*   SQL queries display the first 5 rows of each view to verify the data.

Total de ventas por cliente y mes

In [44]:
spark.sql("""
          select vc.anyo, vc.mes, vc.id_cliente, c.nombre, c.apellido_1, c.apellido_2, sum(vc.importe) as importe
          from ventas_cliente vc
          join clientes c on c.id_cliente=vc.id_cliente
          group by 1,2,3,4,5,6
          order by 1,2,3
""").show()

+----+---+---------------+----------+----------+----------+-------+
|anyo|mes|     id_cliente|    nombre|apellido_1|apellido_2|importe|
+----+---+---------------+----------+----------+----------+-------+
|2018| 12|731814709223377|    Matías|     Durán|    Pastor|     30|
|2019|  1|100434974821163|    Zaqueo| Gutiérrez|   Álvarez|   1125|
|2019|  1|100609367711644|   Teodora|   Pascual|   Fuentes|    925|
|2019|  1|101222940698139|   Matilde|   Carmona|      Ruiz|   1080|
|2019|  1|101336235770802|  Ifigenia|   Vázquez| Domínguez|   1135|
|2019|  1| 10225677955338|  Josefina|    Méndez|    Vargas|    680|
|2019|  1|102639691840435|     Ester|   Sánchez|     Román|   1005|
|2019|  1|103340889464728|  Heraclio|       Rey|    Medina|    500|
|2019|  1|103574152138567|    Acacio|    Cortés|    Flores|    495|
|2019|  1|104296523397751|   Jacinto|    Ferrer|  Santiago|   1040|
|2019|  1|104419325532001| Fulgencio|      Sáez|      Vega|    915|
|2019|  1|104881179041071|      Olga|     Ortiz|

Explanation:

*   A SQL query aggregates sales (importe) by year, month, and client ID.
*   The query joins ventas_cliente with clientes to include client details.
*   The results are grouped and ordered by year, month, and client ID, displaying the aggregated sales for each client per year and month.

Total de ventas por cliente cuyo nombre es gloria

In [51]:
spark.sql("""
    SELECT vc.id_cliente, c.nombre, c.apellido_1, c.apellido_2, SUM(vc.importe) AS importe
    FROM ventas_cliente vc
    JOIN clientes c ON c.id_cliente = vc.id_cliente
    WHERE c.nombre = 'Gloria'
    GROUP BY vc.id_cliente, c.nombre, c.apellido_1, c.apellido_2
    ORDER BY vc.id_cliente
""").show()


+---------------+------+----------+----------+-------+
|     id_cliente|nombre|apellido_1|apellido_2|importe|
+---------------+------+----------+----------+-------+
|112173292754148|Gloria|    Romero|    Pastor|  27660|
|151275648428666|Gloria|    Castro|     Román|  29490|
|392580240294815|Gloria|    Suárez|       Rey|  29420|
|543508035906917|Gloria|  Martínez|   Hidalgo|  26215|
|802750163317303|Gloria|       Rey| Fernández|  29010|
+---------------+------+----------+----------+-------+



Explanation:

*   A SQL query aggregates sales (importe) for clients named "Gloria".
*   The query joins ventas_cliente with clientes to include client details.
*   The results are grouped by client ID and client details, displaying the aggregated sales for each client named "Gloria".