# Ejemplo 2.1 M&Ms

Para llevar a cabo estos ejercicios he descargado el documento "mnm_dataset.csv" del repositorio de GitHub: 
https://github.com/databricks/LearningSparkV2/tree/master/chapter2/py/src/data

### Python

    Para ejecutar esta parte:
     Kernel -> Change kernel -> Python 3

Importamos las librerias necesarias:

In [1]:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import count
from pyspark.sql import functions as F

Creamos una SparkSession

In [2]:
spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())

Leemos los datos en un Spark DataFrame: las columnas tienen cabecera y estan separadas por coma

In [3]:
mnm ="C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/mnm_dataset.csv"

In [4]:
mnm_df = (spark.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load(mnm))


Mostramos las cinco primeras filas:

In [5]:
mnm_df.show(n=5, truncate=False)

+-----+------+-----+
|State|Color |Count|
+-----+------+-----+
|TX   |Red   |20   |
|NV   |Blue  |66   |
|CO   |Blue  |79   |
|OR   |Blue  |71   |
|WA   |Yellow|93   |
+-----+------+-----+
only showing top 5 rows



Realizamos una query para ordenar por total de veces los estados y sus colores.

In [6]:
count_mnm_df = (mnm_df
               .select("State", "Color", "Count") #seleccionamos las 3 columnas
               .groupBy("State", "Color") #agrupamos por State y color
               .agg(count("Count").alias("Total")) # contamos la columna count y como nombre de columna ponemos total
               .orderBy("Total", ascending=False)) #lo ordenamos por el numero total de veces que aparecen

Mostramos por pantalla el resultado

In [7]:
count_mnm_df.show(n=60, truncate=False)
print("Total Rows = %d" % (count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
|AZ   |Brown |1698 |
|WY   |Green |1695 |
|CO   |Blue  |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CA   |Red   |1656 |
|CO   |Brown |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|UT   |Yellow|1645 |
|OR   |Red   |1645 |
|CO   |Orange|1642 |
|TX   |Brown 

En este caso, la consulta es igual a la anterior pero solo mostramos los datos de CA.

In [8]:
ca_count_mnm_df = (mnm_df
                  .select("State", "Color", "Count") #seleccionamos las 3 columnas
                  .where(mnm_df.State =="CA") #filtramos para que el estado sea california
                  .groupBy("State", "Color") # agrupamos por estado y color
                  .agg(count("Count").alias("Total")) #contamos cuantas veces aparece cada grupo
                  .orderBy("Total", ascending=False)) #las ordenamos por el numero de veces que aparecen

Mostramos por pantalla el resultado:

In [9]:
ca_count_mnm_df.show(n=10, truncate=False)
print("Total Rows = %d" % (ca_count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|CA   |Green |1723 |
|CA   |Brown |1718 |
|CA   |Orange|1657 |
|CA   |Red   |1656 |
|CA   |Blue  |1603 |
+-----+------+-----+

Total Rows = 6


***Ejercicios extras***

Máximo número por color, 5 primeros

In [11]:
max_col= (mnm_df
          .select("Color","Count")
          .groupBy("Color")
          .agg({"Count":"max"})
          .orderBy("max(Count)", ascending=True))
max_col.show(n=5, truncate=False)


+------+----------+
|Color |max(Count)|
+------+----------+
|Orange|100       |
|Brown |100       |
|Red   |100       |
|Green |100       |
|Yellow|100       |
+------+----------+
only showing top 5 rows



Igual que el apartado anterior, por State

In [10]:
max_col_Sta= (mnm_df
          .select("Color","Count", "State")
          .groupBy("State","Color")
          .agg({"Count":"max"})
          .orderBy("max(Count)", ascending=True))
max_col_Sta.show(n=5, truncate=False)

+-----+------+----------+
|State|Color |max(Count)|
+-----+------+----------+
|UT   |Blue  |100       |
|NM   |Green |100       |
|WA   |Red   |100       |
|WA   |Orange|100       |
|WY   |Green |100       |
+-----+------+----------+
only showing top 5 rows



Minimo número por color, 5 primeros

In [12]:
min_col= (mnm_df
        .select("Color","Count")
        .groupBy("Color")
        .agg({"Count":"min"})
        .orderBy("min(Count)", ascending=True))
min_col.show(n=5, truncate=False)

+------+----------+
|Color |min(Count)|
+------+----------+
|Orange|10        |
|Green |10        |
|Blue  |10        |
|Brown |10        |
|Yellow|10        |
+------+----------+
only showing top 5 rows



Igual que el apartado anterior, por State


In [13]:
min_col_Sta= (mnm_df
              .select("Color","Count","State")
              .groupBy("State","Color")
              .agg({"Count":"min"})
              .orderBy("min(Count)", ascending=True))
min_col_Sta.show(n=5, truncate=False)

+-----+-----+----------+
|State|Color|min(Count)|
+-----+-----+----------+
|NV   |Red  |10        |
|NV   |Brown|10        |
|WA   |Red  |10        |
|CA   |Blue |10        |
|NM   |Green|10        |
+-----+-----+----------+
only showing top 5 rows



Igual que el apartado donde seleccionamos solo las filas de "CA", filtramos también por "NV", "CO" y "TX"

In [14]:
St_count_mnm_df = (mnm_df
                  .select("State", "Color", "Count") 
                  .where((mnm_df.State =="TX")|( mnm_df.State =="NV")|( mnm_df.State =="CA")|( mnm_df.State =="CO") ) 
                  .groupBy("State", "Color") 
                  .agg(count("Count").alias("Total")) 
                  .orderBy("Total", ascending=False))
St_count_mnm_df.show(n=10, truncate=False)
print("Total Rows = %d" % (ca_count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
+-----+------+-----+
only showing top 10 rows

Total Rows = 6


Calculamos por estado y color: maximo, minimo, media y conteo.

In [15]:
calculos_mnm= (mnm_df
              .select("Count","State","Color")
              .groupBy("State","Color")
              .agg(F.min("Count"),F.max("Count"), F.avg("Count"), F.count("Count"))
              .orderBy("State", "Color"))

 

calculos_mnm.show(n=20, truncate=False)

+-----+------+----------+----------+------------------+------------+
|State|Color |min(Count)|max(Count)|avg(Count)        |count(Count)|
+-----+------+----------+----------+------------------+------------+
|AZ   |Blue  |10        |100       |54.99449877750611 |1636        |
|AZ   |Brown |10        |100       |54.350412249705535|1698        |
|AZ   |Green |10        |100       |54.82219570405728 |1676        |
|AZ   |Orange|10        |100       |54.28300769686205 |1689        |
|AZ   |Red   |10        |100       |54.637135922330096|1648        |
|AZ   |Yellow|10        |100       |54.98548972188634 |1654        |
|CA   |Blue  |10        |100       |55.59762944479102 |1603        |
|CA   |Brown |10        |100       |55.740395809080326|1718        |
|CA   |Green |10        |100       |54.268717353453276|1723        |
|CA   |Orange|10        |100       |54.502715751357876|1657        |
|CA   |Red   |10        |100       |55.26992753623188 |1656        |
|CA   |Yellow|10        |100      

### Scala

    Para ejecutar esta parte:
     Kernel -> Change kernel -> spylon-kernel

Cargamos las librerias:

In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

Intitializing Scala interpreter ...

Spark Web UI available at http://EM2021002836.bosonit.local:4040
SparkContext available as 'sc' (version = 3.1.1, master = local[*], app id = local-1622113869365)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._


Creamos la variable Spark para crear nuestra SparkSession:

In [2]:
val spark = SparkSession
        .builder
        .appName("MnMCount")
        .getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4e014b50


Leemos nuestro conjunto de datos 

In [3]:
//Indicamos la ruta de nuestro conjunto de datos
val mnmFile = "C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/mnm_dataset.csv"

mnmFile: String = C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/mnm_dataset.csv


In [4]:
//Leemos el dataset
val mnmDF = spark.read.format("csv")
         .option("header", "true")
         .option("inferSchema", "true")
         .load(mnmFile)

mnmDF: org.apache.spark.sql.DataFrame = [State: string, Color: string ... 1 more field]


Realizamos una query para ordenar por total de veces los estados y sus colores.

In [5]:
val countMnMDF = mnmDF
         .select("State", "Color", "Count")
         .groupBy("State", "Color")
         .agg(count("Count").alias("Total"))
         .orderBy(desc("Total"))

countMnMDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [State: string, Color: string ... 1 more field]


Mostramos el resultado por pantalla:

In [6]:
countMnMDF.show(60)

+-----+------+-----+
|State| Color|Total|
+-----+------+-----+
|   CA|Yellow| 1807|
|   WA| Green| 1779|
|   OR|Orange| 1743|
|   TX| Green| 1737|
|   TX|   Red| 1725|
|   CA| Green| 1723|
|   CO|Yellow| 1721|
|   CA| Brown| 1718|
|   CO| Green| 1713|
|   NV|Orange| 1712|
|   TX|Yellow| 1703|
|   NV| Green| 1698|
|   AZ| Brown| 1698|
|   CO|  Blue| 1695|
|   WY| Green| 1695|
|   NM|   Red| 1690|
|   AZ|Orange| 1689|
|   NM|Yellow| 1688|
|   NM| Brown| 1687|
|   UT|Orange| 1684|
|   NM| Green| 1682|
|   UT|   Red| 1680|
|   AZ| Green| 1676|
|   NV|Yellow| 1675|
|   NV|  Blue| 1673|
|   WA|   Red| 1671|
|   WY|   Red| 1670|
|   WA| Brown| 1669|
|   NM|Orange| 1665|
|   WY|  Blue| 1664|
|   WA|Yellow| 1663|
|   WA|Orange| 1658|
|   NV| Brown| 1657|
|   CA|Orange| 1657|
|   CA|   Red| 1656|
|   CO| Brown| 1656|
|   UT|  Blue| 1655|
|   AZ|Yellow| 1654|
|   TX|Orange| 1652|
|   AZ|   Red| 1648|
|   OR|  Blue| 1646|
|   OR|   Red| 1645|
|   UT|Yellow| 1645|
|   CO|Orange| 1642|
|   TX| Brown

In [7]:
println(s"Total Rows = ${countMnMDF.count()}")

Total Rows = 60


En este caso, la consulta es igual a la anterior pero solo mostramos los datos de CA.

In [31]:
val caCountMnMDF = mnmDF
         .select("State", "Color", "Count")
         .where(col("State") === "CA")
         .groupBy("State", "Color")
         .agg(count("Count").alias("Total"))
         .orderBy(desc("Total"))

caCountMnMDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [State: string, Color: string ... 1 more field]


Mostramos el resultado por pantalla:

In [32]:
caCountMnMDF.show(10)

+-----+------+-----+
|State| Color|Total|
+-----+------+-----+
|   CA|Yellow| 1807|
|   CA| Green| 1723|
|   CA| Brown| 1718|
|   CA|Orange| 1657|
|   CA|   Red| 1656|
|   CA|  Blue| 1603|
+-----+------+-----+



In [10]:
println(s"Total Rows = ${caCountMnMDF.count()}")

Total Rows = 6


***Ejercicios extras***

Maximo por color y estado, mostramos los 5 primeros

In [17]:
val max_col_Sta= mnmDF
          .select("Color","Count", "State")
          .groupBy("State","Color")
          .agg(max("Count"))
          .orderBy(asc("max(Count)"))
max_col_Sta.show(5)

+-----+------+----------+
|State| Color|max(Count)|
+-----+------+----------+
|   NV|   Red|       100|
|   WY| Green|       100|
|   UT|  Blue|       100|
|   WA|Orange|       100|
|   WA|   Red|       100|
+-----+------+----------+
only showing top 5 rows



max_col_Sta: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [State: string, Color: string ... 1 more field]


Igual que el apartado donde seleccionamos solo las filas de "CA", filtramos también por "NV", "CO" y "TX"

In [53]:
val St_count_mnm_df = mnmDF
                  .select("State", "Color", "Count")  
                  .where(col("State") === "TX" or col("State") === "NV" or col("State") === "CA" or col("State") ==="CO") 
                  .groupBy("State", "Color") 
                  .agg(count("Count").alias("Total")) 
                  .orderBy(asc("Total"))
St_count_mnm_df.show() 

+-----+------+-----+
|State| Color|Total|
+-----+------+-----+
|   CA|  Blue| 1603|
|   NV|   Red| 1610|
|   TX|  Blue| 1614|
|   CO|   Red| 1624|
|   TX| Brown| 1641|
|   CO|Orange| 1642|
|   TX|Orange| 1652|
|   CO| Brown| 1656|
|   CA|   Red| 1656|
|   NV| Brown| 1657|
|   CA|Orange| 1657|
|   NV|  Blue| 1673|
|   NV|Yellow| 1675|
|   CO|  Blue| 1695|
|   NV| Green| 1698|
|   TX|Yellow| 1703|
|   NV|Orange| 1712|
|   CO| Green| 1713|
|   CA| Brown| 1718|
|   CO|Yellow| 1721|
+-----+------+-----+
only showing top 20 rows



St_count_mnm_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [State: string, Color: string ... 1 more field]


Calculamos por estado y color: maximo, minimo, media y conteo.

In [43]:
val calculos_mnm= mnmDF
              .select("Count","State","Color")
              .groupBy("State","Color")
              .agg(min("Count").alias("Minimo"), max("Count").alias("Maximo"), avg("Count").alias("Media"), count("Count").alias("Count"))
              .orderBy("State", "Color")
calculos_mnm.show()

+-----+------+------+------+------------------+-----+
|State| Color|Minimo|Maximo|             Media|Count|
+-----+------+------+------+------------------+-----+
|   AZ|  Blue|    10|   100| 54.99449877750611| 1636|
|   AZ| Brown|    10|   100|54.350412249705535| 1698|
|   AZ| Green|    10|   100| 54.82219570405728| 1676|
|   AZ|Orange|    10|   100| 54.28300769686205| 1689|
|   AZ|   Red|    10|   100|54.637135922330096| 1648|
|   AZ|Yellow|    10|   100| 54.98548972188634| 1654|
|   CA|  Blue|    10|   100| 55.59762944479102| 1603|
|   CA| Brown|    10|   100|55.740395809080326| 1718|
|   CA| Green|    10|   100|54.268717353453276| 1723|
|   CA|Orange|    10|   100|54.502715751357876| 1657|
|   CA|   Red|    10|   100| 55.26992753623188| 1656|
|   CA|Yellow|    10|   100|  55.8693967902601| 1807|
|   CO|  Blue|    10|   100| 55.11032448377581| 1695|
|   CO| Brown|    10|   100| 56.57729468599034| 1656|
|   CO| Green|    10|   100| 54.71336835960304| 1713|
|   CO|Orange|    10|   100|

calculos_mnm: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [State: string, Color: string ... 4 more fields]
