# SPARK BÁSICO - 05
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DataFrames I. Primeros pasos

### Definición del esquema para el modelo de datos

In [0]:
from pyspark.sql.types import *
schema_1 = StructType([StructField("nombre", StringType(), False),
  StructField("apellidos", StringType(), False),
  StructField("edad", IntegerType(), False),
  StructField("sexo", StringType(), False)])

### Carga de datos sobre el esquema

In [0]:
data_1 = [["Michael", "Jordan", 65, "Masculino"],
          ["Kevin", "Garnett", 50, "Masculino"],
          ["Marie", "Curie", 48, "Femenino"],
          ]

### Creando el DataFrame

In [0]:
df_1 = spark.createDataFrame(data_1, schema_1)
df_1.show()

+-------+---------+----+---------+
| nombre|apellidos|edad|     sexo|
+-------+---------+----+---------+
|Michael|   Jordan|  65|Masculino|
|  Kevin|  Garnett|  50|Masculino|
|  Marie|    Curie|  48| Femenino|
+-------+---------+----+---------+



### Volvemos a repetir el proceso paso a paso
- Creación de esquema (en esta ocasión se hace sin usar *StructType*)
- Carga de datos usando el esquema
- Creación del DataFrame

In [0]:
schema_2 = "ciudad STRING, provincia STRING, comunidad_autonoma STRING, pais STRING"

In [0]:
data_2 = [["Madrid", "Madrid", "Madrid", "España"],
          ["Alcalá de Henares", "Madrid", "Madrid", "España"],
          ["Barcelona", "Barcelona", "Cataluña", "España"],
          ["Toledo", "Toledo", "Castilla la Mancha", "España"],
          ["Tarazona", "Zaragoza", "Aragón", "España"]
          ]

In [0]:
df_2 = spark.createDataFrame(data_2, schema_2)
df_2.show()

+-----------------+---------+------------------+------+
|           ciudad|provincia|comunidad_autonoma|  pais|
+-----------------+---------+------------------+------+
|           Madrid|   Madrid|            Madrid|España|
|Alcalá de Henares|   Madrid|            Madrid|España|
|        Barcelona|Barcelona|          Cataluña|España|
|           Toledo|   Toledo|Castilla la Mancha|España|
|         Tarazona| Zaragoza|            Aragón|España|
+-----------------+---------+------------------+------+



In [0]:
df_2.dtypes

Out[7]: [('ciudad', 'string'),
 ('provincia', 'string'),
 ('comunidad_autonoma', 'string'),
 ('pais', 'string')]

### Procesos de selección y filtrado de datos usando DataFrames

In [0]:
df_2.select('ciudad', 'pais').show()

+-----------------+------+
|           ciudad|  pais|
+-----------------+------+
|           Madrid|España|
|Alcalá de Henares|España|
|        Barcelona|España|
|           Toledo|España|
|         Tarazona|España|
+-----------------+------+



In [0]:
from pyspark.sql.functions import col

df_2.select(col('ciudad'), col('pais')).show()

+-----------------+------+
|           ciudad|  pais|
+-----------------+------+
|           Madrid|España|
|Alcalá de Henares|España|
|        Barcelona|España|
|           Toledo|España|
|         Tarazona|España|
+-----------------+------+



### Conectando DataFrames. JOIN

In [0]:
from pyspark.sql import Row
from pyspark.sql.functions import desc

df_j01 = spark.createDataFrame([(48, "Alberto"), (50, "Enrique")]).toDF("age", "name")
df_j02 = spark.createDataFrame([Row(height=191, name="Alberto"), Row(height=188, name="Enrique")])

In [0]:
df_j03 = df_j01.join(df_j02, "name")
df_j03.show()

+-------+---+------+
|   name|age|height|
+-------+---+------+
|Alberto| 48|   191|
|Enrique| 50|   188|
+-------+---+------+



In [0]:
df_j03.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- height: long (nullable = true)



In [0]:
df_j03.sort(desc("height")).show()

+-------+---+------+
|   name|age|height|
+-------+---+------+
|Alberto| 48|   191|
|Enrique| 50|   188|
+-------+---+------+



In [0]:
df_j03.filter("height > 190").show()

+-------+---+------+
|   name|age|height|
+-------+---+------+
|Alberto| 48|   191|
+-------+---+------+



In [0]:
df_j03.orderBy("age", ascending = 1).show()

+-------+---+------+
|   name|age|height|
+-------+---+------+
|Alberto| 48|   191|
|Enrique| 50|   188|
+-------+---+------+



In [0]:
df_j03.groupBy("name").agg({"height": "avg"}).sort("name").show()

+-------+-----------+
|   name|avg(height)|
+-------+-----------+
|Alberto|      191.0|
|Enrique|      188.0|
+-------+-----------+



In [0]:
df_j03.agg({"age": "avg"}).show()

+--------+
|avg(age)|
+--------+
|    49.0|
+--------+

