# Tipos Complejos

Explora las funciones integradas para trabajar con colecciones y cadenas.

##### Objetivos
1. Aplicar funciones de colección para procesar arrays
2. Unir DataFrames

##### Métodos
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html" target="_blank">DataFrame</a>: **`union`**, **`unionByName`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html" target="_blank">Funciones Integradas</a>:
  - Agregado: **`collect_set`**
  - Colección: **`array_contains`**, **`element_at`**, **`explode`**
  - Cadena: **`split`**


In [None]:
%pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=dd2dd60864c03876316f7a3ecdf3085f10ad493498e3b732e1b021131b16fd3f
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master('local[*]').appName('complex-types').getOrCreate()
sc = SparkContext.getOrCreate()

In [None]:
data = [("A", [1, 2, 3]),
        ("B", [4, 5]),
        ("C", [6])]

columns = ['id', 'valores']

df = spark.createDataFrame(data, columns)

df.show()

+---+---------+
| id|  valores|
+---+---------+
|  A|[1, 2, 3]|
|  B|   [4, 5]|
|  C|      [6]|
+---+---------+



In [None]:
from pyspark.sql.functions import *

df_exploded = df.select('id', explode('valores').alias('valor_exploded'))
df_exploded.show()

+---+--------------+
| id|valor_exploded|
+---+--------------+
|  A|             1|
|  A|             2|
|  A|             3|
|  B|             4|
|  B|             5|
|  C|             6|
+---+--------------+



### Funciones con String

| Método | Descripción |
| --- | --- |
| translate | Translate any character in the src by a character in replaceString |
| regexp_replace | Replace all substrings of the specified string value that match regexp with rep |
| regexp_extract | Extract a specific group matched by a Java regex, from the specified string column |
| ltrim | Removes the leading space characters from the specified string column |
| lower | Converts a string column to lowercase |
| split | Splits str around matches of the given pattern |

In [None]:
from pyspark.sql.functions import split

data_to_split = [(1, "correo1@gmail.com"),
        (2, 'correo2@hotmail.com'),
        (3, 'correo3@murciaeduca.es')]

df_mail = spark.createDataFrame(data_to_split, schema='id int, email string')
df_email_handle = df_mail.select(split(df_mail.email, '@', 0).alias('email_handle'))

df_email_handle.select(explode('email_handle')).show()

+--------------+
|           col|
+--------------+
|       correo1|
|     gmail.com|
|       correo2|
|   hotmail.com|
|       correo3|
|murciaeduca.es|
+--------------+



### Functions de colecciones

| Método | Descripción |
| --- | --- |
| array_contains | Returns null if the array is null, true if the array contains value, and false otherwise. |
| element_at | Returns element of array at given index. Array elements are numbered starting with **1**. |
| explode | Creates a new row for each element in the given array or map column. |
| collect_set | Returns a set of objects with duplicate elements eliminated. |

In [None]:
data_to_filter = [(1, ['correo1@gmail.com', 'correo2@gmail.com']),
        (2, ['correo2@hotmail.com', 'correo3@gmail.com']),
        (3, ['correo3@murciaeduca.es'])]

df = spark.createDataFrame(data_to_filter)
df.show()

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  1|[correo1@gmail.co...|
|  2|[correo2@hotmail....|
|  3|[correo3@murciaed...|
+---+--------------------+



In [None]:
filtered_df = (
    df.filter(array_contains(col('_2'), 'correo1@gmail.com'))
)

filtered_df.show()

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  1|[correo1@gmail.co...|
+---+--------------------+



## Union y unionByName
<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Advertencia"> El método <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html" target="_blank">**`union`**</a> de DataFrame resuelve las columnas por posición, como en SQL estándar. Deberías usarlo solo si los dos DataFrames tienen exactamente el mismo esquema, incluido el orden de las columnas. En cambio, el método <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html" target="_blank">**`unionByName`**</a> de DataFrame resuelve las columnas por nombre. Esto es equivalente a UNION ALL en SQL. Ninguno de los dos eliminará duplicados.

A continuación se muestra una verificación para ver si los dos DataFrames tienen un esquema coincidente donde sería apropiado usar **`union`**.


In [None]:
df.schema == filtered_df.schema

True

In [None]:
df_union = df.union(filtered_df)
df_union.show()

+---+--------------------+
| _1|                  _2|
+---+--------------------+
|  1|[correo1@gmail.co...|
|  2|[correo2@hotmail....|
|  3|[correo3@murciaed...|
|  1|[correo1@gmail.co...|
+---+--------------------+



### Funciones miscelánea

| Método | Descripción |
| --- | --- |
| col / column | Returns a Column based on the given column name. |
| lit | Creates a Column of literal value |
| isnull | Return true if the column is null |
| rand | Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0) |

In [None]:
gmail_accounts = df_mail.filter(col('email').endswith('gmail.com'))
gmail_accounts.show()

+---+-----------------+
| id|            email|
+---+-----------------+
|  1|correo1@gmail.com|
+---+-----------------+



In [None]:
df_gmail_user = gmail_accounts.select('email', lit(True).alias('gmail user'))
df_gmail_user.show()

+-----------------+----------+
|            email|gmail user|
+-----------------+----------+
|correo1@gmail.com|      true|
+-----------------+----------+



### DataFrameNaFunctions
<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameNaFunctions.html#pyspark.sql.DataFrameNaFunctions" target="_blank">DataFrameNaFunctions</a> es un submódulo de DataFrame con métodos para manejar valores nulos. Obtén una instancia de DataFrameNaFunctions accediendo al atributo na de un DataFrame.

In [None]:
df = (spark
    .read
    .option('inferSchema', True)
    .json('/content/sample_data/anscombe.json')
  )

df.show()

+------+----+-----+---------------+
|Series|   X|    Y|_corrupt_record|
+------+----+-----+---------------+
|  NULL|NULL| NULL|              [|
|     I|10.0| 8.04|           NULL|
|     I| 8.0| 6.95|           NULL|
|     I|13.0| 7.58|           NULL|
|     I| 9.0| 8.81|           NULL|
|     I|11.0| 8.33|           NULL|
|     I|14.0| 9.96|           NULL|
|     I| 6.0| 7.24|           NULL|
|     I| 4.0| 4.26|           NULL|
|     I|12.0|10.84|           NULL|
|     I| 7.0| 4.81|           NULL|
|     I| 5.0| 5.68|           NULL|
|    II|10.0| 9.14|           NULL|
|    II| 8.0| 8.14|           NULL|
|    II|13.0| 8.74|           NULL|
|    II| 9.0| 8.77|           NULL|
|    II|11.0| 9.26|           NULL|
|    II|14.0|  8.1|           NULL|
|    II| 6.0| 6.13|           NULL|
|    II| 4.0|  3.1|           NULL|
+------+----+-----+---------------+
only showing top 20 rows



In [None]:
print(df.count())
print(df.na.drop().count())

46
0


In [None]:
df.select('Series').na.drop().count()

44

In [None]:
df.select('Series').na.fill('NO SERIES').show()

+---------+
|   Series|
+---------+
|NO SERIES|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|        I|
|       II|
|       II|
|       II|
|       II|
|       II|
|       II|
|       II|
|       II|
+---------+
only showing top 20 rows



### Unión de DataFrames
El método <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join" target="_blank">join</a> de DataFrame une dos DataFrames basándose en una expresión de unión dada.

Se admiten varios tipos diferentes de uniones:

Unión interna basada en valores iguales de una columna compartida llamada "name" (es decir, una unión equitativa)<br/>
`df1.join(df2, "name")`

Unión interna basada en valores iguales de las columnas compartidas llamadas "name" y "age"<br/>
`df1.join(df2, ["name", "age"])`

Unión externa completa basada en valores iguales de una columna compartida llamada "name"<br/>
`df1.join(df2, "name", "outer")`

Unión externa izquierda basada en una expresión explícita de columna<br/>
`df1.join(df2, df1["customer_name"] == df2["account_name"], "left_outer")`

In [None]:
joined_df = gmail_accounts.join(other=df_gmail_user, on='email', how = "inner").show()

+-----------------+---+----------+
|            email| id|gmail user|
+-----------------+---+----------+
|correo1@gmail.com|  1|      true|
+-----------------+---+----------+

