# DataFrame & Column
##### Objetivos
1. Construir columnas
1. Seleccionar columnas
1. Agregar o reemplazar columnas
1. Seleccionar filas
1. Ordenar filas

##### Métodos
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html" target="_blank">DataFrame</a>: **`select`**, **`selectExpr`**, **`drop`**, **`withColumn`**, **`withColumnRenamed`**, **`filter`**, **`distinct`**, **`limit`**, **`sort`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html" target="_blank">Column</a>: **`alias`**, **`isin`**, **`cast`**, **`isNotNull`**, **`desc`**, operators

In [None]:
%pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=8488b0017c610cee9f56c09e213b51b393873a4320d585c3a93790a39bc64555
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master('local[*]').appName('dfcolumns').getOrCreate()
sc = SparkContext.getOrCreate()

In [None]:
df = spark.read.csv('/content/sample_data/california_housing_test.csv', header=True)
df.createOrReplaceTempView('cht')

## Expresiones de Columna

Una <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html" target="_blank">Columna</a> es una construcción lógica que se calculará en función de los datos en un DataFrame mediante una expresión.

Construye una nueva Columna basada en columnas existentes en un DataFrame


In [None]:
from pyspark.sql.functions import col

print(df.median_house_value)
print(df['median_house_value'])
print(col('median_house_value'))

Column<'median_house_value'>
Column<'median_house_value'>
Column<'median_house_value'>


### Operadores y métodos de columna
| Method | Description |
| --- | --- |
| \*, + , <, >= | Math and comparison operators |
| ==, != | Equality and inequality tests (Scala operators are **`===`** and **`=!=`**) |
| alias | Gives the column an alias |
| cast, astype | Casts the column to a different data type |
| isNull, isNotNull, isNan | Is null, is not null, is NaN |
| asc, desc | Returns a sort expression based on ascending/descending order of the column |

In [None]:
col("median_house_value") + col("median_income")
col("total_bedrooms").desc()
(col("housing_median_age") * 100).cast("int")

Column<'CAST((housing_median_age * 100) AS INT)'>

In [None]:
rev_df = (df
         .filter(col('total_rooms').isNotNull())
         .withColumn("income", (col("median_income") * 100).cast("int"))
         .sort(col("housing_median_age").desc())
        )

In [None]:
rev_df.show()

+-----------+---------+------------------+------------+--------------+-----------+-----------+-------------+------------------+------+
|  longitude| latitude|housing_median_age| total_rooms|total_bedrooms| population| households|median_income|median_house_value|income|
+-----------+---------+------------------+------------+--------------+-----------+-----------+-------------+------------------+------+
|-118.630000|34.240000|          9.000000| 4759.000000|    924.000000|1884.000000| 915.000000|     4.833300|     277200.000000|   483|
|-116.240000|33.760000|          9.000000| 1961.000000|    595.000000| 966.000000| 275.000000|     3.812500|      96700.000000|   381|
|-122.510000|38.760000|          9.000000| 2589.000000|    482.000000|1050.000000| 374.000000|     4.043500|     132600.000000|   404|
|-117.190000|32.770000|          9.000000|  634.000000|    152.000000| 248.000000| 133.000000|     3.857100|     143800.000000|   385|
|-121.930000|38.010000|          9.000000| 2294.000000|

## Métodos de transformación de DataFrames
| Method | Description |
| --- | --- |
| **`select`** | Returns a new DataFrame by computing given expression for each element |
| **`drop`** | Returns a new DataFrame with a column dropped |
| **`withColumnRenamed`** | Returns a new DataFrame with a column renamed |
| **`withColumn`** | Returns a new DataFrame by adding a column or replacing the existing column that has the same name |
| **`filter`**, **`where`** | Filters rows using the given condition |
| **`sort`**, **`orderBy`** | Returns a new DataFrame sorted by the given expressions |
| **`dropDuplicates`**, **`distinct`** | Returns a new DataFrame with duplicate rows removed |
| **`limit`** | Returns a new DataFrame by taking the first n rows |
| **`groupBy`** | Groups the DataFrame using the specified columns, so we can run aggregation on them |

### Subconjunto de columnas

In [None]:
df.select('longitude', 'latitude').show()

In [None]:
from pyspark.sql.functions import col

df_rooms = df.select(
    col('total_rooms').alias('rooms'),
    col('total_bedrooms').alias('bedrooms')
)

df_rooms.show()

+-----------+-----------+
|      rooms|   bedrooms|
+-----------+-----------+
|3885.000000| 661.000000|
|1510.000000| 310.000000|
|3589.000000| 507.000000|
|  67.000000|  15.000000|
|1241.000000| 244.000000|
|1018.000000| 213.000000|
|1009.000000| 225.000000|
|2310.000000| 471.000000|
|3080.000000| 617.000000|
|2402.000000| 632.000000|
| 972.000000| 249.000000|
| 736.000000| 166.000000|
|1089.000000| 182.000000|
|3936.000000| 694.000000|
|2097.000000| 325.000000|
| 161.000000|  40.000000|
| 570.000000| 123.000000|
|3077.000000| 607.000000|
|1590.000000| 196.000000|
|8814.000000|1307.000000|
+-----------+-----------+
only showing top 20 rows



In [None]:
df_households = df.selectExpr('longitude', 'latitude', "total_rooms == 67 as rooms_67")

In [None]:
df_households.filter('rooms_67 == true').show()

+-----------+---------+--------+
|  longitude| latitude|rooms_67|
+-----------+---------+--------+
|-118.360000|33.820000|    true|
+-----------+---------+--------+



#### drop()
Devuelve un nuevo DataFrame después de eliminar la columna proporcionada, especificada como una cadena o un objeto Column

In [None]:
no_longitude_df = df.drop(col('longitude'))

In [None]:
no_longitude_df.show()

+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|
+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|37.370000|         27.000000|3885.000000|    661.000000|1537.000000| 606.000000|     6.608500|     344700.000000|
|34.260000|         43.000000|1510.000000|    310.000000| 809.000000| 277.000000|     3.599000|     176500.000000|
|33.780000|         27.000000|3589.000000|    507.000000|1484.000000| 495.000000|     5.793400|     270500.000000|
|33.820000|         28.000000|  67.000000|     15.000000|  49.000000|  11.000000|     6.135900|     330000.000000|
|36.330000|         19.000000|1241.000000|    244.000000| 850.000000| 237.000000|     2.937500|      81700.000000|
|36.510000|         37.000000|1018.000000|    213.000000| 663.000000| 204.000000

#### Añadir o reemplazar columnas

##### withColumn()

In [None]:
df_latitude_rounded = df.withColumn('latitude_rounded', col('latitude').cast('int'))

In [None]:
df_latitude_rounded.show()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|latitude_rounded|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|-122.050000|37.370000|         27.000000|3885.000000|    661.000000|1537.000000| 606.000000|     6.608500|     344700.000000|              37|
|-118.300000|34.260000|         43.000000|1510.000000|    310.000000| 809.000000| 277.000000|     3.599000|     176500.000000|              34|
|-117.810000|33.780000|         27.000000|3589.000000|    507.000000|1484.000000| 495.000000|     5.793400|     270500.000000|              33|
|-118.360000|33.820000|         28.000000|  67.000000|     15.000000|  49.000000|  11.000000|     6.135900|     330000.000000|          

In [None]:
df_latitude_rounded.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)
 |-- latitude_rounded: integer (nullable = true)



#### Subconjunto de filas

Filtra las filas basándose en una condición a nivel de columna.

Alias: where

In [None]:
df_latitude_rounded.filter('latitude_rounded > 37').show()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|latitude_rounded|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|-121.430000|38.630000|         43.000000|1009.000000|    225.000000| 604.000000| 218.000000|     1.664100|      67000.000000|              38|
|-122.840000|38.400000|         15.000000|3080.000000|    617.000000|1446.000000| 599.000000|     3.669600|     194400.000000|              38|
|-121.200000|38.690000|         26.000000|3077.000000|    607.000000|1603.000000| 595.000000|     2.717400|     137500.000000|              38|
|-122.590000|38.010000|         35.000000|8814.000000|   1307.000000|3450.000000|1258.000000|     6.172400|     414300.000000|          

In [None]:
df_latitude_rounded.filter(col('total_rooms').isNotNull()).show()

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|latitude_rounded|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+----------------+
|-122.050000|37.370000|         27.000000|3885.000000|    661.000000|1537.000000| 606.000000|     6.608500|     344700.000000|              37|
|-118.300000|34.260000|         43.000000|1510.000000|    310.000000| 809.000000| 277.000000|     3.599000|     176500.000000|              34|
|-117.810000|33.780000|         27.000000|3589.000000|    507.000000|1484.000000| 495.000000|     5.793400|     270500.000000|              33|
|-118.360000|33.820000|         28.000000|  67.000000|     15.000000|  49.000000|  11.000000|     6.135900|     330000.000000|          

#### dropDuplicates()

Devuelve un nuevo DataFrame con las filas duplicadas eliminadas.

Alias: distinct

In [None]:
df_latitude_rounded.select('latitude_rounded').distinct().show()

+----------------+
|latitude_rounded|
+----------------+
|              34|
|              40|
|              41|
|              37|
|              35|
|              39|
|              38|
|              32|
|              33|
|              36|
+----------------+



#### limit()

Devuelve un DataFrame solo con las primeras n filas.

In [None]:
limit_df = df_latitude_rounded.limit(3).show()

+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+----------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population|households|median_income|median_house_value|latitude_rounded|
+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+----------------+
|-122.050000|37.370000|         27.000000|3885.000000|    661.000000|1537.000000|606.000000|     6.608500|     344700.000000|              37|
|-118.300000|34.260000|         43.000000|1510.000000|    310.000000| 809.000000|277.000000|     3.599000|     176500.000000|              34|
|-117.810000|33.780000|         27.000000|3589.000000|    507.000000|1484.000000|495.000000|     5.793400|     270500.000000|              33|
+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+----------------+

#### sort()

Alias: orderBy

In [None]:
df_latitude_rounded.select('latitude_rounded').distinct().sort('latitude_rounded').show()

+----------------+
|latitude_rounded|
+----------------+
|              32|
|              33|
|              34|
|              35|
|              36|
|              37|
|              38|
|              39|
|              40|
|              41|
+----------------+



In [None]:
df_latitude_rounded.select('latitude_rounded').distinct().sort(col('latitude_rounded').desc()).show()

+----------------+
|latitude_rounded|
+----------------+
|              41|
|              40|
|              39|
|              38|
|              37|
|              36|
|              35|
|              34|
|              33|
|              32|
+----------------+

