sparklyr
===

* *30 min* | Última modificación: Junio 22, 2019.

Spark SQL es una interfaz para el procesamiento de datos estructurados usando el lenguaje SQL. En adición, Spark SQL también puede ser usado para leer datos de Apache Hive. Spark SQL opera sobre DataFrames, los cuales son Datasets (RDD) organizado por columnas identificadas por nombres, los cuales equivalen a tablas en los sistemas de bases de datos relacionales.

**Test de la implementación**.

In [1]:
%load_ext rpy2.ipython

In [2]:
%%R
library(sparklyr)
library(dplyr)
sc <- spark_connect(master='local', spark_home='/root/spark/spark-2.4.3-bin-hadoop2.7')
spark_version(sc)
src_tbls(sc)
spark_disconnect(sc)

Attaching package: ‘dplyr’



    filter, lag



    intersect, setdiff, setequal, union


  method      from
  print.bytes Rcpp



NULL


## Instalación

La librería `sparklyr` se instala normalmente con:

In [3]:
## install.packages("sparklyr")

`sparklyr` puede trabajar en la máquina local o conectarse a un servidor de Spark. En este tutorial se trabajará en la máquina local usando un modo pseudo-distribuido. 

* En el sistema hdfs debe existir la carpeta `/tmp/hive` con permisos de lectura y escritura. Si no existe, use:


     hdfs dfs -ls /tmp              ## para mostrar el contenido de la carpeta /tmp
     hdfs dfs -mkdir /tmp/hive      ## para crear la carpeta hive
     hdfs dfs -chmod 777 /tmp/hive  ## para cambiar los permisos sobre la carpeta

* `sparklyr` requiere su propia instalación local de Spark para ejecutarse. Las funciones relacionadas son las siguientes:

In [4]:
%%R
## Opciones disponibles de Spark para instalar
spark_available_versions()

  spark
1   1.6
2   2.0
3   2.1
4   2.2
5   2.3
6   2.4


La función `spark_install` instala una versión de Spark para usar con conexión local. La versión debe corresponder a una de las versiones listadas arriba, por ejemplo:

    spark_install(version = '2.4')
 
Se desinstala con:

    spark_uninstall(version = '2.4')


In [5]:
# spark_install(version = '2.4')

In [6]:
%%R
## Lista de versiones de Spark instaladas en la máquina local
spark_installed_versions()

  spark hadoop                                   dir
1 2.4.3    2.7 /root/spark/spark-2.4.3-bin-hadoop2.7


La conexión se realiza con la función `spark_connect`. Note que el parámetro `spark_home` debe coincidir con uno de los directorios listados en la celda de arriba.

    spark_connect(master='local', spark_home='/home/vagrant/spark/spark-2.4.0-bin-hadoop2.7')

In [7]:
%%R
## Carpeta de instalación de Spark
spark_install_dir()

[1] "/root/spark"


## Preparación

In [8]:
%%R
##
## Esta función se usará para ejecutar comandos en el sistema operativo
## y capturar la salida.
##
systemp <- function(command) cat(system(command, intern = TRUE), sep = '\n')

In [9]:
%%R
library(sparklyr)
library(dplyr)
sc <- spark_connect(master='local', spark_home='/root/spark/spark-2.4.3-bin-hadoop2.7')
spark_version(sc)

[1] ‘2.4.3’


## Creación de DataFrames

A continuación se presenta la carga de DataFrames desde diferentes formatos.

### Formato JSON

Se crea un archivo en formato JSON en la máquina local.

In [10]:
%%writefile people.json
{"id": 1,  "firstname": "Vivian",   "surname": "Hamilton", "birthdate": "1971-07-08",  "color": "green",  "quantity": 1 }
{"id": 2,  "firstname": "Karen",    "surname": "Holcomb",  "birthdate": "1974-05-23",  "color": "green",  "quantity": 4 }
{"id": 3,  "firstname": "Cody",     "surname": "Garrett",  "birthdate": "1973-04-22",  "color": "orange", "quantity": 1 }
{"id": 4,  "firstname": "Roth",     "surname": "Fry",      "birthdate": "1975-01-29",  "color": "black",  "quantity": 1 }
{"id": 5,  "firstname": "Zoe",      "surname": "Conway",   "birthdate": "1974-07-03",  "color": "blue",   "quantity": 2 }
{"id": 6,  "firstname": "Gretchen", "surname": "Kinney",   "birthdate": "1974-10-18",  "color": "violet", "quantity": 1 }
{"id": 7,  "firstname": "Driscoll", "surname": "Klein",    "birthdate": "1970-10-05",  "color": "blue",   "quantity": 5 }
{"id": 8,  "firstname": "Karyn",    "surname": "Diaz",     "birthdate": "1969-02-24",  "color": "red",    "quantity": 1 }
{"id": 9,  "firstname": "Merritt",  "surname": "Guy",      "birthdate": "1974-10-17",  "color": "indigo", "quantity": 4 }
{"id": 10, "firstname": "Kylan",    "surname": "Sexton",   "birthdate": "1975-02-28",  "color": "black",  "quantity": 4 }
{"id": 11, "firstname": "Jordan",   "surname": "Estes",    "birthdate": "1969-12-07",  "color": "indigo", "quantity": 4 }
{"id": 12, "firstname": "Hope",     "surname": "Coffey",   "birthdate": "1973-12-24",  "color": "green",  "quantity": 5 }
{"id": 13, "firstname": "Vivian",   "surname": "Crane",    "birthdate": "1970-08-27",  "color": "gray",   "quantity": 5 }
{"id": 14, "firstname": "Clio",     "surname": "Noel",     "birthdate": "1972-12-12",  "color": "red",    "quantity": 5 }
{"id": 15, "firstname": "Hope",     "surname": "Silva",    "birthdate": "1970-07-01",  "color": "blue",   "quantity": 5 }
{"id": 16, "firstname": "Ayanna",   "surname": "Jarvis",   "birthdate": "1974-02-11",  "color": "orange", "quantity": 5 }
{"id": 17, "firstname": "Chanda",   "surname": "Boyer",    "birthdate": "1973-04-01",  "color": "green",  "quantity": 4 }
{"id": 18, "firstname": "Chadwick", "surname": "Knight",   "birthdate": "1973-04-29",  "color": "yellow", "quantity": 1 }

Writing people.json


In [11]:
## Copia el archivo al HDFS
!hdfs dfs -copyFromLocal people.json /tmp/people.json

In [12]:
%%R
##
## La función spark_read_json() carga directamente
## el archivo en JSON con un DataFrame.
##
df <- spark_read_json(sc,                 ## spark_connection
                      'people',           ## nombre de la tabla
                      '/tmp/people.json') ## ubicación del archivo
                                          ## en el sistema hdfs
df

# Source: spark<people> [?? x 6]
   birthdate  color  firstname    id quantity surname 
   <chr>      <chr>  <chr>     <dbl>    <dbl> <chr>   
 1 1971-07-08 green  Vivian        1        1 Hamilton
 2 1974-05-23 green  Karen         2        4 Holcomb 
 3 1973-04-22 orange Cody          3        1 Garrett 
 4 1975-01-29 black  Roth          4        1 Fry     
 5 1974-07-03 blue   Zoe           5        2 Conway  
 6 1974-10-18 violet Gretchen      6        1 Kinney  
 7 1970-10-05 blue   Driscoll      7        5 Klein   
 8 1969-02-24 red    Karyn         8        1 Diaz    
 9 1974-10-17 indigo Merritt       9        4 Guy     
10 1975-02-28 black  Kylan        10        4 Sexton  
# … with more rows


In [13]:
%%R
##
## La función collect() permite imprimirlo en pantalla
##
collect(df)

# A tibble: 18 x 6
   birthdate  color  firstname    id quantity surname 
   <chr>      <chr>  <chr>     <dbl>    <dbl> <chr>   
 1 1971-07-08 green  Vivian        1        1 Hamilton
 2 1974-05-23 green  Karen         2        4 Holcomb 
 3 1973-04-22 orange Cody          3        1 Garrett 
 4 1975-01-29 black  Roth          4        1 Fry     
 5 1974-07-03 blue   Zoe           5        2 Conway  
 6 1974-10-18 violet Gretchen      6        1 Kinney  
 7 1970-10-05 blue   Driscoll      7        5 Klein   
 8 1969-02-24 red    Karyn         8        1 Diaz    
 9 1974-10-17 indigo Merritt       9        4 Guy     
10 1975-02-28 black  Kylan        10        4 Sexton  
11 1969-12-07 indigo Jordan       11        4 Estes   
12 1973-12-24 green  Hope         12        5 Coffey  
13 1970-08-27 gray   Vivian       13        5 Crane   
14 1972-12-12 red    Clio         14        5 Noel    
15 1970-07-01 blue   Hope         15        5 Silva   
16 1974-02-11 orange Ayanna       16        5 

### Formato CSV

A continuación se ejemplifica como procesar un archivo de texto para convertirlo en un DataFrame. 

In [14]:
%%writefile people.csv
id,firstname,surname,birthdate,color,quantity
1,Vivian,Hamilton,1971-07-08,green,1
2,Karen,Holcomb,1974-05-23,green,4
3,Cody,Garrett,1973-04-22,orange,1
4,Roth,Fry,1975-01-29,black,1
5,Zoe,Conway,1974-07-03,blue,2
6,Gretchen,Kinney,1974-10-18,violet,1
7,Driscoll,Klein,1970-10-05,blue,5
8,Karyn,Diaz,1969-02-24,red,1
9,Merritt,Guy,1974-10-17,indigo,4
10,Kylan,Sexton,1975-02-28,black,4
11,Jordan,Estes,1969-12-07,indigo,4
12,Hope,Coffey,1973-12-24,green,5
13,Vivian,Crane,1970-08-27,gray,5
14,Clio,Noel,1972-12-12,red,5
15,Hope,Silva,1970-07-01,blue,5
16,Ayanna,Jarvis,1974-02-11,orange,5
17,Chanda,Boyer,1973-04-01,green,4
18,Chadwick,Knight,1973-04-29,yellow,1

Writing people.csv


In [15]:
## copia el archivo al HDFS
!hdfs dfs -rm /tmp/people.csv
!hdfs dfs -copyFromLocal people.csv /tmp/people.csv

rm: `/tmp/people.csv': No such file or directory


In [16]:
%%R
df <- spark_read_csv(sc,                 # spark_connection
                     'people',           # nombre de la tabla
                     '/tmp/people.csv')  # ubicación del archivo
                                         # en el sistema hdfs
df

# Source: spark<people> [?? x 6]
      id firstname surname  birthdate           color  quantity
   <int> <chr>     <chr>    <dttm>              <chr>     <int>
 1     1 Vivian    Hamilton 1971-07-08 00:00:00 green         1
 2     2 Karen     Holcomb  1974-05-23 00:00:00 green         4
 3     3 Cody      Garrett  1973-04-22 00:00:00 orange        1
 4     4 Roth      Fry      1975-01-29 00:00:00 black         1
 5     5 Zoe       Conway   1974-07-03 00:00:00 blue          2
 6     6 Gretchen  Kinney   1974-10-18 00:00:00 violet        1
 7     7 Driscoll  Klein    1970-10-05 00:00:00 blue          5
 8     8 Karyn     Diaz     1969-02-24 00:00:00 red           1
 9     9 Merritt   Guy      1974-10-17 00:00:00 indigo        4
10    10 Kylan     Sexton   1975-02-28 00:00:00 black         4
# … with more rows


## Operaciones sobre DataFrames

In [17]:
%%R
df <- spark_read_csv(sc,                 # spark_connection
                     'people',           # nombre de la tabla
                     '/tmp/people.csv')  # ubicación del archivo
df

# Source: spark<people> [?? x 6]
      id firstname surname  birthdate           color  quantity
   <int> <chr>     <chr>    <dttm>              <chr>     <int>
 1     1 Vivian    Hamilton 1971-07-08 00:00:00 green         1
 2     2 Karen     Holcomb  1974-05-23 00:00:00 green         4
 3     3 Cody      Garrett  1973-04-22 00:00:00 orange        1
 4     4 Roth      Fry      1975-01-29 00:00:00 black         1
 5     5 Zoe       Conway   1974-07-03 00:00:00 blue          2
 6     6 Gretchen  Kinney   1974-10-18 00:00:00 violet        1
 7     7 Driscoll  Klein    1970-10-05 00:00:00 blue          5
 8     8 Karyn     Diaz     1969-02-24 00:00:00 red           1
 9     9 Merritt   Guy      1974-10-17 00:00:00 indigo        4
10    10 Kylan     Sexton   1975-02-28 00:00:00 black         4
# … with more rows


In [18]:
%%R
##
## Imprime el esquema en formato de arbol
##
sdf_schema(df)

$id
$id$name
[1] "id"

$id$type
[1] "IntegerType"


$firstname
$firstname$name
[1] "firstname"

$firstname$type
[1] "StringType"


$surname
$surname$name
[1] "surname"

$surname$type
[1] "StringType"


$birthdate
$birthdate$name
[1] "birthdate"

$birthdate$type
[1] "TimestampType"


$color
$color$name
[1] "color"

$color$type
[1] "StringType"


$quantity
$quantity$name
[1] "quantity"

$quantity$type
[1] "IntegerType"




In [19]:
%%R
##
## Selección de una columna en particular
##
select(df, 'firstname')

# Source: spark<?> [?? x 1]
   firstname
   <chr>    
 1 Vivian   
 2 Karen    
 3 Cody     
 4 Roth     
 5 Zoe      
 6 Gretchen 
 7 Driscoll 
 8 Karyn    
 9 Merritt  
10 Kylan    
# … with more rows


In [20]:
%%R
##
## Selección de varias columnas
##
select(df, c('firstname', 'surname'))

# Source: spark<?> [?? x 2]
   firstname surname 
   <chr>     <chr>   
 1 Vivian    Hamilton
 2 Karen     Holcomb 
 3 Cody      Garrett 
 4 Roth      Fry     
 5 Zoe       Conway  
 6 Gretchen  Kinney  
 7 Driscoll  Klein   
 8 Karyn     Diaz    
 9 Merritt   Guy     
10 Kylan     Sexton  
# … with more rows


In [21]:
%%R
##
## Filtrado de registros usando condicionales
##
filter(df, color == 'blue')

# Source: spark<?> [?? x 6]
     id firstname surname birthdate           color quantity
  <int> <chr>     <chr>   <dttm>              <chr>    <int>
1     5 Zoe       Conway  1974-07-03 00:00:00 blue         2
2     7 Driscoll  Klein   1970-10-05 00:00:00 blue         5
3    15 Hope      Silva   1970-07-01 00:00:00 blue         5


In [22]:
%%R
##
## Consultas
##

## Se realiza la consulta usando directamente SQL
DBI::dbGetQuery(sc, 'SELECT * FROM people')


   id firstname  surname  birthdate  color quantity
1   1    Vivian Hamilton 1971-07-08  green        1
2   2     Karen  Holcomb 1974-05-23  green        4
3   3      Cody  Garrett 1973-04-22 orange        1
4   4      Roth      Fry 1975-01-29  black        1
5   5       Zoe   Conway 1974-07-03   blue        2
6   6  Gretchen   Kinney 1974-10-18 violet        1
7   7  Driscoll    Klein 1970-10-05   blue        5
8   8     Karyn     Diaz 1969-02-24    red        1
9   9   Merritt      Guy 1974-10-17 indigo        4
10 10     Kylan   Sexton 1975-02-28  black        4
11 11    Jordan    Estes 1969-12-07 indigo        4
12 12      Hope   Coffey 1973-12-24  green        5
13 13    Vivian    Crane 1970-08-27   gray        5
14 14      Clio     Noel 1972-12-12    red        5
15 15      Hope    Silva 1970-07-01   blue        5
16 16    Ayanna   Jarvis 1974-02-11 orange        5
17 17    Chanda    Boyer 1973-04-01  green        4
18 18  Chadwick   Knight 1973-04-29 yellow        1


In [23]:
%%R
arrange(df, desc(firstname))

# Source:     spark<?> [?? x 6]
# Ordered by: desc(firstname)
      id firstname surname  birthdate           color  quantity
   <int> <chr>     <chr>    <dttm>              <chr>     <int>
 1     5 Zoe       Conway   1974-07-03 00:00:00 blue          2
 2    13 Vivian    Crane    1970-08-27 00:00:00 gray          5
 3     1 Vivian    Hamilton 1971-07-08 00:00:00 green         1
 4     4 Roth      Fry      1975-01-29 00:00:00 black         1
 5     9 Merritt   Guy      1974-10-17 00:00:00 indigo        4
 6    10 Kylan     Sexton   1975-02-28 00:00:00 black         4
 7     8 Karyn     Diaz     1969-02-24 00:00:00 red           1
 8     2 Karen     Holcomb  1974-05-23 00:00:00 green         4
 9    11 Jordan    Estes    1969-12-07 00:00:00 indigo        4
10    12 Hope      Coffey   1973-12-24 00:00:00 green         5
# … with more rows


In [24]:
%%R
sdf_sort(df, c('quantity', 'firstname'))

# Source: spark<?> [?? x 6]
      id firstname surname  birthdate           color  quantity
   <int> <chr>     <chr>    <dttm>              <chr>     <int>
 1    18 Chadwick  Knight   1973-04-29 00:00:00 yellow        1
 2     3 Cody      Garrett  1973-04-22 00:00:00 orange        1
 3     6 Gretchen  Kinney   1974-10-18 00:00:00 violet        1
 4     8 Karyn     Diaz     1969-02-24 00:00:00 red           1
 5     4 Roth      Fry      1975-01-29 00:00:00 black         1
 6     1 Vivian    Hamilton 1971-07-08 00:00:00 green         1
 7     5 Zoe       Conway   1974-07-03 00:00:00 blue          2
 8    17 Chanda    Boyer    1973-04-01 00:00:00 green         4
 9    11 Jordan    Estes    1969-12-07 00:00:00 indigo        4
10     2 Karen     Holcomb  1974-05-23 00:00:00 green         4
# … with more rows


In [25]:
%%R
summarise(df, mean(quantity))

# Source: spark<?> [?? x 1]
  `mean(quantity)`
             <dbl>
1             3.22


In [26]:
%%R
mutate(df, prod=quantity*10)

# Source: spark<?> [?? x 7]
      id firstname surname  birthdate           color  quantity  prod
   <int> <chr>     <chr>    <dttm>              <chr>     <int> <dbl>
 1     1 Vivian    Hamilton 1971-07-08 00:00:00 green         1    10
 2     2 Karen     Holcomb  1974-05-23 00:00:00 green         4    40
 3     3 Cody      Garrett  1973-04-22 00:00:00 orange        1    10
 4     4 Roth      Fry      1975-01-29 00:00:00 black         1    10
 5     5 Zoe       Conway   1974-07-03 00:00:00 blue          2    20
 6     6 Gretchen  Kinney   1974-10-18 00:00:00 violet        1    10
 7     7 Driscoll  Klein    1970-10-05 00:00:00 blue          5    50
 8     8 Karyn     Diaz     1969-02-24 00:00:00 red           1    10
 9     9 Merritt   Guy      1974-10-17 00:00:00 indigo        4    40
10    10 Kylan     Sexton   1975-02-28 00:00:00 black         4    40
# … with more rows


In [27]:
%%R
result <- summarize(group_by(df, color), count=n(), mean_quantity=mean(quantity))
result

# Source: spark<?> [?? x 3]
  color  count mean_quantity
  <chr>  <dbl>         <dbl>
1 blue       3           4  
2 yellow     1           1  
3 green      4           3.5
4 orange     2           3  
5 black      2           2.5
6 indigo     2           4  
7 violet     1           1  
8 red        2           3  
9 gray       1           5  


In [28]:
%%R
cat(dbplyr::sql_render(result))

SELECT `color`, count(*) AS `count`, AVG(`quantity`) AS `mean_quantity`
FROM `people`
GROUP BY `color`

In [29]:
%%R
df %>% group_by(color) %>% summarize(count = n(), mean_quantity = mean(quantity))

# Source: spark<?> [?? x 3]
  color  count mean_quantity
  <chr>  <dbl>         <dbl>
1 blue       3           4  
2 yellow     1           1  
3 green      4           3.5
4 orange     2           3  
5 black      2           2.5
6 indigo     2           4  
7 violet     1           1  
8 red        2           3  
9 gray       1           5  


In [30]:
%%R
summarise(df, max(quantity))

# Source: spark<?> [?? x 1]
  `max(quantity)`
            <int>
1               5


In [31]:
!hdfs dfs -rm  -r -f /tmp/demo/

In [32]:
%%R
spark_write_csv(filter(df, color == 'blue'),
                path='/tmp/demo/',
                delimiter = ",",
                quote = '\"')

In [33]:
!hdfs dfs -ls /tmp/demo/

Found 2 items
-rw-r--r--   1 root supergroup          0 2019-07-30 21:44 /tmp/demo/_SUCCESS
-rw-r--r--   1 root supergroup        186 2019-07-30 21:44 /tmp/demo/part-00000-8186cc5d-f4c1-4edc-8c7f-d855c4dd7acc-c000.csv


## Ejemplos

Los siguientes ejemplos son realizados usando el archivo `people.json` creado al principio de este tutorial.

In [34]:
%%R
df <- spark_read_csv(sc,                 # spark_connection
                     'people',           # nombre de la tabla
                     '/tmp/people.csv')  # ubicación del archivo

### Ejemplo 1

Seleccione las personas cuya fecha de nacimiento sea del año 1974 en adelante.

In [35]:
%%R
##
## Se usa la función filter() del DataFrame
##
filter(df, birthdate >= '1974')

# Source: spark<?> [?? x 6]
     id firstname surname birthdate           color  quantity
  <int> <chr>     <chr>   <dttm>              <chr>     <int>
1     2 Karen     Holcomb 1974-05-23 00:00:00 green         4
2     4 Roth      Fry     1975-01-29 00:00:00 black         1
3     5 Zoe       Conway  1974-07-03 00:00:00 blue          2
4     6 Gretchen  Kinney  1974-10-18 00:00:00 violet        1
5     9 Merritt   Guy     1974-10-17 00:00:00 indigo        4
6    10 Kylan     Sexton  1975-02-28 00:00:00 black         4
7    16 Ayanna    Jarvis  1974-02-11 00:00:00 orange        5


In [36]:
%%R
##
## Se crea una vista temporal para ejecutar
## una consulta SQL sobre ella
##
DBI::dbGetQuery(sc, 'SELECT * FROM people WHERE YEAR(birthdate) >= 1974')

  id firstname surname  birthdate  color quantity
1  2     Karen Holcomb 1974-05-23  green        4
2  4      Roth     Fry 1975-01-29  black        1
3  5       Zoe  Conway 1974-07-03   blue        2
4  6  Gretchen  Kinney 1974-10-18 violet        1
5  9   Merritt     Guy 1974-10-17 indigo        4
6 10     Kylan  Sexton 1975-02-28  black        4
7 16    Ayanna  Jarvis 1974-02-11 orange        5


### Ejemplo 2

Obtenga una lista de colores únicos.

In [37]:
%%R
##
## Se usa la función distinct() del DataFrame
##
distinct(select(df, 'color'))

# Source: spark<?> [?? x 1]
  color 
  <chr> 
1 blue  
2 yellow
3 green 
4 orange
5 black 
6 indigo
7 violet
8 red   
9 gray  


In [38]:
%%R
##
## Como una consulta 
##
DBI::dbGetQuery(sc, 'SELECT DISTINCT(color) FROM people')

   color
1   blue
2 yellow
3  green
4 orange
5  black
6 indigo
7 violet
8    red
9   gray


### Ejemplo 3

Ordene la tabla por cantidad y luego por color.

In [39]:
%%R
##
## Note que las funciones se aplican de derecha 
## a izquierda -- este ejemplo no corre en sparkR
##
sdf_sort(df, c('color', 'quantity'))

# Source: spark<?> [?? x 6]
      id firstname surname  birthdate           color quantity
   <int> <chr>     <chr>    <dttm>              <chr>    <int>
 1     4 Roth      Fry      1975-01-29 00:00:00 black        1
 2    10 Kylan     Sexton   1975-02-28 00:00:00 black        4
 3     5 Zoe       Conway   1974-07-03 00:00:00 blue         2
 4     7 Driscoll  Klein    1970-10-05 00:00:00 blue         5
 5    15 Hope      Silva    1970-07-01 00:00:00 blue         5
 6    13 Vivian    Crane    1970-08-27 00:00:00 gray         5
 7     1 Vivian    Hamilton 1971-07-08 00:00:00 green        1
 8     2 Karen     Holcomb  1974-05-23 00:00:00 green        4
 9    17 Chanda    Boyer    1973-04-01 00:00:00 green        4
10    12 Hope      Coffey   1973-12-24 00:00:00 green        5
# … with more rows


In [40]:
%%R
##
## Como una consulta de SQL
##
DBI::dbGetQuery(sc, 'SELECT * FROM people ORDER BY quantity, color')

   id firstname  surname  birthdate  color quantity
1   4      Roth      Fry 1975-01-29  black        1
2   1    Vivian Hamilton 1971-07-08  green        1
3   3      Cody  Garrett 1973-04-22 orange        1
4   8     Karyn     Diaz 1969-02-24    red        1
5   6  Gretchen   Kinney 1974-10-18 violet        1
6  18  Chadwick   Knight 1973-04-29 yellow        1
7   5       Zoe   Conway 1974-07-03   blue        2
8  10     Kylan   Sexton 1975-02-28  black        4
9   2     Karen  Holcomb 1974-05-23  green        4
10 17    Chanda    Boyer 1973-04-01  green        4
11  9   Merritt      Guy 1974-10-17 indigo        4
12 11    Jordan    Estes 1969-12-07 indigo        4
13  7  Driscoll    Klein 1970-10-05   blue        5
14 15      Hope    Silva 1970-07-01   blue        5
15 13    Vivian    Crane 1970-08-27   gray        5
16 12      Hope   Coffey 1973-12-24  green        5
17 16    Ayanna   Jarvis 1974-02-11 orange        5
18 14      Clio     Noel 1972-12-12    red        5


**Limpieza del directorio de trabajo**

In [41]:
!rm people.*
!hdfs dfs -rm  -r -f /tmp/demo/
!hdfs dfs -rm /tmp/people*

Deleted /tmp/demo
Deleted /tmp/people.csv
Deleted /tmp/people.json
