# Basic Structured Operations

At first let's create a spark session

import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4a5c6013


Then we've to retrieve the path of the dataset's csv file... 

In [70]:
!ls ../../src/201508_station_data.csv

../../src/201508_station_data.csv



in order to load/read it:

+----------+--------------------+---------+-----------+---------+--------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|
+----------+--------------------+---------+-----------+---------+--------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|
|         6|    San Pedro Square|37.336721|-121.894074|       15|San Jose|    8/7/2013|
+----------+--------------------+---------+-----------+---------+--------+------------+
only showing top 5 rows



df: org.apache.spark.sql.DataFrame = [station_id: int, name: string ... 5 more fields]


List all the colums of our dataset (in a Scala array)

res1: Array[String] = Array(station_id, name, lat, long, dockcount, landmark, installation)


We can also get the number of cols

res2: Int = 7


with the number of rows

res3: Long = 70


We can retrieve the first element of the columns this way:

res4: String = station_id


Now, let's add 5 to all station_id rows

+----------------+
|(station_id + 5)|
+----------------+
|               7|
|               8|
|               9|
|              10|
|              11|
+----------------+
only showing top 5 rows



The same thing can be done differently :

+----------------+
|(station_id + 5)|
+----------------+
|               7|
|               8|
|               9|
|              10|
|              11|
+----------------+
only showing top 5 rows



Creation of a row / record:

import org.apache.spark.sql.Row
myRow: org.apache.spark.sql.Row = [Hello,null,1,false]
res7: Any = Hello


then display various elements of this row

null
1
false


res9: String = Hello


res10: Int = 1


res11: Boolean = false


Create DF from rows with a manual schema

+----------+-----+----+
|someString|names|idNb|
+----------+-----+----+
|     Hello| null|   1|
|     What?| test|  10|
+----------+-----+----+



import org.apache.spark.sql.types._
myManualSchema: org.apache.spark.sql.types.StructType = StructType(StructField(someString,StringType,true), StructField(names,StringType,true), StructField(idNb,LongType,false))
myRows: Seq[org.apache.spark.sql.Row] = List([Hello,null,1], [What?,test,10])
myRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[41] at parallelize at <console>:50
myDF: org.apache.spark.sql.DataFrame = [someString: string, names: string ... 1 more field]


Or you can simply convert a Seq (aka a list in Python) to a DF:

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|Hello|   2|   1|
+-----+----+----+



myDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]


Change a column name

+----------+--------------------+----------+-----------+---------+--------+------------+
|station_id|                name|LATTITUDE!|       long|dockcount|landmark|installation|
+----------+--------------------+----------+-----------+---------+--------+------------+
|         2|San Jose Diridon ...| 37.329732|-121.901782|       27|San Jose|    8/6/2013|
|         3|San Jose Civic Ce...| 37.330698|-121.888979|       15|San Jose|    8/5/2013|
+----------+--------------------+----------+-----------+---------+--------+------------+
only showing top 2 rows



+--------------------+
|                NAME|
+--------------------+
|San Jose Diridon ...|
|San Jose Civic Ce...|
+--------------------+
only showing top 2 rows



Same thing in an SQL way

+-----------+
|  LONGITUDE|
+-----------+
|-121.901782|
|-121.888979|
|-121.894902|
|  -121.8932|
+-----------+



import org.apache.spark.sql.functions._


In [28]:
//https://mungingdata.com/apache-spark/aggregations/

Duplicate landmark only if lockcount > 20

+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_bis|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|    San Jose|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|        null|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|        null|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|        null|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
only showing top 4 rows



df_dup: org.apache.spark.sql.DataFrame = [station_id: int, name: string ... 6 more fields]


add an other condition

+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_ter|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|    San Jose|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|         bbb|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|        test|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|         bbb|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
only showing top 4 rows



df_ter: org.apache.spark.sql.DataFrame = [station_id: int, name: string ... 6 more fields]


Create a new col depending on the fact that two cols are the same

+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_bis|  EL|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|    San Jose|true|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|        null|null|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|        null|null|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|        null|null|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+
only showing top 4 rows



Same thing the SQL Way

+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_bis|  EL|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|    San Jose|true|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|        null|null|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|        null|null|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|        null|null|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+----+



In [33]:
//https://hackersandslackers.com/transforming-pyspark-dataframes/

Average of a col values

+---------------+
|avg(station_id)|
+---------------+
|           43.0|
+---------------+



+---------------+
|avg(station_id)|
+---------------+
|           43.0|
+---------------+



Discting rows nb

res29: Long = 70


Count distinct with 1 col

res30: Long = 5


+------------------------+
|count(DISTINCT landmark)|
+------------------------+
|                       5|
+------------------------+



Count distinct with 2 cols

+--------------------------------------+
|count(DISTINCT landmark, landmark_ter)|
+--------------------------------------+
|                                    13|
+--------------------------------------+



res33: Long = 13


converting to spark literal types

+--------+---+
|landmark|one|
+--------+---+
|San Jose|  1|
|San Jose|  1|
|San Jose|  1|
+--------+---+
only showing top 3 rows



Adding col with literal

+----------+--------------------+---------+-----------+---------+--------+------------+---+
|station_id|                name|      lat|       long|dockcount|landmark|installation|one|
+----------+--------------------+---------+-----------+---------+--------+------------+---+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|  1|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|  1|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|  1|
+----------+--------------------+---------+-----------+---------+--------+------------+---+
only showing top 3 rows



Adding col with a condition

+----------+--------------------+---------+-----------+---------+--------+------------+-------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|greater|
+----------+--------------------+---------+-----------+---------+--------+------------+-------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|   true|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|  false|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|  false|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|   true|
|         6|    San Pedro Square|37.336721|-121.894074|       15|San Jose|    8/7/2013|  false|
+----------+--------------------+---------+-----------+---------+--------+------------+-------+
only showing top 5 rows



+----------+--------------------+---------+-----------+---------+--------+------------+------------+-----+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_ter|equal|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+-----+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|    San Jose| true|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|         bbb|false|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|        test|false|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|         bbb|false|
|         6|    San Pedro Square|37.336721|-121.894074|       15|San Jose|    8/7/2013|         bbb|false|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+-----+
only showing top 5 rows



Duplicate a col

+----------+--------------------+---------+-----------+---------+--------+------------+--------+
|station_id|                name|      lat|       long|dockcount|landmark|installation| newName|
+----------+--------------------+---------+-----------+---------+--------+------------+--------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|San Jose|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|San Jose|
+----------+--------------------+---------+-----------+---------+--------+------------+--------+
only showing top 2 rows



Drop columns

+----------+-----------+---------+--------+------------+
|station_id|       long|dockcount|landmark|installation|
+----------+-----------+---------+--------+------------+
|         2|-121.901782|       27|San Jose|    8/6/2013|
|         3|-121.888979|       15|San Jose|    8/5/2013|
+----------+-----------+---------+--------+------------+
only showing top 2 rows



Changing column type

root
 |-- station_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- long: double (nullable = true)
 |-- dockcount: integer (nullable = true)
 |-- landmark: string (nullable = true)
 |-- installation: string (nullable = true)
 |-- newDockCount: long (nullable = true)



Filtering rows (numerical condition)

+----------+--------------------+---------+-----------+---------+--------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|
+----------+--------------------+---------+-----------+---------+--------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|
|        11|         MLK Library|37.335885| -121.88566|       19|San Jose|    8/6/2013|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|San Jose|    8/7/2013|
|        14|Arena Green / SAP...|37.332692|-121.900084|       19|San Jose|    8/5/2013|
+----------+--------------------+---------+-----------+---------+--------+------------+
only showing top 5 rows



+----------+--------------------+---------+-----------+---------+--------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|
+----------+--------------------+---------+-----------+---------+--------+------------+
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|
|        11|         MLK Library|37.335885| -121.88566|       19|San Jose|    8/6/2013|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|San Jose|    8/7/2013|
+----------+--------------------+---------+-----------+---------+--------+------------+
only showing top 3 rows



Same thing the SQL Way

+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|landmark_bis|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|San Jose|    8/5/2013|        null|
|        11|         MLK Library|37.335885| -121.88566|       19|San Jose|    8/6/2013|        null|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|San Jose|    8/7/2013|        null|
+----------+--------------------+---------+-----------+---------+--------+------------+------------+
only showing top 3 rows



Filtering rows : 2 conditions on strings

+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|     San Jose|    8/6/2013|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|     San Jose|    8/5/2013|
|        11|         MLK Library|37.335885| -121.88566|       19|     San Jose|    8/6/2013|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|     San Jose|    8/7/2013|
|        14|Arena Green / SAP...|37.332692|-121.900084|       19|     San Jose|    8/5/2013|
|        28|Mountain View Cal...|37.394358|-122.076713|       23|Mountain View|   8/15/2013|
|        29|San Antonio Caltr...| 37.40694|-122.106758|       23|Mount

Same thing the SQL Way

+----------+--------------------+---------+-----------+---------+-------------+------------+-------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation| landmark_bis|
+----------+--------------------+---------+-----------+---------+-------------+------------+-------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|     San Jose|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|     San Jose|    8/6/2013|         null|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|     San Jose|    8/5/2013|         null|
|        11|         MLK Library|37.335885| -121.88566|       19|     San Jose|    8/6/2013|         null|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|     San Jose|    8/7/2013|         null|
|        14|Arena Green / SAP...|37.332692|-121.900084|       19|     San Jose|    8/5/2013|         null|
|        28|Mountain View Cal...|37.3

AND

+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|     San Jose|    8/6/2013|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|     San Jose|    8/5/2013|
|        11|         MLK Library|37.335885| -121.88566|       19|     San Jose|    8/6/2013|
|        12|SJSU 4th at San C...|37.332808|-121.883891|       19|     San Jose|    8/7/2013|
|        14|Arena Green / SAP...|37.332692|-121.900084|       19|     San Jose|    8/5/2013|
|        28|Mountain View Cal...|37.394358|-122.076713|       23|Mountain View|   8/15/2013|
|        29|San Antonio Caltr...| 37.40694|-122.106758|       23|Mount

+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|     San Jose|    8/5/2013|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|     San Jose|    8/6/2013|
|         5|    Adobe on Almaden|37.331415|  -121.8932|       19|     San Jose|    8/5/2013|
|         6|    San Pedro Square|37.336721|-121.894074|       15|     San Jose|    8/7/2013|
|         7|Paseo de San Antonio|37.333798|-121.886943|       15|     San Jose|    8/7/2013|
|         8| San Salvador at 1st|37.330165|-121.885831|       15|     San Jose|    8/5/2013|
|         9|           Japantown|37.348742|-121.894715|       15|     

Sample

seed: Int = 5
withReplacement: Boolean = false
fraction: Double = 0.5
df_sample: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [station_id: int, name: string ... 5 more fields]


+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|     San Jose|    8/5/2013|
|         6|    San Pedro Square|37.336721|-121.894074|       15|     San Jose|    8/7/2013|
|         7|Paseo de San Antonio|37.333798|-121.886943|       15|     San Jose|    8/7/2013|
|         8| San Salvador at 1st|37.330165|-121.885831|       15|     San Jose|    8/5/2013|
|        10|  San Jose City Hall|37.337391|-121.886995|       15|     San Jose|    8/6/2013|
|        11|         MLK Library|37.335885| -121.88566|       19|     San Jose|    8/6/2013|
|        13|       St James Park|37.339301|-121.889937|       15|     

Random splits

13


seed: Int = 42
testAndTrain: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([station_id: int, name: string ... 5 more fields], [station_id: int, name: string ... 5 more fields])


57


res51: Double = 17.5


In [60]:
// https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html

Sorting rows 

+----------+-----------------+---------+-----------+---------+-------------+------------+
|station_id|             name|      lat|       long|dockcount|     landmark|installation|
+----------+-----------------+---------+-----------+---------+-------------+------------+
|        61|  2nd at Townsend|37.780526|-122.390288|       27|San Francisco|   8/22/2013|
|        77|Market at Sansome|37.789625|-122.400811|       27|San Francisco|   8/25/2013|
+----------+-----------------+---------+-----------+---------+-------------+------------+
only showing top 2 rows



+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|     San Jose|    8/6/2013|
|        61|     2nd at Townsend|37.780526|-122.390288|       27|San Francisco|   8/22/2013|
|        67|      Market at 10th|37.776619|-122.417385|       27|San Francisco|   8/23/2013|
+----------+--------------------+---------+-----------+---------+-------------+------------+
only showing top 3 rows



+----------+-----------------+---------+-----------+---------+-------------+------------+
|station_id|             name|      lat|       long|dockcount|     landmark|installation|
+----------+-----------------+---------+-----------+---------+-------------+------------+
|        77|Market at Sansome|37.789625|-122.400811|       27|San Francisco|   8/25/2013|
|        67|   Market at 10th|37.776619|-122.417385|       27|San Francisco|   8/23/2013|
|        61|  2nd at Townsend|37.780526|-122.390288|       27|San Francisco|   8/22/2013|
+----------+-----------------+---------+-----------+---------+-------------+------------+
only showing top 3 rows



+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                name|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|     San Jose|    8/6/2013|
|        32|Castro Street and...|37.385956|-122.083678|       11|Mountain View|  12/31/2013|
|        35|University and Em...|37.444521|-122.163093|       11|    Palo Alto|   8/15/2013|
+----------+--------------------+---------+-----------+---------+-------------+------------+
only showing top 3 rows



res57: Int = 1


rdd_repart: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[244] at repartition at <console>:40


res58: Int = 4


res59: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[244] at repartition at <console>:40


<console>:  error: incomplete input