# Basic Operations

This lecture will cover some basic operations with Spark DataFrames.

We will play around with some stock data from Apple.

In [0]:
display(dbutils.fs.ls("/databricks-datasets/asa/small"))

path,name,size,modificationTime
dbfs:/databricks-datasets/asa/small/small.csv,small.csv,20411051,1459311537000


In [0]:
df = spark.read.csv("dbfs:/databricks-datasets/asa/small/small.csv", inferSchema=True, header=True)

In [0]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- TaxiIn: integer (nullable = true)
 |-- TaxiOut: integer (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- 

In [0]:
display(df.take(5))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415,1420,2058,2117,UA,48,N598UA,283,297,272,-19,-5,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,13,5,2237,2230,711,654,UA,48,N546UA,334,324,313,17,7,SFO,BOS,2704,6,15,0,,0,,,,,
2002,12,14,6,2227,2230,638,654,UA,48,N551UA,311,324,296,-16,-3,SFO,BOS,2704,3,12,0,,0,,,,,
2002,12,15,7,2226,2230,645,654,UA,48,N544UA,319,324,295,-9,-4,SFO,BOS,2704,4,20,0,,0,,,,,
2002,12,16,1,2224,2230,641,654,UA,48,N597UA,317,324,293,-13,-6,SFO,BOS,2704,5,19,0,,0,,,,,


## Filtering Data

A large part of working with DataFrames is the ability to quickly filter out data based on conditions. Spark DataFrames are built on top of the Spark SQL platform, which means that is you already know SQL, you can quickly and easily grab that data using SQL commands, or using the DataFram methods (which is what we focus on in this course).

In [0]:
display(df.filter("ActualElapsedTime<120"))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,1,7,948,950,1117,1126,UA,51,N545UA,89,96,69,-9,-2,BOS,IAD,413,3,17,0,,0,,,,,
2002,12,2,1,948,950,1119,1126,UA,51,N548UA,91,96,75,-7,-2,BOS,IAD,413,5,11,0,,0,,,,,
2002,12,3,2,950,950,1127,1126,UA,51,N550UA,97,96,68,1,0,BOS,IAD,413,4,25,0,,0,,,,,
2002,12,4,3,946,950,1125,1126,UA,51,N548UA,99,96,77,-1,-4,BOS,IAD,413,4,18,0,,0,,,,,
2002,12,5,4,957,950,1150,1126,UA,51,N549UA,113,96,81,24,7,BOS,IAD,413,18,14,0,,0,,,,,
2002,12,7,6,949,950,1116,1126,UA,51,N546UA,87,96,71,-10,-1,BOS,IAD,413,5,11,0,,0,,,,,
2002,12,8,7,948,950,1121,1126,UA,51,N598UA,93,96,75,-5,-2,BOS,IAD,413,4,14,0,,0,,,,,
2002,12,9,1,947,950,1121,1126,UA,51,N597UA,94,96,75,-5,-3,BOS,IAD,413,4,15,0,,0,,,,,
2002,12,10,2,947,950,1115,1126,UA,51,N549UA,88,96,74,-11,-3,BOS,IAD,413,3,11,0,,0,,,,,
2002,12,11,3,1000,950,1147,1126,UA,51,N546UA,107,96,84,21,10,BOS,IAD,413,8,15,0,,0,,,,,


In [0]:
df.filter("ActualElapsedTime<120").select("DepTime").show()

+-------+
|DepTime|
+-------+
|    948|
|    948|
|    950|
|    946|
|    957|
|    949|
|    948|
|    947|
|    947|
|   1000|
|    945|
|    944|
|    946|
|    947|
|    943|
|    947|
|    951|
|    958|
|    950|
|   1001|
+-------+
only showing top 20 rows



In [0]:
df[df.ActualElapsedTime<120].select("DepTime").show() # Sintaxis de pandas en el filtro, cuando no me deja acceder a las sintaxis normal, es porque esta distribuida es columna o puede estar

+-------+
|DepTime|
+-------+
|    948|
|    948|
|    950|
|    946|
|    957|
|    949|
|    948|
|    947|
|    947|
|   1000|
|    945|
|    944|
|    946|
|    947|
|    943|
|    947|
|    951|
|    958|
|    950|
|   1001|
+-------+
only showing top 20 rows



In [0]:
df.filter("ActualElapsedTime<120").select(["DayOfWeek", "DayOfMonth"]).show()

+---------+----------+
|DayOfWeek|DayOfMonth|
+---------+----------+
|        7|         1|
|        1|         2|
|        2|         3|
|        3|         4|
|        4|         5|
|        6|         7|
|        7|         8|
|        1|         9|
|        2|        10|
|        3|        11|
|        4|        12|
|        5|        13|
|        6|        14|
|        7|        15|
|        1|        16|
|        2|        17|
|        3|        18|
|        4|        19|
|        5|        20|
|        6|        21|
+---------+----------+
only showing top 20 rows



Using normal python comparison operators is another way to do this, they will look very similar to SQL operators, except you need to make sure you are calling the entire column within the dataframe, using the format: df["column name"]

Let's see some examples:

In [0]:
display(df.filter(df["ActualElapsedTime"] < 200))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,1,7,701,700,1016,1020,UA,50,N549UA,135,140,104,-4,1,LAX,DEN,862,6,25,0,,0,,,,,
2002,12,2,1,701,700,1016,1020,UA,50,N544UA,135,140,108,-4,1,LAX,DEN,862,6,21,0,,0,,,,,
2002,12,3,2,701,700,1028,1020,UA,50,N545UA,147,140,112,8,1,LAX,DEN,862,7,28,0,,0,,,,,
2002,12,4,3,658,700,1012,1020,UA,50,N550UA,134,140,109,-8,-2,LAX,DEN,862,5,20,0,,0,,,,,
2002,12,5,4,656,700,1024,1020,UA,50,N548UA,148,140,105,4,-4,LAX,DEN,862,21,22,0,,0,,,,,
2002,12,6,5,654,700,1012,1020,UA,50,N549UA,138,140,110,-8,-6,LAX,DEN,862,15,13,0,,0,,,,,
2002,12,7,6,654,700,1028,1020,UA,50,N550UA,154,140,106,8,-6,LAX,DEN,862,30,18,0,,0,,,,,
2002,12,8,7,654,700,1012,1020,UA,50,N551UA,138,140,109,-8,-6,LAX,DEN,862,5,24,0,,0,,,,,
2002,12,9,1,700,700,1026,1020,UA,50,N548UA,146,140,113,6,0,LAX,DEN,862,6,27,0,,0,,,,,
2002,12,10,2,656,700,1015,1020,UA,50,N598UA,139,140,116,-5,-4,LAX,DEN,862,6,17,0,,0,,,,,


In [0]:
df.filter(df["DayOfWeek"]<5 and df["DayOfMonth"]> 15).show() # Ver error

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
[0;32m<command-834199842432360>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0mdf[0m[0;34m.[0m[0mfilter[0m[0;34m([0m[0mdf[0m[0;34m[[0m[0;34m"DayOfWeek"[0m[0;34m][0m[0;34m<[0m[0;36m5[0m [0;32mand[0m [0mdf[0m[0;34m[[0m[0;34m"DayOfMonth"[0m[0;34m][0m[0;34m>[0m [0;36m15[0m[0;34m)[0m[0;34m.[0m[0mshow[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/sql/column.py[0m in [0;36m__nonzero__[0;34m(self)[0m
[1;32m    914[0m [0;34m[0m[0m
[1;32m    915[0m     [0;32mdef[0m [0m__nonzero__[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 916[0;31m         raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
[0m[1;32m    917[0m         

In [0]:
display(df.filter((df["DayOfWeek"]<5) & (df["DayOfMonth"]> 15)))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415.0,1420,2058.0,2117,UA,48,N598UA,283.0,297,272.0,-19.0,-5.0,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,16,1,2224.0,2230,641.0,654,UA,48,N597UA,317.0,324,293.0,-13.0,-6.0,SFO,BOS,2704,5,19,0,,0,,,,,
2002,12,17,2,2226.0,2230,703.0,654,UA,48,N598UA,337.0,324,294.0,9.0,-4.0,SFO,BOS,2704,3,40,0,,0,,,,,
2002,12,18,3,2233.0,2230,652.0,654,UA,48,N544UA,319.0,324,286.0,-2.0,3.0,SFO,BOS,2704,6,27,0,,0,,,,,
2002,12,19,4,2254.0,2230,659.0,654,UA,48,N551UA,305.0,324,288.0,5.0,24.0,SFO,BOS,2704,6,11,0,,0,,,,,
2002,12,23,1,2228.0,2230,633.0,654,UA,48,N595UA,305.0,324,285.0,-21.0,-2.0,SFO,BOS,2704,5,15,0,,0,,,,,
2002,12,24,2,2225.0,2230,647.0,654,UA,48,N597UA,322.0,324,302.0,-7.0,-5.0,SFO,BOS,2704,6,14,0,,0,,,,,
2002,12,25,3,2239.0,2230,714.0,654,UA,48,N545UA,335.0,324,308.0,20.0,9.0,SFO,BOS,2704,8,19,0,,0,,,,,
2002,12,26,4,2231.0,2230,643.0,654,UA,48,N552UA,312.0,324,297.0,-11.0,1.0,SFO,BOS,2704,3,12,0,,0,,,,,
2002,12,30,1,2232.0,2230,645.0,654,UA,48,N597UA,313.0,324,284.0,-9.0,2.0,SFO,BOS,2704,6,23,0,,0,,,,,


In [0]:
df.filter(df["DayOfWeek"] < 5).select(["ActualElapsedTime", "TailNum", "AirTime"]).show()

+-----------------+-------+-------+
|ActualElapsedTime|TailNum|AirTime|
+-----------------+-------+-------+
|              283| N598UA|    272|
|              317| N597UA|    293|
|              337| N598UA|    294|
|              319| N544UA|    286|
|              305| N551UA|    288|
|              305| N595UA|    285|
|              322| N597UA|    302|
|              335| N545UA|    308|
|              312| N552UA|    297|
|              313| N597UA|    284|
|              313| N598UA|    295|
|              376| N675UA|    325|
|              342| N666UA|    319|
|              317| N675UA|    299|
|              373| N664UA|    319|
|              318| N665UA|    301|
|              320| N667UA|    302|
|              313| N665UA|    295|
|              346| N676UA|    321|
|              360| N670UA|    323|
+-----------------+-------+-------+
only showing top 20 rows



In [0]:
display(df.filter((df["DayOfWeek"] < 5) & (df["DayOfMonth"] > 20)).take(5))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415,1420,2058,2117,UA,48,N598UA,283,297,272,-19,-5,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,23,1,2228,2230,633,654,UA,48,N595UA,305,324,285,-21,-2,SFO,BOS,2704,5,15,0,,0,,,,,
2002,12,24,2,2225,2230,647,654,UA,48,N597UA,322,324,302,-7,-5,SFO,BOS,2704,6,14,0,,0,,,,,
2002,12,25,3,2239,2230,714,654,UA,48,N545UA,335,324,308,20,9,SFO,BOS,2704,8,19,0,,0,,,,,
2002,12,26,4,2231,2230,643,654,UA,48,N552UA,312,324,297,-11,1,SFO,BOS,2704,3,12,0,,0,,,,,


In [0]:
display(df.filter((df["DayOfWeek"] < 5) |(df["DayOfMonth"] > 20)).take(5))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415,1420,2058,2117,UA,48,N598UA,283,297,272,-19,-5,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,16,1,2224,2230,641,654,UA,48,N597UA,317,324,293,-13,-6,SFO,BOS,2704,5,19,0,,0,,,,,
2002,12,17,2,2226,2230,703,654,UA,48,N598UA,337,324,294,9,-4,SFO,BOS,2704,3,40,0,,0,,,,,
2002,12,18,3,2233,2230,652,654,UA,48,N544UA,319,324,286,-2,3,SFO,BOS,2704,6,27,0,,0,,,,,
2002,12,19,4,2254,2230,659,654,UA,48,N551UA,305,324,288,5,24,SFO,BOS,2704,6,11,0,,0,,,,,


In [0]:
display(df.filter((df["DayOfWeek"] < 5) & ~(df["DayOfMonth"] < 20)).take(5))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415,1420,2058,2117,UA,48,N598UA,283,297,272,-19,-5,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,23,1,2228,2230,633,654,UA,48,N595UA,305,324,285,-21,-2,SFO,BOS,2704,5,15,0,,0,,,,,
2002,12,24,2,2225,2230,647,654,UA,48,N597UA,322,324,302,-7,-5,SFO,BOS,2704,6,14,0,,0,,,,,
2002,12,25,3,2239,2230,714,654,UA,48,N545UA,335,324,308,20,9,SFO,BOS,2704,8,19,0,,0,,,,,
2002,12,26,4,2231,2230,643,654,UA,48,N552UA,312,324,297,-11,1,SFO,BOS,2704,3,12,0,,0,,,,,


In [0]:
display(df.filter(df["DayOfWeek"] == 2).take(2))

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,31,2,1415,1420,2058,2117,UA,48,N598UA,283,297,272,-19,-5,LIH,SFO,2447,4,7,0,,0,,,,,
2002,12,17,2,2226,2230,703,654,UA,48,N598UA,337,324,294,9,-4,SFO,BOS,2704,3,40,0,,0,,,,,


In [0]:
display(df.filter(df["DayOfMonth"] ==3).collect())

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2002,12,3,2,2213.0,2200,448.0,500,UA,50,N550UA,275.0,300,256.0,-12.0,13.0,KOA,LAX,2504,7,12,0,,0,,,,,
2002,12,3,2,701.0,700,1028.0,1020,UA,50,N545UA,147.0,140,112.0,8.0,1.0,LAX,DEN,862,7,28,0,,0,,,,,
2002,12,3,2,950.0,950,1127.0,1126,UA,51,N550UA,97.0,96,68.0,1.0,0.0,BOS,IAD,413,4,25,0,,0,,,,,
2002,12,3,2,1300.0,1235,1522.0,1503,UA,51,N550UA,322.0,328,305.0,19.0,25.0,IAD,LAX,2288,5,12,0,,0,,,,,
2002,12,3,2,1701.0,1640,2107.0,2012,UA,51,N550UA,366.0,332,338.0,55.0,21.0,LAX,KOA,2504,3,25,0,,0,,,,,
2002,12,3,2,1346.0,1345,2028.0,2102,UA,52,N211UA,282.0,317,257.0,-34.0,1.0,HNL,LAX,2556,9,16,0,,0,,,,,
2002,12,3,2,917.0,920,1307.0,1253,UA,53,N215UA,350.0,333,333.0,14.0,-3.0,LAX,HNL,2556,3,14,0,,0,,,,,
2002,12,3,2,2152.0,2155,438.0,502,UA,54,N210UA,286.0,307,266.0,-24.0,-3.0,HNL,LAX,2556,7,13,0,,0,,,,,
2002,12,3,2,1214.0,1130,1608.0,1509,UA,55,N210UA,354.0,339,341.0,59.0,44.0,LAX,HNL,2556,3,10,0,,0,,,,,
2002,12,3,2,2328.0,2330,618.0,647,UA,56,N672UA,290.0,317,268.0,-29.0,-2.0,HNL,LAX,2556,6,16,0,,0,,,,,


In [0]:
result = df.filter(df["DayOfWeek"] == 3).collect()

In [0]:
type(result[0])

Out[29]: pyspark.sql.types.Row

In [0]:
row = result[0]

In [0]:
row.asDict()

Out[36]: {'Year': 2002,
 'Month': 12,
 'DayofMonth': 18,
 'DayOfWeek': 3,
 'DepTime': '2233',
 'CRSDepTime': 2230,
 'ArrTime': '652',
 'CRSArrTime': 654,
 'UniqueCarrier': 'UA',
 'FlightNum': 48,
 'TailNum': 'N544UA',
 'ActualElapsedTime': '319',
 'CRSElapsedTime': 324,
 'AirTime': '286',
 'ArrDelay': '-2',
 'DepDelay': '3',
 'Origin': 'SFO',
 'Dest': 'BOS',
 'Distance': 2704,
 'TaxiIn': 6,
 'TaxiOut': 27,
 'Cancelled': 0,
 'CancellationCode': 'NA',
 'Diverted': 0,
 'CarrierDelay': 'NA',
 'WeatherDelay': 'NA',
 'NASDelay': 'NA',
 'SecurityDelay': 'NA',
 'LateAircraftDelay': 'NA'}

In [0]:
for item in result[0]:
    print(item)

2002
12
18
3
2233
2230
652
654
UA
48
N544UA
319
324
286
-2
3
SFO
BOS
2704
6
27
0
NA
0
NA
NA
NA
NA
NA


That is all for now Great Job!