<a href="https://colab.research.google.com/github/jcestevezc/Cloudera/blob/master/Spark/Introduction%20to%20PySpark/Spark%20Catalog%20and%20transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
 #! pip install pyspark
import pandas as pd

#Connecting to Clusters

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

In [2]:
# Verify SparkContext
#print(sc)

# Print Spark version
#print(sc.version)

# RRDs

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. 

# Dataframes

The Spark DataFrames are abstraction built on top of RDDs. The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

# Spark Context and Spark Sessions

You can think of the SparkContext as your connection to the cluster and the SparkSession as your interface with that connection.

In [3]:
# Import SparkSession from 
from pyspark.sql import SparkSession

# Create Session
spark = SparkSession.builder.getOrCreate()

# Print spark
print(spark)

<pyspark.sql.session.SparkSession object at 0x7f1ad440ecf8>


# Consulting the Spark Catalog

Once you've created a SparkSession, you can start poking around to see what data is in your cluster!

Your SparkSession has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the .listTables() method, which returns the names of all the tables in your cluster as a list.

In [4]:
print(spark.catalog.listTables())

[]


## Creating a view in Spark Catalog

### Loading the data using a pandas dataframe

In [5]:
url = 'https://raw.githubusercontent.com/jcestevezc/Cloudera/master/Spark/Introduction%20to%20PySpark/flights_small.csv'
data = pd.read_csv(url)

data.head()

Unnamed: 0,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
0,2014,12,8,658.0,-7.0,935.0,-5.0,VX,N846VA,1780,SEA,LAX,132.0,954,6.0,58.0
1,2014,1,22,1040.0,5.0,1505.0,5.0,AS,N559AS,851,SEA,HNL,360.0,2677,10.0,40.0
2,2014,3,9,1443.0,-2.0,1652.0,2.0,VX,N847VA,755,SEA,SFO,111.0,679,14.0,43.0
3,2014,4,9,1705.0,45.0,1839.0,34.0,WN,N360SW,344,PDX,SJC,83.0,569,17.0,5.0
4,2014,3,9,754.0,-1.0,1015.0,1.0,AS,N612AS,522,SEA,BUR,127.0,937,7.0,54.0


In [6]:
# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(data.astype(str))

# Examine the tables in the catalog
print(spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView('flights')

# Examine the tables in the catalog again
print(spark.catalog.listTables())

[]
[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


In [7]:
# Don't change this query
query = "FROM flights SELECT * LIMIT 10"

# Get the first 10 rows of flights
flights10 = spark.sql(query)

# Show the results
flights10.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|   658.0|     -7.0|   935.0|     -5.0|     VX| N846VA|  1780|   SEA| LAX|   132.0|     954| 6.0|  58.0|
|2014|    1| 22|  1040.0|      5.0|  1505.0|      5.0|     AS| N559AS|   851|   SEA| HNL|   360.0|    2677|10.0|  40.0|
|2014|    3|  9|  1443.0|     -2.0|  1652.0|      2.0|     VX| N847VA|   755|   SEA| SFO|   111.0|     679|14.0|  43.0|
|2014|    4|  9|  1705.0|     45.0|  1839.0|     34.0|     WN| N360SW|   344|   PDX| SJC|    83.0|     569|17.0|   5.0|
|2014|    3|  9|   754.0|     -1.0|  1015.0|      1.0|     AS| N612AS|   522|   SEA| BUR|   127.0|     937| 7.0|  54.0|
|2014|    1| 15|  1037.0|      7.0|  135

### Loading the data using a spark dataframe

In [8]:
# Loading the data
file_path = '/content/sample_data/airports.csv'

# Read in the airports data
airports = spark.read.csv(file_path,header=True)

# Show the data
airports.show(5)

+---+--------------------+----------+-----------+----+---+---+
|faa|                name|       lat|        lon| alt| tz|dst|
+---+--------------------+----------+-----------+----+---+---+
|04G|   Lansdowne Airport|41.1304722|-80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|32.4605722|-85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|41.9893408|-88.1012428| 801| -6|  A|
|06N|     Randall Airport| 41.431912|-74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|31.0744722|-81.4277778|  11| -4|  A|
+---+--------------------+----------+-----------+----+---+---+
only showing top 5 rows



## Data transformation

### Creating columns

In Spark you can do this using the .withColumn() method, which takes two arguments. First, a string with the name of your new column, and second the new column itself.

The new column must be an object of class Column. Creating one of these is as easy as extracting a column from your DataFrame using df.colName.

Updating a Spark DataFrame is somewhat different than working in pandas because the Spark DataFrame is immutable. This means that it can't be changed, and so columns can't be updated in place.

In [9]:
# Reading the DataFrame flights
flights = spark.table("flights")

In [10]:
# Showing the squema
flights.printSchema()

root
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)



In [11]:
# Show the head
flights.show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|   658.0|     -7.0|   935.0|     -5.0|     VX| N846VA|  1780|   SEA| LAX|   132.0|     954| 6.0|  58.0|
|2014|    1| 22|  1040.0|      5.0|  1505.0|      5.0|     AS| N559AS|   851|   SEA| HNL|   360.0|    2677|10.0|  40.0|
|2014|    3|  9|  1443.0|     -2.0|  1652.0|      2.0|     VX| N847VA|   755|   SEA| SFO|   111.0|     679|14.0|  43.0|
|2014|    4|  9|  1705.0|     45.0|  1839.0|     34.0|     WN| N360SW|   344|   PDX| SJC|    83.0|     569|17.0|   5.0|
|2014|    3|  9|   754.0|     -1.0|  1015.0|      1.0|     AS| N612AS|   522|   SEA| BUR|   127.0|     937| 7.0|  54.0|
+----+-----+---+--------+---------+-----

Thus, all these methods return a new DataFrame. To overwrite the original DataFrame you must reassign the returned DataFrame using the method like so:

In [12]:
flights = flights.withColumn("duration_hrs",flights.air_time/60)
flights.show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|   12|  8|   658.0|     -7.0|   935.0|     -5.0|     VX| N846VA|  1780|   SEA| LAX|   132.0|     954| 6.0|  58.0|               2.2|
|2014|    1| 22|  1040.0|      5.0|  1505.0|      5.0|     AS| N559AS|   851|   SEA| HNL|   360.0|    2677|10.0|  40.0|               6.0|
|2014|    3|  9|  1443.0|     -2.0|  1652.0|      2.0|     VX| N847VA|   755|   SEA| SFO|   111.0|     679|14.0|  43.0|              1.85|
|2014|    4|  9|  1705.0|     45.0|  1839.0|     34.0|     WN| N360SW|   344|   PDX| SJC|    83.0|     569|17.0|   5.0|1.3833333333333333|
|2014|    3|  9|   754.0|  

### Filtering Data

Let's take a look at the .filter() method. As you might suspect, this is the Spark counterpart of SQL's WHERE clause. The .filter() method takes either an expression that would follow the WHERE clause of a SQL expression as a string, or a Spark Column of boolean (True/False) values.

For example, the following two expressions will produce the same output:

In [13]:
flights.filter("air_time > 120").show(5)
flights.filter(flights.air_time > 120).show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|   12|  8|   658.0|     -7.0|   935.0|     -5.0|     VX| N846VA|  1780|   SEA| LAX|   132.0|     954| 6.0|  58.0|               2.2|
|2014|    1| 22|  1040.0|      5.0|  1505.0|      5.0|     AS| N559AS|   851|   SEA| HNL|   360.0|    2677|10.0|  40.0|               6.0|
|2014|    3|  9|   754.0|     -1.0|  1015.0|      1.0|     AS| N612AS|   522|   SEA| BUR|   127.0|     937| 7.0|  54.0|2.1166666666666667|
|2014|    1| 15|  1037.0|      7.0|  1352.0|      2.0|     WN| N646SW|    48|   PDX| DEN|   121.0|     991|10.0|  37.0|2.0166666666666666|
|2014|    4| 19|  1236.0|  

For multple filter use:

In [14]:
# Define first filter
filterA = flights.origin == "SEA"

# Define second filter
filterB = flights.dest == "PDX"

# Filter the data, first by filterA then by filterB
flights.filter(filterA).filter(filterB).show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+-------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|       duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+-------------------+
|2014|   10|  1|  1717.0|     -8.0|  1819.0|      4.0|     OO| N810SK|  4546|   SEA| PDX|    28.0|     129|17.0|  17.0| 0.4666666666666667|
|2014|    9| 26|  2339.0|    144.0|    29.0|    142.0|     OO| N822SK|  4612|   SEA| PDX|    29.0|     129|23.0|  39.0|0.48333333333333334|
|2014|    8| 18|  1728.0|     -2.0|  1822.0|      0.0|     OO| N586SW|  5440|   SEA| PDX|    41.0|     129|17.0|  28.0| 0.6833333333333333|
|2014|    2|  4|  2053.0|     -7.0|  2144.0|     -4.0|     OO| N223SW|  5433|   SEA| PDX|    29.0|     129|20.0|  53.0|0.48333333333333334|
|2014|    2|  9|  10

### Selecting

The Spark variant of SQL's SELECT is the .select() method. This method takes multiple arguments - one for each column you want to select. These arguments can either be the column name as a string (one for each column) or a column object (using the df.colName syntax). 

When you pass a column object, you can perform operations like addition or subtraction on the column to change the data contained in it, much like inside .withColumn().

The difference between .select() and .withColumn() methods is that .select() returns only the columns you specify, while .withColumn() returns all the columns of the DataFrame in addition to the one you defined. 

It's often a good idea to drop columns you don't need at the beginning of an operation so that you're not dragging around extra data as you're wrangling. In this case, you would use .select() and not .withColumn().

For example, the following two expressions will produce the same output:

In [15]:
flights.select("origin","dest","carrier").show(5)
flights.select(flights.origin, flights.dest, flights.carrier).show(5)

+------+----+-------+
|origin|dest|carrier|
+------+----+-------+
|   SEA| LAX|     VX|
|   SEA| HNL|     AS|
|   SEA| SFO|     VX|
|   PDX| SJC|     WN|
|   SEA| BUR|     AS|
+------+----+-------+
only showing top 5 rows

+------+----+-------+
|origin|dest|carrier|
+------+----+-------+
|   SEA| LAX|     VX|
|   SEA| HNL|     AS|
|   SEA| SFO|     VX|
|   PDX| SJC|     WN|
|   SEA| BUR|     AS|
+------+----+-------+
only showing top 5 rows



You can perform any column operation and the .select() method will return the transformed column. For example:

In [16]:
duration_hrs = flights.select(flights.air_time/60)
duration_hrs.show()

+------------------+
|   (air_time / 60)|
+------------------+
|               2.2|
|               6.0|
|              1.85|
|1.3833333333333333|
|2.1166666666666667|
|2.0166666666666666|
|               1.5|
|1.6333333333333333|
|              2.25|
|               3.3|
|2.1666666666666665|
| 2.566666666666667|
|2.1166666666666667|
|              3.05|
|              2.15|
|               1.5|
|1.2666666666666666|
|               3.6|
| 4.833333333333333|
|              1.85|
+------------------+
only showing top 20 rows



You can also use the .alias() method to rename a column you're selecting. So if you wanted to .select() the column duration_hrs (which isn't in your DataFrame) you could do

In [17]:
duration_hrs = flights.select((flights.air_time/60).alias("duration_hrs"))
duration_hrs2 = flights.selectExpr("air_time/60 as duration_hrs")
duration_hrs.show(5)
duration_hrs2.show(5)

+------------------+
|      duration_hrs|
+------------------+
|               2.2|
|               6.0|
|              1.85|
|1.3833333333333333|
|2.1166666666666667|
+------------------+
only showing top 5 rows

+------------------+
|      duration_hrs|
+------------------+
|               2.2|
|               6.0|
|              1.85|
|1.3833333333333333|
|2.1166666666666667|
+------------------+
only showing top 5 rows



In [18]:
# Define avg_speed
avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed")
speed1 = flights.select("origin", "dest", "tailnum", avg_speed)
speed1.show(5)

# Equivalence avg_speed
speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")
speed2.show(5)

+------+----+-------+------------------+
|origin|dest|tailnum|         avg_speed|
+------+----+-------+------------------+
|   SEA| LAX| N846VA| 433.6363636363636|
|   SEA| HNL| N559AS| 446.1666666666667|
|   SEA| SFO| N847VA|367.02702702702703|
|   PDX| SJC| N360SW| 411.3253012048193|
|   SEA| BUR| N612AS| 442.6771653543307|
+------+----+-------+------------------+
only showing top 5 rows

+------+----+-------+------------------+
|origin|dest|tailnum|         avg_speed|
+------+----+-------+------------------+
|   SEA| LAX| N846VA| 433.6363636363636|
|   SEA| HNL| N559AS| 446.1666666666667|
|   SEA| SFO| N847VA|367.02702702702703|
|   PDX| SJC| N360SW| 411.3253012048193|
|   SEA| BUR| N612AS| 442.6771653543307|
+------+----+-------+------------------+
only showing top 5 rows



### Data type conversion

In [19]:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType, DoubleType

flights = flights.withColumn("distance", flights["distance"].cast(IntegerType()))
flights = flights.withColumn("air_time", flights["air_time"].cast(IntegerType()))
flights = flights.withColumn("dep_delay", flights["air_time"].cast(DoubleType()))
flights.printSchema()

root
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: double (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- duration_hrs: double (nullable = true)



In [20]:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType, DoubleType
schemaTypes=[StructField('year',IntegerType(),True),
       StructField('month',IntegerType(),True),
       StructField('day',IntegerType(),True),
       StructField('dep_time',StringType(),True),
       StructField('dep_delay',StringType(),True),
       StructField('arr_time',StringType(),True),
       StructField('carrier',StringType(),True),
       StructField('tailnum',StringType(),True),
       StructField('flight',StringType(),True),
       StructField('origin',StringType(),True),
       StructField('dest',StringType(),True),
       StructField('air_time',IntegerType(),True),
       StructField('distance',IntegerType(),True),
       StructField('hour',StringType(),True),
       StructField('minute',StringType(),True),
       StructField('duration_hrs',DoubleType(),True)
       ]
schema = StructType(fields=schemaTypes)
data = spark.read.csv('/content/sample_data/flights_small.csv', schema = schema)

data.printSchema()

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- duration_hrs: double (nullable = true)



### Aggregating

All of the common aggregation methods, like .min(), .max(), and .count() are GroupedData methods. These are created by calling the .groupBy() DataFrame method.

In [21]:
# Find the shortest flight from PDX in terms of distance
flights.filter(flights.origin == "PDX").groupBy().min("distance").show()

# Find the longest flight from SEA in terms of air time
flights.filter(flights.origin == "SEA").groupBy().max("air_time").show()

+-------------+
|min(distance)|
+-------------+
|          106|
+-------------+

+-------------+
|max(air_time)|
+-------------+
|          409|
+-------------+



In [22]:
# Average duration of Delta flights
flights.filter(flights.carrier == 'DL').filter(flights.origin == 'SEA').groupBy().avg('air_time').show()

# Total hours in the air
flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs").show()

+------------------+
|     avg(air_time)|
+------------------+
|188.20689655172413|
+------------------+

+-----------------+
|sum(duration_hrs)|
+-----------------+
|25289.60000000004|
+-----------------+



In [23]:
# Group by tailnum
by_plane = flights.groupBy("tailnum")

# Number of flights each plane made
by_plane.count().show()

# Group by origin
by_origin = flights.groupBy("origin")

# Average duration of flights from PDX and SEA
by_origin.avg("air_time").show()

+-------+-----+
|tailnum|count|
+-------+-----+
| N442AS|   38|
| N102UW|    2|
| N36472|    4|
| N38451|    4|
| N73283|    4|
| N513UA|    2|
| N954WN|    5|
| N388DA|    3|
| N567AA|    1|
| N516UA|    2|
| N927DN|    1|
| N8322X|    1|
| N466SW|    1|
|  N6700|    1|
| N607AS|   45|
| N622SW|    4|
| N584AS|   31|
| N914WN|    4|
| N654AW|    2|
| N336NW|    1|
+-------+-----+
only showing top 20 rows

+------+------------------+
|origin|     avg(air_time)|
+------+------------------+
|   SEA| 160.4361496051259|
|   PDX|137.11543248288737|
+------+------------------+



In addition to the GroupedData methods you've already seen, there is also the .agg() method. This method lets you pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions submodule.

This submodule contains many useful functions for computing things like standard deviations. All the aggregation functions in this submodule take the name of a column in a GroupedData table.

In [24]:
# Import pyspark.sql.functions as F
import pyspark.sql.functions as F

# Group by month and dest
by_month_dest = flights.groupBy("month","dest")

# Average departure delay by month and destination
by_month_dest.avg("dep_delay").show()

# Standard deviation of departure delay
by_month_dest.agg(F.stddev("dep_delay")).show()

+-----+----+------------------+
|month|dest|    avg(dep_delay)|
+-----+----+------------------+
|   11| TUS|140.33333333333334|
|   11| ANC|192.91176470588235|
|    1| BUR|             114.0|
|    1| PDX| 33.84615384615385|
|    6| SBA|            115.25|
|    5| LAX|122.89473684210526|
|   10| DTW|             217.4|
|    6| SIT|             123.0|
|   10| DFW| 193.8181818181818|
|    3| FAI|             198.8|
|   10| SEA|31.733333333333334|
|    2| TUS|132.66666666666666|
|   12| OGG| 340.8181818181818|
|    9| DFW|195.16666666666666|
|    5| EWR| 285.0833333333333|
|    3| RDM|              29.2|
|    8| DCA|             278.1|
|    7| ATL|247.54054054054055|
|    4| JFK|276.61538461538464|
|   10| SNA|             131.4|
+-----+----+------------------+
only showing top 20 rows

+-----+----+----------------------+
|month|dest|stddev_samp(dep_delay)|
+-----+----+----------------------+
|   11| TUS|    1.1547005383792515|
|   11| ANC|    17.462284287830716|
|    1| BUR|     7.8337999

### Renaming



In [25]:
# Rename the faa column
airports = airports.withColumnRenamed("faa", "dest")
airports.show(5)

+----+--------------------+----------+-----------+----+---+---+
|dest|                name|       lat|        lon| alt| tz|dst|
+----+--------------------+----------+-----------+----+---+---+
| 04G|   Lansdowne Airport|41.1304722|-80.6195833|1044| -5|  A|
| 06A|Moton Field Munic...|32.4605722|-85.6800278| 264| -5|  A|
| 06C| Schaumburg Regional|41.9893408|-88.1012428| 801| -6|  A|
| 06N|     Randall Airport| 41.431912|-74.3915611| 523| -5|  A|
| 09J|Jekyll Island Air...|31.0744722|-81.4277778|  11| -4|  A|
+----+--------------------+----------+-----------+----+---+---+
only showing top 5 rows



### Join

In [26]:
# Join the DataFrames
flights_with_airports = flights.join(airports,"dest",how="leftouter")

# Examine the new DataFrame
print(flights_with_airports.show(5))

+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+------------------+--------------------+---------+-----------+---+---+---+
|dest|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|air_time|distance|hour|minute|      duration_hrs|                name|      lat|        lon|alt| tz|dst|
+----+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+--------+--------+----+------+------------------+--------------------+---------+-----------+---+---+---+
| LAX|2014|   12|  8|   658.0|    132.0|   935.0|     -5.0|     VX| N846VA|  1780|   SEA|     132|     954| 6.0|  58.0|               2.2|    Los Angeles Intl|33.942536|-118.408075|126| -8|  A|
| HNL|2014|    1| 22|  1040.0|    360.0|  1505.0|      5.0|     AS| N559AS|   851|   SEA|     360|    2677|10.0|  40.0|               6.0|       Honolulu Intl|21.318681|-157.922428| 13|-10|  N|
| SFO|2014|    3|  9|  1443.0|