#### References
https://docs.azuredatabricks.net/user-guide/visualizations/index.html<br>

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame<br>

https://stackoverflow.com/questions/tagged/apache-spark+apache-spark-sql+python<br>
https://stackoverflow.com/questions/tagged/databricks+python

#### Run data ingest so we can use dataframe here

In [3]:
%run ./adb_3_ingest_to_df

### Dataframe API for data operations

In [5]:
group_by_column = "state"
count_column = "n"

# Group by state so we can count airports per state; rename count column (confusing with describe() otherwise); sort by count descending
df_groupby = df_explicit\
  .groupBy(group_by_column)\
  .count()\
  .withColumnRenamed("count", count_column)\
  .sort(count_column, ascending=False)

In [6]:
# Basic descriptive statistics

df_groupby.describe().show()

In [7]:
# Spark execution plan

df_groupby.explain()

In [8]:
display(df_groupby)

state,n
CA,32
TX,29
AK,22
FL,20
NY,14
MI,12
NC,11
CO,11
WI,8
GA,8


In [9]:
# We can also use dataframe select, passing it a list of column names, which emits a new dataframe that can be operated on with dataframe API

In [10]:
df_select = df_explicit\
  .select("*")\
  .sort("name", ascending=True)

In [11]:
display(df_select)

### Spark SQL for data operations

In [13]:
# To use a DF in explicit SQL queries, register it as a temp view (cluster lifetime scope)

df_explicit.createOrReplaceTempView("df_explicit")

In [14]:
# A Spark SQL SELECT query will emit a new dataframe. This is the same query as the dataframe API query above, for example.

df_groupby_2 = sql("SELECT state, COUNT(state) AS n FROM df_explicit GROUP BY state ORDER BY n DESC")

In [15]:
df_groupby_2.show(10)

In [16]:
display(df_groupby_2)