# Pandas API

Starting from version 3.2.0, Spark supports [Pandas API](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html). When using it, one can work with Spark DataFrames as if they were Pandas ones. Remember though, they never really are!

In [0]:
from pyspark import pandas as pd

pd.set_option("compute.default_index_type", "distributed")
data = pd.read_json("dbfs:///data.json_lines")

In [0]:
len(data)

Out[2]: 208743

# SQL API

In [0]:
# we can easily regiter a DataFrame as a table
data.spark.frame().createOrReplaceTempView("data")

In [0]:
# now we can write SQL queries as to any database
# notice that before we ask Spark to ``show``
# everything is a 'transformation', not an 'action'
# so it's like creating views
spark.sql("""
SELECT
    COUNT(*) AS cnt
FROM data
""").createOrReplaceTempView("count")

In [0]:
# here the action happens and we see the result set
spark.sql("SELECT * FROM count LIMIT 1").show()

+------+
|   cnt|
+------+
|208743|
+------+



# Spark DataFrames

In [0]:
# this API is rather different from both SQL and Pandas
# but provides most of the functionality
from pyspark.sql.functions import count_distinct
data_df = data.spark.frame()
data_df.select(
    count_distinct("genre").alias("number_of_ranks")
).show()

+---------------+
|number_of_ranks|
+---------------+
|            528|
+---------------+



# Do It Yourself

[Spark Manual](https://spark.apache.org/docs/latest/api/python/reference) is your best friend!

* count numbers of albums, artists, countries, languages, genres
* find ten longest and sortest albumns
* find top ten countries where the maximal number of albums is available
* build a table of total duration of all albums appeared yearly for every country
* find an artist working in the largest number of different genres
* find ten most frequent words from a title in ten most popular languages

When done with your favourite API, try doing the same with another one!