# Use Spark to explore data

Use the ▷ button to the left of the code cell to run just that cell, and review the results.

In [None]:
%%pyspark
df = spark.read.load('abfss://data@asadatalake9slnoqw.dfs.core.windows.net/spark/2019.csv', format='csv'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))

When the code has finished running, and then review the output beneath the cell in the notebook. It shows the first ten rows in the file you selected, with automatic column names in the form _c0, _c1, _c2, and so on.

Modify the code so that the spark.read.load function reads data from all of the CSV files in the folder, and the display function shows the first 100 rows. Your code should look like this (with asadatalakexxxxxxx matching the name of your data lake store):

In [None]:
df = spark.read.load('abfss://data@asadatalake9slnoqw.dfs.core.windows.net/spark/*.csv', format='csv')

display(df.limit(100))

The dataframe now includes data from 2019 of the files, but the column names are not useful. Spark uses a “schema-on-read” approach to try to determine appropriate data types for the columns based on the data they contain, and if a header row is present in a text file it can be used to identify the column names (by specifying a header=True parameter in the load function). Alternatively, you can define an explicit schema for the dataframe.

In [None]:
df = spark.read.load('abfss://data@asadatalake9slnoqw.dfs.core.windows.net/spark/*.csv', format='csv', header=True)

display(df.limit(100))

To display the dataframe’s schema:

In [None]:
df.printSchema()

Modify the code as follows (replacing asadatalakexxxxxxx), to define an explicit schema for the dataframe that includes the column names and data types.

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

orderSchema = StructType([
    StructField("SalesOrderNumber", StringType()),
    StructField("SalesOrderLineNumber", IntegerType()),
    StructField("OrderDate", DateType()),
    StructField("CustomerName", StringType()),
    StructField("Email", StringType()),
    StructField("Item", StringType()),
    StructField("Quantity", IntegerType()),
    StructField("UnitPrice", FloatType()),
    StructField("Tax", FloatType())
    ])

df = spark.read.load('abfss://data@asadatalake9slnoqw.dfs.core.windows.net/spark/*.csv', format='csv', schema=orderSchema)
display(df.limit(100))

In [None]:
df.printSchema()

# Analyze data in a dataframe

The dataframe object in Spark is similar to a Pandas dataframe in Python, and includes a wide range of functions that you can use to manipulate, filter, group, and otherwise analyze the data it contains.

### Filter a dataframe

- When you perform an operation on a dataframe, the result is a new dataframe (in this case, a new customers dataframe is created by selecting a specific subset of columns from the df dataframe) 
- dataframes provide functions such as count and distinct that can be used to summarize and filter the data they contain.
- The dataframe['Field1', 'Field2', ...] syntax is a shorthand way of defining a subset of column. You can also use select method, so the first line of the code above could be written as customers = df.select("CustomerName", "Email")

In [None]:
customers = df['CustomerName', 'Email']
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

Note that in the below cell you can “chain” multiple functions together so that the output of one function becomes the input for the next - in this case, the dataframe created by the select method is the source dataframe for the where method that is used to apply filtering criteria.

In [None]:
 customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
 print(customers.count())
 print(customers.distinct().count())
 display(customers.distinct())

### Aggregate and group data in a dataframe

The groupBy method groups the rows by Item, and the subsequent sum aggregate function is applied to all of the remaining numeric columns (in this case, Quantity)

In [None]:
productSales = df.select("Item", "Quantity").groupBy("Item").sum()
display(productSales)

The select method includes a SQL year function to extract the year component of the OrderDate field, and then an alias method is used to assign a columm name to the extracted year value. The data is then grouped by the derived Year column and the count of rows in each group is calculated before finally the orderBy method is used to sort the resulting dataframe.

In [None]:
yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)

# Query data using Spark SQL

As you’ve seen, the native methods of the dataframe object enable you to query and analyze data quite effectively. However, many data analysts are more comfortable working with SQL syntax. Spark SQL is a SQL language API in Spark that you can use to run SQL statements, or even persist data in relational tables.

### Use Spark SQL in PySpark code

The default language in Azure Synapse Studio notebooks is PySpark, which is a Spark-based Python runtime. Within this runtime, you can use the spark.sql library to embed Spark SQL syntax within your Python code, and work with SQL constructs such as tables and views.

#### Run the cell and Observe that:

- The code persists the data in the df dataframe as a temporary view named salesorders. Spark SQL supports the use of temporary views or persisted tables as sources for SQL queries.
- The spark.sql method is then used to run a SQL query against the salesorders view.
- The results of the query are stored in a dataframe.

In [None]:
df.createOrReplaceTempView("salesorders")

spark_df = spark.sql("SELECT * FROM salesorders")
display(spark_df)

### Run SQL code in a cell

While it’s useful to be able to embed SQL statements into a cell containing PySpark code, data analysts often just want to work directly in SQL.

#### Run the cell and review the results. Observe that:

- The %%sql line at the beginning of the cell (called a magic) indicates that the Spark SQL language runtime should be used to run the code in this cell instead of PySpark.
- The SQL code references the salesorder view that you created previously using PySpark.
- The output from the SQL query is automatically displayed as the result under the cell.

In [None]:
%%sql
SELECT YEAR(OrderDate) AS OrderYear,
    SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
FROM salesorders
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;

# Visualize data with Spark

A picture is proverbially worth a thousand words, and a chart is often better than a thousand rows of data. While notebooks in Azure Synapse Analytics include a built in chart view for data that is displayed from a dataframe or Spark SQL query, it is not designed for comprehensive charting. However, you can use Python graphics libraries like matplotlib and seaborn to create charts from data in dataframes.

## View results as a chart
### Run the code cell and in the results section beneath the cell, change the View option from Table to Chart.

Use the View options button at the top right of the chart to duisplay the options pane for the chart. Then set the options as follows and select Apply:
- Chart type: Bar chart
- Key: Item
- Values: Quantity
- Series Group: leave blank
- Aggregation: Sum
- Stacked: Unselected

In [None]:
%%sql
SELECT * FROM salesorders

## Get started with matplotlib

### Run the code cell and observe that it returns a Spark dataframe containing the yearly revenue.

In [None]:
sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
                SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
            FROM salesorders \
            GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
            ORDER BY OrderYear"
df_spark = spark.sql(sqlQuery)
df_spark.show()

To visualize the data as a chart, we’ll start by using the matplotlib Python library. This library is the core plotting library on which many others are based, and provides a great deal of flexibility in creating charts.

### Run the cell and review the results, which consist of a column chart with the total gross revenue for each year. Note the following features of the code used to produce this chart:
- The matplotlib library requires a Pandas dataframe, so you need to convert the Spark dataframe returned by the Spark SQL query to this format.
- At the core of the matplotlib library is the pyplot object. This is the foundation for most plotting functionality.
- The default settings result in a usable chart, but there’s considerable scope to customize it

In [None]:
from matplotlib import pyplot as plt

# matplotlib requires a Pandas dataframe, not a Spark one
df_sales = df_spark.toPandas()

# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])

# Display the plot
plt.show()

### Now run the modified code and view the results. 
The chart now includes a little more information.

A plot is technically contained with a Figure. In the previous examples, the figure was created implicitly for you; but you can create it explicitly.

In [None]:
 # Clear the plot area
 plt.clf()

 # Create a bar plot of revenue by year
 plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')

 # Customize the chart
 plt.title('Revenue by Year')
 plt.xlabel('Year')
 plt.ylabel('Revenue')
 plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
 plt.xticks(rotation=45)

 # Show the figure
 plt.show()

### Run the code cell and view the results. The figure determines the shape and size of the plot.

A figure can contain multiple subplots, each on its own axis.

In [None]:
# Clear the plot area
plt.clf()

# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of revenue by year on the first axis
ax[0].bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
ax[0].set_title('Revenue by Year')

# Create a pie chart of yearly order counts on the second axis
yearly_counts = df_sales['OrderYear'].value_counts()
ax[1].pie(yearly_counts)
ax[1].set_title('Orders per Year')
ax[1].legend(yearly_counts.keys().tolist())

# Add a title to the Figure
fig.suptitle('Sales Data')

# Show the figure
plt.show()

# Use the seaborn library

While matplotlib enables you to create complex charts of multiple types, it can require some complex code to achieve the best results. For this reason, over the years, many new libraries have been built on the base of matplotlib to abstract its complexity and enhance its capabilities. One such library is seaborn.

### Run the code and observe that it displays a bar chart using the seaborn library.

In [None]:
 import seaborn as sns

 # Clear the plot area
 plt.clf()

 # Create a bar chart
 ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
 plt.show()

### Run the code and note that seaborn enables you to set a consistent color theme for your plots.

In [None]:
# Clear the plot area
plt.clf()

# Set the visual theme for seaborn
sns.set_theme(style="whitegrid")

# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()

### Run the code to view the yearly revenue as a line chart.

In [None]:
# Clear the plot area
plt.clf()

# Create a bar chart
ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()

If you’ve finished exploring Apache spark in Azure Synapse Analytics, you should stop the session and if you want to save the notebook you can publish the same.