# Tutorial
Spark's Python wrapper lets us interact with data very similarly to Pandas, which should be very familiar to Python users. In this notebook you will learn how to use the basic functionality of the wrapper, as well as visualize the data that you will be working with for the project. Make sure you have downloaded and unzipped the data to the correct location before trying to run the code.

In [None]:
# import necessary libraries
import pandas as pd 
import numpy
import matplotlib.pyplot as plt 
from pyspark.sql import SparkSession, dataframe
import plotly.express as px
geojson = px.data.gapminder()
# create sparksession
spark = SparkSession \
    .builder \
    .appName("CS236") \
    .getOrCreate()

In [None]:
# Utility function to write query plans to a file
# you will be using this to understand how your queries are being processed
def write_explain(df: dataframe.DataFrame, output_path: str = "out.txt"):
    from contextlib import redirect_stdout
    with open(output_path, "w") as f:
        with redirect_stdout(f):
            df.explain(extended=True)

Read a csv to a Spark dataframe, then return the column names

In [None]:
%%timeit
sdf = spark.read.csv("../data/StateAndCountyData.csv", header=True)

In [None]:
sdf = spark.read.csv("../data/StateAndCountyData.csv", header=True)
sdf.columns

Show the first 20 rows of the Spark dataframe

In [None]:
sdf.show()

In [None]:
sdf.createOrReplaceTempView('state_county')
# run your SQL query as you would with any database
my_df = spark.sql(
'''
select 
  state
  , avg(value) as avg
from state_county
where variable_code = 'PCT_LACCESS_POP15' 
group by state
order by state
'''
)
my_df.show()

In [None]:
write_explain(my_df)
# print out the query plan
my_df.explain()

## Visualizing with Choropleths
We will be using Plotly Express to easily visualize the data you will be working with. The most important arguments besides the dataframe itself are `locations` and `color`.
- `locations` - the name of the column that defines which values go into which state in the chart
- `color` - the name of the column that contains the values to be displayed

In [None]:
fig = px.choropleth(my_df,
                    locations='state',
                    color='avg',
                    color_continuous_scale='spectral_r',
                    locationmode='USA-states',
                    scope='usa')
fig.update_geos(
    visible=True, 
    scope="usa",
)
fig.show()