# PySpark Tutorial - Introduction

Create the Spark Context required for any PySpark program.  Most programs will store this in a variable named `sc`.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkTutorial").getOrCreate()

The following code creates a DataFrame with sample inline data.

In [2]:
df = spark.createDataFrame([('Arthur', 42), ('Trillian', 37), ('Zaphod', 45)], ['name', 'age'])

The Schema of a DataFrame can be viewed with `printSchema()`.

In [3]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



The contents of a DataFrame can be viewed in the log using `show()`.

In [4]:
df.show(2)

+--------+---+
|    name|age|
+--------+---+
|  Arthur| 42|
|Trillian| 37|
+--------+---+
only showing top 2 rows



The following code creates a new DataFrame with data from `people` where the age is greater than 40.
> Note that this does NOT cause any processing to be done.  Spark uses *Lazy Evaluation* and will only execute when something needs to be done, such as `show()` or `write()`.  This can be thought of like a view in SQL.

In [5]:
over40 = df.filter(df.age>40)

The call to `show()` on the new DataFrame forces execution of the `filter()` and the rows are returned.

In [6]:
over40.show()

+------+---+
|  name|age|
+------+---+
|Arthur| 42|
|Zaphod| 45|
+------+---+



Another useful action is `count()` which returns the count of rows.  Note that this can be a very expensive operation depending on the volume of data and complexity of the DataFrame.

In [7]:
rows = over40.count()
print(rows)

2


It helps to save resources if you `stop()` the Spark session when you are finished.  Note that by doing this you will be unable to re-run any of the code above without first re-creating the `spark` variable.

In [8]:
spark.stop()