# Reading and Writing Data with Spark

This notebook contains the code from the previous screencast. The only difference is that instead of reading in a dataset from a remote cluster, the data set is read in from a local file. You can see the file by clicking on the "jupyter" icon and opening the folder titled "data".

Run the code cell to see how everything works. 

First let's import SparkConf and SparkSession

In [1]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

Since we're using Spark locally we already have both a sparkcontext and a sparksession running. We can update some of the parameters, such our application's name. Let's just call it "Our first Python Spark SQL example"

In [2]:
spark = SparkSession \
    .builder \
    .appName("Our first Python Spark SQL example") \
    .getOrCreate()

Let's check if the change went through

In [3]:
spark.sparkContext.getConf().getAll()

[('spark.driver.host', 'ais-macbook-pro'),
 ('spark.app.name', 'Our first Python Spark SQL example'),
 ('spark.app.id', 'local-1612152123814'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.port', '59583'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true')]

In [4]:
spark

As you can see the app name is exactly how we set it

Let's create our first dataframe from a fairly small sample data set. Througout the course we'll work with a log file data set that describes user interactions with a music streaming service. The records describe events such as logging in to the site, visiting a page, listening to the next song, seeing an ad.

In [5]:
path = "../../data/log_data/2018/11/2018-11-01-events.json"
user_log = spark.read.json(path)

In [6]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: double (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [7]:
user_log.describe()

DataFrame[summary: string, artist: string, auth: string, firstName: string, gender: string, itemInSession: string, lastName: string, length: string, level: string, location: string, method: string, page: string, registration: string, sessionId: string, song: string, status: string, ts: string, userAgent: string, userId: string]

In [8]:
user_log.show(n=1)

+------+---------+---------+------+-------------+--------+------+-----+--------------------+------+----+-----------------+---------+----+------+-------------+--------------------+------+
|artist|     auth|firstName|gender|itemInSession|lastName|length|level|            location|method|page|     registration|sessionId|song|status|           ts|           userAgent|userId|
+------+---------+---------+------+-------------+--------+------+-----+--------------------+------+----+-----------------+---------+----+------+-------------+--------------------+------+
|  null|Logged In|   Walter|     M|            0|    Frye|  null| free|San Francisco-Oak...|   GET|Home|1.540919166796E12|       38|null|   200|1541105830796|"Mozilla/5.0 (Mac...|    39|
+------+---------+---------+------+-------------+--------+------+-----+--------------------+------+----+-----------------+---------+----+------+-------------+--------------------+------+
only showing top 1 row



In [9]:
user_log.take(5)

[Row(artist=None, auth='Logged In', firstName='Walter', gender='M', itemInSession=0, lastName='Frye', length=None, level='free', location='San Francisco-Oakland-Hayward, CA', method='GET', page='Home', registration=1540919166796.0, sessionId=38, song=None, status=200, ts=1541105830796, userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='39'),
 Row(artist=None, auth='Logged In', firstName='Kaylee', gender='F', itemInSession=0, lastName='Summers', length=None, level='free', location='Phoenix-Mesa-Scottsdale, AZ', method='GET', page='Home', registration=1540344794796.0, sessionId=139, song=None, status=200, ts=1541106106796, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"', userId='8'),
 Row(artist="Des'ree", auth='Logged In', firstName='Kaylee', gender='F', itemInSession=1, lastName='Summers', length=246.30812, level='free'

In [10]:
out_path = "../../data/process_data/2018-11-01-events.csv"

In [11]:
user_log.write.save(out_path, format="csv", header=True)

In [12]:
user_log_2 = spark.read.csv(out_path, header=True)

In [13]:
user_log_2.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: string (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: string (nullable = true)
 |-- sessionId: string (nullable = true)
 |-- song: string (nullable = true)
 |-- status: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [14]:
user_log_2.take(2)

[Row(artist=None, auth='Logged In', firstName='Walter', gender='M', itemInSession='0', lastName='Frye', length=None, level='free', location='San Francisco-Oakland-Hayward, CA', method='GET', page='Home', registration='1.540919166796E12', sessionId='38', song=None, status='200', ts='1541105830796', userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='39'),
 Row(artist=None, auth='Logged In', firstName='Kaylee', gender='F', itemInSession='0', lastName='Summers', length=None, level='free', location='Phoenix-Mesa-Scottsdale, AZ', method='GET', page='Home', registration='1.540344794796E12', sessionId='139', song=None, status='200', ts='1541106106796', userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"', userId='8')]

In [15]:
user_log_2.select("userID").show()

+------+
|userID|
+------+
|    39|
|     8|
|     8|
|     8|
|     8|
|     8|
|     8|
|     8|
|     8|
|     8|
|    10|
|    26|
|    26|
|    26|
|   101|
+------+



In [16]:
user_log_2.take(1)

[Row(artist=None, auth='Logged In', firstName='Walter', gender='M', itemInSession='0', lastName='Frye', length=None, level='free', location='San Francisco-Oakland-Hayward, CA', method='GET', page='Home', registration='1.540919166796E12', sessionId='38', song=None, status='200', ts='1541105830796', userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='39')]