# Object Storage and PySpark Programming

- Examples From Video Lecture 

In [1]:
import pyspark
from pyspark.sql import SparkSession

bucket = "d-object-spark"

spark = SparkSession.builder \
    .master("local") \
    .appName('jupyter-pyspark') \
        .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
        .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.1.2,org.apache.spark:spark-avro_2.12:3.1.2")\
        .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
        .config("spark.hadoop.fs.s3a.access.key", "minio") \
        .config("spark.hadoop.fs.s3a.secret.key", "SU2orange!") \
        .config("spark.hadoop.fs.s3a.fast.upload", True) \
        .config("spark.hadoop.fs.s3a.path.style.access", True) \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .enableHiveSupport() \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR") # Keeps the noise down!!!



:: loading settings :: url = jar:file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dc97fd8b-a8ff-404e-81ab-f095a6e5367b;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.1.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.271 in central
	found org.apache.spark#spark-avro_2.12;3.1.2 in central
	found org.spark-project.spark#unused;1.0.0 in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.2/hadoop-aws-3.1.2.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.1.2!hadoop-aws.jar (50ms)
downloading https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar ...
	[SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar (38ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-ja

In [2]:
# Print context
print('Spark Context : ', spark.sparkContext)
print('Spark Version : ', spark.sparkContext.version)
print('Spark appName :', spark.sparkContext.appName)
print('Hadoop version: ', spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
print('Spark Confiuration:')
for conf in spark.sparkContext._conf.getAll():
    print(f"\t{conf[0]} = {conf[1]}")

Spark Context :  <SparkContext master=local appName=jupyter-pyspark>
Spark Version :  3.1.2
Spark appName : jupyter-pyspark
Hadoop version:  3.2.0
Spark Confiuration:
	spark.master = local
	spark.app.initial.jar.urls = spark://jupyter:36041/jars/com.amazonaws_aws-java-sdk-bundle-1.11.271.jar,spark://jupyter:36041/jars/org.apache.spark_spark-avro_2.12-3.1.2.jar,spark://jupyter:36041/jars/org.spark-project.spark_unused-1.0.0.jar,spark://jupyter:36041/jars/org.apache.hadoop_hadoop-aws-3.1.2.jar
	hive.metastore.uris = thrift://hive-metastore:9083
	spark.submit.pyFiles = /home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.1.2.jar,/home/jovyan/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.1.2.jar,/home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.271.jar,/home/jovyan/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
	spark.app.name = jupyter-pyspark
	spark.hadoop.fs.s3a.path.style.access = True
	spark.app.id = local-1670516395164
	spark.serializer.objectStreamReset = 100
	sp

## Setup

- Put data in the right places!!!
- Run these cells to ensure you have the data for the examples

In [3]:
! pip install minio

Collecting minio
  Downloading minio-7.1.12-py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 2.2 MB/s eta 0:00:011
Installing collected packages: minio
Successfully installed minio-7.1.12


In [5]:
from minio import Minio

# Make the minio bucket
client = Minio("minio:9000","minio","SU2orange!", secure=False)
not client.bucket_exists(bucket) and client.make_bucket(bucket)

# open the example 
df = spark.read.csv("/home/jovyan/datasets/stocks/stocks.csv", inferSchema=True, header=True)

# Put the example in minio
df.write.mode("Overwrite").csv(f"s3a://{bucket}/stocks.csv",header=True)

#put the example in HDFS
df.write.mode("Overwrite").csv(f"hdfs://namenode/user/root/{bucket}/stocks.csv",header=True)


                                                                                

## Minio Client

This section outlines commands from the minio client

These commands are run from the terminal in your jupyter notebook setup

### Minio Alias Setup

```
Installing the client

$ wget https://dl.min.io/client/mc/release/linux-amd64/mc && chmod +x mc && sudo mv -f  mc /usr/local/bin

# view aliases

mc alias list

# create alias to our server, which we will call "ms"

mc alias set ms http://minio:9000 minio SU2orange!

# to delete an alias its

ms alias rm ms

```

### Minio File and bucket commands

These are similar to the `hadoop fs` commands. 

```
#make bucket testing 
mc mb play/testing

# list buckets on the play alias
mc ls play

# copy files to the play/testing bucket

mc cp /datasets/customers/* play/testing
```


## Reading Data into the Spark Dataframe: Paths

Spark can read (and write) data from a variety of locations, just by including the proper path to the file.

- `file://` read a file off the local file system. Not ideal for a clustered environment. Use `SparkFiles`.
- `s3a://` read from our object storage configration
- `hdfs://` head from hadoop's HDFS using the client
- `webhdfs://` head from hadoop's HDFS using the web client
- `https://` read over the web - must use `SparkFiles`. See Next Section.


In [None]:
print("file://") # Not Ideal!
spark.read.text("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)

print("s3a://")
spark.read.text(f"s3a://{bucket}/stocks.csv").show(3)

print("hdfs://")
spark.read.text(f"hdfs://namenode/user/root/{bucket}/stocks/").show(3)

print("webhdfs://")
spark.read.text(f"webhdfs://namenode:50070/user/root/{bucket}/stocks/").show(3)



## Reading Data : `SparkFiles`

Let's not forget Spark is a distributed computing environment. Reading a local file, or file off the web into our cluster doesn't help spark take advantage of its distributed nature. So to do that we need to use `SparkFiles` which registers the file with the `sparkContext` of the `sparkSession`. This, in essence makes the cluster aware of the file.

`spark.sparkContext.addFile(url)` will  download the file at `url` and add it to the tmp location on the worker nodes in the cluster.

When you need the file, use `SparkFiles.get(filename)` to retrieve its path.

NOTES: 

- You add a file by path, but access the file by name. 
- You cannot add the same file name more than once


In [9]:
from pyspark import SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/mafudge/datasets/master/stocks/stocks.csv")
file_on_spark = SparkFiles.get("stocks.csv")

print("Temporary Location: ", SparkFiles.get("stocks.csv"))

print("https://")
spark.read.csv(SparkFiles.get("stocks.csv"), header=True).show(3)

Temporary Location:  /tmp/spark-b19a70b2-811c-4f2c-bb55-b3482dc8899a/userFiles-a844192a-11e5-46d3-abd0-342429617132/stocks.csv
https://
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows



## Reading Data : Wildcards

You don't have to read a single file. Instead you can read an entire folder of files, or a wildcard match of files.


In [None]:
print("read just fall")
spark.read.text("file:///home/jovyan/datasets/grades/fall*.tsv").show()

# read all of them
print("read all files")
spark.read.text("file:///home/jovyan/datasets/grades").show()

## Reading Data: File Formats

Spark can read data in a variety of formats. Each format has configurable options.

- `csv` delimited (comma, tab, etc) file
- `text` generic text file, one row per line
- `json` JSON format 
- `parquet` Parquet format (common big-data format with schema included)
- `orc` Another common big-data format with schema.

Each format has options to change behaviors of the file format. Use the `option()` method to set them.

More Information: https://spark.apache.org/docs/latest/sql-data-sources.html


In [12]:


# readng a schema based file has less options
print("read Parquet file")
spark.read \
    .parquet("file:///home/jovyan/datasets/stocks/stocks.parquet").show(3)


# JSON file format - there are many options for this file format
print("Read JSON file")
spark.read.option("multiline",True).json("/home/jovyan/datasets/json-samples/stocks.json").show(3)

# This is not comma-delimited
print("Read a pipe-separated file")
spark.read \
    .option("sep","|") \
    .option("header",False) \
    .option("inferSchema",True) \
    .csv("file:///home/jovyan/datasets/tweets/tweets.psv").show(3)


read Parquet file
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

Read JSON file
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

Read a pipe-separated file
+-------------------+--------------------+--------------------+--------+--------------------+
|                _c0|                 _c1|                 _c2|     _c3|                 _c4|
+-------------------+--------------------+--------------------+--------+--------------------+
|2845428583999282239|1.4337661612984276E9|Mon Jun 08 08:22:...|rovlight|Why so horrible d...|
|1658183905022391067|1.4298210344679017E9|Thu Apr 23 16:30:...|   sladd|Just placed an or...|
| 973476786498736360|1.4421079524352274E9|Sat Sep 12 21:32:...| rdeboat|Worst purchase ev...|
+-------------------+--------------------+--------------------+--------+----------------

In [10]:
# Handle headers
spark.read \
    .option("header",True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)

# Infer schema from the columns
spark.read \
    .option("header",True) \
    .option("inferSchema", True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)


spark.read \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv", header=True, inferSchema=True).show(3)


+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows



## Caching DataFrames

The `cache()` function will persist the `DataFrame` to temp storage on the spark cluster. This can be in-memory, on disk, or both depending on the cluster size and data set size.

This is specially useful when the data source is external to the spark cluster (a remote database, for example) and it will be retrieved and transformed multiple times.

`cache()` forces lazy evaluation so any transformation prior to caching are executed.

In [None]:
print("s3a://")
stocks = spark.read.option("header",True).option("inferSchema",True).csv(f"s3a://{bucket}/stocks.csv").cache()
stocks.show(3)


## DataFrame Schemas

Every spark dataframe has a schema, or collection of typed columns. The schema is stored in a `StructType` and the columns are `StructFields` consisting of the field name and a specific `StructType`

- When you `spark.read` data, from  the schema is always the most flexible type, `StringType`.
- When you include the `inferSchema` option, and extra pass is made over the data to infer the `StructType` for each column.
- For formats that include a schema, like `parquet` or `orc` the schema in the file is loaded.


In [13]:
print("Stocks: No Schema")
spark.read \
    .option("header",True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").printSchema()

# Infer schema from the columns
print("Stocks: Infer Schema")
spark.read \
    .option("header",True) \
    .option("inferSchema", True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").printSchema()


# This is not comma-delimited
print("Customers...")
customers = spark.read \
    .option("sep",",") \
    .option("header",True) \
    .option("inferSchema",True) \
    .csv("file:///home/jovyan/datasets/customers/customers.csv")
    
customers.printSchema()
customers.show(5)

Stocks: No Schema
root
 |-- price: string (nullable = true)
 |-- symbol: string (nullable = true)

Stocks: Infer Schema
root
 |-- price: double (nullable = true)
 |-- symbol: string (nullable = true)

Customers...
root
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Last IP Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Total Orders: integer (nullable = true)
 |-- Total Purchased: integer (nullable = true)
 |-- Months Customer: integer (nullable = true)

+-----+------+--------------------+------+---------------+--------+-----+------------+---------------+---------------+
|First|  Last|               Email|Gender|Last IP Address|    City|State|Total Orders|Total Purchased|Months Customer|
+-----+------+--------------------+------+---------------+--------+-----+------------+---------------+---------------+
|   Al|Fresco|

## DataFrame Schemas: Nested Schema

Spark handles file formats with nested schemas, such as `json` very well. This means you can read from Document and Graph databases easily. 

- Embedded columns can be additional `StructType` columns or `ArrayType` for nested lists of values.
- Later we will introduce strategies for dealing with nested schema like this one|

In [15]:
# This is not comma-delimited
print("Customers...")
places = spark.read \
    .json("file:///home/jovyan/datasets/json-samples/google-places.json")
    
places.printSchema()
places.toPandas()


Customers...
root
 |-- business_status: string (nullable = true)
 |-- geometry: struct (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- lng: double (nullable = true)
 |    |-- viewport: struct (nullable = true)
 |    |    |-- northeast: struct (nullable = true)
 |    |    |    |-- lat: double (nullable = true)
 |    |    |    |-- lng: double (nullable = true)
 |    |    |-- southwest: struct (nullable = true)
 |    |    |    |-- lat: double (nullable = true)
 |    |    |    |-- lng: double (nullable = true)
 |-- icon: string (nullable = true)
 |-- icon_background_color: string (nullable = true)
 |-- icon_mask_base_uri: string (nullable = true)
 |-- name: string (nullable = true)
 |-- opening_hours: struct (nullable = true)
 |    |-- open_now: boolean (nullable = true)
 |-- photos: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- h

Unnamed: 0,business_status,geometry,icon,icon_background_color,icon_mask_base_uri,name,opening_hours,photos,place_id,plus_code,price_level,rating,reference,scope,types,user_ratings_total,vicinity
0,,"((43.0481221, -76.14742439999999), ((43.086101...",https://maps.gstatic.com/mapfiles/place_api/ic...,#7B9EB0,https://maps.gstatic.com/mapfiles/place_api/ic...,Syracuse,,"[(1080, [<a href=""https://maps.google.com/maps...",ChIJDZqXv5vz2YkRRZWt1-IM1QA,,,,ChIJDZqXv5vz2YkRRZWt1-IM1QA,GOOGLE,"[locality, political]",,Syracuse
1,OPERATIONAL,"((43.0476078, -76.1417642), ((43.0489864302915...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,Crowne Plaza Syracuse,"(True,)","[(2048, [<a href=""https://maps.google.com/maps...",ChIJXxPu66Tz2YkRrDM-ZxvDhEE,"(2VX5+27 Syracuse, NY, USA, 87M52VX5+27)",,4.1,ChIJXxPu66Tz2YkRrDM-ZxvDhEE,GOOGLE,"[lodging, point_of_interest, establishment]",1153.0,"701 East Genesee Street, Syracuse"
2,OPERATIONAL,"((43.0476157, -76.140986), ((43.0488486802915,...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,The Parkview Hotel,"(True,)","[(2584, [<a href=""https://maps.google.com/maps...",ChIJrWsN9KTz2YkRFwo4vpFDe8I,"(2VX5+2J Syracuse, NY, USA, 87M52VX5+2J)",,4.3,ChIJrWsN9KTz2YkRFwo4vpFDe8I,GOOGLE,"[lodging, point_of_interest, establishment]",350.0,"713 East Genesee Street, Syracuse"
3,OPERATIONAL,"((43.0472894, -76.15385049999999), ((43.048601...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,Jefferson Clinton Suites,"(True,)","[(4096, [<a href=""https://maps.google.com/maps...",ChIJa_hOyrjz2YkRPNS3MsHdMlA,"(2RWW+WF Syracuse, NY, USA, 87M52RWW+WF)",,4.4,ChIJa_hOyrjz2YkRPNS3MsHdMlA,GOOGLE,"[lodging, point_of_interest, establishment]",397.0,"416 South Clinton Street, Syracuse"
4,OPERATIONAL,"((43.0488846, -76.1561175), ((43.0501538302915...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,Courtyard by Marriott Syracuse Downtown at Arm...,"(True,)","[(1192, [<a href=""https://maps.google.com/maps...",ChIJGzEmOsfz2YkRmZJfIkMXpPo,"(2RXV+HH Syracuse, NY, USA, 87M52RXV+HH)",,4.1,ChIJGzEmOsfz2YkRmZJfIkMXpPo,GOOGLE,"[lodging, point_of_interest, establishment]",396.0,"300 West Fayette Street, Syracuse"
5,OPERATIONAL,"((43.05264399999999, -76.14681999999999), ((43...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,Quality Inn & Suites Downtown,"(True,)","[(3000, [<a href=""https://maps.google.com/maps...",ChIJQzTi87Dz2YkRzieumuYnlws,"(3V33+37 Syracuse, NY, USA, 87M53V33+37)",,3.7,ChIJQzTi87Dz2YkRzieumuYnlws,GOOGLE,"[lodging, point_of_interest, establishment]",385.0,"454 James Street, Syracuse"
6,OPERATIONAL,"((43.0391534, -76.1351158), ((43.0491796000000...",https://maps.gstatic.com/mapfiles/place_api/ic...,#7B9EB0,https://maps.gstatic.com/mapfiles/place_api/ic...,Syracuse University,"(False,)","[(2268, [<a href=""https://maps.google.com/maps...",ChIJVcwsup_z2YkRTQhRUgaJYF4,"(2VQ7+MX Syracuse, NY, USA, 87M52VQ7+MX)",,4.3,ChIJVcwsup_z2YkRTQhRUgaJYF4,GOOGLE,"[university, point_of_interest, establishment]",257.0,Syracuse
7,OPERATIONAL,"((43.0464172, -76.13539879999999), ((43.047831...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,"Collegian Hotel & Suites, Trademark Collection...","(True,)","[(3840, [<a href=""https://maps.google.com/maps...",ChIJbXjR46bz2YkRONbKfknxgnE,"(2VW7+HR Syracuse, NY, USA, 87M52VW7+HR)",,4.0,ChIJbXjR46bz2YkRONbKfknxgnE,GOOGLE,"[lodging, restaurant, food, point_of_interest,...",942.0,"1060 East Genesee Street, Syracuse"
8,OPERATIONAL,"((43.0526411, -76.15469379999999), ((43.053954...",https://maps.gstatic.com/mapfiles/place_api/ic...,#FF9E67,https://maps.gstatic.com/mapfiles/place_api/ic...,Dinosaur Bar-B-Que,"(True,)","[(2382, [<a href=""https://maps.google.com/maps...",ChIJkSmfAbjz2YkR5WIa4ilZjQU,"(3R3W+34 Syracuse, NY, USA, 87M53R3W+34)",2.0,4.6,ChIJkSmfAbjz2YkR5WIa4ilZjQU,GOOGLE,"[restaurant, food, point_of_interest, establis...",7561.0,"246 West Willow Street, Syracuse"
9,OPERATIONAL,"((43.04396249999999, -76.13607999999999), ((43...",https://maps.gstatic.com/mapfiles/place_api/ic...,#909CE1,https://maps.gstatic.com/mapfiles/place_api/ic...,"Hotel Skyler Syracuse, Tapestry Collection by ...","(True,)","[(3277, [<a href=""https://maps.google.com/maps...",ChIJuQVNaqHz2YkRdrkAL_quAeA,"(2VV7+HH Syracuse, NY, USA, 87M52VV7+HH)",,4.6,ChIJuQVNaqHz2YkRdrkAL_quAeA,GOOGLE,"[lodging, point_of_interest, establishment]",376.0,"601 South Crouse Avenue, Syracuse"


## Column Transformations

 - `withColumnRenamed()` – rename a column
 - `toDF()` – rename all columns
 - `withColumn()` – overwrite an existing column, deriving new columns
 - `drop()` – remove a column
 - `select()` - column projections


### Setting Column Names

In [16]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/*.tsv")

print("Default Columns Names... yuck")
grades.show(5)

print("Rename first two columns")
grades2 = grades.withColumnRenamed("_c0","Year").withColumnRenamed("_c1","Semester")
grades2.show(5)

print("Rename all the columns")
grades3 = grades.toDF("Year", "Semester", "Course", "Credits", "Grade")
grades3.show(5)




Default Columns Names... yuck
+----+----+------+---+---+
| _c0| _c1|   _c2|_c3|_c4|
+----+----+------+---+---+
|2016|Fall|IST346|  3|  A|
|2016|Fall|CHE111|  4| A-|
|2016|Fall|PSY120|  3| B+|
|2016|Fall|IST256|  3|  A|
|2016|Fall|ENG121|  3| B+|
+----+----+------+---+---+
only showing top 5 rows

Rename first two columns
+----+--------+------+---+---+
|Year|Semester|   _c2|_c3|_c4|
+----+--------+------+---+---+
|2016|    Fall|IST346|  3|  A|
|2016|    Fall|CHE111|  4| A-|
|2016|    Fall|PSY120|  3| B+|
|2016|    Fall|IST256|  3|  A|
|2016|    Fall|ENG121|  3| B+|
+----+--------+------+---+---+
only showing top 5 rows

Rename all the columns
+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
+----+--------+------+-------+-----+
only showing top 

### Derived Columns 

In [20]:
# deriving a column
from pyspark.sql.functions import lit
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

# R
# Python Pandas
# dataframe grades["Year"]
# Object grades.Year
# String/SQL "Year"

grades2 = grades.withColumn("Next Year",grades.Year + 1) \
    .withColumn("YearString", grades['Year'].cast("String") ) \
    .withColumn("NullCol", lit(None) )
grades2.printSchema()
grades2.show()

grades2 = grades2.drop(grades2.NullCol).show()
grades3.show()


root
 |-- Year: integer (nullable = true)
 |-- Semester: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Credits: integer (nullable = true)
 |-- Grade: string (nullable = true)
 |-- Next Year: integer (nullable = true)
 |-- YearString: string (nullable = true)
 |-- NullCol: null (nullable = true)

+----+--------+------+-------+-----+---------+----------+-------+
|Year|Semester|Course|Credits|Grade|Next Year|YearString|NullCol|
+----+--------+------+-------+-----+---------+----------+-------+
|2016|    Fall|IST346|      3|    A|     2017|      2016|   null|
|2016|    Fall|CHE111|      4|   A-|     2017|      2016|   null|
|2016|    Fall|PSY120|      3|   B+|     2017|      2016|   null|
|2016|    Fall|IST256|      3|    A|     2017|      2016|   null|
|2016|    Fall|ENG121|      3|   B+|     2017|      2016|   null|
|2015|    Fall|IST101|      1|    A|     2016|      2015|   null|
|2015|    Fall|IST195|      3|    A|     2016|      2015|   null|
|2015|    Fall|IST233

### Column Projections with `select`

In [21]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

# string references
grades.select("Course", "Grade").show(5)

# Object property references
grades.select(grades.Course, grades.Grade).show(5)

# Dataframe references
grades.select(grades["Course"], grades["Grade"]).show(5)


+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows

+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows

+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows



In [23]:
grades.select("Year", grades["Course"], grades.Grade).show()

+----+------+-----+
|Year|Course|Grade|
+----+------+-----+
|2016|IST346|    A|
|2016|CHE111|   A-|
|2016|PSY120|   B+|
|2016|IST256|    A|
|2016|ENG121|   B+|
|2015|IST101|    A|
|2015|IST195|    A|
|2015|IST233|   B+|
|2015|SOC101|   A-|
|2015|MAT221|    C|
|2016|GEO110|   B+|
|2016|MAT222|    A|
|2016|SOC121|   C+|
|2016|BIO240|   B-|
|2017|IST462|    A|
|2017|MAT411|    C|
|2017|SOC422|   B-|
|2017|ENV201|   A-|
+----+------+-----+



## Row Transformations

- `where()` or `filter()` apply a row based filter
- `distinct()` remove duplicates
- `sort()` or `orderBy()` sort by columns


### Where / Filter

In [25]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

print("A grades")
# string references
grades.where("Grade = 'A' or Grade='A-'").show()

# Object property references
grades.filter( (grades.Grade == "A") | (grades.Grade == "A-") ).show()

# Dataframe references
grades.filter( (grades["Grade"] == "A") | (grades["Grade"] == "A-") ).show()


A grades
+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|IST256|      3|    A|
|2015|    Fall|IST101|      1|    A|
|2015|    Fall|IST195|      3|    A|
|2015|    Fall|SOC101|      3|   A-|
|2016|  Spring|MAT222|      3|    A|
|2017|  Spring|IST462|      3|    A|
|2017|  Spring|ENV201|      3|   A-|
+----+--------+------+-------+-----+

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|IST256|      3|    A|
|2015|    Fall|IST101|      1|    A|
|2015|    Fall|IST195|      3|    A|
|2015|    Fall|SOC101|      3|   A-|
|2016|  Spring|MAT222|      3|    A|
|2017|  Spring|IST462|      3|    A|
|2017|  Spring|ENV201|      3|   A-|
+----+--------+------+-------+-----+

+----+--------+------+-----

In [41]:
grades \
    .where(grades.Semester == 'Fall') \
    .where(grades.Credits != 3)\
    .select("Course","Semester","Credits")\
    .sort("Credits").show()

+------+--------+-------+
|Course|Semester|Credits|
+------+--------+-------+
|IST101|    Fall|      1|
|CHE111|    Fall|      4|
+------+--------+-------+



### Distinct

In [35]:
terms = grades.select("Year","Semester")
print("Terms")
terms.show()
print("Distinct Terms")
dterms = terms.distinct()
dterms.show()

Terms
+----+--------+
|Year|Semester|
+----+--------+
|2016|    Fall|
|2016|    Fall|
|2016|    Fall|
|2016|    Fall|
|2016|    Fall|
|2015|    Fall|
|2015|    Fall|
|2015|    Fall|
|2015|    Fall|
|2015|    Fall|
|2016|  Spring|
|2016|  Spring|
|2016|  Spring|
|2016|  Spring|
|2017|  Spring|
|2017|  Spring|
|2017|  Spring|
|2017|  Spring|
+----+--------+

Distinct Terms
+----+--------+
|Year|Semester|
+----+--------+
|2016|    Fall|
|2017|  Spring|
|2015|    Fall|
|2016|  Spring|
+----+--------+



### Sort / orderBy

In [38]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

print("Sorting")
# string references
grades.sort("Year","Course").show()

# Object property references
grades.sort(grades.Year, grades.Course.desc() ).show()

# Dataframe references
grades.sort( grades["Year"], grades["Course"].desc()).show()


Sorting
+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2015|    Fall|IST101|      1|    A|
|2015|    Fall|IST195|      3|    A|
|2015|    Fall|IST233|      3|   B+|
|2015|    Fall|MAT221|      3|    C|
|2015|    Fall|SOC101|      3|   A-|
|2016|  Spring|BIO240|      3|   B-|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|ENG121|      3|   B+|
|2016|  Spring|GEO110|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|IST346|      3|    A|
|2016|  Spring|MAT222|      3|    A|
|2016|    Fall|PSY120|      3|   B+|
|2016|  Spring|SOC121|      3|   C+|
|2017|  Spring|ENV201|      3|   A-|
|2017|  Spring|IST462|      3|    A|
|2017|  Spring|MAT411|      3|    C|
|2017|  Spring|SOC422|      3|   B-|
+----+--------+------+-------+-----+

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2015|    Fall|SOC101|      3|   A-|
|2015|    Fall|MAT221|      3

## Aggregate Transformations

- `groupBy()`  - perform a column grouping,similar to SQL group by,  returns a `GroupedData`
- `agg()` - allows the application of an aggregate function to the `GroupedData`, returns a `DataFrame`
- `alias()` - used to assign a name to a derived column
- Aggregate Functions `count(), avg(), max(), min(), sum()`

In [42]:
from pyspark.sql.functions import col,sum,avg,max,min,count

grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

grades.show(5)

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
+----+--------+------+-------+-----+
only showing top 5 rows



In [47]:
totalcredits = grades.groupBy().agg( 
        sum("Credits").alias("TotalCredits"), 
        count("*").alias("CourseCount") 
)
totalcredits.show() 

termcredits = grades.groupBy("Year", "Semester").agg( \
    count("*").alias("CourseCount"), sum("Credits").alias("TotalCredits") \
    ).sort("Year",col("Semester").desc())
termcredits.show()

grades.groupby(grades.Year, grades['Semester']).agg(count("*")).show()

+------------+-----------+
|TotalCredits|CourseCount|
+------------+-----------+
|          53|         18|
+------------+-----------+

+----+--------+-----------+------------+
|Year|Semester|CourseCount|TotalCredits|
+----+--------+-----------+------------+
|2015|    Fall|          5|          13|
|2016|  Spring|          4|          12|
|2016|    Fall|          5|          16|
|2017|  Spring|          4|          12|
+----+--------+-----------+------------+

+----+--------+--------+
|Year|Semester|count(1)|
+----+--------+--------+
|2016|    Fall|       5|
|2017|  Spring|       4|
|2015|    Fall|       5|
|2016|  Spring|       4|
+----+--------+--------+



## Merge Transformations

- `join()` -Merge data frame by column matching SQL join. Requires a join type string:
    - "inner"  - SQL-like inner join
    - "full" - SQL-like full outer join
    - "left" - SQL-like left join
    - "right" - SQL-Like right join
    - "cross" - Cartesan Product 
- `union()` - merge two data frames by row, duplicates included, use `distinct()` to remove them.

### Joins

In [52]:
gradepoints = spark.read.option("inferSchema",True)\
    .csv("/home/jovyan/datasets/courses/grade-points.csv")\
    .toDF("letterGrade","gradePoint")
print("gradepoints")
gradepoints.show()
grades.show(5)

gradepoints
+-----------+----------+
|letterGrade|gradePoint|
+-----------+----------+
|          A|       4.0|
|         A-|     3.666|
|         B+|     3.333|
|          B|       3.0|
|         B-|     2.666|
|         C+|     2.333|
|          C|       2.0|
|         C-|     1.666|
|          D|       1.0|
|          F|       0.0|
+-----------+----------+

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
+----+--------+------+-------+-----+
only showing top 5 rows



In [57]:
print("Inner")
grades.join(gradepoints, grades.Grade == gradepoints.letterGrade, "inner").show()

print("Full")
grades.join(gradepoints, grades.Grade == gradepoints.letterGrade, "full").show()


Inner
+----+--------+------+-------+-----+-----------+----------+
|Year|Semester|Course|Credits|Grade|letterGrade|gradePoint|
+----+--------+------+-------+-----+-----------+----------+
|2016|    Fall|IST346|      3|    A|          A|       4.0|
|2016|    Fall|CHE111|      4|   A-|         A-|     3.666|
|2016|    Fall|PSY120|      3|   B+|         B+|     3.333|
|2016|    Fall|IST256|      3|    A|          A|       4.0|
|2016|    Fall|ENG121|      3|   B+|         B+|     3.333|
|2015|    Fall|IST101|      1|    A|          A|       4.0|
|2015|    Fall|IST195|      3|    A|          A|       4.0|
|2015|    Fall|IST233|      3|   B+|         B+|     3.333|
|2015|    Fall|SOC101|      3|   A-|         A-|     3.666|
|2015|    Fall|MAT221|      3|    C|          C|       2.0|
|2016|  Spring|GEO110|      3|   B+|         B+|     3.333|
|2016|  Spring|MAT222|      3|    A|          A|       4.0|
|2016|  Spring|SOC121|      3|   C+|         C+|     2.333|
|2016|  Spring|BIO240|      3|   B

In [58]:
joinedgrades = grades.join(gradepoints, grades.Grade == gradepoints.letterGrade, "full")
joinedgrades.filter("Grade is null").show()

+----+--------+------+-------+-----+-----------+----------+
|Year|Semester|Course|Credits|Grade|letterGrade|gradePoint|
+----+--------+------+-------+-----+-----------+----------+
|null|    null|  null|   null| null|          F|       0.0|
|null|    null|  null|   null| null|          B|       3.0|
|null|    null|  null|   null| null|          D|       1.0|
|null|    null|  null|   null| null|         C-|     1.666|
+----+--------+------+-------+-----+-----------+----------+



### Unions

In [59]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")
fallgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/fall*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")
springgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/spring*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

fallgrades.show()
springgrades.show()

fallgrades.union(springgrades).show()

print("Double the courses!")
grades.union(grades).groupBy().count().show()

print("Filter out the duplicates")
grades.union(grades).distinct().groupby().count().show()

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
|2015|    Fall|IST101|      1|    A|
|2015|    Fall|IST195|      3|    A|
|2015|    Fall|IST233|      3|   B+|
|2015|    Fall|SOC101|      3|   A-|
|2015|    Fall|MAT221|      3|    C|
+----+--------+------+-------+-----+

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|  Spring|GEO110|      3|   B+|
|2016|  Spring|MAT222|      3|    A|
|2016|  Spring|SOC121|      3|   C+|
|2016|  Spring|BIO240|      3|   B-|
|2017|  Spring|IST462|      3|    A|
|2017|  Spring|MAT411|      3|    C|
|2017|  Spring|SOC422|      3|   B-|
|2017|  Spring|ENV201|      3|   A-|
+----+--------+------+-------+-----+

+----+--------+------+-------+-----+



+-----+
|count|
+-----+
|   18|
+-----+



                                                                                

In [66]:

fallgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t")\
    .csv("file:///home/jovyan/datasets/grades/fall*.tsv")\
    .toDF("Year", "Semester", "Course", "Credits", "Grade")
springgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t")\
    .csv("file:///home/jovyan/datasets/grades/spring*.tsv")\
    .toDF("Year", "Semester", "Course", "Credits", "Grade").drop("Grade")

fallgrades.printSchema()


# Column Schemas must match

springgrades = springgrades.withColumn("Grade", lit(None) )
springgrades.printSchema()

allgrades = fallgrades.union(springgrades)
allgrades.show()

root
 |-- Year: integer (nullable = true)
 |-- Semester: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Credits: integer (nullable = true)
 |-- Grade: string (nullable = true)

root
 |-- Year: integer (nullable = true)
 |-- Semester: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Credits: integer (nullable = true)
 |-- Grade: null (nullable = true)

+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
|2015|    Fall|IST101|      1|    A|
|2015|    Fall|IST195|      3|    A|
|2015|    Fall|IST233|      3|   B+|
|2015|    Fall|SOC101|      3|   A-|
|2015|    Fall|MAT221|      3|    C|
|2016|  Spring|GEO110|      3| null|
|2016|  Spring|MAT222|      3| null|
|2016|  Spring|SOC121|      3| null|
|2016|  Spring

## User-Defined Functions (UDF's)

- User-defined functions allow us to write custom transformations. The process:

1. Create python function, decorated for spark with `@func.udf(returnType=?)`, 
2. Apply function in `select()` or `withColumn()`


In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(returnType=StringType())
def term(year, semester):
    return f"{year}-{semester}"


@func.udf(returnType=BooleanType())
def inMajor(course):
    return course.startswith("IST")


grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

grades.withColumn("Term", term( grades.Year, grades.Semester) ).show()

grades.select("Course", inMajor(grades.Course).alias("InMajor")).show()


## Nested Column Transformations

- Sometimes the schema is nested with additional `StructType` or `ArrayType` fields.
- For nested `StructType` you can use the object property accessor to get to the nested columns.
- For nested `ArrayType` you can use the `explode()` function to flatten the nested data. when you explode an array, the parent values will repeat for each value in the array.

In [73]:
from pyspark.sql.functions import explode
places = spark.read.json("file:///home/jovyan/datasets/json-samples/google-places.json", multiLine=True)
places.printSchema()
# places.show(5)

print("Two places")
places.select('name','geometry.location.lat',places.geometry.location.lng, places['types']).show(2)

print("Same two places, one row per type")
a = places.select('name','geometry.location.lat',places.geometry.location.lng, explode(places.types).alias("type") )
a.where("type = 'lodging'").show()

print("Let's the the photo attributions")
places.select('name', explode( places.photos ).alias("col") ) \
     .select("name", explode("col.html_attributions").alias("attributions") ) \
     .show(truncate=False)

root
 |-- business_status: string (nullable = true)
 |-- geometry: struct (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- lng: double (nullable = true)
 |    |-- viewport: struct (nullable = true)
 |    |    |-- northeast: struct (nullable = true)
 |    |    |    |-- lat: double (nullable = true)
 |    |    |    |-- lng: double (nullable = true)
 |    |    |-- southwest: struct (nullable = true)
 |    |    |    |-- lat: double (nullable = true)
 |    |    |    |-- lng: double (nullable = true)
 |-- icon: string (nullable = true)
 |-- icon_background_color: string (nullable = true)
 |-- icon_mask_base_uri: string (nullable = true)
 |-- name: string (nullable = true)
 |-- opening_hours: struct (nullable = true)
 |    |-- open_now: boolean (nullable = true)
 |-- photos: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- height: long (nullable = true)
 |    |    |-- html_attributi

## Explain

The `explain()` function will demonstrate the execution plan of the spark transformations. 
This is useful for understanding how the DAG processes the transformations. 
It should be noted that they are not processed in the order as written but instead processed  as optimized by spark.

Notice in this example the last transformation is to filter the Year to 2016. In the Physical plan, this is one of the first transoformations. (You read the transformation graph from bottom to top).



In [74]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t")\
    .csv("file:///home/jovyan/datasets/grades/*.tsv")\
    .toDF("Year", "Semester", "Course", "Credits", "Grade")
termcredits = grades.groupBy("Year", "Semester").agg( \
    count("*").alias("CourseCount"), sum("Credits").alias("TotalCredits") \
    ).sort("Year",col("Semester").desc())
final = termcredits.filter("Year=2016")
final.explain()


== Physical Plan ==
*(2) HashAggregate(keys=[Year#4756, Semester#4757], functions=[count(1), sum(cast(Credits#4759 as bigint))])
+- Exchange hashpartitioning(Year#4756, Semester#4757, 200), ENSURE_REQUIREMENTS, [id=#2920]
   +- *(1) HashAggregate(keys=[Year#4756, Semester#4757], functions=[partial_count(1), partial_sum(cast(Credits#4759 as bigint))])
      +- *(1) Project [_c0#4746 AS Year#4756, _c1#4747 AS Semester#4757, _c3#4749 AS Credits#4759]
         +- *(1) Filter (isnotnull(_c0#4746) AND (_c0#4746 = 2016))
            +- FileScan csv [_c0#4746,_c1#4747,_c3#4749] Batched: false, DataFilters: [isnotnull(_c0#4746), (_c0#4746 = 2016)], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [IsNotNull(_c0), EqualTo(_c0,2016)], ReadSchema: struct<_c0:int,_c1:string,_c3:int>




In [76]:
a = grades.filter("year = 2016")\
    .filter(grades.Semester == "Fall")\
    .sort("Course") \
    .select("Course", grades.Credits, grades["Grade"])

In [77]:
a.explain()

== Physical Plan ==
*(2) Sort [Course#4758 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(Course#4758 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#2941]
   +- *(1) Project [_c2#4748 AS Course#4758, _c3#4749 AS Credits#4759, _c4#4750 AS Grade#4760]
      +- *(1) Filter (((isnotnull(_c0#4746) AND isnotnull(_c1#4747)) AND (_c0#4746 = 2016)) AND (_c1#4747 = Fall))
         +- FileScan csv [_c0#4746,_c1#4747,_c2#4748,_c3#4749,_c4#4750] Batched: false, DataFilters: [isnotnull(_c0#4746), isnotnull(_c1#4747), (_c0#4746 = 2016), (_c1#4747 = Fall)], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [IsNotNull(_c0), IsNotNull(_c1), EqualTo(_c0,2016), EqualTo(_c1,Fall)], ReadSchema: struct<_c0:int,_c1:string,_c2:string,_c3:int,_c4:string>




In [78]:
b = grades.sort("Course") \
    .filter(grades.Semester == "Fall")\
    .select("Course", grades.Credits, grades["Grade"])\
    .filter("year = 2016")

In [79]:
b.explain()

== Physical Plan ==
*(2) Sort [Course#4758 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(Course#4758 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#2962]
   +- *(1) Project [_c2#4748 AS Course#4758, _c3#4749 AS Credits#4759, _c4#4750 AS Grade#4760]
      +- *(1) Filter (((isnotnull(_c1#4747) AND isnotnull(_c0#4746)) AND (_c1#4747 = Fall)) AND (_c0#4746 = 2016))
         +- FileScan csv [_c0#4746,_c1#4747,_c2#4748,_c3#4749,_c4#4750] Batched: false, DataFilters: [isnotnull(_c1#4747), isnotnull(_c0#4746), (_c1#4747 = Fall), (_c0#4746 = 2016)], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [IsNotNull(_c1), IsNotNull(_c0), EqualTo(_c1,Fall), EqualTo(_c0,2016)], ReadSchema: struct<_c0:int,_c1:string,_c2:string,_c3:int,_c4:string>




In [None]:
df

In [None]:
from pyspark.sql.types import DoubleType
#df.withColumn("price", df.price.cast(DoubleType())).printSchema().sort(df["price"]).toPandas()

df.sort(df.price.cast("Float").asc()).show()

In [81]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t")\
    .csv("file:///home/jovyan/datasets/grades/*.tsv")
grades.explain()

== Physical Plan ==
FileScan csv [_c0#4807,_c1#4808,_c2#4809,_c3#4810,_c4#4811] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_c0:int,_c1:string,_c2:string,_c3:int,_c4:string>


