# Object Storage and PySpark Programming

- Examples From Video Lecture 

In [32]:
import pyspark
from pyspark.sql import SparkSession

bucket = "d-object-spark"

spark = SparkSession.builder \
    .master("local") \
    .appName('jupyter-pyspark') \
        .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
        .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.1.2,org.apache.spark:spark-avro_2.12:3.1.2")\
        .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
        .config("spark.hadoop.fs.s3a.access.key", "minio") \
        .config("spark.hadoop.fs.s3a.secret.key", "SU2orange!") \
        .config("spark.hadoop.fs.s3a.fast.upload", True) \
        .config("spark.hadoop.fs.s3a.path.style.access", True) \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .enableHiveSupport() \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR") # Keeps the noise down!!!

In [4]:
# Print context
print('Spark Context : ', spark.sparkContext)
print('Spark Version : ', spark.sparkContext.version)
print('Spark appName :', spark.sparkContext.appName)
print('Hadoop version: ', spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
print('Spark Confiuration:')
for conf in spark.sparkContext._conf.getAll():
    print(f"\t{conf[0]} = {conf[1]}")

Spark Context :  <SparkContext master=local appName=jupyter-pyspark>
Spark Version :  3.1.2
Spark appName : jupyter-pyspark
Hadoop version:  3.2.0
Spark Confiuration:
	spark.master = local
	hive.metastore.uris = thrift://hive-metastore:9083
	spark.submit.pyFiles = /home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.1.2.jar,/home/jovyan/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.1.2.jar,/home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.271.jar,/home/jovyan/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar
	spark.app.name = jupyter-pyspark
	spark.hadoop.fs.s3a.path.style.access = True
	spark.serializer.objectStreamReset = 100
	spark.jars.packages = org.apache.hadoop:hadoop-aws:3.1.2,org.apache.spark:spark-avro_2.12:3.1.2
	spark.submit.deployMode = client
	spark.driver.port = 34137
	spark.hadoop.fs.s3a.fast.upload = True
	spark.app.initial.jar.urls = spark://jupyter:34137/jars/org.apache.hadoop_hadoop-aws-3.1.2.jar,spark://jupyter:34137/jars/org.apache.spark_spark-avr

## Setup

- Put data in the right places!!!
- Run these cells to ensure you have the data for the examples

In [2]:
! pip install minio

Collecting minio
  Downloading minio-7.1.5-py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 2.0 MB/s eta 0:00:011
Installing collected packages: minio
Successfully installed minio-7.1.5


In [4]:
from minio import Minio

# Make the minio bucket
client = Minio("minio:9000","minio","SU2orange!", secure=False)
not client.bucket_exists(bucket) and client.make_bucket(bucket)

# open the example 
df = spark.read.csv("/home/jovyan/datasets/stocks/stocks.csv", inferSchema=true, header=True)

# Put the example in minio
df.write.mode("Overwrite").csv(f"s3a://{bucket}/stocks.csv",header=True)

#put the example in HDFS
df.write.mode("Overwrite").csv(f"hdfs://namenode/user/root/{bucket}/stocks.csv",header=True)


NameError: name 'bucket' is not defined

## Minio Client

This section outlines commands from the minio client

These commands are run from the terminal in your jupyter notebook setup

### Minio Alias Setup

```
Installing the client

$ wget https://dl.min.io/client/mc/release/linux-amd64/mc && chmod +x mc && sudo mv -f  mc /usr/local/bin

# view aliases

mc alias list

# create alias to our server, which we will call "ms"

mc alias set ms http://minio:9000 minio SU2orange!

# to delete an alias its

ms alias rm ms

```

### Minio File and bucket commands

These are similar to the `hadoop fs` commands. 

```
#make bucket testing 
mc mb play/testing

# list buckets on the play alias
mc ls play

# copy files to the play/testing bucket

mc cp /datasets/customers/* play/testing
```


## Reading Data into the Spark Dataframe: Paths

Spark can read (and write) data from a variety of locations, just by including the proper path to the file.

- `file://` read a file off the local file system. Not ideal for a clustered environment. Use `SparkFiles`.
- `s3a://` read from our object storage configration
- `hdfs://` head from hadoop's HDFS using the client
- `webhdfs://` head from hadoop's HDFS using the web client
- `https://` read over the web - must use `SparkFiles`. See Next Section.


In [11]:
print("file://") # Not Ideal!
spark.read.text("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)

print("s3a://")
spark.read.text(f"s3a://{bucket}/stocks.csv").show(3)

print("hdfs://")
spark.read.text(f"hdfs://namenode/user/root/{bucket}/stocks/").show(3)

print("webhdfs://")
spark.read.text(f"webhdfs://namenode:50070/user/root/{bucket}/stocks/").show(3)



file://
+------------+
|       value|
+------------+
|price,symbol|
| 126.82,AAPL|
|3098.12,AMZN|
+------------+
only showing top 3 rows

s3a://
+------------+
|       value|
+------------+
|price,symbol|
| 126.82,AAPL|
|3098.12,AMZN|
+------------+
only showing top 3 rows

hdfs://


AnalysisException: Path does not exist: hdfs://namenode/user/root/d-object-spark/stocks

## Reading Data : `SparkFiles`

Let's not forget Spark is a distributed computing environment. Reading a local file, or file off the web into our cluster doesn't help spark take advantage of its distributed nature. So to do that we need to use `SparkFiles` which registers the file with the `sparkContext` of the `sparkSession`. This, in essence makes the cluster aware of the file.

`spark.sparkContext.addFile(url)` will  download the file at `url` and add it to the tmp location on the worker nodes in the cluster.

When you need the file, use `SparkFiles.get(filename)` to retrieve its path.

NOTES: 

- You add a file by path, but access the file by name. 
- You cannot add the same file name more than once


In [12]:
from pyspark import SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/mafudge/datasets/master/stocks/stocks.csv")
file_on_spark = SparkFiles.get("stocks.csv")


print("Temporary Location: ", SparkFiles.get("stocks.csv"))

print("https://")
spark.read.text(SparkFiles.get("stocks.csv")).show(3)

Temporary Location:  /tmp/spark-08af238d-31a5-4822-9179-b3e718641d77/userFiles-29a1326d-89b9-431f-9502-499f9bfbf500/stocks.csv
https://
+------------+
|       value|
+------------+
|price,symbol|
| 126.82,AAPL|
|3098.12,AMZN|
+------------+
only showing top 3 rows



## Reading Data : Wildcards

You don't have to read a single file. Instead you can read an entire folder of files, or a wildcard match of files.


In [13]:
print("read just fall")
spark.read.text("file:///home/jovyan/datasets/grades/fall*.tsv").show()

# read all of them
print("read all files")
spark.read.text("file:///home/jovyan/datasets/grades").show()

read just fall
+--------------------+
|               value|
+--------------------+
|2016	Fall	IST346	3	A|
|2016	Fall	CHE111	...|
|2016	Fall	PSY120	...|
|2016	Fall	IST256	3	A|
|2016	Fall	ENG121	...|
|2015	Fall	IST101	1	A|
|2015	Fall	IST195	3	A|
|2015	Fall	IST233	...|
|2015	Fall	SOC101	...|
|2015	Fall	MAT221	3	C|
+--------------------+

read all files
+--------------------+
|               value|
+--------------------+
|2016	Fall	IST346	3	A|
|2016	Fall	CHE111	...|
|2016	Fall	PSY120	...|
|2016	Fall	IST256	3	A|
|2016	Fall	ENG121	...|
|2015	Fall	IST101	1	A|
|2015	Fall	IST195	3	A|
|2015	Fall	IST233	...|
|2015	Fall	SOC101	...|
|2015	Fall	MAT221	3	C|
|2016	Spring	GEO11...|
|2016	Spring	MAT22...|
|2016	Spring	SOC12...|
|2016	Spring	BIO24...|
|2017	Spring	IST46...|
|2017	Spring	MAT41...|
|2017	Spring	SOC42...|
|2017	Spring	ENV20...|
+--------------------+



## Reading Data: File Formats

Spark can read data in a variety of formats. Each format has configurable options.

- `csv` delimited (comma, tab, etc) file
- `text` generic text file, one row per line
- `json` JSON format 
- `parquet` Parquet format (common big-data format with schema included)
- `orc` Another common big-data format with schema.

Each format has options to change behaviors of the file format. Use the `option()` method to set them.

More Information: https://spark.apache.org/docs/latest/sql-data-sources.html


In [14]:
# Handle headers
spark.read \
    .option("header",True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)

# Infer schema from the columns
spark.read \
    .option("header",True) \
    .option("inferSchema", True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").show(3)

# readng a schema based file has less options
print("read Parquet file")
spark.read \
    .parquet("file:///home/jovyan/datasets/stocks/stocks.parquet").show(3)


# JSON file format - there are many options for this file format
print("Read JSON file")
spark.read.option("multiline",True).json("/home/jovyan/datasets/json-samples/stocks.json").show(3)

# This is not comma-delimited
print("Read a pip-separated file")
spark.read \
    .option("sep","|") \
    .option("header",False) \
    .option("inferSchema",True) \
    .csv("file:///home/jovyan/datasets/tweets/tweets.psv").show(3)


+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

read Parquet file
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

Read JSON file
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows

Read a pip-separated file
+-------------------+--------------------+--------------------+--------+--------------------+
|                _c0|                 _c1|                 _c2|     _c3|                 _c4|
+-------------------+--------------------+--------------------+--------+--------------------+
|2845428583999282239|1.4337661612984276E9|Mon Jun 08 08:22:...|rovlight|Why so horr

## Caching DataFrames

The `cache()` function will persist the `DataFrame` to temp storage on the spark cluster. This can be in-memory, on disk, or both depending on the cluster size and data set size.

This is specially useful when the data source is external to the spark cluster (a remote database, for example) and it will be retrieved and transformed multiple times.

`cache()` forces lazy evaluation so any transformation prior to caching are executed.

In [15]:
print("s3a://")
stocks = spark.read.option("header",True).option("inferSchema",True).csv(f"s3a://{bucket}/stocks.csv").cache()
stocks.show(3)


s3a://
+-------+------+
|  price|symbol|
+-------+------+
| 126.82|  AAPL|
|3098.12|  AMZN|
| 251.11|    FB|
+-------+------+
only showing top 3 rows



## DataFrame Schemas

Every spark dataframe has a schema, or collection of typed columns. The schema is stored in a `StructType` and the columns are `StructFields` consisting of the field name and a specific `StructType`

- When you `spark.read` data, from  the schema is always the most flexible type, `StringType`.
- When you include the `inferSchema` option, and extra pass is made over the data to infer the `StructType` for each column.
- For formats that include a schema, like `parquet` or `orc` the schema in the file is loaded.


In [None]:
print("Stocks: No Schema")
spark.read \
    .option("header",True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").printSchema()

# Infer schema from the columns
print("Stocks: Infer Schema")
spark.read \
    .option("header",True) \
    .option("inferSchema", True) \
    .csv("file:///home/jovyan/datasets/stocks/stocks.csv").printSchema()


# This is not comma-delimited
print("Customers...")
customers = spark.read \
    .option("sep",",") \
    .option("header",True) \
    .option("inferSchema",True) \
    .csv("file:///home/jovyan/datasets/customers/customers.csv")
    
customers.printSchema()
customers.show(5)

## DataFrame Schemas: Nested Schema

Spark handles file formats with nested schemas, such as `json` very well. This means you can read from Document and Graph databases easily. 

- Embedded columns can be additional `StructType` columns or `ArrayType` for nested lists of values.
- Later we will introduce strategies for dealing with nested schema like this one|

In [None]:
# This is not comma-delimited
print("Customers...")
places = spark.read \
    .json("file:///home/jovyan/datasets/json-samples/google-places.json")
    
places.printSchema()
places.show(5)


## Column Transformations

 - `withColumnRenamed()` – rename a column
 - `toDF()` – rename all columns
 - `withColumn()` – overwrite an existing column, deriving new columns
 - `drop()` – remove a column
 - `select()` - column projections


### Setting Column Names

In [16]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/*.tsv")

print("Default Columns Names... yuck")
grades.show(5)

print("Rename first two columns")
grades2 = grades.withColumnRenamed("_c0","Year").withColumnRenamed("_c1","Semester")
grades2.show(5)

print("Rename all the columns")
grades3 = grades.toDF("Year", "Semester", "Course", "Credits", "Grade")
grades3.show(5)




Default Columns Names... yuck
+----+----+------+---+---+
| _c0| _c1|   _c2|_c3|_c4|
+----+----+------+---+---+
|2016|Fall|IST346|  3|  A|
|2016|Fall|CHE111|  4| A-|
|2016|Fall|PSY120|  3| B+|
|2016|Fall|IST256|  3|  A|
|2016|Fall|ENG121|  3| B+|
+----+----+------+---+---+
only showing top 5 rows

Rename first two columns
+----+--------+------+---+---+
|Year|Semester|   _c2|_c3|_c4|
+----+--------+------+---+---+
|2016|    Fall|IST346|  3|  A|
|2016|    Fall|CHE111|  4| A-|
|2016|    Fall|PSY120|  3| B+|
|2016|    Fall|IST256|  3|  A|
|2016|    Fall|ENG121|  3| B+|
+----+--------+------+---+---+
only showing top 5 rows

Rename all the columns
+----+--------+------+-------+-----+
|Year|Semester|Course|Credits|Grade|
+----+--------+------+-------+-----+
|2016|    Fall|IST346|      3|    A|
|2016|    Fall|CHE111|      4|   A-|
|2016|    Fall|PSY120|      3|   B+|
|2016|    Fall|IST256|      3|    A|
|2016|    Fall|ENG121|      3|   B+|
+----+--------+------+-------+-----+
only showing top 

### Derived Columns 

In [None]:
# deriving a column
from pyspark.sql.functions import lit
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")


grades2 = grades.withColumn("Next Year",grades["Year"] + 1) \
    .withColumn("YearString", grades['Year'].cast("String") ) \
    .withColumn("NullCol", lit(None) )
grades2.printSchema()
grades2.show()

grades2 = grades2.drop("NullCol").show()
grades3.show()


### Column Projections with `select`

In [17]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

# string references
grades.select("Course", "Grade").show(5)

# Object property references
grades.select(grades.Course, grades.Grade).show(5)

# Dataframe references
grades.select(grades["Course"], grades["Grade"]).show(5)


+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows

+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows

+------+-----+
|Course|Grade|
+------+-----+
|IST346|    A|
|CHE111|   A-|
|PSY120|   B+|
|IST256|    A|
|ENG121|   B+|
+------+-----+
only showing top 5 rows



## Row Transformations

- `where()` or `filter()` apply a row based filter
- `distinct()` remove duplicates
- `sort()` or `orderBy()` sort by columns


### Where / Filter

In [None]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

print("A grades")
# string references
grades.filter("Grade = 'A' or Grade='A-'").show()

# Object property references
grades.filter( (grades.Grade == "A") | (grades.Grade == "A-") ).show()

# Dataframe references
grades.filter( (grades["Grade"] == "A") | (grades["Grade"] == "A-") ).show()


### Distinct

In [None]:
terms = grades.select("Year","Semester")
print("Terms")
terms.show()
print("Distinct Terms")
dterms = terms.distinct()
dterms.show()

### Sort / orderBy

In [None]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

print("Sorting")
# string references
grades.sort("Year","Course").show()

# Object property references
grades.sort(grades.Year, grades.Course.desc() ).show()

# Dataframe references
grades.sort( grades["Year"], grades["Course"].desc()).show()


## Aggregate Transformations

- `groupBy()`  - perform a column grouping,similar to SQL group by,  returns a `GroupedData`
- `agg()` - allows the application of an aggregate function to the `GroupedData`, returns a `DataFrame`
- `alias()` - used to assign a name to a derived column
- Aggregate Functions `count(), avg(), max(), min(), sum()`

In [None]:
from pyspark.sql.functions import col,sum,avg,max,min,count

grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

totalcredits = grades.groupBy().agg( sum("Credits").alias("TotalCredits"), count("*").alias("CourseCount") )
totalcredits.show() 

termcredits = grades.groupBy("Year", "Semester").agg( \
    count("*").alias("CourseCount"), sum("Credits").alias("TotalCredits") \
    ).sort("Year",col("Semester").desc())
termcredits.show()

## Merge Transformations

- `join()` -Merge data frame by column matching SQL join. Requires a join type string:
    - "inner"  - SQL-like inner join
    - "full" - SQL-like full outer join
    - "left" - SQL-like left join
    - "right" - SQL-Like right join
    - "cross" - Cartesan Product 
- `union()` - merge two data frames by row, duplicates included, use `distinct()` to remove them.

### Joins

In [6]:
gradepoints = spark.read.option("inferSchema",True).csv("file://home/jovyan/datasets/courses/grade-points.csv")

grades.join(gradepoints, grades.Grade == gradepoints.Grade, "inner").show()

grades.join(courses, grades.Course == courses.Course, "full").show()


NameError: name 'spark' is not defined

### Unions

In [None]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")
fallgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/fall*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")
springgrades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/spring*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

fallgrades.show()
springgrades.show()

fallgrades.union(springgrades).show()

print("Double the courses!")
grades.union(grades).groupBy().count().show()

print("Filter out the duplicates")
grades.union(grades).distinct().groupby().count().show()

## User-Defined Functions (UDF's)

- User-defined functions allow us to write custom transformations. The process:

1. Create python function, decorated for spark with `@func.udf(returnType=?)`, 
2. Apply function in `select()` or `withColumn()`


In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(returnType=StringType())
def term(year, semester):
    return f"{year}-{semester}"


@func.udf(returnType=BooleanType())
def inMajor(course):
    return course.startswith("IST")


grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t") \
    .csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")

grades.withColumn("Term", term( grades.Year, grades.Semester) ).show()

grades.select("Course", inMajor(grades.Course).alias("InMajor")).show()


## Nested Column Transformations

- Sometimes the schema is nested with additional `StructType` or `ArrayType` fields.
- For nested `StructType` you can use the object property accessor to get to the nested columns.
- For nested `ArrayType` you can use the `explode()` function to flatten the nested data. when you explode an array, the parent values will repeat for each value in the array.

In [None]:
from pyspark.sql.functions import explode
places = spark.read.json("file:///home/jovyan/datasets/json-samples/google-places.json", multiLine=True)
places.printSchema()
places.show(5)

print("Two places")
places.select('name','geometry.location.lat',places.geometry.location.lng, places['types']).show(2)

print("Same two places, one row per type")
places.select('name','geometry.location.lat',places.geometry.location.lng, explode(places.types).alias("type") ).show(5)

print("Let's the the photo attributions")
places.select('name', explode( places.photos ).alias("col") ) \
    .select("name", explode("col.html_attributions").alias("attributions") ) \
    .show(truncate=False)

## Explain

The `explain()` function will demonstrate the execution plan of the spark transformations. 
This is useful for understanding how the DAG processes the transformations. 
It should be noted that they are not processed in the order as written but instead processed  as optimized by spark.

Notice in this example the last transformation is to filter the Year to 2016. In the Physical plan, this is one of the first transoformations. (You read the transformation graph from bottom to top).



In [None]:
grades = spark.read.option("header",False).option("inferSchema", True).option("sep", "\t").csv("file:///home/jovyan/datasets/grades/*.tsv").toDF("Year", "Semester", "Course", "Credits", "Grade")
termcredits = grades.groupBy("Year", "Semester").agg( \
    count("*").alias("CourseCount"), sum("Credits").alias("TotalCredits") \
    ).sort("Year",col("Semester").desc())
final = termcredits.filter("Year=2016")
final.explain()


In [25]:
a = grades.filter("year = 2016")\
    .filter(grades.Semester == "Fall")\
    .sort("Course") \
    .select("Course", grades.Credits, grades["Grade"])

In [28]:
a.explain()

== Physical Plan ==
*(2) Sort [Course#489 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(Course#489 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#467]
   +- *(1) Project [_c2#479 AS Course#489, _c3#480 AS Credits#490, _c4#481 AS Grade#491]
      +- *(1) Filter (((isnotnull(_c0#477) AND isnotnull(_c1#478)) AND (_c0#477 = 2016)) AND (_c1#478 = Fall))
         +- FileScan csv [_c0#477,_c1#478,_c2#479,_c3#480,_c4#481] Batched: false, DataFilters: [isnotnull(_c0#477), isnotnull(_c1#478), (_c0#477 = 2016), (_c1#478 = Fall)], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [IsNotNull(_c0), IsNotNull(_c1), EqualTo(_c0,2016), EqualTo(_c1,Fall)], ReadSchema: struct<_c0:int,_c1:string,_c2:string,_c3:int,_c4:string>




In [29]:
b = grades.sort("Course") \
    .filter(grades.Semester == "Fall")\
    .select("Course", grades.Credits, grades["Grade"])\
    .filter("year = 2016")

In [31]:
b.explain()

== Physical Plan ==
*(2) Sort [Course#489 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(Course#489 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#488]
   +- *(1) Project [_c2#479 AS Course#489, _c3#480 AS Credits#490, _c4#481 AS Grade#491]
      +- *(1) Filter (((isnotnull(_c1#478) AND isnotnull(_c0#477)) AND (_c1#478 = Fall)) AND (_c0#477 = 2016))
         +- FileScan csv [_c0#477,_c1#478,_c2#479,_c3#480,_c4#481] Batched: false, DataFilters: [isnotnull(_c1#478), isnotnull(_c0#477), (_c1#478 = Fall), (_c0#477 = 2016)], Format: CSV, Location: InMemoryFileIndex[file:/home/jovyan/datasets/grades/fall2015.tsv, file:/home/jovyan/datasets/grad..., PartitionFilters: [], PushedFilters: [IsNotNull(_c1), IsNotNull(_c0), EqualTo(_c1,Fall), EqualTo(_c0,2016)], ReadSchema: struct<_c0:int,_c1:string,_c2:string,_c3:int,_c4:string>




In [34]:
df

DataFrame[price: string, symbol: string]

In [50]:
from pyspark.sql.types import DoubleType
#df.withColumn("price", df.price.cast(DoubleType())).printSchema().sort(df["price"]).toPandas()

df.sort(df.price.cast("Float").asc()).show()

+-------+------+
|  price|symbol|
+-------+------+
|  45.11|  TWTR|
|   78.0|   NET|
| 126.82|  AAPL|
| 128.39|   IBM|
| 212.55|  MSFT|
| 251.11|    FB|
|  497.0|  NFLX|
|  823.8|  TSLA|
|1725.05|  GOOG|
|3098.12|  AMZN|
+-------+------+

