## Using `sqlContext`, create a DataFrame named `exampleDF` by reading the table you uploaded:

In [1]:
exampleDF = spark.read.csv("pageviews_short.tsv", sep="\t", inferSchema=True, header=True)

In [3]:
exampleDF.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: string (nullable = true)



In [4]:
exampleDF.show(5)

+--------------------+-------+--------+
|           timestamp|   site|requests|
+--------------------+-------+--------+
|2015-04-02 23:48:...|desktop|    2251|
|2015-03-16 00:09:...| mobile|   1595s|
|2015-03-16 00:10:...| mobile|    1544|
|2015-03-16 00:19:...|desktop|    2460|
|2015-03-16 00:38:...|desktop|    2237|
+--------------------+-------+--------+
only showing top 5 rows



In [None]:
# import into mysql database first, then read from there.


## Partitions and tasks

In [4]:
exampleDF.rdd.getNumPartitions()

8

In [5]:
exampleDF.count()

3499999

In [13]:
sc.setLogLevel('ALL')

Check http://flash-deals-c9-ohliumliu.c9users.io:8082

## Transformation: `orderBy()`

In [6]:
# timestamp
exampleDF.orderBy("timestamp").show(10)

+--------------------+-------+--------+
|           timestamp|   site|requests|
+--------------------+-------+--------+
|2015-03-16 00:00:...|desktop|    2343|
|2015-03-16 00:00:...| mobile|    1628|
|2015-03-16 00:00:...|desktop|    2382|
|2015-03-16 00:00:...| mobile|    1636|
|2015-03-16 00:00:...|desktop|    2546|
|2015-03-16 00:00:...| mobile|    1619|
|2015-03-16 00:00:...| mobile|    1776|
|2015-03-16 00:00:...|desktop|    2402|
|2015-03-16 00:00:...|desktop|    2370|
|2015-03-16 00:00:...| mobile|    1716|
+--------------------+-------+--------+
only showing top 10 rows



In [7]:
exampleDF.orderBy("timestamp", exampleDF.site.desc()).show(10)

+--------------------+-------+--------+
|           timestamp|   site|requests|
+--------------------+-------+--------+
|2015-03-16 00:00:...| mobile|    1628|
|2015-03-16 00:00:...|desktop|    2343|
|2015-03-16 00:00:...| mobile|    1636|
|2015-03-16 00:00:...|desktop|    2382|
|2015-03-16 00:00:...| mobile|    1619|
|2015-03-16 00:00:...|desktop|    2546|
|2015-03-16 00:00:...| mobile|    1776|
|2015-03-16 00:00:...|desktop|    2402|
|2015-03-16 00:00:...| mobile|    1716|
|2015-03-16 00:00:...|desktop|    2370|
+--------------------+-------+--------+
only showing top 10 rows



In [17]:
exampleDF.orderBy("timestamp", exampleDF.site.desc())

DataFrame[timestamp: timestamp, site: string, requests: string]

## cache exampleDF

In [8]:
exampleDF.cache()

DataFrame[timestamp: timestamp, site: string, requests: string]

In [None]:
exampleDF.count()

3499999

In [None]:
exampleDF.count()

## mobile vs desktop

In [5]:
mobile_count = exampleDF.filter(exampleDF.site == 'mobile').count()

In [6]:
mobile_count

1749976

In [7]:
desktop_count = exampleDF.filter("site == 'desktop'").count()

In [8]:
desktop_count

1750023

_Although a filter is added, this job still consists of two stages. Same as count only. Check the SQL tab in SparkUI to confirm this._

## Shuffle partitions

In [9]:
sqlContext.getConf("spark.sql.shuffle.partitions")

u'200'

In [10]:
sqlContext.setConf("spark.sql.shuffle.partitions", 8)

In [11]:
sqlContext.getConf("spark.sql.shuffle.partitions")

u'8'

In [12]:
exampleDF.orderBy("timestamp", exampleDF.site.desc()).count()

3499999

In [13]:
sqlContext.setConf("spark.sql.shuffle.partitions", 4)

In [14]:
exampleDF.orderBy("timestamp", exampleDF.site.desc()).count()

3499999

## persistence

In [18]:
sqlContext.setConf("spark.sql.shuffle.partitions", 4)

In [15]:
exampleDFordered = exampleDF.orderBy("timestamp", exampleDF.site.desc()).cache()

In [16]:
exampleDFordered.count()

3499999

_Check storage tab. There are 4 partitions._

In [17]:
exampleDFordered.unpersist() 

DataFrame[timestamp: timestamp, site: string, requests: string]

_Check storage tab. No dataframe._

In [2]:
sqlContext.setConf("spark.sql.shuffle.partitions", 8)

In [3]:
exampleDFordered = exampleDF.orderBy("timestamp", exampleDF.site.desc()).cache()

In [4]:
exampleDFordered.count()

3499999

_Check storage tab. There are 4 partitions._

In [12]:
exampleDFordered.createOrReplaceTempView("example_df_ordered")
sqlContext.cacheTable("example_df_ordered")

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:36109)
Traceback (most recent call last):
  File "/home/ubuntu/workspace/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 963, in start
    self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:36109)

In [11]:
example_df_or

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:36109)
Traceback (most recent call last):
  File "/home/ubuntu/workspace/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 963, in start
    self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:36109)

_Check the name of in-memory table?_