# Spark Dataframes

Built on top of RDDs are dataframes, Pandas- or R-like column-organized tables of data.  In this example from [Data Bricks](https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html), we take a look how these aer used.

Spark Dataframes allow SQL-like queriying and calculations.  This is more than a familiar interface; by using such an approach, mature SQL query optimizers can be brought to beear to re-structure the movement of data, bringing a powerful runtime approach to performance

In [1]:
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
import findspark

In [5]:
findspark.init()
from pyspark import SparkContext, SQLContext
sc = SparkContext("local[4]")
sqlc = SQLContext(sc)

That done, we write some code:

In [8]:
from pyspark.sql.functions import rand, randn
df = sqlc.range(0, 100)
df.show(10)

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
only showing top 10 rows



In [15]:
df = df.select("id", 
                  rand(seed=10).alias("uniform"), 
                  randn(seed=27).alias("normal"))
df.show(10)

+---+-------------------+-------------------+
| id|            uniform|             normal|
+---+-------------------+-------------------+
|  0|0.41371264720975787| 0.5888539012978773|
|  1| 0.7311719281896606| 0.8645537008427937|
|  2| 0.9031701155118229| 1.2524569684217643|
|  3|0.09430205113458567| -2.573636861034734|
|  4|0.38340505276222947| 0.5469737451926588|
|  5| 0.5569246135523511|0.17431283601478723|
|  6| 0.4977441406613893|-0.7040284633147095|
|  7| 0.2076666106201438| 0.4637547571868822|
|  8| 0.9571919406508957|  0.920722532496133|
|  9| 0.7429395461204413|-1.4353459012380192|
+---+-------------------+-------------------+
only showing top 10 rows



In [16]:
df.describe().show()

+-------+------------------+--------------------+--------------------+
|summary|                id|             uniform|              normal|
+-------+------------------+--------------------+--------------------+
|  count|               100|                 100|                 100|
|   mean|              49.5|    0.49686601567822|-0.01216928004517...|
| stddev|29.011491975882016| 0.28826347846677686|  1.0617174284468838|
|    min|                 0|0.002510505496357...| -2.6620895295953004|
|    max|                99|  0.9958062482976284|   2.750429557170309|
+-------+------------------+--------------------+--------------------+



In [19]:
from pyspark.sql.functions import *
df.select([mean('uniform'), min('uniform'), max('uniform')]).show(10)

+----------------+--------------------+------------------+
|    avg(uniform)|        min(uniform)|      max(uniform)|
+----------------+--------------------+------------------+
|0.49686601567822|0.002510505496357...|0.9958062482976284|
+----------------+--------------------+------------------+



In [18]:
df.stat.cov('uniform', 'normal')

0.0025093879241165065

In [24]:
df = df.select('id','uniform','normal',(cos(df.normal*2*3.14159).alias('cos_normal')))
df.show(10)

+---+-------------------+-------------------+--------------------+
| id|            uniform|             normal|          cos_normal|
+---+-------------------+-------------------+--------------------+
|  0|0.41371264720975787| 0.5888539012978773| -0.8481662251719577|
|  1| 0.7311719281896606| 0.8645537008427937|  0.6592023710354442|
|  2| 0.9031701155118229| 1.2524569684217643|-0.01543032849405...|
|  3|0.09430205113458567| -2.573636861034734|  -0.894868255256957|
|  4|0.38340505276222947| 0.5469737451926588| -0.9567608933620207|
|  5| 0.5569246135523511|0.17431283601478723|   0.457834069473377|
|  6| 0.4977441406613893|-0.7040284633147095| -0.2848514172214892|
|  7| 0.2076666106201438| 0.4637547571868822| -0.9741795801191937|
|  8| 0.9571919406508957|  0.920722532496133|  0.8784823756819166|
|  9| 0.7429395461204413|-1.4353459012380192| -0.9186125933094329|
+---+-------------------+-------------------+--------------------+
only showing top 10 rows



In [27]:
sc.stop()