### Spark and Cassandra

This tutorial aims to show how to work with cassandra and apache spark. The idea of merging the distributed dataset (Cassandra) and the Distributed framework, is that will allow us to easily play with the data on the cases that the data itself is bigger than what our machines (or a single powefull machine) can handle.

#### Activate python 3.5
The Cassandra Spark connector does not work with the newest version of apache spark, which also does not work with python 3.6, so we need to make this trick
```bash
conda create -n py35 python=3.5 anaconda
source activate py35
```

#### Initialize Spark and Cassandra
```bash
pyspark --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 --conf spark.cassandra.connection.host=127.0.0.1
```
References:
* https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
* https://www.youtube.com/watch?v=GjNXK1SGDLw
* https://www.youtube.com/watch?v=6x-917-BIrM&t=548s
* https://github.com/datastax/spark-cassandra-connector
* https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md
* https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md
* https://spark-packages.org/package/datastax/spark-cassandra-connector
* https://spark-packages.org/package/TargetHolding/pyspark-cassandra
* https://github.com/TargetHolding/pyspark-cassandra
* https://stackoverflow.com/questions/34882097/cannot-connect-to-cassandra-from-spark
* https://github.com/anguenot/pyspark-cassandra#examples


### Create a Dataframe from Cassandra

In [1]:
df = spark.read.format("org.apache.spark.sql.cassandra").options(table="tb_drive", keyspace="mydb").load()

# Show dataframe (Distributed table structure)
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- acc: float (nullable = true)
 |-- image: binary (nullable = true)
 |-- wheel_angle: float (nullable = true)



### Get dataframe size

In [2]:
# Get the size of the dataframe
print('DataFrame size:',df.count())

DataFrame size: 1040


### Convert dataframe to pandas to display nicelly

In [5]:
df.toPandas().head(5)
# Get 3 lines
#df.take(3)

Unnamed: 0,id,acc,image,wheel_angle
0,7e1ea253-5672-11e7-9708-989096d72294,1.0,"[137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13,...",0.138223
1,8da0443a-5672-11e7-9708-989096d72294,1.0,"[137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13,...",0.235052
2,87a6c739-5672-11e7-9708-989096d72294,1.0,"[137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13,...",-0.080126
3,7cea6243-5672-11e7-9708-989096d72294,0.690148,"[137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13,...",0.080126
4,8da04448-5672-11e7-9708-989096d72294,1.0,"[137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13,...",-0.099492


### Do queries inside the dataframe (Using Spark Dataframe API)

In [13]:
# Get all angles different than zero
df.filter(df.wheel_angle != 0).count()

733

In [7]:
# How many times we did not accelerate fully...
df.filter(df.acc != 1).count()

512

### Do queries inside the dataframe (Using Spark Sql API)
We can transform a dataframe into a temporary table on the cluster allowing us to fully use SQL language.

In [14]:
df.registerTempTable("autodrive")
df_filt = sqlContext.sql("SELECT wheel_angle, acc FROM autodrive where wheel_angle between 0.1 and 1.0")
print('Number of instances:',df_filt.count())
df_filt.toPandas().head()

Number of instances: 214


Unnamed: 0,wheel_angle,acc
0,0.138223,1.0
1,0.235052,1.0
2,0.206003,0.544904
3,0.109174,0.680465
4,0.196321,0.0
