# DataStax Spark Cassandra Connector

## PySpark DataFrames

É possível utilizar o Cassandra com o Apache Spark incluindo o *data source* disponibilizado pela datastax. O melhor jeito de fazer isso é por meio do site spark-packages, o comando no momento de chamar o pyspark fica da seguinte forma:

```
./bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 --conf spark.cassandra.connection.host=127.0.0.1
```

## Loading a DataFrame in Python

### Example Loading a Cassandra Table as a Pyspark DataFrame

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("DataStax Spark CassandraConnector").getOrCreate()

In [3]:
home_activity_df = spark.read.format("org.apache.spark.sql.cassandra").options(table="activity", keyspace="home_security").load()

In [4]:
home_activity_df.show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
|H01033638|2014-05-21 01:32:34|     2121|       alarm set|
|H01474777|2014-05-20 23:32:04|     5599|       alarm set|
|H02257222|2014-05-20 23:29:47|     1566|       alarm set|
+---------+-------------------+---------+----------------+



## Manipulating Cassandra Table's Data with DataFrames

In [5]:
home_activity_df.select("home_id", "datetime").show()

+---------+-------------------+
|  home_id|           datetime|
+---------+-------------------+
|H01033638|2014-05-21 01:55:12|
|H01033638|2014-05-21 01:33:43|
|H01033638|2014-05-21 01:32:34|
|H01474777|2014-05-20 23:32:04|
|H02257222|2014-05-20 23:29:47|
+---------+-------------------+



In [6]:
home_activity_df.where(home_activity_df["code_used"] == 2121).select("*").show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
|H01033638|2014-05-21 01:32:34|     2121|       alarm set|
+---------+-------------------+---------+----------------+



In [7]:
home_activity_df.createOrReplaceTempView("tab_activity")

In [8]:
view_home_activity_df = spark.sql("SELECT * FROM tab_activity WHERE code_used = 2121")

In [9]:
view_home_activity_df.show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
|H01033638|2014-05-21 01:32:34|     2121|       alarm set|
+---------+-------------------+---------+----------------+



In [10]:
view_home_activity_df = spark.sql("SELECT * FROM tab_activity WHERE event LIKE '%off%'")

In [11]:
view_home_activity_df.show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
+---------+-------------------+---------+----------------+



In [12]:
home_activity_df.show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
|H01033638|2014-05-21 01:32:34|     2121|       alarm set|
|H01474777|2014-05-20 23:32:04|     5599|       alarm set|
|H02257222|2014-05-20 23:29:47|     1566|       alarm set|
+---------+-------------------+---------+----------------+



## Saving a DataFrame into Cassandra's Table

In [13]:
view_home_activity_df.write.format("org.apache.spark.sql.cassandra").mode("append").options(table="activity2", keyspace="home_security").save()

In [14]:
view_home_activity_df.show()

+---------+-------------------+---------+----------------+
|  home_id|           datetime|code_used|           event|
+---------+-------------------+---------+----------------+
|H01033638|2014-05-21 01:55:12|     2121|   alarm set off|
|H01033638|2014-05-21 01:33:43|     2121|alarm turned off|
+---------+-------------------+---------+----------------+



In [15]:
view_home_activity_df.printSchema()

root
 |-- home_id: string (nullable = true)
 |-- datetime: timestamp (nullable = true)
 |-- code_used: string (nullable = true)
 |-- event: string (nullable = true)

