## Tutorial Dataframe
On this tutorial we will learn how to load a csv file into the cluster and do some analysis.

### Resilient Distributed Dataset
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

### Dataframes
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data.

### When should I use DataFrames or Datasets
* If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
* If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
* If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
* If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
* If you are a R user, use DataFrames.
* If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

#### References
* https://stackoverflow.com/questions/29936156/get-csv-to-spark-dataframe
* https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
* https://spark.apache.org/docs/latest/sql-programming-guide.html
* https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
* http://discourse.snowplowanalytics.com/t/running-sql-queries-on-dataframes-in-spark-sql-updated/119
* https://github.com/mahmoudparsian/data-algorithms-book
* https://www.datacamp.com/community/tutorials/apache-spark-python#gs.1vfxjmY
* https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

### Read titanic dataset from csv file with Pandas

In [2]:
pandas_df = pd.read_csv('../data/titanic/train.csv')
pandas_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Read titanic dataset on the cluster (Distributed dataframe)
With the dataframe we can extract statistics and/or filter our data distributed accros the clusters.

In [3]:
# Create a SQL context from the SparkContext(sc)
sql_sc = SQLContext(sc)

# Create a dataframe from the csv file
# Imagine that this will create a table from the csv file
df = sqlContext.read.csv('../data/titanic/train.csv')
df.show(5)

# Print structure
df.printSchema()

+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|        _c0|     _c1|   _c2|                 _c3|   _c4|_c5|  _c6|  _c7|             _c8|    _c9| _c10|    _c11|
+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
|          1|       0|     3|Braund, Mr. Owen ...|  male| 22|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female| 38|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female| 26|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female| 35|    1|    0|          113803|   53.1| C123|       S|
+-----------+--------+------+--------------------+------+---+-----+-----+---------------

### Do some query on the data
Notice that we're doing queries without SQL only with the API.

In [4]:
df_male = df.filter(df._c4 == 'male')
# Convert to pandas to get a nice 
df_male.toPandas().head()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
2,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
3,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
4,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


### Doing SQL queries
Sometimes you may want to have the SQL engine to perform some computation.

In [5]:
df.registerTempTable("titanic")
sqlContext.sql("SELECT COUNT(*) AS Total FROM titanic").toPandas().head()

Unnamed: 0,Total
0,892
