# Spark Dataframe Basics
* DataFrames act as powerful versions of tables with rows and columns that can easily handle large datasets. 
* Advantages of dataframes:
    * Simpler syntax
    * SQL can be directly used in the dataframe
    * Operations are automatically distributed across RDDs (distributed system)
* Spark Dataframe adopts and expand many concepts from Pnadas and R.
___

## Create a Dataframe

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Starting a Spark session (requires some time to process)
spark = SparkSession.builder.appName("Basics").getOrCreate()

* Dataset access (Form connection to a large distributed file such as HDFS for online data tools)

In [4]:
# Example (tiny) dataset
# df = spark.read.format('json').load('PySparkDataSets/people.json')
df = spark.read.json('PySparkDataSets/people.json')

In [8]:
# Show data
df.show()

+----+--------------+
| age|          name|
+----+--------------+
|null|ChickenTonight|
|  30|         Kebab|
|  19|         Minji|
+----+--------------+



In [9]:
# Show datatype
df.dtypes

[('age', 'bigint'), ('name', 'string')]

In [5]:
# Display the datatype schema
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [5]:
# Show column names
df.columns

['age', 'name']

In [7]:
# Show statistical summary of the dataframe
df.describe().show()

+-------+------------------+--------------+
|summary|               age|          name|
+-------+------------------+--------------+
|  count|                 2|             3|
|   mean|              24.5|          null|
| stddev|7.7781745930520225|          null|
|    min|                19|ChickenTonight|
|    max|                30|         Minji|
+-------+------------------+--------------+



___
## Set Schema Type
* The structure fields must be specified.

In [11]:
# from pyspark.sql.types import StructType,StructField,IntegerType,StringType,FloatType
from pyspark.sql.types import *

Listing of the structure fields:
* :param name: string, name of the field.
* :param dataType: :class: `DataType` of the field.
* :param nullable: boolean, whether the field can be null (None) or not.

In [12]:
dataschema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]

In [13]:
finalstructure = StructType(fields=dataschema)

In [14]:
df = spark.read.json('PySparkDataSets/people.json', schema=finalstructure)
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



___
## Obtain data

In [14]:
# Show dataframe column
df['age']

Column<'age'>

In [15]:
# Show the column object type
type(df['age'])

pyspark.sql.column.Column

In [17]:
# Select a single column
df.select('age')

DataFrame[age: int]

In [18]:
# Show the dataframe type
type(df.select('age'))

pyspark.sql.dataframe.DataFrame

In [19]:
# Select the dataframe and show the single column
df.select('age').show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



In [20]:
# Check the first two rows of the dataframe
df.head(2)

[Row(age=None, name='ChickenTonight'), Row(age=30, name='Kebab')]

In [21]:
# Show the type of the first column first two rows
type(df.head(2)[0])

pyspark.sql.types.Row

In [19]:
# Select multiple columns
df.select(['age','name'])

DataFrame[age: int, name: string]

In [22]:
# Display the selected columns
df.select(['age','name']).show()

+----+--------------+
| age|          name|
+----+--------------+
|null|ChickenTonight|
|  30|         Kebab|
|  19|         Minji|
+----+--------------+



___
## Create new columns

In [26]:
# Add a new column with a simple copy + arithmetic operation
df.withColumn('TripleAge',df['age']*3).show()

+----+--------------+---------+
| age|          name|TripleAge|
+----+--------------+---------+
|null|ChickenTonight|     null|
|  30|         Kebab|       90|
|  19|         Minji|       57|
+----+--------------+---------+



In [24]:
df.show()

+----+--------------+
| age|          name|
+----+--------------+
|null|ChickenTonight|
|  30|         Kebab|
|  19|         Minji|
+----+--------------+



In [25]:
# To rename columns
df.withColumnRenamed('age','A').show()

+----+--------------+
|   A|          name|
+----+--------------+
|null|ChickenTonight|
|  30|         Kebab|
|  19|         Minji|
+----+--------------+



In [28]:
# Add a new column with a simple copy + arithmetic operation
df.withColumn('Age x 2.5',df['age']*2.5).show()

+----+--------------+---------+
| age|          name|Age x 2.5|
+----+--------------+---------+
|null|ChickenTonight|     null|
|  30|         Kebab|     75.0|
|  19|         Minji|     47.5|
+----+--------------+---------+



In [29]:
# Add a new column with a simple copy + arithmetic operation
df.withColumn('Age + 1',df['age']+1).show()

+----+--------------+-------+
| age|          name|Age + 1|
+----+--------------+-------+
|null|ChickenTonight|   null|
|  30|         Kebab|     31|
|  19|         Minji|     20|
+----+--------------+-------+



In [30]:
# Add a new column with a simple copy + arithmetic operation
df.withColumn('Half Age',df['age']/2).show()

+----+--------------+--------+
| age|          name|Half Age|
+----+--------------+--------+
|null|ChickenTonight|    null|
|  30|         Kebab|    15.0|
|  19|         Minji|     9.5|
+----+--------------+--------+



In [31]:
# Show dataframe column properties
df.withColumn('half_age',df['age']/2)

DataFrame[age: int, name: string, half_age: double]

___
## Optional: Queries with SQL Syntax
SQL queries must be registered into a temperoary view in order to be used directly within the dataframe.

In [32]:
# Register dataframe as a SQL temporary view
df.createOrReplaceTempView("people")

In [33]:
# Show dataframe datatype
sql = spark.sql("SELECT * FROM people")
sql

DataFrame[age: int, name: string]

In [34]:
sql.show()

+----+--------------+
| age|          name|
+----+--------------+
|null|ChickenTonight|
|  30|         Kebab|
|  19|         Minji|
+----+--------------+



In [35]:
# A direct sql query
spark.sql("SELECT * FROM people WHERE age=30").show()

+---+-----+
|age| name|
+---+-----+
| 30|Kebab|
+---+-----+



___