# Spark DataFrames Basics

#### In Apache Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:

1. Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD  after applying transformations.
2. Lazy Evaluations: Which means that a task is not executed until an action is performed.
3. Distributed: RDD and DataFrame both are distributed in nature.


#### Advantages

1. DataFrames are designed for processing large collection of structured or semi-structured data.
2. Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
3. DataFrame in Apache Spark has the ability to handle petabytes of data.
4. DataFrame has a support for wide range of data format and sources.
5. It has API support for different languages like Python, R, Scala, Java.


In [14]:
from pyspark import SparkContext as sc

In [15]:
from pyspark.sql import SparkSession

In [27]:
spark1 = SparkSession.builder.appName('Basics').enableHiveSupport().getOrCreate()

#### SparkSession is a new entry point to Spark. It works similar to SparkContext. SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext. All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

### Dataframe creation
A DataFrame in Apache Spark can be created in multiple ways:

1. It can be created using different data formats. For example, loading the data from JSON, CSV.
2. Loading data from Existing RDD.
3. Programmatically specifying schema


#### Reading from a json file

In [20]:
df_json = spark1.read.json('people.json')
df_json

DataFrame[age: bigint, name: string]

In [21]:
df_json.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



#### From RDD

In [38]:
from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
sc = spark1.sparkContext
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
df_rdd = spark1.createDataFrame(people)
df_rdd

DataFrame[age: bigint, name: string]

In [39]:
df_rdd.show()

+---+--------+
|age|    name|
+---+--------+
| 25|   Ankit|
| 22|Jalfaizy|
| 20| saurabh|
| 26|    Bala|
+---+--------+



#### From csv

In [44]:
df_csv = spark1.read.csv("sales.csv")

In [45]:
df_csv

DataFrame[_c0: string, _c1: string, _c2: string]

In [46]:
df_csv.show()

+-------+-------+-----+
|    _c0|    _c1|  _c2|
+-------+-------+-----+
|Company| Person|Sales|
|   GOOG|    Sam|  200|
|   GOOG|Charlie|  120|
|   GOOG|  Frank|  340|
|   MSFT|   Tina|  600|
|   MSFT|    Amy|  124|
|   MSFT|Vanessa|  243|
|     FB|   Carl|  870|
|     FB|  Sarah|  350|
|   APPL|   John|  250|
|   APPL|  Linda|  130|
|   APPL|   Mike|  750|
|   APPL|  Chris|  350|
+-------+-------+-----+



### defining own schema

#### import datatypes and structure types to build data schema 

In [53]:
from pyspark.sql.types import StructField, IntegerType, StringType, StructType

#### Define your data schema by supplying name and data types to the structure fields you will be importing

In [54]:
data_schema = [StructField('age',IntegerType(),True),
StructField('name',StringType(),True)]

#### Now create a StrucType with this schema as field

In [55]:
final_struc = StructType(fields=data_schema)

In [57]:
df = spark1.read.json('people.json',schema=final_struc)
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [59]:
df.summary.show()

AttributeError: 'function' object has no attribute 'show'

### Miscellenous Operations

In [47]:
df_csv.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)



In [49]:
df_json.columns

['age', 'name']

#### different statistics

In [50]:
df_json.describe

<bound method DataFrame.describe of DataFrame[age: bigint, name: string]>

In [51]:
df_json.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



In [52]:
df_json.summary().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    25%|                19|   null|
|    50%|                19|   null|
|    75%|                30|   null|
|    max|                30|Michael|
+-------+------------------+-------+

