```{figure} ../images/banner.png
---
align: center
name: banner
---
```

# Chapter 3 : Data Sources

## Chapter Learning Objectives

- various data sources & file formats
- general methods to load the data into spark dataframe 
- general methods to save the data into spark dataframe 

## Chapter Outline

- [1. Various data sources & file formats](#1)
- [2. Loading & Saving data from various data sources](#2)
    - [2a. from text file](#3)
    - [2b. from CSV file](#4)
    - [2c. from JSON file](#5)
    - [2d. from Parquet file](#6)
    - [2e. from ORC file](#7)
    - [2f. from AVRO file](#8)
    - [2g. from whole Binary file](#9)


## Chapter Outline (Visual)
Click on any one of the image to go to that section


```{figure} img/chapter3/datasources.png
---
align: center
---
```

## Chapter Outline (Visual)
Click on any one of the image to go to that section

In [209]:
import panel as pn
css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

<div class="special_table"></div>

click on any of the image below |To come back to this image gallery, on the top right corner under contents, click on "Outline Gallery" 
- | - 
[![alt](img/chapter3/text.png)](#501)| [![alt](img/chapter3/csv.png)](#502)
[![alt](img/chapter3/json.png)](#503) | [![alt](img/chapter3/parquet.png)](#504)
[![alt](img/chapter3/orc.png)](#505) | [![alt](img/chapter3/avro.png)](#506)
[![alt](img/chapter3/binary.png)](#507) |


# test123

[<img src="img/chapter3/text.png" />](#12451)

In [197]:
## [![alt text](img/chapter2/pd_spark.png "Title")](#23)
#***

<a id='1'></a>

## 1. What is spark dataframe?
A DataFrame simply represents a table of data with rows and columns. A simple analogy would be a spreadsheet with named columns.

Spark Data Frame is a distributed collection of data organized into named columns. It can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDD, Lists, Pandas data frame. 


```{figure} img/chapter2/spark_dataframe.png
---
align: center
---
```

In [146]:
#alist = [(John	180	True	1.70	1960-01-01	{“home”: 123456789, “office”:234567567}	[“blue”,”red”,”green”]	]
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
spark.createDataFrame([("John",180,True, 1.7, "1960-01-01", '{“home”: 123456789, “office”:234567567}'),]).show(1,False)        

+----+---+----+---+----------+---------------------------------------+
|_1  |_2 |_3  |_4 |_5        |_6                                     |
+----+---+----+---+----------+---------------------------------------+
|John|180|true|1.7|1960-01-01|{“home”: 123456789, “office”:234567567}|
+----+---+----+---+----------+---------------------------------------+



In [147]:
jsonStrings = ['{"name":"Yin","age":45,"smoker": true,"test":34, "address":{"city":"Columbus","state":"Ohio"},"favorite_colors": ["blue","green"] }',]
otherPeopleRDD = spark.sparkContext.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()  
#123

root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- age: long (nullable = true)
 |-- favorite_colors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- name: string (nullable = true)
 |-- smoker: boolean (nullable = true)
 |-- test: long (nullable = true)



In [148]:
from pyspark.sql.types import *
from pyspark.sql import functions as func
schema = StructType([
    StructField("name", StringType()),
    StructField("weight", LongType()),
    StructField("smoker", BooleanType()),
    StructField("height", DoubleType()),
    StructField("birthdate", StringType()),
    StructField("phone_nos", MapType(StringType(),LongType(),True),True),  
    StructField("favorite_colors", ArrayType(StringType(),True),True),  
    StructField("address", StructType([
        StructField("houseno", IntegerType(),True),
        StructField("street", StringType(),True),
        StructField("city", StringType(),True),
        StructField("zipcode", IntegerType(),True),
    ])) 
    
])

df = spark.createDataFrame((
    [["john",180,True,1.7,'1960-01-01',{'office': 123456789, 'home': 223456789},["blue","red"],(100,'street1','city1',12345)],
    ["tony",180,True,1.8,'1990-01-01',{'office': 223456789, 'home': 323456789},["green","purple"],(200,'street2','city2',22345)],
    ["mike",180,True,1.65,'1980-01-01',{'office': 323456789, 'home': 423456789},["yellow","orange"],(300,'street3','city3',32345)]]
),schema=schema)
df.show(3,False)

+----+------+------+------+----------+----------------------------------------+----------------+----------------------------+
|name|weight|smoker|height|birthdate |phone_nos                               |favorite_colors |address                     |
+----+------+------+------+----------+----------------------------------------+----------------+----------------------------+
|john|180   |true  |1.7   |1960-01-01|[office -> 123456789, home -> 223456789]|[blue, red]     |[100, street1, city1, 12345]|
|tony|180   |true  |1.8   |1990-01-01|[office -> 223456789, home -> 323456789]|[green, purple] |[200, street2, city2, 22345]|
|mike|180   |true  |1.65  |1980-01-01|[office -> 323456789, home -> 423456789]|[yellow, orange]|[300, street3, city3, 32345]|
+----+------+------+------+----------+----------------------------------------+----------------+----------------------------+



In [149]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- weight: long (nullable = true)
 |-- smoker: boolean (nullable = true)
 |-- height: double (nullable = true)
 |-- birthdate: string (nullable = true)
 |-- phone_nos: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- favorite_colors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- address: struct (nullable = true)
 |    |-- houseno: integer (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- zipcode: integer (nullable = true)



In [150]:
#JSON FILE
#df.repartition(1).write.json("/Users/deepak/Documents/sparkbook/chapters/data/json")
spark.read.format("json").load("/Users/deepak/Documents/sparkbook/chapters/data/json").show(3,False)

+----------------------------+----------+----------------+------+----+----------------------+------+------+
|address                     |birthdate |favorite_colors |height|name|phone_nos             |smoker|weight|
+----------------------------+----------+----------------+------+----+----------------------+------+------+
|[city1, 100, street1, 12345]|1960-01-01|[blue, red]     |1.7   |john|[223456789, 123456789]|true  |180   |
|[city2, 200, street2, 22345]|1990-01-01|[green, purple] |1.8   |tony|[323456789, 223456789]|true  |180   |
|[city3, 300, street3, 32345]|1980-01-01|[yellow, orange]|1.65  |mike|[423456789, 323456789]|true  |180   |
+----------------------------+----------+----------------+------+----+----------------------+------+------+



In [151]:

#df.select(func.concat("name","weight","smoker","height","birthdate",func.to_json("phone_nos"),func.to_json("favorite_colors"),func.to_json("address")).alias("text")).repartition(1).write.format("text").option("header","true").save("/Users/deepak/Documents/sparkbook/chapters/data/people/part-00000-29f563bb-12e4-42c3-bae3-3e30505a16cc-c000.txt")

In [152]:
spark.read.format("text").load("/Users/deepak/Documents/sparkbook/chapters/data/people/part-00000-29f563bb-12e4-42c3-bae3-3e30505a16cc-c000.txt").show(3,False)

+--------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                             |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
|john180true1.71960-01-01{"office":123456789,"home":223456789}["blue","red"]{"houseno":100,"street":"street1","city":"city1","zipcode":12345}      |
|tony180true1.81990-01-01{"office":223456789,"home":323456789}["green","purple"]{"houseno":200,"street":"street2","city":"city2","zipcode":22345}  |
|mike180true1.651980-01-01{"office":323456789,"home":423456789}["yellow","orange"]{"houseno":300,"street":"street3","city":"city3","zipcode":32345}|
+---------------------------------------------------------------------------------------------------------

In [153]:
spark.read.csv('/Users/deepak/Documents/sparkbook/chapters/data/people.csv', header=True).show(3,False)

+----+------+------+------+----------+-------------------------------------+-------------------+-----------------------------------------------------------------+
|name|weight|smoker|height|birthdate |phone_nos                            |favorite_colors    |address                                                          |
+----+------+------+------+----------+-------------------------------------+-------------------+-----------------------------------------------------------------+
|john|180   |true  |1.7   |1960-01-01|{"office":123456789,"home":223456789}|["blue","red"]     |{"houseno":100,"street":"street1","city":"city1","zipcode":12345}|
|tony|180   |true  |1.8   |1990-01-01|{"office":223456789,"home":323456789}|["green","purple"] |{"houseno":200,"street":"street2","city":"city2","zipcode":22345}|
|mike|180   |true  |1.65  |1980-01-01|{"office":323456789,"home":423456789}|["yellow","orange"]|{"houseno":300,"street":"street3","city":"city3","zipcode":32345}|
+----+------+------+--

In [154]:
#/Users/deepak/Documents/sparkbook/chapters
from pyspark.sql import functions as func
#df.select("name","weight","smoker","height","birthdate",func.to_json("phone_nos").alias("phone_nos"),func.to_json("favorite_colors").alias("favorite_colors"),func.to_json("address").alias("address")).repartition(1).write.csv('/Users/deepak/Documents/sparkbook/chapters/data/people.csv', header=True)#show(3,False)
#df.write.csv('/Users/deepak/Documents/sparkbook/chapters/data/people1.csv', header=True)

In [155]:
#df.repartition(1).write.parquet("/Users/deepak/Documents/sparkbook/chapters/data/parquetfile",mode='overwrite')

In [156]:
#spark.read.parquet("/Users/deepak/Documents/sparkbook/chapters/data/parquetfile").show(3,False)

In [157]:
#df.repartition(1).write.orc("/Users/deepak/Documents/sparkbook/chapters/data/orcfile",mode='overwrite')

In [158]:
#spark.read.orc("/Users/deepak/Documents/sparkbook/chapters/data/orcfile").show(3,False)

In [159]:
#df.repartition(1).write.json("/Users/deepak/Documents/sparkbook/chapters/data/jsonfile",mode='overwrite')

In [160]:
#df.repartition(1).write.format("avro").save("/Users/deepak/Documents/sparkbook/chapters/data/avrofile")

In [161]:
import pandas as pd
pd.options.display.width = 0
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.width', None)
pd.set_option('display.expand_frame_repr', False)
spark.read.format("binaryFile").load("/Users/deepak/Documents/sparkbook/images/banner.png").show()

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/Users/deepa...|2021-01-19 21:26:53| 63864|[89 50 4E 47 0D 0...|
+--------------------+-------------------+------+--------------------+



## hive

<a id='501'></a>

<a id='502'></a>

<a id='503'></a>

<a id='504'></a>

<a id='505'></a>

<a id='506'></a>

<a id='507'></a>

In [162]:
warehouse_location = 'hive-warehouse'
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.0.1") \
    .enableHiveSupport() \
    .getOrCreate()

In [163]:
df.createOrReplaceTempView("people") 

In [164]:
#myDf.createOrReplaceTempView("mytempTable") 

In [165]:
spark.conf.set("hive.metastore.schema.verification","false")

In [166]:
#df.write.saveAsTable('people')

In [167]:
#spark.sql("create table people as select * from df")

In [168]:
#sqlContext.sql("create table mytable as select * from mytempTable");

In [169]:
from pyspark.sql.types import *
schema = StructType([
    StructField("name", StringType()),
    StructField("weight", LongType()),
    StructField("smoker", BooleanType()),
    StructField("height", DoubleType()),
    StructField("birthdate", StringType()),
    StructField("phone_nos", MapType(StringType(),LongType(),True),True),  
    StructField("favorite_colors", ArrayType(StringType(),True),True),  
    StructField("address", StructType([
        StructField("houseno", IntegerType(),True),
        StructField("street", StringType(),True),
        StructField("city", StringType(),True),
        StructField("zipcode", IntegerType(),True),
    ])) 
    
])
schema

StructType(List(StructField(name,StringType,true),StructField(weight,LongType,true),StructField(smoker,BooleanType,true),StructField(height,DoubleType,true),StructField(birthdate,StringType,true),StructField(phone_nos,MapType(StringType,LongType,true),true),StructField(favorite_colors,ArrayType(StringType,true),true),StructField(address,StructType(List(StructField(houseno,IntegerType,true),StructField(street,StringType,true),StructField(city,StringType,true),StructField(zipcode,IntegerType,true))),true)))

In [188]:
print("hello)")

hello)


<a id='1001'></a>

<a id='2'></a>

###  2. Creating a spark dataframe 

Lets first understand the syntax

```{admonition} Syntax
<b>createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)</b>

<b>Parameters</b>:

data – RDD,list, or pandas.DataFrame.

schema – a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None. 

samplingRatio – the sample ratio of rows used for inferring

verifySchema – verify data types of every row against schema.
```

<a id='33'></a>

## 2a. from RDD


```{figure} img/chapter2/rdd_dataframe.png
---
align: center
---
```

<b>What is RDD?</b>

Resilient Distributed Datasets (RDDs)

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. 

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 

RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. 

Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

<b>Creating RDD :</b>

In [170]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
rdd_spark = spark.sparkContext.parallelize([('John', 'Seattle', 60, True, 1.7, '1960-01-01'),
 ('Tony', 'Cupertino', 30, False, 1.8, '1990-01-01'),
 ('Mike', 'New York', 40, True, 1.65, '1980-01-01')]).collect()


In [171]:
print(rdd_spark)

[('John', 'Seattle', 60, True, 1.7, '1960-01-01'), ('Tony', 'Cupertino', 30, False, 1.8, '1990-01-01'), ('Mike', 'New York', 40, True, 1.65, '1980-01-01')]


<b>Creating a spark dataframe:</b>

In [172]:
spark.createDataFrame(rdd_spark).show()

+----+---------+---+-----+----+----------+
|  _1|       _2| _3|   _4|  _5|        _6|
+----+---------+---+-----+----+----------+
|John|  Seattle| 60| true| 1.7|1960-01-01|
|Tony|Cupertino| 30|false| 1.8|1990-01-01|
|Mike| New York| 40| true|1.65|1980-01-01|
+----+---------+---+-----+----+----------+



<a id='4'></a>

## 2b. from List


```{figure} img/chapter2/list_dataframe.png
---
align: center
---
```

In [173]:
spark.createDataFrame([('John', 'Seattle', 60, True, 1.7, '1960-01-01'), 
('Tony', 'Cupertino', 30, False, 1.8, '1990-01-01'), 
('Mike', 'New York', 40, True, 1.65, '1980-01-01')]).show()

+----+---------+---+-----+----+----------+
|  _1|       _2| _3|   _4|  _5|        _6|
+----+---------+---+-----+----+----------+
|John|  Seattle| 60| true| 1.7|1960-01-01|
|Tony|Cupertino| 30|false| 1.8|1990-01-01|
|Mike| New York| 40| true|1.65|1980-01-01|
+----+---------+---+-----+----+----------+



<a id='12451'></a>

## 2c. from pandas dataframe


```{figure} img/chapter2/pd_spark.png
---
align: center
---
```

<b>Input: pandas dataframe</b>

<b>Creating pandas dataframe</b>

In [174]:
import pandas as pd
df_pd = pd.DataFrame([('John', 'Seattle', 60, True, 1.7, '1960-01-01'), 
('Tony', 'Cupertino', 30, False, 1.8, '1990-01-01'), 
('Mike', 'New York', 40, True, 1.65, '1980-01-01')])
df_pd

Unnamed: 0,0,1,2,3,4,5
0,John,Seattle,60,True,1.7,1960-01-01
1,Tony,Cupertino,30,False,1.8,1990-01-01
2,Mike,New York,40,True,1.65,1980-01-01


<b>Output: spark dataframe</b>

In [175]:
#spark.createDataFrame(df_pd).show()

## .  &emsp; 3a. test1
<a id='4'></a>

##  &emsp;  5. &emsp; &emsp; test2
<a id='5'></a>