# 101 - Intro to Spark

Spark is an open-source cluster computing engine for data processing that is fast and easy to use. It's a unified engine packaged for SQL queries, streaming data, machine learning, and graph processing. Spark supports programming languages such as Python, Java, Scala, and R, and runs in laptops until clusters.

To understand what Spark does, imagine a single computer that is useful in watching movies, navigation, or spreadsheet software. This only computer is not powerful enough to perform computations in a massive amount of data. An option is to use a cluster. A cluster is a group of machines that shares resources to work together as a single computer. To make it happens, Spark helps managing and coordinating the group of computers.

In the cluster, there is one computer to be the *master* that works splitting up the data and processing tasks. After that, sending to the nodes and aggregating each result. The other computers, called nodes, receive and perform the processing task, realize, and return the result to the Master.

### Downloading PySpark - the Python API for Spark

In [1]:
!pip install pyspark



### Creating Spark Context

```SparkContext```: Main entry point for Spark functionality.

In [2]:
from pyspark import SparkContext

# create a spark context
sc = SparkContext("local", "First App")
print(sc)
print(sc.version)

<SparkContext master=local appName=First App>
2.4.4


## Using DataFrames

A DataFrame is a collection of data organized in rows and named columns. Spark DataFrames are built on top of a low level object called Resilient Distributed Datase (RDD). RDD is the core data structured of Spark. The Spark DataFrame is easy to understand and more optimized for complicated operations than RDD.

Using Spark DataFrames you are able to query data in your Spark cluster.

To start working with DataFrames is necessary to create a ```SparkSession```.

In [3]:
from pyspark.sql import SparkSession

# create a spark session
spark_session = SparkSession.builder.getOrCreate()

print(spark_session)

<pyspark.sql.session.SparkSession object at 0x000001B7F7120348>


### To view all tables/views in your cluster

Creating a table from a file with ```.read```. There is a lot of data sources included in this attribute

In [4]:
# read a csv file
df = spark_session.read.csv("sample_data/101-intro-to-spark.csv")

# register in the catalog
df.registerTempTable("address")

# list tables in catalog
spark_session.catalog.listTables()

[Table(name='address', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

### To query in your table

Using the command ```<your_spark_session>.sql(query)``` it's possible to query as SQL.

In [5]:
query = "FROM address SELECT * LIMIT 2"

address2 = spark_session.sql(query) 

address2.show()

+----+--------+-----------------+---------+---+------+
| _c0|     _c1|              _c2|      _c3|_c4|   _c5|
+----+--------+-----------------+---------+---+------+
|John|     Doe|120 jefferson st.|Riverside| NJ| 08075|
|Jack|McGinnis|     220 hobo Av.|    Phila| PA| 09119|
+----+--------+-----------------+---------+---+------+



### Spark DataFrame to Pandas

Transform a Spark DataFrame in Pandas Dataframe

In [6]:
query = "SELECT * FROM address"

# run the query
address = spark_session.sql(query)

# convert to pandas dataframe
df_address = address.toPandas()

# Print the head of pd_counts
columns = ["name", "surname", "street", "city", "state", "zipcode"]
df_address.columns = columns
df_address.head()

Unnamed: 0,name,surname,street,city,state,zipcode
0,John,Doe,120 jefferson st.,Riverside,NJ,8075
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
2,"""John """"Da Man""""""",Repici,120 Jefferson St.,Riverside,NJ,8075
3,Stephen,Tyler,"""7452 Terrace """"At the Plaza"""" road""",SomeTown,SD,91234
4,,Blankman,,SomeTown,SD,298


### Pandas DataFrame to Spark DataFrame

In [7]:
import pandas as pd
import numpy as np

# create a pandas dataframe
pd_df = pd.DataFrame(np.random.random(10))

# create spark_df from pd_df
spark_df = spark_session.createDataFrame(pd_df)

# list tables in catalog
print("First: ", spark_session.catalog.listTables())

# Add spark_temp to the catalog
spark_df.createTempView("table_from_pandas")

# list tables in catalog
print("Second: ", spark_session.catalog.listTables())

First:  [Table(name='address', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
Second:  [Table(name='address', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='table_from_pandas', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


```.createTempView()``` register the DataFrame as a table in the catalog.

```.createDataFrame()``` create a Spark DataFrame from pandas.

## RDD x DataFrame

### When to use RDD?

    - For a specific reason
    - When you need fine-grained control over the physical distribution of data (custom partitioning of data)

-

Resilient Distributed Datasets (RDDs) is a collection of immutable Java, Scala or Python objects. 
    
    - In RDD, you are able to store what you want in any format.
    - There isn't the concept of row in RDD
    - Manipulations or interactions are defined by hand, it will be a lot of manual work.
    - operates in parallel

The use of RDD can be significantly slowly in Python

DataFrames are also immutable collections of data organized by named columns. The use of DataFrames helps you in:

    - processing large data sets
    - formalize the structure of the data
    - looks like a relational database
    

In [68]:
# sc is defined above

myCollection = "A ship in the harbor is safe, but that is not what a ship is for.\
A stitch in time saves nine.\
As you sow, so you shall reap.\
Be slow in choosing, but slower in changing.\
Curiosity killed the cat.\
Don't cast pearls before swine.\
Don't count your chickens before they hatch.\
Don't cross a bridge until you come to it.\
Don't judge a book by its cover.\
Don't put the cart before the horse.\
Early bird catches the worm.".split(".") # list of phrases

To create a RDD you ```.parallelize()```  a collection (list or an array of some elements):

It is possible creating a RDD from a file using ```.textFile(<file_path>)```


In [76]:
phrases = sc.parallelize(myCollection, 2)

```.collect()``` will run an action to bring it back to the driver

In [70]:
phrases.collect()

['A ship in the harbor is safe, but that is not what a ship is for',
 'A stitch in time saves nine',
 'As you sow, so you shall reap',
 'Be slow in choosing, but slower in changing',
 'Curiosity killed the cat',
 "Don't cast pearls before swine",
 "Don't count your chickens before they hatch",
 "Don't cross a bridge until you come to it",
 "Don't judge a book by its cover",
 "Don't put the cart before the horse",
 'Early bird catches the worm',
 '']

In [78]:
# mapping the sentences that contains ' a '
match_str = " a "
phrases2 = phrases.map(lambda p: (p, match_str in p))

# filtering the map response where the map answer is True
phrases2.filter(lambda record: record).take(5)

[('A ship in the harbor is safe, but that is not what a ship is for', True),
 ('A stitch in time saves nine', False),
 ('As you sow, so you shall reap', False),
 ('Be slow in choosing, but slower in changing', False),
 ('Curiosity killed the cat', False)]