# Why Pyspark

### Introduction

If we think about our normal understanding of databases, our databases have two components: (1) storage and (2) processing.  That is, we both store our data in our databases, and we can perform operations on that data.  

As we'll see with Pyspark, we'll only be concerned with processing our data.  And we can be saved to disk externally, on a service like S3.  This has vast implications for the way we treat and value data in an organization, as well as the operations we can perform on that data. 

### Databases: Storage and Processing 

Now normally, when we work with a database, we are working with two components: storage and processing.

<img src="./db_both.jpg" width="40%">

We store our data in tables, and this data is stored in a file.  And when we perform an operation, like select records from that table, we load data into memory and perform any transformations (like rounding floats) and return data to the user.

Even when working with a distributed database like redshift, where data is stored across compute nodes, each compute node only processes it's own data.

<img src="./distributed_db.jpg" width="100%">

> So above, slice 1 both stores a partition of the movies, and calculate the average of that subset of movies.  Slice 2 does the same for it's subset of movies, and so on.  Then the leader node uses the sub-averages to calculate the average across all movies.

### Pyspark stores data in memory

As we may have guessed, things operate differently in pyspark.  Take a look at the diagram below.

> <img src="./s3_to_movies.jpg" width="60%">

Instead of storing the data on disk inside the database, instead we can permanently store the data externally in something like an S3 bucket, and then can read the data into Spark's memory when we need to perform a query on the data. 

So in the diagram above, notice that inside of spark, we do not need to store our data on disk.  Instead, inside of spark, our data can be directly loaded into memory.  

And because Spark is organized as a cluster spark can also read our data into memory, partitioning it onto different machines in the cluster.

<img src="./spark_cluster_partition.jpg" width="80%">

### But there's still local storage

Even though Spark performs much of it's computation in memory, and even allows for in memory storage, Spark nodes still use local disks for data that does not fit into RAM, and to store intermediate output in a complex operation.

So let's update our diagram to more accurately reflect the hardware spark.

> <img src='./spark_cluster_disk.jpg' width="80%">

Still even though there is technically in memory storage in worker nodes, what distinguishes Spark is it's reliance on in memory storage and computation.

Ok, now let's see some of the benefits on the reliance of in memory storage and computation.

### The Benefits of In Memory Storage 

So we saw that with Spark, we read in our data from an external data source, and in spark, we can store and process this data in memory.  Let's see some of the benefits that we get with this kind of storage.

1. Cheaper storage

The first benefit is simply that it costs less money if we store our data externally on something like S3 than storing our data inside of a database on disk.  With our data stored on S3, our data is simply written to disk, and there is no associated software running along with it.  By contrast, when we store our data in a database like redshift or spark, for each node we have running, we are not just paying for the data storage but a database with the capability to process that data.

With our data being cheaper to store, we can store data that has relatively low value, or unknown value.  And if a data scientist or data engineer can extract value from it later on, the data will be available.

2. Schema on Read

Not only is the data available, but with storage in memory, it's easier to explore and process this data.  For example, we can read data into a spark dataframe, which operates similarly to a pandas dataframe.  How easy is it to go from a csv file to a dataframe?

In [1]:
import pandas as pd
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
df = pd.read_csv(url)
df[:2]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0


Pretty easy.  Contrast this with our operations with redshift or postgres.  There, we first needed to create our tables, and transform our data to insert records into those tables.  With Spark, we can let the software determine the schema that fits to the data.  This is called `schema on read`, as the schema is determined when we read in the data.

This is also referred to as extract-load-transform (ELT), as we can load our external data into our spark database, and then transform this data.

3. Memory intensive computations

The third benefit of memory storage really comes from the perspective of data science.  Performing certain computations requires having a large amount of data accessible for computation, and having our entire dataset loaded in memory means the data will be accessible.

### Summary

In this lesson, we saw that with Pyspark we store our data externally in something like an S3 bucket, and then load this data into memory when used by Spark.  If we need additional memory, we add more nodes to our Spark cluster.

Because Spark is in memory storage, this means that we are able to store data with low value, or unknown value.  And reading our data into memory means that we do not need to first create a schema and then load our data into these tables.  In fact spark often can detect the appropriate datatypes for each column for us, creating the `schema on read`.  Finally, for data scientists, Spark is useful for memory intensive calculations. 