# Spark: In Memory 

### Starting with Postgres

<img src="./db_both.jpg" width="40%">

* When select

### By Contrast Spark

> <img src="./data_s3.png" width="60%">

And read it in when need to process.

> <img src="./s3_to_movies.jpg" width="80%">

### Diving In

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("films").master("local[2]").getOrCreate()

In [2]:
spark = SparkSession.builder.appName("films").master("local[2]").getOrCreate()

In [3]:
spark

Finally, we'll read in data from s3 using pandas.

In [4]:
import pandas as pd
url = "s3://jigsaw-labs-student/imdb_movies.csv"
df = pd.read_csv(url)

In [7]:
movies_df = spark.createDataFrame(df.astype(str))

In [8]:
movies_df.take(1)

[Row(title='Avatar', genre='Action', budget='237000000', runtime='162.0', year='2009', month='12', revenue='2787965087')]

> <img src="./s3_to_movies.jpg" width="60%">

### The Benefits of In Memory Storage 

1. Cheaper storage
    * costs less money if we store on S3 vs Postgres.
    * So...can store data that has relatively low value, or even unknown value.

2. Extract Load Transform

In [38]:
import requests
url = "https://data.cityofnewyork.us/resource/biws-g3hs.json?pulocationid=186"
response = requests.get(url)

taxi_rides = response.json()

In [39]:
taxi_rides[:1]

[{'vendorid': '1',
  'tpep_pickup_datetime': '2017-01-09T11:32:27.000',
  'tpep_dropoff_datetime': '2017-01-09T11:36:01.000',
  'passenger_count': '1',
  'trip_distance': '0.90',
  'ratecodeid': '1',
  'store_and_fwd_flag': 'N',
  'pulocationid': '186',
  'dolocationid': '234',
  'payment_type': '1',
  'fare_amount': '5',
  'extra': '0',
  'mta_tax': '0.5',
  'tip_amount': '1.45',
  'tolls_amount': '0',
  'improvement_surcharge': '0.3',
  'total_amount': '7.25'}]

A. Postgres

In [None]:
* create a new table
* coerce some of the data
* Insert in the records

B. Pyspark

```python
import pandas as pd
df = pd.read_csv("s3://jigsaw-labs/imdb_movies.csv")
movies_df = spark.createDataFrame(df.astype(str))
movies_df.take(1)
```

3. Memory intensive computations

* Certain operations really requires having large amounts of data available in memory -- so having this in ability for in memory storage valuable.

### But there's still local some storage

Even though Spark performs much of it's computation in memory, and primarily uses in memory storage, Spark nodes still do use local disks for data that does not fit into RAM, and to store intermediate output in a complex operation.

> But this storage on disk is quite minor as compared to most databases.  

So let's update our diagram to more accurately reflect the hardware spark.

> <img src='./pyspark-components.jpg' width="50%">

So even though there is technically in local storage to disk Spark, what distinguishes Spark is it's reliance on in memory storage and computation.

[Resources](https://runawayhorse001.github.io/LearningApacheSpark/)