# 2. Downloading Apache Spark and Getting Started

Installing pyspark in your venv downloads a spark into it. You can either use `pyspark` or Spark REPL.

```sh
(venv)$ pyspark # in terminal set ups a local spark `loca[*]`
spark-version/bin/(venv)$ ./spark-submit <file>
```

Running Spark this way connects to a temporary spark instance and it's generally for testing, development. Usually, production level jobs connects to an existing Spark.

Spark operations can be divided into 2:
- **transformations**: Lazy evaluations such as `select, filter`
- **actions**: Eager evaluations such as `show, collect`


| Transformations | Actions  |
| :---------------|:---------|
| `orderBy`       | `select` |
| `groubBy`       | `take`   |
| `filter`        | `count`  |
| `select`        | `collect`|
| `join`          | `save`   | 

Transformations also can be divided into 2:
- **narrow**: Can be performed independiently only using the partition information: `filter`
- **wide**: Cannot be performed independiently, needs information from other partitions: `orderBy`



In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

spark = SparkSession.builder.appName("M&M Counter").getOrCreate()
spark

In [None]:
df = spark.read.format("csv").option("header", "true").load("data/mnm_dataset.csv")
df.show()

In [None]:
count_mm_df = (
    df.groupBy("State", "Color")
    .agg(count("Count").alias("Total"))
    .orderBy("Total", ascending=False)
)
spark_df = (
    count_mm_df.toPandas()
    .sort_values(["State", "Color", "Total"])
    .reset_index(drop=True)
)
spark_df

In [None]:
import pandas as pd
from tqdm import tqdm

pd.set_option("display.max_colwidth", None)
data = pd.read_csv("data/mnm_dataset.csv")
pandas_df = (
    data.groupby(["State", "Color"])["Count"]
    .count()
    .reset_index()
    .rename(columns={"Count": "Total"})
    .sort_values(["State", "Color", "Total"])
).reset_index(drop=True)

pandas_df

In [None]:
all(pandas_df == spark_df)