# First Steps
### A Simple Program in PySpark

Data-driven applications, no matter how complex, all boil down to what we can
think of as three meta steps, which are easy to distinguish in a program:
1. We start by loading or reading the data we wish to work with.
2. We transform the data, either via a few simple instructions or a very complex
machine learning model.
3. We then export (or sink) the resulting data, either into a file or by summarizing our findings into a visualization.

> **Note**: 
> This is assumes you have Spark and Java installed and setup

To run spark interactively in terminal, run below commands in terminal
- `set PYSPARK_DRIVER_PYTHON=ipython`
- `set PYSPARK_DRIVER_PYTHON_OPTS=`

In [1]:
import findspark
findspark.init()

Using the pyspark REPL

```
>pyspark

Python 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]        
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).   
23/01/02 22:24:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.3
      /_/

Using Python version 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021 11:48:03)
Spark context Web UI available at http://ManojZephyrusG15:4040
Spark context available as 'sc' (master = local[*], app id = local-1672694655134).
SparkSession available as 'spark'.

In [1]:
```


### SparkSession entry point



In [6]:
from pyspark.sql import SparkSession
spark = SparkSession\
            .builder\
            .appName("Analyzing the vocabulary of Pride and Prejudice.")\
            .getOrCreate()

In [12]:
# Reading older PySpark code
sc = spark.sparkContext
sqlContext = spark
sc

A simple problem: “What are the most popular words used in the English language?”

Steps to carry out:
1. *Read*—Read the input data (we’re assuming a plain text file).
2. *Token*—Tokenize each word.
3. *Clean*—Remove any punctuation and/or tokens that aren’t words. Lowercase
each word.
4. *Count*—Count the frequency of each word present in the text.
5. *Answer*—Return the top 10 (or 20, 50, 100)

![A simple program](images/first_steps_simple_program.png)


## Ingest and Explore

we need to choose how we are going to store the ingested data. PySpark provides two main structures:
1. RDD
2. Data frame

The **RDD** is like a distributed collection of objects (or rows). You pass orders to the RDD through regular Python functions over the items in it.

The **data frame**  is a stricter version of the RDD. Conceptually, you can think of it like a table, where each cell can contain one value. The data frame makes heavy usage of the concept of columns, where you operate on columns instead of on records, like in the *RDD*