<center>
<img src="./images/00_main_arcada.png" style="width:1400px">
</center>

## Lecture 3: Big Data Engineering


## Instructor:
Anton Akusok <br/> 
email:  anton.akusok@arcada.fi<br/>
message: "Anton Akusok" @ Microsoft Treams

## Content 

1. Big Data Analytics technologies

2. Introduction to Spark (https://andfanilo.github.io/pyspark-interactive-lecture/#/)

3. Exercise: Install Spark and look at the data

3. Basics of PySpark
   (https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html)

5. PySpark and SQL
   (https://docs.databricks.com/en/getting-started/dataframes-python.html)

6. PySpark operations demo

7. Exercise/homework: Analyze shopping trends data

# 1. Tech around Big Data Analytics

![spark book](images/book.png)

https://www.amazon.de/-/en/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321 

## Basic idea

- store large data is almost always in "bucket" / Cloud object storage, like Amazon S3
- interface to look at the data while processing it / interactive data visualization 
two options: Jupyter / Databricks notebook, or an SQL query web thing
- schedule data processing pipeline: 
either run a script e.g. AirFlow runs Jupyter notebooks, 
or define transformations and they run as needed e.g. table graph in "dbt"
- ability to write code and integrate with other services, 
"easy visualization tools" will always have a few missing things that are deal breakers for industry

## Options
- open source tools like SQL, Spark, dbt
- paid special tools like Tableau
- cloud providers offering tools in a nice wrapping: AWS SageMaker, RedShift

## Surrounding tools
- NoSQL databases - data storage and queries, can use Spark for processing
- Graph databases
- Stream data processing (Kafka, Flink). Spark Streaming is in-between.

## Data in-memory: Dataframe
- There are Pandas dataframe and Spark dataframe, they are different but can convert into each other.
- Attention: Pandas is one machine in-memory data and Spark is distributed across many machines. Easy to get out of memory converting Spark to Pandas dataframe.

## Data storage
- Distributed data storage (always with large data). Has many files. Each one file reads at average speed, but all files at once can be read amazingly fast.

  
- Columnar storage format! Can load a few columns without reading everything. Examples: Parquet, Databricks Delta
- JSON: most flexible, often the format data comes in. Always compressed (.gzip) and takes very long time to read, easily 80-90% or runtime just for reading data.

- JSON Benefits: natural choice when data comes one row at a time - those rows are compressed and appended to a JSON. Always comes in many smaller files that are read very fast (limited by CPU)
- JSON Drawbacks: cannot quickly read a subset of data, no defined schema, schema will change over time causing problems reading old data. Can provide your own schema to selectively read only necessary columns.

# 2. Introduction to Spark 

https://andfanilo.github.io/pyspark-interactive-lecture/#/

# 3. Let's run some Spark!

<center>
<img src="./images/00_hands-on.jpg" style="width:1000px">
</center>

# Installing Spark

`pip install pyspark`   
or  
`pip install "pyspark[sql]"`

https://spark.apache.org/docs/latest/api/python/getting_started/install.html

## (Py)Spark needs Java

There are many Javas like OpenJDK

Macos: `brew install openjdk`

https://howtodoinjava.com/java/install-openjdk-on-linux-macos-windows/ 

May need to setup `JAVA_HOME` environmental variable - ask AI how to do this

# SparkContext

* Main entry point of a spark application
* SparkContext sets up everything and establishes a connection for writing Spark code
* Used to create RDDs, access Spark services and run Spark jobs.

In [None]:
from pyspark import SparkContext
# sc = SparkContext()
sc = SparkContext.getOrCreate()

## AI assistants!

Don't ask me - ask them. They are always there for you.

This is the new Google / Stackoverflow

## AI Shadow Code

Shadow code give you coding suggestions.

Hint: Type a comment for what you want to do - then it gives you a better suggestion!

```
# print a greeting
<print("Hello, world")>
```

In [None]:
# print a greeting


## VS Code tips and tricks

1. Run a local AI model with Jan.ai and Continue.dev plugin

2. Run Jupyter notebooks by installing Jupyter plugin and writing a magic cell separator string `#%%`

3. Use Python brackets for multi-line code
```
spark.filter().select(a,b,c).join().filter().display()

# is the same as:
(
    spark
    .filter()
    .select(
        a,
        b,
        c
    )
    .join()
    .filter()
    .display()
)
```

## Continue.dev

Open config and give settings of your server. 
Simple to use servers are Jan.ai or Ollama.

```
  "models": [
    {
      "title": "Phind CodeLlama",
      "provider": "openai-aiohttp",
      "model": "phind-codellama-34b",
      "api_base": "http://127.0.0.1:5000"
    }
  ],
```

# Exercise 1

1. Have an AI assistant ready
2. (optional but suggested) Connect AI Shadow Code assistant
3. Load `data/titanic.csv` (or any other) dataset in Spark
4. Print how many lines you have
5. Print a few lines of data

<center>
<img src="./images/00_break.png" style="width:1100px">
</center>

# 4. Basics of PySpark

https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html

# 5. PySpark and SQL

https://docs.databricks.com/en/getting-started/dataframes-python.html

# 6. PySpark operations demo

## Operations
- load, print, save
- read with schema
- count and take a few rows
- filter
- select with its features: change types, rename
- intro to functions inside "select"
- groupby-agg thingy

## SQL 
- inline sql queries
- register dataframe as table
- general SQL
- getting data back as dataframe

# (more complex)
- joins of different kinds
- window functions

## UDFs and Pandas UDFs
- writing custom functions

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# 7. Exercise/homework: Analyze shopping trends data

1. Load the dataset into a PySpark DataFrame, converting the purchase_amount column to a numeric type.
2. Group the data by age and customer ID to compute total purchases for each customer in every age group.
3. Compute statistics such as mean, median, standard deviation, and quartiles for the total purchase amounts across customers.
4. Use PySpark's SQL API to create a view with customer IDs, total purchases.
5. Write SQL queries to find the following:
    1. The top 10 customers with the highest total purchases.
    2. The average purchase amount per age for all customers, grouped by gender.
    3. The total number of purchases made in each age across all customers.
6. Visualize the results using Matplotlib or other libraries to create bar and line charts.
7. Submit a Jupyter notebook with results.

<center><img src="images/00_thats_all.jpg",width=1200></center>