# Big Data Analysis with PySpark [BDAS2]

This course takes you all the way to the ability to process and analyze data of any size. Starting from first steps Python programming, we then cover basic data analysis methods and Python tools for applying them. For truly big data sets, we teach how to harness the computing power of a cluster via PySpark.

The course requires basic knowledge of Python programming, fundamentals of data science and data handling. These requirements are conveyed in [📓 **Data Analysis with Python**](../index/dap2-data-analysis-python.ipynb).

## Table of Contents

### Fundamentals
 

1. [**Processing Big Data**](../python/python-big-data.ipynb)<br>
   What strategies do we have available to compute efficiently with increasing amounts of data?  What is a cluster, and when do we need one?
   
1. [**Introducing Apache Spark**](../spark/spark-introduction.ipynb)<br> 
    What is Spark all about, and what are its components?
   
1. [**Spark Fundamentals**](../spark/spark-fundamentals.ipynb)<br>
   An introduction to the fundamental concepts as well as core data structures and operations.
   
1. [**Submitting Spark Jobs**](../spark/spark-submitting.ipynb)<br>
   How to submit jobs to a Spark cluster for batch processing.
   
1. [**Structured Data**](../spark/spark-structured-data.ipynb)<br>
   Working with tabular data in Spark.

1. **[Streaming Data](../spark/spark-streaming.ipynb)**<br>
   Processing large-scale live data streams.

1. **[Graph Data](../spark/spark-graph.ipynb)**<br>
   Working with grah data using the GraphFrames extension module.



### Machine Learning

1. [**Introduction to Machine Learning**](../ml/ml-outlook.ipynb)<br>
    An overview of the field of machine learning.

1. **[ML for Classification](../ml/ml-classification-intro.ipynb)**<br>
      Learn about classifiers and how to measure the quality of their decisions.

  1. [**Building a Pipeline for Classification with Spark ML**](../spark/spark-ml-pipeline.ipynb)<br>
      Build a classification model and learn about the building blocks of ML in PySpark.

### Exercises
   
1. [**Excercise: Counting Bigrams**](../exercises/spark-exercise-bigrams.ipynb)<br>
    Using Spark to count bigrams in big text data.


### Exercises: Solution Examples
   
1. [**Spark Exercise Solutions**](../exercises/spark-exercise-solutions.ipynb)


### Additional Resources

- [**Test Notebook**](../test.ipynb)<br>
    Verify that your Python stack is working.

- [**Jupyter Cheat Sheet**](../jupyter/cheatsheet.ipynb)<br>
    Some useful commands for Jupyter Notebook, mostly optional.
    
- [**Spark Test Notebook**](../spark/spark-test.ipynb)<br>
    Verify that your PySpark stack is working.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_