# (Big) Data Analysis with Python and PySpark [BDAPS3]

This course takes you all the way to the ability to process and analyze data of any size. Starting from first steps Python programming, we then cover basic data analysis methods and Python tools for applying them. For truly big data sets, we teach how to harness the computing power of a cluster via PySpark.

## Table of Contents

### Curriculum

1. [**Python Basics**](../python/python-basics.ipynb)<br>
    Learn the basics of the Python programming language.
    
1. [**Efficient Computing with numpy**](../python/python-scientific-numpy.ipynb)<br>
    Apply the `numpy` library to compute efficiently with large amounts of data.
    
1. [**Data Handling with pandas**](../python/python-data-handling-pandas.ipynb)<br>
    Learn to work with tabular data, supported by the `pandas` library.

1. **[Plotting and Data Visualization](../python/python-datavis.ipynb)**<br>
    Visualize data with plots

1. [**Introduction to Statistics**](../stats/stats-basics.ipynb)<br>
    First steps with statistics concepts needed for data analysis.

1. [**Processing Big Data**](../python/python-big-data.ipynb)<br>
   What strategies do we have available to compute efficiently with increasing amounts of data?  What is a cluster, and when do we need one?
   
1. [**Spark Fundamentals**](../spark/spark-fundamentals.ipynb)<br>
   An overview of Spark - a framework for programming distributed computation, using PySpark, its Python API - core data structures and operations.
   
1. [**Submitting Spark Jobs**](../spark/spark-submitting.ipynb)<br>
   How to submit jobs to a Spark cluster for batch processing.
   
1. [**Spark and Structured Data**](../spark/spark-structured-data.ipynb)<br>
   Working with structured data in Spark.
   



### Bonus Material

1. [**Handling Time Series with Pandas**](../python/python-timeseries-pandas.ipynb)

1. [**Outlook: Machine Learning**](../ml/ml-outlook.ipynb)

### Exercises

1. [**Excercise: Museums of France**](../exercises/exercise-museums.ipynb)<br>
    An exercise with a clear task, requiring you to apply the learnings from the course.
   
1. [**Excercise: Titanic**](../exercises/exercise-titanic.ipynb)<br>
    An open-ended exercise to practice answering questions with data.
   
1. [**Excercise: Counting Bigrams**](../spark/spark-exercise-bigrams.ipynb)<br>
    Using Spark to count bigrams in big text data.


### Exercises: Solution Examples

1. [**Excercise Solution: Museums of France**](../exercises/spark-exercise-museums-solution.ipynb)<br>
   
1. [**Spark Exercise Solutions**](../exercises/spark-exercise-solutions.ipynb)


### Additional Resources

- [**Test Notebook**](./../test.ipynb)<br>
    Verify that your Python stack is working.

- [**Jupyter Cheat Sheet**](../jupyter/cheatsheet.ipynb)<br>
    Some useful commands for Jupyter Notebook, mostly optional.
    
- [**Spark Test Notebook**](../spark/spark-test.ipynb)<br>
    Verify that your PySpark stack is working.

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_