Lecture: AI I - Basics 

Previous:
[**Chapter 3.5: Preprocessing with Pandas**](../03_data/05_preprocessing.ipynb)

---

# Chapter 3.6: Additional Libraries and Tools

- [Dask](#dask)
- [DuckDB](#duckdb)
- [Polars](#polars)
- [YData Profiling](#ydata-profiling)
- [Apache Arrow & PyArrow](#apache-arrow--pyarrow)
- [RAPIDS](#rapids)
- [Plotly](#plotly)
- [SciPy](#scipy)
- [Statsmodels](#statsmodels)


## Dask

[Dask](https://docs.dask.org/en/stable/) is a parallel computing library in Python that enables scalable data analysis by extending familiar tools like NumPy, pandas, and scikit-learn to larger-than-memory datasets and distributed environments. It is designed for workloads that don’t fit into a single machine’s memory or need to take advantage of multiple CPU cores or even entire clusters:

- Big Data Handling: Works with datasets that exceed memory by splitting them into smaller chunks and processing them in parallel.
- Familiar APIs: Provides drop-in replacements for popular libraries:
    - dask.array → NumPy-like parallel arrays.
    - dask.dataframe → pandas-like parallel DataFrames.
    - dask.delayed → Parallelize arbitrary Python code.
- Scalability: Runs on a laptop, multicore machine, or distributed cluster with minimal code changes.
- Task Scheduling: Uses a dynamic task scheduler to manage computations efficiently.

We can read multiple CSV files simultaneously:

In [1]:
import dask.dataframe as dd

df = dd.read_csv("data/dask/students*.csv")
df

Unnamed: 0_level_0,studentid,firstname,lastname,age,course
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,int64,string,string,int64,string
,...,...,...,...,...
,...,...,...,...,...


Dask executes computations lazily, meaning operations are only planned but not run immediately, and the actual evaluation happens when you explicitly call the `.compute()` method.

In [2]:
df.compute()

Unnamed: 0,studentid,firstname,lastname,age,course
0,1,Alice,Smith,20,Math
1,2,Bob,Johnson,22,CS
2,3,Charlie,Williams,21,Physics
3,4,Diana,Brown,23,Biology
4,5,Ethan,Jones,22,Math
5,6,Fiona,Garcia,20,History
6,7,George,Miller,24,CS
7,8,Hannah,Davis,21,Physics
8,9,Ian,Martinez,23,Biology
9,10,Julia,Taylor,22,History


Here, dask.dataframe works almost like pandas but processes the data in chunks across cores, and `.compute()` triggers the actual execution:

In [3]:
df.groupby("course")["age"].mean().compute()

course
Biology    22.25
CS         23.50
History    21.25
Math       21.25
Physics    21.25
Name: age, dtype: float64

## DuckDB

[DuckDB](https://duckdb.org/docs/stable/clients/python/overview.html) is an in-process SQL OLAP database designed for analytical workloads. Unlike traditional client-server database systems, DuckDB runs entirely inside your Python (or R, C++, etc.) process, making it lightweight and extremely easy to integrate into data workflows. Its focus is on fast analytical queries on large datasets, not transaction-heavy use cases. It works with pandas, Polars, and PyArrow allowing for zero-copy query execution on DataFrames.

In [4]:
import duckdb

student_df = df.compute()

duckdb.sql("SELECT course, AVG(age) FROM student_df GROUP BY course").to_df()

  duckdb.sql("SELECT course, AVG(age) FROM student_df GROUP BY course").to_df()
  duckdb.sql("SELECT course, AVG(age) FROM student_df GROUP BY course").to_df()


Unnamed: 0,course,avg(age)
0,CS,23.5
1,History,21.25
2,Biology,22.25
3,Physics,21.25
4,Math,21.25


# YData Profiling

[YData Profiling](https://docs.profiling.ydata.ai/latest/) (formerly known as Pandas Profiling) is a Python library that automatically generates detailed exploratory data analysis (EDA) reports from a pandas DataFrame. Instead of manually writing code to inspect missing values, distributions, or correlations, YData Profiling creates an interactive report that summarizes your dataset in just a few lines of code.

In [5]:
from ydata_profiling import ProfileReport

ProfileReport(student_df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:00<00:00, 126.05it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# Polars

[Polars](https://docs.pola.rs/) is a high-performance DataFrame library built in Rust with Python bindings, designed for speed, memory efficiency, and parallelism. It uses Apache Arrow as its memory model and offers both an eager API (pandas-like, immediate execution) and a lazy API (query planning + optimization before execution). Polars excels on multi-core machines, large datasets, and complex transformations where predicate pushdown, projection pruning, and parallel execution matter.

# Apache Arrow & PyArrow

[PyArrow](https://arrow.apache.org/docs/python/index.html) is the Python interface to Apache Arrow, a cross-language, columnar in-memory data format designed for high-performance data analytics. It enables fast and efficient data interchange between systems without serialization overhead and forms the backbone of many modern Python data tools.

# RAPIDS

[RAPIDS](https://docs.rapids.ai/user-guide/) is an open-source suite of libraries developed by NVIDIA that brings the power of GPU acceleration to the Python data science ecosystem. It enables data scientists and machine learning practitioners to perform end-to-end data workflows—from data preparation to machine learning—entirely on GPUs, resulting in significant speedups compared to CPU-based tools.
- cuDF: A pandas-like DataFrame library accelerated by GPUs for fast data manipulation.
- cuML: A machine learning library with scikit-learn-like APIs but GPU-backed for faster training and inference.
- ...

# plotly

[Plotly](https://plotly.com/python/) is a powerful Python library for creating interactive, web-based visualizations that go beyond static plots. It supports a wide range of chart types—from simple line and bar charts to complex 3D surfaces and maps—making it a versatile tool for both data exploration and presentation. Unlike matplotlib or seaborn, Plotly charts are interactive by default, allowing users to zoom, pan, hover, and export visualizations easily.

# SciPy

[SciPy](https://docs.scipy.org/doc/scipy/) is a core library for scientific and technical computing in Python. It builds on top of NumPy, extending it with a wide range of efficient algorithms and functions for mathematics, science, and engineering. While NumPy provides fast arrays and basic numerical tools, SciPy adds specialized modules for more advanced computations.

# statsmodels

[Statsmodels](https://www.statsmodels.org/stable/index.html) is a Python library that provides a wide range of tools for statistical analysis, hypothesis testing, and econometrics. Unlike scikit-learn, which focuses on machine learning and prediction, Statsmodels emphasizes statistical inference, making it ideal for situations where understanding the underlying model and its parameters is as important as the predictions themselves.

---

Lecture: AI I - Basics 

Next: [**Chapter 4.1: Data Preparation**](../04_ml/01_data_preparation.ipynb)