# Polars for Data Science
## From zero to tackling real world data science problems with Polars

### Dataset
For this tutorial, we will use the well-known Iris dataset.

# Getting Started
## Install Polars; become familiar with Polars's lazy-mode versus in-memory mode; understand how to leverage Polars's query optimization.

### Loading the Data with Polars
To start, we'll load the Iris dataset using `polars.read_csv()` and inspect it with `polars.DataFrame.head()`. Polars is very explicit about datatypes, which will be helpful when we dive into the Expression API later.

In [1]:
import polars as pl

# Load the data
df = pl.read_csv('iris.csv')

# Inspect the first few rows
print(df.head())

### Selecting Specific Columns
Next, we will select a few columns from the DataFrame using `polars.DataFrame.select()`. This is useful for focusing on specific parts of the data.

In [2]:
# Select specific columns
selected_df = df.select(['sepal_length', 'species'])

# Display the selected columns
print(selected_df.head())

### Lazy Loading with Polars
We can also load the data lazily using `polars.scan_csv()`. Unlike `read_csv()`, `scan_csv()` does not load the data into memory immediately. Instead, it creates a LazyFrame, which allows for query optimization before the data is loaded.

In [3]:
# Lazy load the data
lazy_df = pl.scan_csv('iris.csv')

# Attempt to inspect the first few rows (this will not work as expected)
print(lazy_df.head())

### Selecting Columns in Lazy Mode
We can select columns in a LazyFrame just like we did with a DataFrame. Additionally, we can use `LazyFrame.show_graph()` to visualize the query plan, demonstrating how columns that aren't selected are never read into memory.

In [4]:
# Select specific columns lazily
lazy_selected_df = lazy_df.select(['sepal_length', 'species'])

# Show the query plan graph
lazy_selected_df.show_graph()

### Converting LazyFrame to DataFrame
To execute the lazy query and load the data into memory, we can use `LazyFrame.collect()`. This converts the LazyFrame into a DataFrame.

In [5]:
# Convert LazyFrame to DataFrame
collected_df = lazy_selected_df.collect()

# Display the collected DataFrame
print(collected_df.head())

# Data Manipulation I: Basics
## Become familiar with the Polars API, and be able to perform basic selecting and filtering queries.

### Understanding Polars API Classes
Polars provides four main classes of tools for data manipulation: query clauses (`select`, `filter`, `sort`, `group_by`, `agg`, etc), column expressions (`pl.col()`), collection functions (`collect`, `head`, `shape`), and miscellaneous functions (`value_counts`, `transpose`, `concat`, `plot`).

### Polars vs. SQL Syntax
Let's compare Polars syntax to SQL syntax. Polars provides similar functionalities to SQL but in a more programmatic and efficient way.

In [6]:
# Filter rows in Polars (similar to SQL's WHERE clause)
filtered_df = df.filter(pl.col('sepal_length') > 5.0)

# Display the filtered DataFrame
print(filtered_df.head())

### Exploring the Expression API
The Expression API in Polars is powerful and flexible. Here are a few examples of expressions inside `select` statements, including `pl.col()`, `pl.col().alias()`, `pl.col().suffix()`, and `pl.col().ne()`.

In [7]:
# Using expressions in select statements
expr_df = df.select([
    pl.col('sepal_length').alias('length'),
    pl.col('sepal_width').suffix('_cm'),
    pl.col('petal_length').ne(1.4).alias('not_1.4')
])

# Display the DataFrame with expressions applied
print(expr_df.head())

### Basic Aggregations with Expressions
With the Expression API, you can perform basic aggregations, such as calculating the maximum and minimum values of columns.

In [8]:
# Basic aggregations
aggregations_df = df.select([
    pl.col('sepal_length').max().alias('max_length'),
    pl.col('sepal_width').min().alias('min_width')
])

# Display the aggregated DataFrame
print(aggregations_df)

### Exploring the Expression API Documentation
If you want to explore the types of columns you can add, check out the [Expression API docs](https://pola-rs.github.io/polars/py-polars/html/reference/expressions.html). All expression functions are organized by namespace, with different namespaces for each datatype.