<a href="https://colab.research.google.com/github/jman4162/aws-for-ml/blob/main/Pandas_vs_Polars_A_Guide_for_the_AWS_savvy_Machine_Learning_Researcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas vs. Polars: A Guide for the AWS-savvy Machine Learning Researcher

Name: John Hodge

Date: 10/04/2024

As a machine learning researcher working with massive datasets on AWS, you're likely familiar with the limitations of Pandas when it comes to performance. Polars, a blazingly fast DataFrame library built in Rust with Python bindings, offers a compelling alternative. This guide highlights the key differences between Pandas and Polars, focusing on aspects relevant to your AWS workflow using S3, Athena, and Sagemaker.

## 1. Core Differences & Advantages of Polars:

* **Lazy Evaluation:** Polars employs lazy evaluation, meaning it builds an execution plan first and only processes data when needed. This optimizes operations and significantly reduces memory usage for large datasets. Pandas, on the other hand, evaluates operations eagerly.
* **Parallel Processing:** Polars leverages multiple CPU cores by default for various operations, leading to faster data manipulation and analysis. Pandas primarily relies on single-core processing, though some operations offer limited parallelization.
* **Memory Efficiency:** Polars uses Apache Arrow as its memory model, resulting in highly efficient data storage and faster data access. Pandas relies on its own internal data structures, which can be less memory-efficient, especially for mixed data types.
* **Query Optimization:** Polars' lazy evaluation allows for sophisticated query optimization, automatically rearranging operations for maximum efficiency. Pandas lacks this advanced optimization capability.

## 2. Transitioning from Pandas to Polars:

### 2.1 Data Loading:

**Pandas:**

```python
import pandas as pd

# From S3 using boto3
import boto3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='your-bucket', Key='your-data.csv')
df = pd.read_csv(obj['Body'])

# From Parquet
df = pd.read_parquet('your-data.parquet')
```

**Polars:**

```python
import polars as pl

# From S3 using fsspec
import fsspec
fs = fsspec.filesystem('s3', profile='your-aws-profile')
df = pl.read_csv(fs.open('s3://your-bucket/your-data.csv'))

# From Parquet
df = pl.read_parquet('your-data.parquet')
```

**Note:** Polars seamlessly integrates with `fsspec` for accessing various cloud storage, including S3.

### 2.2 Data Manipulation:

**Pandas:**

```python
# Filtering
df_filtered = df[df['column_name'] > 10]

# Selecting columns
df_selected = df[['column_1', 'column_2']]

# Grouping and aggregation
df_grouped = df.groupby('column_name').agg({'other_column': 'sum'})
```

**Polars:**

```python
# Filtering
df_filtered = df.filter(pl.col('column_name') > 10)

# Selecting columns
df_selected = df.select(['column_1', 'column_2'])

# Grouping and aggregation
df_grouped = df.group_by('column_name').agg(pl.col('other_column').sum())
```

**Note:** Polars uses an expression syntax for data manipulation that is more concise and readable compared to Pandas' chained operations.

### 2.3 Data Exploration:

**Pandas:**

```python
df.head()
df.describe()
df.info()
```

**Polars:**

```python
df.head()
df.describe()
df.schema
```

**Note:** Polars provides similar data exploration functionalities as Pandas.

## 3. Utilizing Polars with AWS Tools:

* **S3 Integration:** Polars' `fsspec` integration facilitates seamless data loading and saving from S3 buckets.
* **Athena Integration:** You can query data stored in S3 via Athena using PyAthena and convert the results to a Polars DataFrame for efficient analysis.
* **Sagemaker Integration:** Utilize Polars within your Sagemaker notebooks or training scripts to process and analyze data faster, reducing training time and improving model performance.


## 4. Example Workflow: Preprocessing data on S3 for Sagemaker with Polars

```python
import polars as pl
import fsspec

# 1. Load data from S3
fs = fsspec.filesystem('s3', profile='your-aws-profile')
df = pl.read_parquet(fs.open('s3://your-bucket/your-data.parquet'))

# 2. Data preprocessing with Polars
#   - Filter irrelevant data
df = df.filter(pl.col('column_name') > 10)
#   - Feature engineering
df = df.with_columns(
    (pl.col('column_1') * pl.col('column_2')).alias('new_feature')
)

# 3. Save preprocessed data back to S3
df.write_parquet(fs.open('s3://your-bucket/preprocessed-data.parquet', 'wb'))
```

This example demonstrates how you can efficiently load, preprocess, and save data on S3 using Polars, preparing it for training within your Sagemaker environment.

## 5. Conclusion:

Polars offers a powerful and efficient alternative to Pandas for machine learning researchers dealing with large datasets on AWS. Its lazy evaluation, parallel processing, and memory efficiency can significantly accelerate your data manipulation and analysis tasks. By leveraging Polars' integration with S3, Athena, and Sagemaker, you can build faster and more scalable machine learning pipelines. This guide provides a starting point for your transition to Polars, empowering you to harness its capabilities and improve your workflow. Remember to explore the official Polars documentation for detailed information and advanced features.
