In [1]:
# import the libraries
import os
import polars as pl
import numpy as np
import pandas as pd

import time
import random

❓ Count the number of cpu's using the OS module

In [25]:
# Count the number of cpu's 
pass  # YOUR CODE HERE

💡 Pandas only uses one of these CPU's, while Polars ia able to use all of them for **parallel processing**. Its ideal use case is data too big for pandas and too small for spark. Polars offers both an eager and a **lazy API**. The lazy API is said to be ‘somewhat similar to spark’. The lazy API allows the user to optimise the query before it runs, promising ‘blazingly fast’ performance.

In [26]:
# Generate a list of random numbers
values = [random.random() for _ in range(100000)]

# Convert the list to a pandas DataFrame and a polars DataFrame
df_pandas = pd.DataFrame(values)
df_polars = pl.DataFrame(values)

# Compute the mean of the values in the pandas DataFrame
start_time = time.time()
mean_pandas = df_pandas.mean()
elapsed_time = time.time() - start_time 
print(f"Mean computation with pandas took {elapsed_time:.6f} seconds")

# Compute the mean of the values in the polars DataFrame
start_time = time.time()
mean_polars = df_polars.mean()
elapsed_time = time.time() - start_time
print(f"Mean computation with polars took {elapsed_time:.6f} seconds")

In [27]:
# Compute the mean of the values in the pandas DataFrame
start_time = time.time()
mean_pandas = df_pandas.std()
elapsed_time = time.time() - start_time
print(f"Std computation with pandas took {elapsed_time:.6f} seconds")

# Compute the mean of the values in the polars DataFrame
start_time = time.time()
mean_polars = df_polars.std()
elapsed_time = time.time() - start_time
print(f"Std computation with polars took {elapsed_time:.6f} seconds")

### Lets load a csv and see what lazy evaluation means

We are using the pandas library to first load the data from S3, storing it as a CSV, and then loading it using Polarsin the lazy modus. 

In [23]:
df = pd.read_csv('s3://wagon-public-datasets/data-engineering/W3D3-processing/data/winemag-data_first150k.csv')
df.to_csv('winemag-data_first150k.csv', index=False)

In [24]:
lazy_df = pl.scan_csv('winemag-data_first150k.csv', ignore_errors=True)

In [25]:
lazy_df

🤔 Wait, the data is not loaded yet? What if we do some transformations on the data?

In [7]:
lazy_df.filter(
    (pl.col('country').is_not_null()) &
    (pl.col('country') != 'US-France')
)

💡 As we can see nothing happens right away. From the documentation: ‘This is due to the lazyness, nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution.’ Lets see the optimized version! 

In [8]:
lazy_df.filter(
    (pl.col('country').is_not_null()) &
    (pl.col('country') != 'US-France')
).show_graph()

To actually see the results we can do two things: 

1️⃣ **collect()** → runs the query over all the results

2️⃣ **fetch()** → takes the first 500 rows and runs the query

Fetch takes the first 500 rows (or less) and runs the query. Collect runs the query over all the results

In [26]:
%%time
pl.scan_csv('winemag-data_first150k.csv', ignore_errors=True).filter(
    (pl.col('country').is_not_null()) &
    (pl.col('country') != 'US-France')
).fetch(5)

🤔 What if we want to load the data from the csv file using Pandas?

In [27]:
%%time
pd.read_csv("winemag-data_first150k.csv", nrows=5)

🤯 It is faster to load and filter the data using Polarsthan it is to load the first few rows with Pandas without any filtering. Parallel processing and optimizing queries can bring big advantages in terms of computational speed. Lets continue to Pyspark to further explore these concepts! 🚀