# Polars: Part 4 - Working with Large Data Sets

There is a stark difference between large data and big data. Large data typically refers to data sets that are too large to fit into system memory but still not too large to fit on your machines hard drive. Big data by contrast is truly too large to even fit on a single hard drive and is often distributed over multiple machines on a network.

Nonetheless, working with large data sets poses a challenge, because it requires careful consideration of how to import, transform, and analyze this data. Unlike Pandas, the Polars package allows so called lazy evaluations, which means that any aggregations, transformation, or operation on the data will only be performed when the result is actually needed. This enables Polars to perform some optimizations in the background that order these operations in the most efficient way as well as avoid piling up copies and copies of DataFrames in memory.

Let us begin by importing `polars` into our namespace.

In [1]:
import polars as pl



##  Hard Drive Stats

For our example, we will work with [hard drive stats from Backblaze](https://www.backblaze.com/blog/backblaze-hard-drive-stats-q3-2020/) which is a provider of cloud storage. Backblaze collects and publishes statistical data of its hard drives, including internal data reported by the drives. In return, it hopes to get help from the data science community on how to better predict hard drive failures.



### Reading Multiple Files

Let us download their harddrive stats from 2021. The stats are split up into multiple CSV file compressed into a single ZIP file. 

#### In-class exercise

Use Copilot to craft a little function that requests a number of zip files from the following URLs:
- https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q1_2020.zip
- https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q2_2020.zip
- https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q3_2020.zip
- https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q4_2020.zip

The function then unpacks the files into the directors ```csvs/``` and returns a list that contains the path names of each CSV.

In [None]:
# python function that downloads a zip from a given url, unpacks it into a given directory and deletes the zip file and returns the content as a list of file names
def download(urls, directory):
    import requests, zipfile, io
    lst = []
    for url in urls:
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(directory)
        lst.append(z.namelist())
        z.close()
    return lst

csvs = download([f"https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q{i}_2020.zip" for i in [1,2,3,4]], "csvs/")

## Lazy DataFrames

We now have a large director full of CVSs that consumes about 16g of disk space. That is large enough to cause some trouble when importing all that data into a single DataFrame. 

Instead of using ```read_csv```, we are now going to import a lazy DataFrame using ```scan_csv```. The advantage is that Polars will only begin importing the data after the function ```collect``` has been called. This gives us the chance to select only those columns that we are actually interested in. Let's take a peek at the data and see what secrets it holds for us.

In [2]:
lazy_df = pl.scan_csv("csvs/*.csv")
pl.DataFrame(lazy_df.schema)


date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,smart_3_raw,smart_4_normalized,smart_4_raw,smart_5_normalized,smart_5_raw,smart_7_normalized,smart_7_raw,smart_8_normalized,smart_8_raw,smart_9_normalized,smart_9_raw,smart_10_normalized,smart_10_raw,smart_11_normalized,smart_11_raw,smart_12_normalized,smart_12_raw,smart_13_normalized,smart_13_raw,smart_15_normalized,smart_15_raw,smart_16_normalized,smart_16_raw,smart_17_normalized,smart_17_raw,smart_18_normalized,smart_18_raw,…,smart_218_raw,smart_220_normalized,smart_220_raw,smart_222_normalized,smart_222_raw,smart_223_normalized,smart_223_raw,smart_224_normalized,smart_224_raw,smart_225_normalized,smart_225_raw,smart_226_normalized,smart_226_raw,smart_231_normalized,smart_231_raw,smart_232_normalized,smart_232_raw,smart_233_normalized,smart_233_raw,smart_235_normalized,smart_235_raw,smart_240_normalized,smart_240_raw,smart_241_normalized,smart_241_raw,smart_242_normalized,smart_242_raw,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,…,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object
Utf8,Utf8,Utf8,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Utf8,Utf8,Int64,Int64,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Int64,Int64,…,Utf8,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Utf8,Utf8,Int64,Int64,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Int64,Int64,Int64,Int64,Int64,Int64,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8,Utf8


That was quick! As we can see, the data contains information on the status of the drive on each day, information on the drive, as well as so-scalled smart stats. While the smart stats may indeed be interesting if we were to build a predictive model, we don't need them to calculate the drive days. Instead, we are going to select all columns but those that start with smart and then transform the date column.

In [3]:
df = (lazy_df
  .select(["date","serial_number","model","capacity_bytes","failure"])
  .with_columns(pl.col("date").str.strptime(pl.Date))
  .collect()
)
df.describe()

describe,date,serial_number,model,capacity_bytes,failure
str,str,str,str,f64,f64
"""count""","""52286398""","""52286398""","""52286398""",52286398.0,52286398.0
"""null_count""","""0""","""0""","""0""",0.0,0.0
"""mean""",,,,9255600000000.0,2.9e-05
"""std""",,,,3751200000000.0,0.005347
"""min""","""2020-01-01""","""0564f6f3fab900…","""DELLBOSS VD""",-1.0,0.0
"""25%""",,,,4000800000000.0,0.0
"""50%""",,,,12000000000000.0,0.0
"""75%""",,,,12000000000000.0,0.0
"""max""","""2020-12-31""","""fe78b975bfab00…","""WDC WUH721414A…",18000000000000.0,1.0


The final data set has 52.3m rows. What is important is that we explicitly labeled the columns that we wanted to import to save memory. Despite of this, we still end up with about 8g of data in memory.

## Parquet Format

A useful format for storing DataFrames is the parquet format. Parquet is an open source, column-oriented data file format that provides efficient data compression and data type encoding to handle complex data in bulk. Unlike CSVs it will be able to store datetime objects as such an not convert them to strings.

In [None]:
df.write_parquet("concat_data_2021.parquet")

## Exercises

Let us now calculate the drive days...

In [11]:
drive_days = df.groupby("serial_number").agg(
    pl.col("date").apply(lambda x:(x.max()-x.min()).days).alias("drivedays"),
    pl.exclude("date").apply(pl.first)
    )
drive_days.sample(10)

serial_number,drivedays,model,capacity_bytes,failure
str,i64,str,i64,i64
"""ZLW18E7Q""",108,"""ST14000NM001G""",14000519643136,0
"""ZJV2EMNP""",182,"""ST12000NM0007""",12000138625024,0
"""S300X152""",365,"""ST4000DM000""",4000787030016,0
"""ZA14ELAJ""",365,"""ST8000NM0055""",8001563222016,0
"""ZA14ELXG""",365,"""ST8000NM0055""",8001563222016,0
"""Z305D6YV""",365,"""ST4000DM000""",4000787030016,0
"""ZCH0CY1X""",155,"""ST12000NM0007""",12000138625024,0
"""PL1331LAHEPRKH…",365,"""HGST HMS5C4040…",4000787030016,0
"""ZJV4PQHV""",281,"""ST12000NM0007""",12000138625024,0
"""ZA17Z5CZ""",365,"""ST8000NM0055""",8001563222016,0


In [13]:
# plot average drivedays grouped by model as horizontal bar chart using plotly express
import plotly.express as px
px.bar(drive_days.groupby("model").agg(pl.col("drivedays").mean()).sort("drivedays"), y="model", x="drivedays")
