# Polars Tutorial and Demonstration
> ### Jonathan Scofield
#### This notebook will help you set up polars on you computer and query a CSV file. For more information about Polars, please visit the official [website](https://pola.rs/). <br>
#### We will be using public real estate sales data from the State of Connecticut for property valued $2K or more from the year 2001 to 2022.
#### The data used for this project is in the public domain and is can be found [here](https://catalog.data.gov/dataset/real-estate-sales-2001-2018).

## Setup <br>
#### You need to have a version of Python installed equal to or greater than 3.10.
#### Use this command to install Polars:
>pip install 'polars[all]' 

In [1]:
!pip show polars

Name: polars
Version: 1.13.1
Summary: Blazingly fast DataFrame library
Home-page: 
Author: 
Author-email: Ritchie Vink <ritchie46@gmail.com>
License: 
Location: /opt/anaconda3/lib/python3.11/site-packages
Requires: 
Required-by: 


# Using the SQL Interface

#### Import required modules.

In [2]:
# "pl" is the conventional alias for the Polars library
import polars as pl
import os

#### Let's take a look at the size of the CSV file we want to examine.

In [3]:
# Get the size of the source .csv file

f"{round(os.path.getsize(r'Real_Estate_Sales_2001-2022_GL.csv') / (1024 ** 2), 2)} MB"

FileNotFoundError: [Errno 2] No such file or directory: 'Real_Estate_Sales_2001-2022_GL.csv'

#### It is fairly large, so we will scan it to a LazyFrame.

In [None]:
# Create a LazyFrame that infers schema based on first 1000 rows

df = pl.scan_csv( #We are scanning, not reading
    'Real_Estate_Sales_2001-2022_GL.csv', 
    ignore_errors = False,   # No crash on error
    infer_schema_length = 1000, # Sample size for schema detection
    low_memory = True, # Memory > speed
    try_parse_dates = True, # Format dates automatically   
)

#### After scanning, we can see the proposed schema from the given sample size:

In [None]:
df.schema # View the inferred schema

  df.schema # View the inferred schema


Schema([('Serial Number', Int64),
        ('List Year', Int64),
        ('Date Recorded', String),
        ('Town', String),
        ('Address', String),
        ('Assessed Value', Float64),
        ('Sale Amount', Float64),
        ('Sales Ratio', Float64),
        ('Property Type', String),
        ('Residential Type', String),
        ('Non Use Code', String),
        ('Assessor Remarks', String),
        ('OPM remarks', String),
        ('Location', String)])

#### Because this is a LazyFrame, Polars can't return the count of rows.

In [None]:
df.select(pl.count()) # This won't return anything on a LazyFrame

  df.select(pl.count()) # This won't return anything on a LazyFrame


## Querying the Data

#### We can query the data as if it was a SQL database using the following syntax:

In [None]:
# Count records for each year

select_df = pl.SQLContext(register_globals = True).execute(
   ''' 
   SELECT
        "List Year",
        count("List Year") as "Record Count"
    FROM 
        df
    GROUP BY "List Year"
    ORDER BY "List Year"
    '''
)

#### To view the data, we must call the **collect()** method.

In [None]:
select_df.collect(streaming = True) # Perform query and load into memory

List Year,Record Count
i64,u32
2001,59584
2002,59682
2003,64239
2004,84056
2005,61602
…,…
2018,50709
2019,58954
2020,66592
2021,56946


#### We can perform most basic SQL queries on the data. Let's try some string manipulation:

In [None]:
# String manipulation with filtering

pl.SQLContext(register_globals = True).execute(
   ''' 
   SELECT
       "Town",
       "Residential Type",
       upper(trim("Town")) + '-' + upper(trim("Residential Type")) as "New Column",
       "Assessed Value"
    FROM 
        df
    WHERE 
        "List Year" = 2021 AND 
        "Assessed Value" > 1000000 
        AND "Residential Type" IS NOT NULL
    ORDER BY 
        "Assessed Value" DESC
   LIMIT 5
    '''
).collect(streaming = True)


Town,Residential Type,New Column,Assessed Value
str,str,str,f64
"""New Canaan""","""Condo""","""NEW CANAAN-CONDO""",37913540.0
"""New Canaan""","""Condo""","""NEW CANAAN-CONDO""",37913540.0
"""New Canaan""","""Condo""","""NEW CANAAN-CONDO""",37913540.0
"""Darien""","""Single Family""","""DARIEN-SINGLE FAMILY""",35030100.0
"""Darien""","""Single Family""","""DARIEN-SINGLE FAMILY""",35030100.0


In [None]:
# String manipulation with calculated column

highest_premium_df = pl.SQLContext(register_globals = True).execute(
   ''' 
   SELECT
        TRIM(UPPER("Town")),
        "Address",
       ("Sale Amount" - "Assessed Value") AS "Premium"
    FROM 
        df
    WHERE 
        "List Year" = 2021 AND 
        "Town" IS NOT NULL
    ORDER BY
       "Premium" DESC
    LIMIT 1
    '''
).collect(streaming = True)

highest_premium_df

Town,Address,Premium
str,str,f64
"""STAMFORD""","""695 EAST MAIN STREET""",209100000.0


# Using the Polars Expression Syntax

#### The Polars Expression Syntax will be immediately familiar to anyone who has worked with PySpark. It offers additional granularity and a "Pythonic" syntax for manipulating data.

## Getting Data Profile

In [None]:
# Descriptive statistics for entire DataFrame

df.describe()

statistic,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
str,f64,f64,str,str,str,f64,f64,f64,str,str,str,str,str,str
"""count""",1097629.0,1097629.0,"""1097627""","""1097629""","""1097578""",1097629.0,1097629.0,1097629.0,"""715183""","""699240""","""313451""","""171236""","""13031""","""298111"""
"""null_count""",0.0,0.0,"""2""","""0""","""51""",0.0,0.0,0.0,"""382446""","""398389""","""784178""","""926393""","""1084598""","""799518"""
"""mean""",537035.693168,2011.218395,,,,281801.578617,405314.559762,9.603926,,,,,,
"""std""",7526100.0,6.773485,,,,1657900.0,5143500.0,1801.663865,,,,,,
"""min""",0.0,2001.0,"""01/01/2002""","""***Unknown***""","""#110 &L77 RANSOM HALL RD""",0.0,0.0,0.0,"""Apartments""","""Condo""","""01 - Family""","""""AS IS SALE""""","""#190309 HAS SAME REMARK AND IS…","""POINT (-121.23091 40.30336)"""
"""25%""",30713.0,2005.0,,,,89090.0,145000.0,0.477867,,,,,,
"""50%""",80706.0,2011.0,,,,140580.0,233000.0,0.610566,,,,,,
"""75%""",170341.0,2018.0,,,,228270.0,375000.0,0.77072,,,,,,
"""max""",2000500000.0,2022.0,"""12/31/2021""","""Woodstock""","""parking space only""",881510000.0,5000000000.0,1226420.0,"""Vacant Land""","""Two Family""","""Single Family""","""�non-market transaction includ…","""town site shows assessment as …","""POINT (-89.50175 34.34596)"""


## Basic Selecting and Filtering

#### Basic Selecting

In [None]:
# Select number of rows in 'Serial Number' column

df.select(pl.col('Serial Number').count()).collect()

Serial Number
u32
1097629


#### Selecting with Sort and Limit

In [None]:
# Select and order by 'Serial Number'

df.select(['Serial Number', 'Assessor Remarks']
          ).limit(5).sort(by = 'Serial Number').collect()

Serial Number,Assessor Remarks
i64,str
200500,
2020090,
2020177,
2020225,
2020348,


#### Select with Filter

In [None]:
# Filter out null 'Assessor Remarks'

df.select(['Serial Number', 'Assessor Remarks']
          ).filter(pl.col('Assessor Remarks').is_not_null()
                   ).limit(5).sort(by = 'Serial Number').collect()

Serial Number,Assessor Remarks
i64,str
20058,"""2003 COLONIAL, 2140 SFLA, 2.99…"
20093,"""ESTATE SALE"""
200110,"""BAA OVERRIDE"""
200142,"""R/C/8"""
200207,"""MULTIPLE LOT SALE"""


#### Select with Transformation and Filter

In [None]:
# Recreate the sales ratio with a column-level calculation

df.select(
    pl.col('List Year').alias('Year'), 
    (pl.col('Assessed Value') / pl.col('Sale Amount')).alias('Sales Ratio')
    ).filter(
        (pl.col('Year') >= 2020) & (pl.col('Sales Ratio') >= 0.8)
    ).sort(by = ['Year', 'Sales Ratio'], descending = [False, True]).collect()

Year,Sales Ratio
i64,f64
2020,679.500889
2020,671.8894
2020,362.841333
2020,318.034706
2020,213.5675
…,…
2022,0.8001
2022,0.8
2022,0.8
2022,0.8


## Adding/Transforming Columns

#### Transforming Existing Columns

In [None]:
# Coalesce remarks and default to "N/A"

df.with_columns(
    pl.coalesce(pl.col(['Assessor Remarks', 'OPM remarks']), pl.lit('N/A')),
    pl.coalesce(pl.col(['OPM remarks', 'Assessor Remarks']), pl.lit('N/A'))
    ).limit(5).collect()

Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
i64,i64,str,str,str,f64,f64,f64,str,str,str,str,str,str
2020177,2020,"""04/14/2021""","""Ansonia""","""323 BEAVER ST""",133000.0,248400.0,0.5354,"""Residential""","""Single Family""",,"""N/A""","""N/A""","""POINT (-73.06822 41.35014)"""
2020225,2020,"""05/26/2021""","""Ansonia""","""152 JACKSON ST""",110500.0,239900.0,0.4606,"""Residential""","""Three Family""",,"""N/A""","""N/A""",
2020348,2020,"""09/13/2021""","""Ansonia""","""230 WAKELEE AVE""",150500.0,325000.0,0.463,"""Commercial""",,,"""N/A""","""N/A""",
2020090,2020,"""12/14/2020""","""Ansonia""","""57 PLATT ST""",127400.0,202500.0,0.6291,"""Residential""","""Two Family""",,"""N/A""","""N/A""",
200500,2020,"""09/07/2021""","""Avon""","""245 NEW ROAD""",217640.0,400000.0,0.5441,"""Residential""","""Single Family""",,"""N/A""","""N/A""",


#### Creating a New Column

In [None]:
# Create a new column of the struct type

df.filter(
    (pl.col('Assessor Remarks').is_not_null()) & 
    (pl.col('OPM remarks').is_not_null())
    ).with_columns(
        pl.struct(pl.col(['Assessor Remarks','OPM remarks'])).alias('Combined Remarks')
        ).limit(5).collect()

Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location,Combined Remarks
i64,i64,str,str,str,f64,f64,f64,str,str,str,str,str,str,struct[2]
201312,2020,"""08/26/2021""","""Griswold""","""17-19 SCHOOL RD""",108430.0,150000.0,0.722867,"""Residential""","""Two Family""","""07 - Change in Property""","""DUPLEX""","""RENOVATED ONE UNIT PER MLS - S…",,"{""DUPLEX"",""RENOVATED ONE UNIT PER MLS - SEE PREVIOUS SALE #201127""}"
200052,2020,"""02/04/2021""","""Goshen""","""36 SANDY BEACH ROAD""",211730.0,290000.0,0.730103,"""Residential""","""Single Family""","""07 - Change in Property""","""current structure no value/ mu…","""PER MLS SALE IS FOR LAND - ASS…",,"{""current structure no value/ must be removed - PER MLS"",""PER MLS SALE IS FOR LAND - ASSESSMENT INCLUDES BUILDING""}"
200020,2020,"""12/08/2020""","""East Granby""","""96 KIMBERLY ROAD""",187800.0,255000.0,0.736471,"""Residential""","""Single Family""",,"""PP TOO LOW, RATIO TOO HIGH""","""GOOD SALE PER MLS - SOLD OVER …","""POINT (-72.75829 41.95346)""","{""PP TOO LOW, RATIO TOO HIGH"",""GOOD SALE PER MLS - SOLD OVER ASKING""}"
200594,2020,"""02/16/2021""","""Danbury""","""8 HICKORY ST""",121600.0,146216.0,0.831646,"""Residential""","""Single Family""","""25 - Other""","""I11192""","""HOUSE HAS SETTLED PER MLS""","""POINT (-73.44696 41.41179)""","{""I11192"",""HOUSE HAS SETTLED PER MLS""}"
200871,2020,"""04/27/2021""","""Danbury""","""2 CREST AV""",160200.0,370000.0,0.432973,"""Residential""","""Single Family""",,"""I05185""","""GOOD SALE PER MLS""","""POINT (-73.44654 41.44203)""","{""I05185"",""GOOD SALE PER MLS""}"


#### Casting

In [None]:
# Cast a string column to Polars date type

df.select(
    pl.col('Date Recorded').str.strptime(pl.Date, r'%m/%d/%Y'),
    pl.col('Assessed Value').cast(pl.Int64)
    ).collect()

Date Recorded,Assessed Value
date,i64
2021-04-14,133000
2021-05-26,110500
2021-09-13,150500
2020-12-14,127400
2021-09-07,217640
…,…
2022-10-11,483380
2023-09-29,20650
2023-01-09,132900
2023-09-26,1099400


## Joins

#### SQL-like Joins

In [None]:
# Sample data
df1 = df.filter(pl.col('List Year') == 2020)
df2 = df.filter(pl.col('List Year') == 2019)

# Inner join
df1.join(df2, on='Address', how = 'inner').collect()

Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location,Serial Number_right,List Year_right,Date Recorded_right,Town_right,Assessed Value_right,Sale Amount_right,Sales Ratio_right,Property Type_right,Residential Type_right,Non Use Code_right,Assessor Remarks_right,OPM remarks_right,Location_right
i64,i64,str,str,str,f64,f64,f64,str,str,str,str,str,str,i64,i64,str,str,f64,f64,f64,str,str,str,str,str,str
200121,2020,"""12/15/2020""","""Avon""","""63 NORTHGATE""",528490.0,775000.0,0.6819,"""Residential""","""Single Family""",,,,"""POINT (-72.89675 41.79445)""",190320,2019,"""06/23/2020""","""Simsbury""",196570.0,299500.0,0.6563,"""Single Family""","""Single Family""",,,,
200193,2020,"""11/18/2020""","""Bristol""","""14 CREST DR""",150290.0,276000.0,0.5445,"""Residential""","""Single Family""",,,,,190055,2019,"""12/16/2019""","""Cromwell""",249270.0,350000.0,0.7122,"""Single Family""","""Single Family""",,,,
201048,2020,"""07/13/2021""","""Bristol""","""17 LINCOLN ST""",102270.0,190000.0,0.5382,"""Residential""","""Two Family""",,,,,19062,2019,"""11/14/2019""","""Farmington""",146200.0,225000.0,0.6498,"""Single Family""","""Single Family""",,,,
200469,2020,"""05/13/2021""","""East Haven""","""15 ELM CT""",108350.0,247000.0,0.4386,"""Residential""","""Single Family""",,,,,190053,2019,"""11/14/2019""","""Plainville""",121590.0,198000.0,0.6141,"""Single Family""","""Single Family""",,,,
200097,2020,"""12/11/2020""","""Killingly""","""106 MAIN ST""",64400.0,66900.0,0.9626,"""Commercial""",,,,,,1900142,2019,"""03/20/2020""","""East Hampton""",27100.0,33700.0,0.8042,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
200061,2020,"""11/09/2020""","""Vernon""","""2 TERRACE DR""",113910.0,101000.0,1.1278,"""Residential""","""Single Family""","""14 - Foreclosure""",,,"""POINT (-72.47596 41.8638)""",190151,2019,"""01/21/2020""","""Vernon""",113910.0,91383.0,1.2465,"""Single Family""","""Single Family""","""14 - Foreclosure""",,,"""POINT (-72.47596 41.8638)"""
200887,2020,"""07/22/2021""","""Torrington""","""39 PROSPECT ST""",77860.0,200000.0,0.3893,"""Residential""","""Two Family""",,,,"""POINT (-73.12458 41.79851)""",190138,2019,"""07/28/2020""","""Essex""",246500.0,395000.0,0.6241,"""Single Family""","""Single Family""",,,,
200228,2020,"""07/12/2021""","""Windsor Locks""","""58 GROVE ST""",94850.0,180000.0,0.526944,"""Residential""","""Single Family""",,,,,190231,2019,"""01/24/2020""","""Wallingford""",175800.0,199000.0,0.8834,"""Single Family""","""Single Family""","""10 - A Will""",,,"""POINT (-72.82989 41.49695)"""
200228,2020,"""07/12/2021""","""Windsor Locks""","""58 GROVE ST""",94850.0,180000.0,0.526944,"""Residential""","""Single Family""",,,,,190108,2019,"""03/30/2020""","""Windsor Locks""",94850.0,153000.0,0.6199,"""Single Family""","""Single Family""",,,,


#### Union All

In [None]:
# Concatenation of previous DataFrames

pl.concat([df1, df2], how='diagonal_relaxed').collect()

Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
i64,i64,str,str,str,f64,f64,f64,str,str,str,str,str,str
2020177,2020,"""04/14/2021""","""Ansonia""","""323 BEAVER ST""",133000.0,248400.0,0.5354,"""Residential""","""Single Family""",,,,"""POINT (-73.06822 41.35014)"""
2020225,2020,"""05/26/2021""","""Ansonia""","""152 JACKSON ST""",110500.0,239900.0,0.4606,"""Residential""","""Three Family""",,,,
2020348,2020,"""09/13/2021""","""Ansonia""","""230 WAKELEE AVE""",150500.0,325000.0,0.463,"""Commercial""",,,,,
2020090,2020,"""12/14/2020""","""Ansonia""","""57 PLATT ST""",127400.0,202500.0,0.6291,"""Residential""","""Two Family""",,,,
200500,2020,"""09/07/2021""","""Avon""","""245 NEW ROAD""",217640.0,400000.0,0.5441,"""Residential""","""Single Family""",,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
190272,2019,"""06/24/2020""","""New London""","""4 BISHOP CT""",60410.0,53100.0,1.137665,"""Single Family""","""Single Family""","""14 - Foreclosure""",,,
190284,2019,"""11/27/2019""","""Waterbury""","""126 PERKINS AVE""",68280.0,76000.0,0.8984,"""Single Family""","""Single Family""","""25 - Other""","""PRIVATE SALE""",,
190129,2019,"""04/27/2020""","""Windsor Locks""","""19 HATHAWAY ST""",121450.0,210000.0,0.5783,"""Single Family""","""Single Family""",,,,
190504,2019,"""06/03/2020""","""Middletown""","""8 BYSTREK DR""",203360.0,280000.0,0.7263,"""Single Family""","""Single Family""",,,,


#### Union

In [None]:
# Concatenation while enforcing uniqueness on the 'Address' column

pl.concat([df1, df2], how='diagonal_relaxed').unique('Address').collect()

Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
i64,i64,str,str,str,f64,f64,f64,str,str,str,str,str,str
19000191,2019,"""07/27/2020""","""Granby""","""11 OAKRIDGE DR""",126840.0,210000.0,0.604,"""Single Family""","""Single Family""",,,,
201081,2020,"""06/28/2021""","""New Haven""","""53 HARBOUR CLOSE # B-53""",125860.0,228000.0,0.552,"""Residential""","""Condo""",,,,
200047,2020,"""10/20/2020""","""Glastonbury""","""149 POND CIR""",157200.0,260000.0,0.6046,"""Residential""","""Single Family""",,,,
200721,2020,"""12/28/2020""","""Waterbury""","""49 FOX RUN RD""",95370.0,217000.0,0.4394,"""Residential""","""Single Family""",,,,
200132,2020,"""04/06/2021""","""Westbrook""","""18 OLD FORGE RD""",44770.0,76000.0,0.589,"""Residential""","""Single Family""",,"""MANUFACTURED HOME""",,"""POINT (-72.43206 41.28308)"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…
190274,2019,"""02/04/2020""","""Norwich""","""113 OX HILL RD""",278700.0,452000.0,0.6166,"""Single Family""","""Single Family""","""25 - Other""","""INCORRECT LEGAL/WAITING FOR CO…",,
19280,2019,"""07/09/2020""","""Darien""","""97 RAYMOND STREET""",1.29731e6,2.2875e6,0.5671,"""Single Family""","""Single Family""",,,,
20139,2020,"""08/17/2021""","""Lebanon""","""FOWLER RD (265-28.004)""",38610.0,395000.0,0.0977,"""Vacant Land""",,"""25 - Other""","""2 PARCELS""",,"""POINT (-72.23176 41.5951)"""
200459,2020,"""01/14/2021""","""Manchester""","""216 WALEK FARMS ROAD""",181900.0,300000.0,0.6063,"""Residential""","""Single Family""",,,,


## Aggregations

#### Simple Grouping

In [None]:
# Multi-index grouping with basic aggregations

df.group_by(['List Year', 'Town']).agg([
    pl.col('Assessed Value').mean().alias('Avg Assessed Value'),
    pl.col('Sale Amount').mean().alias('Avg Sale Amount'),
    pl.col('Assessed Value').median().alias('Median Assessed Value'),
    pl.col('Sale Amount').median().alias('Median Sale Amount')
]
).sort(
    by = ['List Year', 'Avg Assessed Value', 'Avg Sale Amount']
).collect()

List Year,Town,Avg Assessed Value,Avg Sale Amount,Median Assessed Value,Median Sale Amount
i64,str,f64,f64,f64,f64
2001,"""Eastford""",42143.275862,170997.689655,23510.0,48666.5
2001,"""Union""",44832.258065,98736.774194,36950.0,90000.0
2001,"""Sterling""",54886.99115,109120.672566,59190.0,110000.0
2001,"""Plainfield""",56903.713528,112815.106101,58300.0,113000.0
2001,"""Chaplin""",58055.555556,110212.462963,55500.0,107187.5
…,…,…,…,…,…
2022,"""Wilton""",737009.40613,1.2333e6,473830.0,995000.0
2022,"""New Canaan""",1.0148e6,1.7992e6,855820.0,1.575e6
2022,"""Westport""",1.0663e6,2.2534e6,728750.0,1.65e6
2022,"""Darien""",1.0778e6,2.1694e6,883855.0,1.8495e6


#### Grouping Without Aggregating

In [None]:
# Creates a list if no aggregation is specified 

list_df = df.group_by(['List Year', 'Town']).agg([
    pl.col('Assessed Value').alias('Assessed Values'),
    pl.col('Sale Amount').alias('Sale Amounts') 
]).collect()

list_df

List Year,Town,Assessed Values,Sale Amounts
i64,str,list[f64],list[f64]
2014,"""Warren""","[211400.0, 169050.0, … 7690.0]","[380000.0, 234500.0, … 500000.0]"
2015,"""Ridgefield""","[593560.0, 210670.0, … 110140.0]","[850000.0, 150000.0, … 99225.0]"
2022,"""Middlefield""","[189400.0, 151900.0, … 251000.0]","[305000.0, 195000.0, … 300000.0]"
2020,"""Portland""","[56420.0, 209650.0, … 206570.0]","[56400.0, 393500.0, … 354000.0]"
2001,"""Burlington""","[68600.0, 124180.0, … 214900.0]","[125000.0, 268000.0, … 477000.0]"
…,…,…,…
2003,"""Windham""","[0.0, 49810.0, … 54530.0]","[160000.0, 110000.0, … 134000.0]"
2016,"""New Hartford""","[156800.0, 297850.0, … 185240.0]","[229000.0, 489900.0, … 270000.0]"
2002,"""Suffield""","[152670.0, 118650.0, … 53690.0]","[385000.0, 176000.0, … 89900.0]"
2004,"""Marlborough""","[117530.0, 56900.0, … 160490.0]","[218500.0, 507372.0, … 289000.0]"


#### Using Non-Aggregating Groups to Limit Results by Group

In [None]:
# List types can be sorted, sliced, and exploded to achieve a "top nth" result

list_df.with_columns(
    pl.col("Assessed Values").list.sort(descending=True).list.slice(0, 4),
    pl.col("Sale Amounts").list.sort(descending=True).list.slice(0, 4),
).explode(pl.col("Assessed Values"), pl.col("Sale Amounts"))

List Year,Town,Assessed Values,Sale Amounts
i64,str,f64,f64
2014,"""Warren""",2.31616e6,8.9915e6
2014,"""Warren""",1.64233e6,3.75e6
2014,"""Warren""",1.43626e6,2.8e6
2014,"""Warren""",619210.0,1.5585e6
2015,"""Ridgefield""",5.53321e6,4.975e6
…,…,…,…
2004,"""Marlborough""",403960.0,750000.0
2001,"""Kent""",1.5888e6,5.4e6
2001,"""Kent""",368400.0,1.059375e6
2001,"""Kent""",361400.0,1e6


## Window Operations

In [None]:
# Create a DataFrame for the window

window_data = df.select(
    [
        'List Year',
        'Address',
        'Assessed Value'
     ]
    ).filter(
        pl.col('Address') == "1 CEDAR ST"
    ).unique().collect().sort('List Year')

window_data

List Year,Address,Assessed Value
i64,str,f64
2001,"""1 CEDAR ST""",73100.0
2002,"""1 CEDAR ST""",66600.0
2004,"""1 CEDAR ST""",67600.0
2005,"""1 CEDAR ST""",42280.0
2008,"""1 CEDAR ST""",99260.0
…,…,…
2013,"""1 CEDAR ST""",166110.0
2013,"""1 CEDAR ST""",136090.0
2013,"""1 CEDAR ST""",118500.0
2019,"""1 CEDAR ST""",106500.0


#### Using Over()

In [None]:
# Multi-window operation - rank is over all the rows while the second ranks by year

window_data.with_columns(
    pl.col('Assessed Value').rank('dense', descending = True).alias('Total Rank'),
    pl.col('Assessed Value').rank('dense', descending = True).over('List Year').alias('Yearly Rank')
)
                                                                       

List Year,Address,Assessed Value,Total Rank,Yearly Rank
i64,str,f64,u32,u32
2001,"""1 CEDAR ST""",73100.0,9,1
2002,"""1 CEDAR ST""",66600.0,11,1
2004,"""1 CEDAR ST""",67600.0,10,1
2005,"""1 CEDAR ST""",42280.0,12,1
2008,"""1 CEDAR ST""",99260.0,8,1
…,…,…,…,…
2013,"""1 CEDAR ST""",166110.0,1,1
2013,"""1 CEDAR ST""",136090.0,4,2
2013,"""1 CEDAR ST""",118500.0,5,3
2019,"""1 CEDAR ST""",106500.0,7,1


#### Using Rolling()

In [None]:
# Create example df

rolling_data = df.select(
    [
    pl.col('Date Recorded').str.strptime(pl.Date, r'%m/%d/%Y'),
    pl.col('Address'),
    pl.col('Assessed Value').cast(pl.Int64)
    ]
    ).filter(
        pl.col('Date Recorded') > pl.date(2020, 1, 1)
        ).collect(
            ).unique('Date Recorded').sort('Date Recorded').limit(10)

rolling_data

Date Recorded,Address,Assessed Value
date,str,i64
2020-01-02,"""9 CAPT AMOS STANTON DR""",194230
2020-01-03,"""6 RIVERCLIFF LA""",160210
2020-01-06,"""3 SYLEO LANE""",47600
2020-01-07,"""TWO POMPERAUG OFFICE PARK U#10""",53560
2020-01-08,"""199 WILMOT RD""",103740
2020-01-09,"""130 BROOKVIEW AVENUE""",372470
2020-01-10,"""40 SOUTHWICK CT U212""",64690
2020-01-13,"""31 WINTHROP WOODS RD""",146160
2020-01-14,"""SEELYE RD""",3090
2020-01-15,"""230 OLD GATE LN""",1925000


In [None]:
# 3-day rolling mean, max and min

rolling_data.with_columns(
    avg_Val=pl.mean('Assessed Value').rolling(index_column='Date Recorded', period='3d'),
    min_Val=pl.min('Assessed Value').rolling(index_column='Date Recorded', period='3d'),
    max_Val=pl.max('Assessed Value').rolling(index_column='Date Recorded', period='3d'),
)

Date Recorded,Address,Assessed Value,avg_Val,min_Val,max_Val
date,str,i64,f64,i64,i64
2020-01-02,"""9 CAPT AMOS STANTON DR""",194230,194230.0,194230,194230
2020-01-03,"""6 RIVERCLIFF LA""",160210,177220.0,160210,194230
2020-01-06,"""3 SYLEO LANE""",47600,47600.0,47600,47600
2020-01-07,"""TWO POMPERAUG OFFICE PARK U#10""",53560,50580.0,47600,53560
2020-01-08,"""199 WILMOT RD""",103740,68300.0,47600,103740
2020-01-09,"""130 BROOKVIEW AVENUE""",372470,176590.0,53560,372470
2020-01-10,"""40 SOUTHWICK CT U212""",64690,180300.0,64690,372470
2020-01-13,"""31 WINTHROP WOODS RD""",146160,146160.0,146160,146160
2020-01-14,"""SEELYE RD""",3090,74625.0,3090,146160
2020-01-15,"""230 OLD GATE LN""",1925000,691416.666667,3090,1925000


## Pivots and Melts

#### Pivot 

In [None]:
# Create sample data by grouping median 'Sale Amount' by 'List Year', 'Property Type'

pivot_sample = df.filter(
    pl.col('Property Type').is_not_null()
    ).group_by(['List Year', 'Property Type']
               ).agg(pl.col('Sale Amount').median().alias('Median Sale Amount')
                     ).collect()

pivot_sample



List Year,Property Type,Median Sale Amount
i64,str,f64
2021,"""Residential""",315000.0
2020,"""Commercial""",400000.0
2020,"""Public Utility""",59950.0
2013,"""Two Family""",130000.0
2020,"""Industrial""",645000.0
…,…,…
2011,"""Condo""",165000.0
2017,"""Two Family""",175000.0
2012,"""Single Family""",246000.0
2012,"""Two Family""",127500.0


In [None]:
# Pivot 'Property' Type into columns

pivot_sample.pivot(
    columns = 'Property Type', 
    index='List Year', 
    values='Median Sale Amount', 
    aggregate_function='mean', 
    sort_columns=True).sort('List Year')


  pivot_sample.pivot(


List Year,Apartments,Commercial,Condo,Four Family,Industrial,Public Utility,Residential,Single Family,Three Family,Two Family,Vacant Land
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2006,,,197500.0,276750.0,,,,303000.0,240000.0,235000.0,
2007,,,200000.0,243000.0,,,,270000.0,205000.0,200000.0,
2008,,,178589.5,143000.0,,,,238000.0,103000.0,145000.0,
2009,,,180000.0,128830.5,,,,247000.0,105000.0,138800.0,
2010,,,171000.0,123500.0,,,,239000.0,95000.0,125000.0,
…,…,…,…,…,…,…,…,…,…,…,…
2018,,,160000.0,234500.0,,,,245000.0,199000.0,187000.0,
2019,,,177900.0,265000.0,,,,280000.0,225000.0,205000.0,
2020,400000.0,400000.0,,,645000.0,59950.0,290000.0,,,,100000.0
2021,560000.0,425000.0,,,866250.0,20000.0,315000.0,,,,110000.0


#### Melt

In [None]:
# Create sample DataFrame
melt_sample = df.select(
    [
        'List Year',
        'Address',
        'Assessed Value',
        'Sale Amount'
    ]
).limit(5).collect()

melt_sample

List Year,Address,Assessed Value,Sale Amount
i64,str,f64,f64
2020,"""323 BEAVER ST""",133000.0,248400.0
2020,"""152 JACKSON ST""",110500.0,239900.0
2020,"""230 WAKELEE AVE""",150500.0,325000.0
2020,"""57 PLATT ST""",127400.0,202500.0
2020,"""245 NEW ROAD""",217640.0,400000.0


In [None]:
# Convert into long format by melting 'Assessed Amount' and 'Sale Amount'
melt_sample.melt(
    id_vars = ['List Year', 'Address'],
    variable_name = 'Value Type',
    value_name = 'Amount'
).sort('Address')

  melt_sample.melt(


List Year,Address,Value Type,Amount
i64,str,str,f64
2020,"""152 JACKSON ST""","""Assessed Value""",110500.0
2020,"""152 JACKSON ST""","""Sale Amount""",239900.0
2020,"""230 WAKELEE AVE""","""Assessed Value""",150500.0
2020,"""230 WAKELEE AVE""","""Sale Amount""",325000.0
2020,"""245 NEW ROAD""","""Assessed Value""",217640.0
2020,"""245 NEW ROAD""","""Sale Amount""",400000.0
2020,"""323 BEAVER ST""","""Assessed Value""",133000.0
2020,"""323 BEAVER ST""","""Sale Amount""",248400.0
2020,"""57 PLATT ST""","""Assessed Value""",127400.0
2020,"""57 PLATT ST""","""Sale Amount""",202500.0
