## Running Polars within Julia

Run this before `PyCall` to use existing python environment, including all env python modules

In [1]:
ENV["JULIA_CONDAPKG_BACKEND"] = "Null";

#### Julia Packages
- Pkg.add("PythonCall")

In [2]:
using PythonCall, DataFrames, DataFramesMeta, Chain, Pipe

#### Python Packages
- Defined in mamba env by defing env variable

In [3]:
pl = pyimport("polars")

Python: <module 'polars' from '/home/cdaniels/mambaforge/envs/bioinfo/lib/python3.12/site-packages/polars/__init__.py'>

In [4]:
PythonCall.python_executable_path()

"/home/cdaniels/mambaforge/envs/bioinfo/bin/python"

In [5]:
PythonCall.python_version()

v"3.12.11"

The `@py` macro in PythonCall.jl allows you to write Python syntax directly in Julia. It's a special macro that interprets Python-like syntax and converts it to PythonCall operations.

So `df` is a python/polars object.

In [6]:
df = @py pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 28, 45],
    "salary": [50000, 60000, 70000, 55000, 80000],
    "dept": ["IT", "HR", "IT", "HR", "IT"],
    "joined": ["2020-01-15", "2019-03-22", "2018-07-01", "2021-06-15", "2017-09-30"]})

name,age,salary,dept,joined
str,i64,i64,str,str
"""Alice""",25,50000,"""IT""","""2020-01-15"""
"""Bob""",30,60000,"""HR""","""2019-03-22"""
"""Charlie""",35,70000,"""IT""","""2018-07-01"""
"""David""",28,55000,"""HR""","""2021-06-15"""
"""Eve""",45,80000,"""IT""","""2017-09-30"""


Here we use natural polars code! The only awkward part is not following the standard '.' leading as opposed to lagging.

In [7]:
df.filter(pl.col("age") > 25).
    with_columns([
        (pl.col("salary") * 0.15).alias("bonus"),
        (pl.col("age").gt(30)).alias("is_senior")]).
    group_by("dept").
    agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("bonus").sum().alias("total_bonus"),
        pl.col("name").count().alias("employee_count")]).
    sort("avg_salary", descending=true)

dept,avg_salary,total_bonus,employee_count
str,f64,f64,u32
"""IT""",75000.0,22500.0,2
"""HR""",57500.0,17250.0,2


`p_str` is a special julia string macro to allow for polars convention of placing '.' at beginning of the line, which is not typically allowed in julia.

Usage: `p"""<code>"""`

The problem is that there is no autocomplete within triple quoted strings. . .

In [8]:
macro p_str(s)
    lines = split(s, '\n')
    transformed = String[]
    
    for line in lines
        stripped = lstrip(line)
        if startswith(stripped, ".") && !isempty(transformed)
            transformed[end] *= "."
            push!(transformed, " " * stripped[2:end])
        elseif !isempty(strip(line))
            push!(transformed, line)
        end
    end
    
    code = join(transformed, '\n')
    return esc(Meta.parse("begin\n$code\nend"))
end

@p_str (macro with 1 method)

In [9]:
df = p"""(df.filter(pl.col("age") > 25)
    .with_columns([
        (pl.col("salary") * 0.15).alias("bonus"),
        (pl.col("age").gt(30)).alias("is_senior")])
    .group_by("dept")
    .agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("bonus").sum().alias("total_bonus"),
        pl.col("name").count().alias("employee_count")])
    .sort("avg_salary", descending=true))"""

dept,avg_salary,total_bonus,employee_count
str,f64,f64,u32
"""IT""",75000.0,22500.0,2
"""HR""",57500.0,17250.0,2


Technically, `df` is a special julia object `Py`, but looks and acts like a python object

In [10]:
typeof(df)

Py

Normal python/polars syntax for all of this!

In [11]:
df.write_csv("test.tsv",separator="\t");

In [12]:
pl.read_csv("test.tsv",separator="\t")

dept,avg_salary,total_bonus,employee_count
str,f64,f64,i64
"""IT""",75000.0,22500.0,2
"""HR""",57500.0,17250.0,2


In [13]:
df1=pl.read_csv("/home/cdaniels/uofc_data/ubs_seq/Projects/UBS-seq_250427/workspace/capture/CA3.capture.depth.csv",separator="\t")

chrom,start,end,name,score
i64,i64,i64,str,i64
1,1033756,1033997,"""omni1m2878""",158
1,1540183,1540469,"""omni1m0206""",86
1,1540236,1540417,"""omni1m3284""",86
1,1630371,1630664,"""omni1m0430""",142
1,1630842,1631124,"""omni1m0436""",111
…,…,…,…,…
9,134641904,134642209,"""omni1m0234""",134
9,137156663,137156844,"""omni1m3358""",276
9,137877418,137877659,"""omni1m2016""",253
9,137877438,137877639,"""omni1m1481""",251


In [14]:
df2 = df1.
    with_columns(diff=pl.col("end")-pl.col("start")).
    drop("score")

chrom,start,end,name,diff
i64,i64,i64,str,i64
1,1033756,1033997,"""omni1m2878""",241
1,1540183,1540469,"""omni1m0206""",286
1,1540236,1540417,"""omni1m3284""",181
1,1630371,1630664,"""omni1m0430""",293
1,1630842,1631124,"""omni1m0436""",282
…,…,…,…,…
9,134641904,134642209,"""omni1m0234""",305
9,137156663,137156844,"""omni1m3358""",181
9,137877418,137877659,"""omni1m2016""",241
9,137877438,137877639,"""omni1m1481""",201


Now we want to convert python/polars df back to julia/DataFrame.

In [15]:
function polars2DataFrame(df_polars::Py)
    dict_data = pyconvert(Dict{String, Vector}, df_polars.to_dict(as_series=false))
    df = DataFrame(dict_data)
    col_order = Symbol.(pyconvert(Vector{String}, df_polars.columns))
    return select(df, col_order...)
end

polars2DataFrame (generic function with 1 method)

Pipes are an easy and elegant way to to this

In [16]:
df2 |> polars2DataFrame |> describe

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,chrom,8.69063,1,7.0,22,0,Int64
2,start,77752100.0,165964,63056800.0,247857335,0,Int64
3,end,77752300.0,166201,63057000.0,247857631,0,Int64
4,name,,omni1m0001,,omni1m3481,0,String
5,diff,247.847,161,241.0,540,0,Int64


In [17]:
df2 |> polars2DataFrame|> x->(first(x,5))

Row,chrom,start,end,name,diff
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,1,1033756,1033997,omni1m2878,241
2,1,1540183,1540469,omni1m0206,286
3,1,1540236,1540417,omni1m3284,181
4,1,1630371,1630664,omni1m0430,293
5,1,1630842,1631124,omni1m0436,282


`@pipe` provides more control over position of input from prior output using '_'

In [18]:
@pipe df2 |> polars2DataFrame |> first(_,5)

Row,chrom,start,end,name,diff
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,1,1033756,1033997,omni1m2878,241
2,1,1540183,1540469,omni1m0206,286
3,1,1540236,1540417,omni1m3284,181
4,1,1630371,1630664,omni1m0430,293
5,1,1630842,1631124,omni1m0436,282


`@chain` assumes that the input from the prior output is alway in the first position. Cleaner in arguments, but requires begin/end block

In [19]:
@chain df2 begin
    polars2DataFrame 
    first(5)
end

Row,chrom,start,end,name,diff
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,1,1033756,1033997,omni1m2878,241
2,1,1540183,1540469,omni1m0206,286
3,1,1540236,1540417,omni1m3284,181
4,1,1630371,1630664,omni1m0430,293
5,1,1630842,1631124,omni1m0436,282


`DataFramesMeta` adds additional macros similiar to `dlypr` in R. Since the df is always the first arg in these macros, `@chain` works very well.

In [20]:
@chain df2 begin
    polars2DataFrame
    @transform(:diff2= :end - :start)
    @select Not(:diff) # Drop col
    @subset(:diff2 .> 250)
    @transform(:bin = 100 .*(:diff2 .÷ 100))
    @by(:bin,:count = length(:bin))
    sort(:bin,rev=true)
end

Row,bin,count
Unnamed: 0_level_1,Int64,Int64
1,500,6
2,400,54
3,300,366
4,200,486
