## 一、介绍

Polars 是一个用于操作结构化数据的高性能 DataFrame 库，由于 Polars 是从0开始用Rust编写，紧密与机器结合。其矢量化和列式处理可在现代处理器上实现缓存一致性算法和高性能。如果您经常使用 pandas，那么用起 Polars 会感觉很轻松，可以说是平替 Pandas 最有潜质的包。

Polars 在独立的 TPCH 基准测试中与其他几个解决方案进行了基准测试。该基准测试旨在复制实践中使用的数据整理操作。由于其并行执行引擎、高效算法以及 SIMD（单指令、多数据）矢量化的使用，Polars 轻松胜过其他解决方案。**与pandas相比，它可以实现30倍以上的性能提升**。



Polars 的目标是提供一个闪电般快速的`DataFrame`库：

- 利用机器上所有可用的内核。
- 优化查询以减少不必要的工作/内存分配。
- 处理比可用 RAM 大得多的数据集。
- 拥有一致且可预测的 API。
- 具有严格的架构（在运行查询之前应该知道数据类型）。

<br>

[User guide: https://pola-rs.github.io/polars/user-guide/](https://pola-rs.github.io/polars/user-guide/)
[API reference: https://pola-rs.github.io/polars/py-polars/html/reference/io.html](https://pola-rs.github.io/polars/py-polars/html/reference/io.html)

<br>

打开命令行， 执行  polars 安装命令

In [None]:
!pip3 install 'polars[all]'


## 二、数据读写

Polars 读写数据支持

- 常见的数据文件，如 csv、xlsx、json、parquet ；
- 云存储，如 S3、Azure Blob, BigQuery； 
- 数据库，如postgres、mysql



咱们主要分享常见的代码操作

In [38]:
import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "idx": [1, 2, 3, 4],
        "name": ["张三", "李四", "王五", "赵六"],
        "birthday": [
            datetime(2009, 5, 1),
            datetime(2005, 10, 15),
            datetime(2000, 12, 31),
            datetime(1995, 6, 15),
        ],
        "gender": ["男", "男", "男", "女"],
        "bio": ["好好学习，天天向上", 
                "泰难了", 
                "学习有毛用", 
                "躺平ing"],
    }
)

#存入csv、excel、json、parquet
df.write_csv("data.csv")
df.write_excel("data.xlsx")
df.write_json("data.json")
df.write_parquet("data.parquet")


df

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing"""


In [39]:
df_csv = pl.read_csv('data.csv')
df_xlsx = pl.read_excel('data.xlsx')

df_csv

idx,name,birthday,gender,bio
i64,str,str,str,str
1,"""张三""","""2009-05-01T00:…","""男""","""好好学习，天天向上"""
2,"""李四""","""2005-10-15T00:…","""男""","""泰难了"""
3,"""王五""","""2000-12-31T00:…","""男""","""学习有毛用"""
4,"""赵六""","""1995-06-15T00:…","""女""","""躺平ing"""


In [40]:
df_json = pl.read_json("data.json")
df_parquet = pl.read_parquet("data.parquet")

df_json

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing"""


## 三、Expressions表达式

`Expressions`是Polars的核心功能， `expressions` 既可以解决简单的查询，又可以轻松扩展到复杂的查询。下面是 polars 的基本表达式

- select
- filter
- with_columns
- grouby 

### 3.1 select



In [46]:
df.select(
    pl.col("name"), 
    pl.col("birthday"),
)

name,birthday
str,datetime[μs]
"""张三""",2009-05-01 00:00:00
"""李四""",2005-10-15 00:00:00
"""王五""",2000-12-31 00:00:00
"""赵六""",1995-06-15 00:00:00


In [45]:
df.select(
    pl.col("name").alias('姓名'),
    pl.col("birthday").alias('生日')
)

姓名,生日
str,datetime[μs]
"""张三""",2009-05-01 00:00:00
"""李四""",2005-10-15 00:00:00
"""王五""",2000-12-31 00:00:00
"""赵六""",1995-06-15 00:00:00


In [118]:
pl.col("name", "birthday")

In [109]:
df.select(
    pl.col("name", "birthday")
)

name,birthday
str,datetime[μs]
"""张三""",2009-05-01 00:00:00
"""李四""",2005-10-15 00:00:00
"""王五""",2000-12-31 00:00:00
"""赵六""",1995-06-15 00:00:00


In [110]:
df.select(
    ["name", "birthday"]
)

name,birthday
str,datetime[μs]
"""张三""",2009-05-01 00:00:00
"""李四""",2005-10-15 00:00:00
"""王五""",2000-12-31 00:00:00
"""赵六""",1995-06-15 00:00:00


In [64]:
df["name"]
df[["name"]]

name
str
"""张三"""
"""李四"""
"""王五"""
"""赵六"""


In [63]:
df.select(["name"])
#df.select("name")

name
str
"""张三"""
"""李四"""
"""王五"""
"""赵六"""


In [108]:
df.with_columns(
    pl.col('name').alias('姓名')
)

idx,name,birthday,gender,bio,姓名
i64,str,datetime[μs],str,str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上""","""张三"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了""","""李四"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用""","""王五"""
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing""","""赵六"""


In [66]:
df.filter(df['birthday']>datetime(2000, 1, 1))

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""


In [75]:
df['birthday'>datetime(2000, 1, 1)]

TypeError: '>' not supported between instances of 'str' and 'datetime.datetime'

In [78]:
#df[pl.col('birthday')>datetime(2000, 1, 1)]


df.filter(pl.col('birthday')>datetime(2000, 1, 1))

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""


In [84]:
help(pl.col('birthday').dt)

Help on ExprDateTimeNameSpace in module polars.expr.datetime object:

class ExprDateTimeNameSpace(builtins.object)
 |  ExprDateTimeNameSpace(expr: 'Expr')
 |  
 |  Namespace for datetime related expressions.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, expr: 'Expr')
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  base_utc_offset(self) -> 'Expr'
 |      Base offset from UTC.
 |      
 |      This is usually constant for all datetimes in a given time zone, but
 |      may vary in the rare case that a country switches time zone, like
 |      Samoa (Apia) did at the end of 2011.
 |      
 |      Returns
 |      -------
 |      Expr
 |          Expression of data type :class:`Duration`.
 |      
 |      See Also
 |      --------
 |      Expr.dt.dst_offset : Daylight savings offset from UTC.
 |      
 |      Examples
 |      --------
 |      >>> from datetime import datetime
 |      >>> df = pl.DataFrame(
 |      ...     {
 |      ...         "ts": [

In [83]:
help(pl.col('bio').str)

Help on ExprStringNameSpace in module polars.expr.string object:

class ExprStringNameSpace(builtins.object)
 |  ExprStringNameSpace(expr: 'Expr')
 |  
 |  Namespace for string related expressions.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, expr: 'Expr')
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  concat(self, delimiter: 'str' = '-', *, ignore_nulls: 'bool' = True) -> 'Expr'
 |      Vertically concat the values in the Series to a single string value.
 |      
 |      Parameters
 |      ----------
 |      delimiter
 |          The delimiter to insert between consecutive string values.
 |      ignore_nulls
 |          Ignore null values (default).
 |      
 |          If set to ``False``, null values will be propagated.
 |          if the column contains any null values, the output is ``None``.
 |      
 |      Returns
 |      -------
 |      Expr
 |          Expression of data type :class:`Utf8`.
 |      
 |      Examples
 |      --------


In [82]:
df.select(
    pl.col('bio').str.len
)

AttributeError: 'ExprStringNameSpace' object has no attribute 'len'

In [86]:
df.filter(pl.col('birthday')>datetime(2000, 1, 1))

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""


In [87]:
df.filter(
  pl.col('bio').str.contains("学习")
)

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""


In [88]:
df.groupby('gender')

  df.groupby('gender')


<polars.dataframe.group_by.GroupBy at 0x132d129d0>

In [91]:
for gender, gender_df in df.groupby(pl.col('gender')):
    print(gender, len(gender_df), type(gender_df))

男 3 <class 'polars.dataframe.frame.DataFrame'>
女 1 <class 'polars.dataframe.frame.DataFrame'>


  for gender, gender_df in df.groupby(pl.col('gender')):


In [97]:
for gender, gender_df in df.groupby(pl.col('gender')):
    print(gender,  gender_df['bio'].apply(lambda t: len(t)).mean())

男 5.666666666666667
女 5.0


  for gender, gender_df in df.groupby(pl.col('gender')):
  print(gender,  gender_df['bio'].apply(lambda t: len(t)).mean())


In [105]:
df.groupby(pl.col('gender')).agg(pl.col('bio').apply(lambda t: len(t)))

  df.groupby(pl.col('gender')).agg(pl.col('bio').apply(lambda t: len(t)))
  df.groupby(pl.col('gender')).agg(pl.col('bio').apply(lambda t: len(t)))


gender,bio
str,i64
"""女""",1
"""男""",3


In [119]:
df

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing"""


In [121]:
import polars.selectors as cs

df.select(
    cs.integer(), cs.string()
)

idx,name,gender,bio
i64,str,str,str
1,"""张三""","""男""","""好好学习，天天向上"""
2,"""李四""","""男""","""泰难了"""
3,"""王五""","""男""","""学习有毛用"""
4,"""赵六""","""女""","""躺平ing"""


In [124]:
df.select(
    cs.contains('r')
)

birthday,gender
datetime[μs],str
2009-05-01 00:00:00,"""男"""
2005-10-15 00:00:00,"""男"""
2000-12-31 00:00:00,"""男"""
1995-06-15 00:00:00,"""女"""


In [130]:
#筛选出字段名含 r 的字段
df.select(
    cs.datetime().dt.to_string("%Y-%M-%d")
)

birthday
str
"""2009-00-01"""
"""2005-00-15"""
"""2000-00-31"""
"""1995-00-15"""


In [134]:
#筛选出字段名含 r 的字段
df.select(
    cs.temporal()
)

birthday
datetime[μs]
2009-05-01 00:00:00
2005-10-15 00:00:00
2000-12-31 00:00:00
1995-06-15 00:00:00


In [132]:
df

idx,name,birthday,gender,bio
i64,str,datetime[μs],str,str
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上"""
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了"""
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用"""
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing"""


In [133]:
df.select(
    cs.matches('na|io')
)

name,bio
str,str
"""张三""","""好好学习，天天向上"""
"""李四""","""泰难了"""
"""王五""","""学习有毛用"""
"""赵六""","""躺平ing"""


In [136]:
df.with_columns(
    pl.when(pl.col('birthday')>datetime(2000, 1, 1))
    .then(True)
    .otherwise(False)
    .alias('00后')
)

idx,name,birthday,gender,bio,00后
i64,str,datetime[μs],str,str,bool
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上""",True
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了""",True
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用""",True
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing""",False


In [153]:
df.with_columns(
    pl.col('bio').str.len_chars().alias('lenth')
)

idx,name,birthday,gender,bio,lenth
i64,str,datetime[μs],str,str,u32
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上""",9
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了""",3
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用""",5
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing""",5


In [156]:
df.with_columns(
    pl.col('bio').str.extract_all('躺平|难|毛').alias('neg')
)

idx,name,birthday,gender,bio,neg
i64,str,datetime[μs],str,str,list[str]
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上""",[]
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了""","[""难""]"
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用""","[""毛""]"
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing""","[""躺平""]"


In [148]:
help(pl.col('bio').str.extract)

Help on method extract in module polars.expr.string:

extract(pattern: 'str', group_index: 'int' = 1) -> 'Expr' method of polars.expr.string.ExprStringNameSpace instance
    Extract the target capture group from provided patterns.
    
    Parameters
    ----------
    pattern
        A valid regular expression pattern, compatible with the `regex crate
        <https://docs.rs/regex/latest/regex/>`_.
    group_index
        Index of the targeted capture group.
        Group 0 means the whole pattern, the first group begins at index 1.
        Defaults to the first capture group.
    
    Notes
    -----
    To modify regular expression behaviour (such as multi-line matching)
    with flags, use the inline `(?iLmsuxU)` syntax. For example:
    
    >>> df = pl.DataFrame(
    ...     data={
    ...         "lines": [
    ...             "I Like\nThose\nOdds",
    ...             "This is\nThe Way",
    ...         ]
    ...     }
    ... )
    >>> df.with_columns(
    ...     pl.col("lin

In [161]:
df.with_columns(
    pl.col('birthday').dt.date().alias('date2')
)

idx,name,birthday,gender,bio,date2
i64,str,datetime[μs],str,str,date
1,"""张三""",2009-05-01 00:00:00,"""男""","""好好学习，天天向上""",2009-05-01
2,"""李四""",2005-10-15 00:00:00,"""男""","""泰难了""",2005-10-15
3,"""王五""",2000-12-31 00:00:00,"""男""","""学习有毛用""",2000-12-31
4,"""赵六""",1995-06-15 00:00:00,"""女""","""躺平ing""",1995-06-15


In [168]:
q = (
    df.lazy()
    .groupby('gender')
    .agg(
        pl.count(),
        pl.col('bio').str.len_chars().mean().alias('mean_len')
    )
)

q.collect()

  .groupby('gender')


gender,count,mean_len
str,u32,f64
"""男""",3,5.666667
"""女""",1,5.0


In [169]:
df.groupby('gender').agg(
    pl.count(),
    pl.col('bio').str.len_chars().mean().alias('mean_len')
)

  df.groupby('gender').agg(


gender,count,mean_len
str,u32,f64
"""女""",1,5.0
"""男""",3,5.666667
