# Python Data Frames Cheat Sheet

## Importing data

Use Polars functions to read data into a DataFrame or write a DataFrame to a file.

In [1]:
import polars as pl               # import the polars package
data = pl.read_csv("people.csv")  # use read_csv to read CSV files
data                              # display the data frame

name,birthdate,weight,height
str,str,f64,f64
"""Alice Archer""","""1997-01-10""",57.9,1.56
"""Ben Brown""","""1985-02-15""",72.5,1.77
"""Chloe Cooper""","""1983-03-22""",53.6,1.65
"""Daniel Donovan""","""1981-04-30""",83.1,1.75


## Data frame attributes

Use DataFrame attributes to get information about how the data are organized.

In [4]:
print(data.columns)  # column names in a list of strings
print(data.dtypes)   # data type of each column
print(data.shape)    # tuple with the number of rows and columns
print(data.height)   # number of rows
print(data.width)    # number of columns

['name', 'birthdate', 'weight', 'height']
[String, String, Float64, Float64]
(4, 4)
4
4


In [6]:
print(data.schema)  # dictionary of column names and their data types

Schema({'name': String, 'birthdate': String, 'weight': Float64, 'height': Float64})


## Creating a data frame directly

To make a DataFrame in your code, rather than inputting it from a file, use `pl.DataFrame`.

To use `pl.DataFrame`, make a dictionary (use curly braces, `{}`) with a key for each column in the DataFrame. Each column will have a list of values, which will correspond to rows in the DataFrame.

In [7]:
data = pl.DataFrame(
    {
        "participant_id": ["001", "002", "003"],
        "age": [25, 32, 65],
        "score1": [3, 6, 2],
        "score2": [8, 2, 4],
    }
)
data

participant_id,age,score1,score2
str,i64,i64,i64
"""001""",25,3,8
"""002""",32,6,2
"""003""",65,2,4


## Accessing data in a data frame

Use indexing (`[]`) to access individual columns.

In [12]:
print(type(data["score1"]))  # each column is a Series
data["score1"]               # indexing a column results in a series

<class 'polars.series.series.Series'>


score1
i64
3
6
2


Columns may be exported to NumPy arrays, allowing data to be analyzed using NumPy functions. But usually it's more efficient to use DataFrame functions for analysis.

In [15]:
score1 = data["score1"].to_numpy()  # access a column and convert to an array
score2 = data["score2"].to_numpy()
diff = score1 - score2              # now can use NumPy operations and functions
print(score1)
print(score2)
print(diff)

[3 6 2]
[8 2 4]
[-5  4 -2]


## Expressions

In Polars, we can use expressions to represent operations on columns in a DataFrame. These expressions are used with the `select`, `with_columns`, `filter`, and `group_by` methods to clean, reorganize, and analyze data.

Expressions let us describe mathmatical operations on data columns, using standard math operators. Note that we can define an expression without actually evaluating it on any data.

In [23]:
print(pl.col("score1"))                     # refer to a "score1" column
print(pl.col("score1") / 10)                # divide column by 10
print(pl.col("score1") ** 2)                # square column
print(pl.col("score1").sqrt())              # square root of column
print(pl.col("score1") + pl.col("score2"))  # add two columns
print(pl.col("score1") * pl.col("score2"))  # multiply two columns


col("score1")
[(col("score1")) / (dyn int: 10)]
col("score1").pow([dyn int: 2])
col("score1").sqrt()
[(col("score1")) + (col("score2"))]
[(col("score1")) * (col("score2"))]


Standard statistics are also available, similar to NumPy. By default, missing data will be ignored, like with NumPy's `nanmean`, `nanstd`, etc.

In [24]:
print(pl.col("score1").sum())           # sum over all rows
print(pl.col("score1").mean())          # mean
print(pl.col("score1").std())           # standard deviation
print(pl.col("score1").min())           # minimum
print(pl.col("score1").max())           # maximum
print(pl.col("score1").median())        # median
print(pl.col("score1").quantile(0.25))  # 25th percentile

col("score1").sum()
col("score1").mean()
col("score1").std()
col("score1").min()
col("score1").max()
col("score1").median()
col("score1").quantile()


## Using select and with_columns

Use `select` to get a subset of columns from a DataFrame, change their order, and transform them. Use `with_columns` to add columns without removing any.

In [16]:
data

participant_id,age,score1,score2
str,i64,i64,i64
"""001""",25,3,8
"""002""",32,6,2
"""003""",65,2,4


Pass a list of columns to reorder them and/or get a subset of columns.

In [None]:
data.select(["score1", "score2", "participant_id"])

score1,score2,participant_id
i64,i64,str
3,8,"""001"""
6,2,"""002"""
2,4,"""003"""


Use an expression to make a new column based on existing columns.

In [19]:
data.select(
    "score1",
    "score2",
    score_total=pl.col("score1") + pl.col("score2"),  # add score 1 and score 2
)

score1,score2,score_total
i64,i64,i64
3,8,11
6,2,8
2,4,6


Use `with_columns` to add a column to the existing ones. Otherwise, it works the same as `select`.

In [20]:
data.with_columns(
    score_total=pl.col("score1") + pl.col("score2")
)

participant_id,age,score1,score2,score_total
str,i64,i64,i64,i64
"""001""",25,3,8,11
"""002""",32,6,2,8
"""003""",65,2,4,6
