# DataFrames
DataFrames are used to organize spreadsheets of data. We can directly import spreadsheet data into a DataFrame in Polars.

## Importing data
We can easily import data from spreadsheets into a variable called a DataFrame. The Polars packages has powerful tools for working with DataFrames. First, we will import the Polars module.

In [1]:
import polars as pl

We can use `pl.read_csv` to import a CSV file into a DataFrame. The `head` method shows the first several lines of a DataFrame (you can choose the number of lines; the default is 5).

In [2]:
raw = pl.read_csv("exp1.csv")
raw.head()

cycle,trial,phase,type,word1,word2,response,RT,correct,lag,serPos1,serPos2,subj,intactLag,prevResponse,prevRT
i64,i64,str,str,str,str,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64
0,-1,"""study""","""intact""","""formal""","""positive""",-1,-1.0,-1,-1,0,0,101,0,0,0
0,0,"""study""","""intact""","""skin""","""careful""",-1,-1.0,-1,-1,1,1,101,0,0,0
0,1,"""study""","""intact""","""upon""","""miss""",-1,-1.0,-1,-1,2,2,101,0,0,0
0,2,"""study""","""intact""","""single""","""tradition""",-1,-1.0,-1,-1,3,3,101,0,0,0
0,3,"""study""","""intact""","""prove""","""airport""",-1,-1.0,-1,-1,4,4,101,0,0,0


There are a lot of columns, including some that we won't use. Let's use the `select` method to make the data a little simpler to look at (more on this method later).

In [3]:
columns = [
    "subj", 
    "cycle", 
    "trial", 
    "phase", 
    "type", 
    "word1", 
    "word2", 
    "response", 
    "RT", 
    "correct", 
    "lag",
]
df = raw.select(columns)
df.head()

subj,cycle,trial,phase,type,word1,word2,response,RT,correct,lag
i64,i64,i64,str,str,str,str,i64,f64,i64,i64
101,0,-1,"""study""","""intact""","""formal""","""positive""",-1,-1.0,-1,-1
101,0,0,"""study""","""intact""","""skin""","""careful""",-1,-1.0,-1,-1
101,0,1,"""study""","""intact""","""upon""","""miss""",-1,-1.0,-1,-1
101,0,2,"""study""","""intact""","""single""","""tradition""",-1,-1.0,-1,-1
101,0,3,"""study""","""intact""","""prove""","""airport""",-1,-1.0,-1,-1


A DataFrame has multiple columns, each of which can have a different type of data. For example, `i64` means a 64-bit integer, and `str` indicates a string. Polars tries to guess the right data type for each column based on the data in that column during import.

The list of columns and their data types is known as a *schema*. We can access it for an existing DataFrame using the `schema` attribute.

In [4]:
df.schema

Schema([('subj', Int64),
        ('cycle', Int64),
        ('trial', Int64),
        ('phase', String),
        ('type', String),
        ('word1', String),
        ('word2', String),
        ('response', Int64),
        ('RT', Float64),
        ('correct', Int64),
        ('lag', Int64)])

When importing data, we can make changes to the schema to read things in differently. Note that we can chain together function calls by adding `.` and then another DataFrame method at the end. Here, we add `.head(1)` to print out the first line after reading in the CSV file again.

In [5]:
pl.read_csv("exp1.csv", schema_overrides={"response": pl.Float64}).select(columns).head(1)

subj,cycle,trial,phase,type,word1,word2,response,RT,correct,lag
i64,i64,i64,str,str,str,str,f64,f64,i64,i64
101,0,-1,"""study""","""intact""","""formal""","""positive""",-1.0,-1.0,-1,-1


Note that the `response` column is now a 64-bit float (`f64`) instead of an integer.

## Accessing data in a DataFrame
DataFrames can organize a lot of data. Often data analysis starts by getting the specific data that we want to work with. This may involve *filtering* the table to get rows that meet some criteria and *selecting* columns to get a subset of them.

We can access columns of a DataFrame using `df["column"]`, like we've seen before with dictionaries. This gives us a special data type called a `Series`. It's used in Polars to represent single columns of data separately from a DataFrame. They aren't used very often by themselves.

In [6]:
df["type"]

type
str
"""intact"""
"""intact"""
"""intact"""
"""intact"""
"""intact"""
…
"""intact"""
"""rearranged"""
"""rearranged"""
"""rearranged"""


We can convert Series into NumPy arrays using the `to_numpy` method.

In [7]:
x = df["response"].to_numpy()
x

array([-1, -1, -1, ...,  1,  0,  1], shape=(107443,))

We can then run any NumPy methods we want, like calculating the mean. We'll use `nanmean` to exclude missing samples.

In [8]:
import numpy as np
np.nanmean(x)

np.float64(-0.31397112887763745)

Usually converting data to NumPy arrays isn't the best way to go, however. Polars has lots of tools for quickly doing things like calculating means and handling missing data by operating on a DataFrame.

We can select multiple columns, and optionally change their order, using the `select` method.

In [9]:
df.select(["subj", "phase", "response", "RT"]).head()

subj,phase,response,RT
i64,str,i64,f64
101,"""study""",-1,-1.0
101,"""study""",-1,-1.0
101,"""study""",-1,-1.0
101,"""study""",-1,-1.0
101,"""study""",-1,-1.0


We can filter the rows included in the table using the `filter` method. This allows for selecting data very flexibly, in a similar way to how we selected data using NumPy expressions when working with NumPy arrays. To create an expression that refers to the value of a column, we must use the `pl.col` function.

In [10]:
df.filter(pl.col("phase") == "test").head()

subj,cycle,trial,phase,type,word1,word2,response,RT,correct,lag
i64,i64,i64,str,str,str,str,i64,f64,i64,i64
101,0,-1,"""test""","""rearranged""","""waste""","""degree""",0,2.312,1,2
101,0,0,"""test""","""rearranged""","""needed""","""able""",0,3.542,1,1
101,0,1,"""test""","""rearranged""","""single""","""clean""",0,2.084,1,3
101,0,2,"""test""","""rearranged""","""train""","""useful""",0,1.669,1,2
101,0,3,"""test""","""rearranged""","""knees""","""various""",0,2.326,1,5


Here, we indicate that we only want rows where the phase is `test`, to get only the test trials and exclude all the study trials. The input to `filter` is called an *expression*. It specifies something we want to do, like comparing a column to some value, without actually doing it yet. This allows Polars to optimize how it runs different operations.

In [11]:
type(pl.col("phase") == "test")

polars.expr.expr.Expr

We can make more complicated expressions using operators for and (`&`), or (`|`), and not (`~`). If you are combining multiple comparisons, you will need to add parentheses around the individual comparisons.

In [12]:
targets = df.filter((pl.col("phase") == "test") & (pl.col("type") == "intact"))
targets.head()

subj,cycle,trial,phase,type,word1,word2,response,RT,correct,lag
i64,i64,i64,str,str,str,str,i64,f64,i64,i64
101,0,4,"""test""","""intact""","""skin""","""careful""",1,1.407,1,-1
101,0,5,"""test""","""intact""","""doctor""","""contrast""",0,4.056,0,-1
101,0,7,"""test""","""intact""","""homes""","""fuel""",1,2.499,1,-1
101,0,8,"""test""","""intact""","""liked""","""tone""",1,1.609,1,-1
101,0,9,"""test""","""intact""","""notice""","""explain""",1,1.352,1,-1


We can put together multiple operations by chaining methods. Here, we'll select some columns and filter the rows in one go.

In [13]:
df_test = (
    df.filter(pl.col("phase") == "test")
    .select(["subj", "trial", "phase", "type", "response", "RT", "correct", "lag"])
)
df_test.head()

subj,trial,phase,type,response,RT,correct,lag
i64,i64,str,str,i64,f64,i64,i64
101,-1,"""test""","""rearranged""",0,2.312,1,2
101,0,"""test""","""rearranged""",0,3.542,1,1
101,1,"""test""","""rearranged""",0,2.084,1,3
101,2,"""test""","""rearranged""",0,1.669,1,2
101,3,"""test""","""rearranged""",0,2.326,1,5


Here, we put parentheses around the whole chain of method calls. This lets us split the different calls so there is one for each line. It tends to make things easier to read.

## Calculating summary statistics
When getting a feel for a dataset, it can be very helpful to calculate some summary statistics to measure central tendency and spread.

We can get some common summary statistics for all the columns using `describe`.

In [14]:
df_test.describe()

statistic,subj,trial,phase,type,response,RT,correct,lag
str,f64,f64,str,str,f64,f64,f64,f64
"""count""",53700.0,53700.0,"""53700""","""53700""",53700.0,53700.0,53700.0,53700.0
"""null_count""",0.0,0.0,"""0""","""0""",0.0,0.0,0.0,0.0
"""mean""",165.655866,28.5,,,0.372607,1.331173,0.656648,1.0
"""std""",99.008229,17.318264,,,0.48554,0.76009,0.474832,2.236089
"""min""",101.0,-1.0,"""test""","""intact""",-1.0,-1.0,0.0,-1.0
"""25%""",128.0,14.0,,,0.0,0.882,0.0,-1.0
"""50%""",158.0,29.0,,,0.0,1.125,1.0,1.0
"""75%""",186.0,43.0,,,1.0,1.526,1.0,3.0
"""max""",1150.0,58.0,"""test""","""rearranged""",1.0,7.925,1.0,5.0


The `count` row gives the nuber of non-null samples in each column, while the `null_count` row indicates how many samples are null.

The `mean` and `std` rows give the mean and standard deviation.

The `min` and `max` rows give the minimum and maximum values in each column. The percentage rows give percentiles in the data. The `50%` percentile is also known as the median.

We can calculate more targeted statistics using expressions.

In [15]:
df_test.select(pl.col("response", "RT").mean())

response,RT
f64,f64
0.372607,1.331173


Often, we're interested in calculating statistics for each of a number of groups. For example, we may want to split responses based on whether they were on a target or a lure trial. We can do this using the `group_by` method with `agg`.

In [16]:
(
    df_test.group_by("type")
    .agg(pl.mean("response"), pl.mean("RT"))
)

type,response,RT
str,f64,f64
"""intact""",0.530168,1.272813
"""rearranged""",0.215047,1.389533


We can use the `alias` method to rename a column. Here, we'll rename `"RT"` to the more standard `"response_time"`.

In [17]:
(
    df_test.group_by("type")
    .agg(pl.mean("response"), pl.mean("RT").alias("response_time"))
)

type,response,response_time
str,f64,f64
"""intact""",0.530168,1.272813
"""rearranged""",0.215047,1.389533
