# Assignment 7

Please fill in blanks in the *Answer* sections of this notebook. To check your answer for a problem, run the Setup, Answer, and Result sections. DO NOT MODIFY SETUP OR RESULT CELLS. See the [README](https://github.com/mortonne/datascipsych) for instructions on setting up a Python environment to run this notebook.

Write your answers for each problem. Then restart the kernel, run all cells, and then save the notebook. Upload your notebook to Canvas.

If you get stuck, read through the other notebooks in this directory, ask us for help in class, or ask other students for help in class or on the weekly discussion board.

## Problem: importing a DataFrame (2 points)

### Import a CSV file (1 point)

Import the `people.csv` file in this directory to a Polars DataFrame called `people`.

### Convert a column to a NumPy array (1 point)

Convert the `height` column to a NumPy array, stored in a variable called `height`.

### Setup

In [1]:
import numpy as np
import polars as pl
people = None
height = None

### Answer

In [2]:
people = pl.read_csv("people.csv")
height = people["height"].to_numpy()

### Result

In [3]:
vars = [people, height]
if all([v is not None for v in vars]):
    # this should print your variables
    print(people)
    print(height)

    # this should not throw any errors
    assert people["name"].equals(
        pl.Series(["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"])
    )
    assert people["weight"].equals(pl.Series([57.9, 72.5, 53.6, 83.1]))
    assert np.array_equal(height, np.array([1.56, 1.77, 1.65, 1.75]))

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ str        ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘
[1.56 1.77 1.65 1.75]


## Problem: creating a DataFrame (2 points)

Create a DataFrame like the table below and assign it to a variable called `data`. The `trial_type` and `response` columns should be strings, and the `correct` column should have integers.

| trial_type | response | correct |
| ---------- | -------- | ------- |
| target     | old      | 1       |
| lure       | old      | 0       |
| target     | new      | 0       |
| lure       | new      | 1       |

The DataFrame should have all the information from the table above (1 point), with the correct datatypes (1 point).

### Setup

In [4]:
data = None

### Answer

In [5]:
data = pl.DataFrame(
    {
        "trial_type": ["target", "lure", "target", "lure"],
        "response": ["old", "old", "new", "new"],
        "correct": [1, 0, 0, 1],
    }
)

### Result

In [6]:
vars = [data]
if all([v is not None for v in vars]):
    # this should print your variables
    print(data)

    # this should not throw any errors
    trial_type = np.array(["target", "lure", "target", "lure"])
    assert np.array_equal(data["trial_type"].to_numpy(), trial_type)
    response = np.array(["old", "old", "new", "new"])
    assert np.array_equal(data["response"].to_numpy(), response)
    assert np.array_equal(data["correct"].to_numpy(), [1, 0, 0, 1])

shape: (4, 3)
┌────────────┬──────────┬─────────┐
│ trial_type ┆ response ┆ correct │
│ ---        ┆ ---      ┆ ---     │
│ str        ┆ str      ┆ i64     │
╞════════════╪══════════╪═════════╡
│ target     ┆ old      ┆ 1       │
│ lure       ┆ old      ┆ 0       │
│ target     ┆ new      ┆ 0       │
│ lure       ┆ new      ┆ 1       │
└────────────┴──────────┴─────────┘


## Problem: using select (2 points)

Given the Osth & Fox dataset (loaded below), use the select method to create a dataset called `subset` with these columns (in order): `subj`, `phase`, `cycle`, `type`.

### Setup

In [7]:
df = pl.read_csv("exp1.csv")
subset = None

### Answer

In [8]:
subset = df.select("subj", "phase", "cycle", "type")

### Result

In [9]:
vars = [subset]
if all([v is not None for v in vars]):
    # this should print your variables
    print(subset)

    # this should not throw any errors
    assert subset.shape == (107443, 4)
    assert subset.columns == ["subj", "phase", "cycle", "type"]


shape: (107_443, 4)
┌──────┬───────┬───────┬────────────┐
│ subj ┆ phase ┆ cycle ┆ type       │
│ ---  ┆ ---   ┆ ---   ┆ ---        │
│ i64  ┆ str   ┆ i64   ┆ str        │
╞══════╪═══════╪═══════╪════════════╡
│ 101  ┆ study ┆ 0     ┆ intact     │
│ 101  ┆ study ┆ 0     ┆ intact     │
│ 101  ┆ study ┆ 0     ┆ intact     │
│ 101  ┆ study ┆ 0     ┆ intact     │
│ 101  ┆ study ┆ 0     ┆ intact     │
│ …    ┆ …     ┆ …     ┆ …          │
│ 213  ┆ test  ┆ 7     ┆ intact     │
│ 213  ┆ test  ┆ 7     ┆ rearranged │
│ 213  ┆ test  ┆ 7     ┆ rearranged │
│ 213  ┆ test  ┆ 7     ┆ rearranged │
│ 213  ┆ test  ┆ 7     ┆ intact     │
└──────┴───────┴───────┴────────────┘


## Problem: using with_columns (2 points)

Given the `people` DataFrame (defined below), which has height in meters in the `height_m` column, add another column with height in feet called `height_ft` (1 meter is 3.28 feet). Assign the modified DataFrame to a variable called `converted`. 1 point for having a correct `height_ft` column; 1 point for keeping all of the old columns in the `converted` DataFrame.

### Setup

In [10]:
people = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "height_m": [1.56, 1.77, 1.65, 1.75],
    }
)
converted = None

### Answer

In [11]:
converted = people.with_columns(
    height_ft=pl.col("height_m") * 3.28
)

### Result

In [12]:
vars = [converted]
if all([v is not None for v in vars]):
    # this should print your variables
    print(converted)

    # this should not throw any errors
    assert "name" in converted
    assert "height_m" in converted
    assert converted["height_ft"].round(2).equals(pl.Series([5.12, 5.81, 5.41, 5.74]))

shape: (4, 3)
┌────────────────┬──────────┬───────────┐
│ name           ┆ height_m ┆ height_ft │
│ ---            ┆ ---      ┆ ---       │
│ str            ┆ f64      ┆ f64       │
╞════════════════╪══════════╪═══════════╡
│ Alice Archer   ┆ 1.56     ┆ 5.1168    │
│ Ben Brown      ┆ 1.77     ┆ 5.8056    │
│ Chloe Cooper   ┆ 1.65     ┆ 5.412     │
│ Daniel Donovan ┆ 1.75     ┆ 5.74      │
└────────────────┴──────────┴───────────┘


## Problem: using filter (2 points)

Given the Osth & Fox dataset (loaded below), use the filter method to get trials where `subj` is `101` (1 point) and `phase` is `"test"` (1 point). Assign the filtered DataFrame to a variable called `filtered`.

### Setup

In [13]:
df = pl.read_csv("exp1.csv")
filtered = None

### Answer

In [14]:
filtered = df.filter((pl.col("subj") == 101) & (pl.col("phase") == "test"))

### Result

In [15]:
vars = [filtered]
if all([v is not None for v in vars]):
    # this should print your variables
    print(filtered)

    # this should not throw any errors
    assert filtered.shape == (480, 16)
    assert (filtered["subj"] == 101).all()
    assert (filtered["phase"] == "test").all()

shape: (480, 16)
┌───────┬───────┬───────┬────────────┬───┬──────┬───────────┬──────────────┬────────┐
│ cycle ┆ trial ┆ phase ┆ type       ┆ … ┆ subj ┆ intactLag ┆ prevResponse ┆ prevRT │
│ ---   ┆ ---   ┆ ---   ┆ ---        ┆   ┆ ---  ┆ ---       ┆ ---          ┆ ---    │
│ i64   ┆ i64   ┆ str   ┆ str        ┆   ┆ i64  ┆ i64       ┆ i64          ┆ i64    │
╞═══════╪═══════╪═══════╪════════════╪═══╪══════╪═══════════╪══════════════╪════════╡
│ 0     ┆ -1    ┆ test  ┆ rearranged ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 0     ┆ test  ┆ rearranged ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 1     ┆ test  ┆ rearranged ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 2     ┆ test  ┆ rearranged ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ 0     ┆ 3     ┆ test  ┆ rearranged ┆ … ┆ 101  ┆ 0         ┆ 0            ┆ 0      │
│ …     ┆ …     ┆ …     ┆ …          ┆ … ┆ …    ┆ …         ┆ …            ┆ …      │
│ 7     ┆ 54    ┆ test  ┆ rearranged 

## Problem: summary statistics (2 points)

Given the `people` DataFrame defined below, create a new DataFrame with one row and two columns that has the mean weight and mean height across people. It should have correct `weight` and `height` columns (1 point), and no other columns (1 point). Assign this DataFrame to a variable called `stats`.

### Setup

In [16]:
people = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)
stats = None

### Answer

In [17]:
stats = (
    people.select("weight", "height")
    .mean()
)

### Result

In [18]:
vars = [stats]
if all([v is not None for v in vars]):
    # this should print your variables
    print(stats)

    # this should not throw any errors
    assert stats.shape == (1, 2)
    assert stats["weight"].equals(pl.Series([66.775]))
    assert stats["height"].equals(pl.Series([1.6825]))

shape: (1, 2)
┌────────┬────────┐
│ weight ┆ height │
│ ---    ┆ ---    │
│ f64    ┆ f64    │
╞════════╪════════╡
│ 66.775 ┆ 1.6825 │
└────────┴────────┘


## Problem (graduate students): working with dates (2 points)

Read the `sessions.csv` file in this directory, parsing the dates in the `session1` and `session2` columns. Add a column called `delay` with the number of days separating sessions 1 and 2 (you should get a column with type `duration[ms]`), and a column called `score_change` with the difference between the `score2` and `score1` columns. Assign your new DataFrame to a variable called `result`.

In Polars, time differences are represented with `duration` datatypes, which look like `duration[ms]` in this case. The unit in brackets refers to the *time unit* (here, milliseconds), which relates to how the data are stored. The time unit does not make a difference for the duration (you should get a difference in days).

### Setup

In [19]:
result = None

### Answer

In [20]:
sessions = pl.read_csv("sessions.csv", try_parse_dates=True)
result = sessions.with_columns(
    delay=pl.col("session2") - pl.col("session1"),
    score_change=pl.col("score2") - pl.col("score1"),
)

### Result

In [21]:
from datetime import timedelta
vars = [result]
if all([v is not None for v in vars]):
    # this should print your variables
    print(result)

    # this should not throw any errors
    assert "participant_id" in result
    assert result["delay"].equals(pl.Series([timedelta(days=3), timedelta(days=5), timedelta(days=4)]))
    assert result["score_change"].equals(pl.Series([3, 7, 5]))


shape: (3, 7)
┌────────────────┬────────────┬────────┬────────────┬────────┬──────────────┬──────────────┐
│ participant_id ┆ session1   ┆ score1 ┆ session2   ┆ score2 ┆ delay        ┆ score_change │
│ ---            ┆ ---        ┆ ---    ┆ ---        ┆ ---    ┆ ---          ┆ ---          │
│ i64            ┆ date       ┆ i64    ┆ date       ┆ i64    ┆ duration[ms] ┆ i64          │
╞════════════════╪════════════╪════════╪════════════╪════════╪══════════════╪══════════════╡
│ 1              ┆ 2024-03-01 ┆ 15     ┆ 2024-03-04 ┆ 18     ┆ 3d           ┆ 3            │
│ 2              ┆ 2024-03-15 ┆ 12     ┆ 2024-03-20 ┆ 19     ┆ 5d           ┆ 7            │
│ 3              ┆ 2024-03-21 ┆ 15     ┆ 2024-03-25 ┆ 20     ┆ 4d           ┆ 5            │
└────────────────┴────────────┴────────┴────────────┴────────┴──────────────┴──────────────┘


## Problem (graduate students): lazy evaluation (2 points)

Read about the [lazy API](https://docs.pola.rs/user-guide/concepts/lazy-api/) in Polars. Use the lazy API to scan `exp1.csv`, filter to get trials where `phase` equals `"test"`, select the `response` and `RT` columns, and calculate their means. Assign the result to a variable called `lazy_result`. All operations should be completed by one call to the `collect` method.

### Setup

In [22]:
lazy_result = None

### Answer

In [23]:
q = (
    pl.scan_csv("exp1.csv")
    .filter(pl.col("phase") == "test")
    .select("response", "RT")
    .mean()
)
lazy_result = q.collect()

### Result

In [24]:
vars = [lazy_result]
if all([v is not None for v in vars]):
    # this should print your variables
    print(lazy_result)

    # this should not throw any errors
    assert lazy_result.shape == (1, 2)
    assert lazy_result["response"].round(2).equals(pl.Series([0.37]))
    assert lazy_result["RT"].round(2).equals(pl.Series([1.33]))


shape: (1, 2)
┌──────────┬──────────┐
│ response ┆ RT       │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.372607 ┆ 1.331173 │
└──────────┴──────────┘


## Problem (graduate students): expression expansion (2 points)

Read about how [expression expansion](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expression-expansion) can be used to write one expression that operates on multiple columns. Given the `people` DataFrame defined below, use a call to `with_columns` with a single expression to calculate the means of the `weight` and `height` columns and place them in new columns named `mean_weight` and `mean_height`, respectively. Assign the resulting DataFrame to a variable called `stats_1expr`.

### Setup

In [25]:
people = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)
stats_1expr = None

### Answer

In [26]:
stats_1expr = people.with_columns(
    pl.col("weight", "height").mean().name.prefix("mean_")
)

### Result

In [27]:
vars = [stats_1expr]
if all([v is not None for v in vars]):
    # this should print your variables
    print(stats_1expr)

    # this should not throw any errors
    assert stats_1expr.shape == (4, 5)
    assert stats_1expr["mean_weight"].equals(pl.Series([66.775, 66.775, 66.775, 66.775]))
    assert stats_1expr["mean_height"].equals(pl.Series([1.6825, 1.6825, 1.6825, 1.6825]))


shape: (4, 5)
┌────────────────┬────────┬────────┬─────────────┬─────────────┐
│ name           ┆ weight ┆ height ┆ mean_weight ┆ mean_height │
│ ---            ┆ ---    ┆ ---    ┆ ---         ┆ ---         │
│ str            ┆ f64    ┆ f64    ┆ f64         ┆ f64         │
╞════════════════╪════════╪════════╪═════════════╪═════════════╡
│ Alice Archer   ┆ 57.9   ┆ 1.56   ┆ 66.775      ┆ 1.6825      │
│ Ben Brown      ┆ 72.5   ┆ 1.77   ┆ 66.775      ┆ 1.6825      │
│ Chloe Cooper   ┆ 53.6   ┆ 1.65   ┆ 66.775      ┆ 1.6825      │
│ Daniel Donovan ┆ 83.1   ┆ 1.75   ┆ 66.775      ┆ 1.6825      │
└────────────────┴────────┴────────┴─────────────┴─────────────┘


## Problem (graduate students): viewing longer DataFrames in Jupyter (2 points)

When viewing a DataFrame in Jupyter, the number of displayed rows will automatically be limited. The Polars package has configuration [options](https://docs.pola.rs/docs/python/version/0.18/reference/config.html) that are used to determine how DataFrames are displayed. Changing the `tbl_rows` option will change the number of rows displayed. There are two ways to change the configuration in Polars.

One option is to change the configuration by calling one of the methods of `pl.Config`. For example:

```python
pl.Config.set_tbl_rows(50)
```

will change the `tbl_rows` option to set the maximum number of rows displayed to 50. This will be the setting until you change it back, for example by running:

```python
pl.Config.restore_defaults()
```

to restore the default options.

Another option can be used to change the configuration just for the DataFrames where you want to see more rows. To change the configuration temporarily, you can use the [context manager](https://docs.pola.rs/docs/python/version/0.18/reference/config.html#use-as-a-context-manager). For example:

```python
with pl.Config(tbl_rows=50):
    # for commands run here, the tbl_rows option is temporarily set to 50
```

The trick there is that DataFrames will not be normally displayed by Jupyter if they are inside a context manager. Instead of just putting the name of a DataFrame in the last line of a cell, for more flexibility you can use the `display` function from `IPython`. You can import it like this:

```python
from IPython.display import display
```

and then use `display(my_data_frame)` to show it as a nice table in Jupyter. This method will work even if the command is not the last line in a cell.

Given the `df_test_cycle1` DataFrame set up below, which includes the first "cycle" of the test phase for the first participant, display all 60 rows in Jupyter using one of the methods described above (1 point). After your code runs, the Polars configuration should be back to its default value (1 point). Depending on the method you use, the configuration may be at its default value either because you restored the defaults, or because you used a context manager that only temporarily changed the option.

### Setup

In [28]:
df = pl.read_csv("exp1.csv")
df_test_cycle1 = df.filter(pl.col("phase") == "test").head(60)

### Answer

In [29]:
from IPython.display import display
with pl.Config(tbl_rows=60):
    display(df_test_cycle1)
# OR:
# pl.Config.set_tbl_rows(60)
# display(df_test_cycle1)
# pl.Config.restore_defaults()

cycle,trial,phase,type,word1,word2,response,RT,correct,lag,serPos1,serPos2,subj,intactLag,prevResponse,prevRT
i64,i64,str,str,str,str,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64
0,-1,"""test""","""rearranged""","""waste""","""degree""",0,2.312,1,2,12,10,101,0,0,0
0,0,"""test""","""rearranged""","""needed""","""able""",0,3.542,1,1,27,28,101,0,0,0
0,1,"""test""","""rearranged""","""single""","""clean""",0,2.084,1,3,3,6,101,0,0,0
0,2,"""test""","""rearranged""","""train""","""useful""",0,1.669,1,2,55,57,101,0,0,0
0,3,"""test""","""rearranged""","""knees""","""various""",0,2.326,1,5,44,49,101,0,0,0
0,4,"""test""","""intact""","""skin""","""careful""",1,1.407,1,-1,1,1,101,0,0,0
0,5,"""test""","""intact""","""doctor""","""contrast""",0,4.056,0,-1,35,35,101,34,1,1
0,6,"""test""","""rearranged""","""critical""","""system""",0,3.366,1,3,8,5,101,0,0,0
0,7,"""test""","""intact""","""homes""","""fuel""",1,2.499,1,-1,29,29,101,0,0,0
0,8,"""test""","""intact""","""liked""","""tone""",1,1.609,1,-1,19,19,101,-10,1,2


### Result

In [30]:
assert pl.Config.state()["POLARS_FMT_MAX_ROWS"] is None, "default options should be set"