# Assignment 9

Please fill in blanks in the *Answer* sections of this notebook. To check your answer for a problem, run the Setup, Answer, and Result sections. DO NOT MODIFY SETUP OR RESULT CELLS. See the [README](https://github.com/mortonne/datascipsych) for instructions on setting up a Python environment to run this notebook.

Write your answers for each problem. Then restart the kernel, run all cells, and then save the notebook. Upload your notebook to Canvas.

If you get stuck, read through the other notebooks in this directory, ask us for help in class, or ask other students for help in class or on the weekly discussion board.

## Problem: working with null values (2 points)

### Read a file with null values (1 point)

Read the `study.csv` file in this directory using `pl.read_csv`. Use the optional `null_values` input to treat `n/a` entries as null. Assign the DataFrame to a variable called `study`.

### Check the number of null values (1 point)

Use a Polars function to get the number of null values in the `response` column and assign it to a variable called `null_responses`.

### Setup

In [1]:
import numpy as np
import polars as pl
from IPython.display import display
study = None
null_responses = None

### Answer

In [2]:
# your code here

### Result

In [3]:
vars = [study, null_responses]
if all([v is not None for v in vars]):
    # this should print your variables
    with pl.Config(tbl_rows=50):
        display(study)
    print(null_responses)

    # this should not throw any errors
    assert null_responses == 2
    response = pl.Series(
        [1, 0, 1, 0, 1, 0, None, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, None, 1, 1, 0, 0, 1]
    )
    assert study["response"].equals(response)

## Problem: recoding variables (2 points)

Take the `data` DataFrame (defined below) and recode the `item_type` column. Use Polars methods to replace that column with a version where `1` is now `"word"` and `2` is now `"picture"`. Assign the result to a variable called `recoded`.

### Setup

In [4]:
data = pl.read_csv("study.csv")
recoded = None
data.head()  # display part of the DataFrame for reference

participant_id,trial_type,item_type,response,response_time
i64,str,i64,str,str
1,"""target""",1,"""1""","""1.5"""
1,"""lure""",2,"""0""","""2.3"""
1,"""target""",2,"""1""","""1.7"""
1,"""target""",1,"""0""","""2.2"""
1,"""lure""",2,"""1""","""1.8"""


### Answer

In [5]:
# your code here

### Result

In [6]:
vars = [recoded]
if all([v is not None for v in vars]):
    # this should print your variables
    with pl.Config(tbl_rows=50):
        display(recoded)

    # this should not throw any errors
    item_type = pl.Series(
        [
            "word", 
            "picture", 
            "picture", 
            "word", 
            "picture", 
            "word", 
            "word", 
            "picture", 
            "picture", 
            "word", 
            "word", 
            "picture", 
            "word", 
            "word",
            "picture",
            "picture",
            "word",
            "word",
            "picture",
            "picture",
            "picture",
            "picture",
            "word",
            "word",
        ]
    )
    assert recoded["item_type"].equals(item_type)

## Problem: grouping and aggregation (2 points)

### One set of groups (1 point)

Take the `data` DataFrame (defined below) and use `group_by` and `agg` to calculate the mean response time for targets and lures. Assign the result to a variable called `rt_trial_type`.

### Two sets of groups (1 point)

Take the `data` DataFrame (defined below) and use `group_by` and `agg` to calculate the mean response time for targets and lures, split by whether the response was `"yes"` or `"no"`. Assign the result to a variable called `rt_trial_type_response`.

### Setup

In [7]:
data = (
    pl.read_csv("study.csv", null_values="n/a")
    .with_columns(pl.col("response").cast(pl.String).replace({"0": "no", "1": "yes"}))
    .filter(pl.col("response").is_not_null())
)
rt_trial_type = None
rt_trial_type_response = None
data.head()  # display part of the DataFrame for reference

participant_id,trial_type,item_type,response,response_time
i64,str,i64,str,f64
1,"""target""",1,"""yes""",1.5
1,"""lure""",2,"""no""",2.3
1,"""target""",2,"""yes""",1.7
1,"""target""",1,"""no""",2.2
1,"""lure""",2,"""yes""",1.8


### Answer

In [8]:
# your code here

### Result

In [9]:
vars = [rt_trial_type, rt_trial_type_response]
if all([v is not None for v in vars]):
    # this should print your variables
    display(rt_trial_type)
    display(rt_trial_type_response)

    # this should not throw any errors
    assert rt_trial_type.sort("trial_type")["response_time"].round(2).equals(pl.Series([2.26, 1.64]))
    assert rt_trial_type_response.sort("trial_type", "response")["response_time"].round(2).equals(pl.Series([2.37, 2.14, 1.85, 1.59]))

## Problem: reshaping data to long format (2 points)

In the `scores.csv` dataset, there were two experimental conditions (1 or 2) and two tests of performance (test 1 and test 2). The spreadsheet has a column for each test. Say we want to reshape the data into long format, with one observation per row.

Take the `scores` DataFrame defined below and reshape it to long format using `unpivot`. Assign the result to a variable called `long`. There should be 4 columns in the `long` DataFrame: `participant_id`, `condition`, `test_type`, and `test_score`, and 8 rows, where each row represents one test score. The `test_type` column should label each row as `"test1"` or `"test2"`. The `test_score` column should give the test score.

Hint: the `index` input of `unpivot` can take either a string corresponding to one column name, or a list of strings indicating multiple columns.

### Setup

In [10]:
scores = pl.read_csv("scores.csv")
long = None
scores.head()  # display the DataFrame for reference

participant_id,condition,test1,test2
i64,i64,i64,i64
1,1,6,9
1,2,4,8
2,1,9,10
2,2,7,9


### Answer

In [11]:
# your code here

### Result

In [12]:
vars = [long]
if all([v is not None for v in vars]):
    # this should print your variables
    display(long)

    # this should not throw any errors
    sorted = long.sort("test_type", "participant_id", "condition")
    assert sorted["participant_id"].equals(pl.Series([1, 1, 2, 2, 1, 1, 2, 2]))
    assert sorted["condition"].equals(pl.Series([1, 2, 1, 2, 1, 2, 1, 2]))
    assert sorted["test_type"].equals(
        pl.Series(
            ["test1", "test1", "test1", "test1", "test2", "test2", "test2", "test2"]
        )
    )
    assert sorted["test_score"].equals(pl.Series([6, 4, 9, 7, 9, 8, 10, 9]))

## Problem: reshaping data to wide format (2 points)

In the `study.csv` data, the item type was either 1 (word) or 2 (picture). Say that we want to know whether there was an average difference in response time for these two types of items.

### Reshape data (1 point)

Given the `rt` DataFrame defined below, "pivot" the data into wide format using the `pivot` function. The resulting DataFrame should have one row per participant, with columns: `participant_id`, `1` (the mean response time for item type 1 for each participant), and `2` (the mean response time for item type 2 for each participant). Assign the result to a variable called `wide`.

### Calculate response time difference (1 point)

Use `with_columns` to modify your `wide` DataFrame by adding a new column called `rt_diff`, which has the difference between response time for item type 2 and item type 1.

### Setup

In [13]:
rt = (
    pl.read_csv("study.csv", null_values="n/a")
    .group_by(pl.col("participant_id", "item_type"))
    .agg(pl.col("response_time").mean())
    .sort("participant_id", "item_type")
)
wide = None
rt.head()  # display part of the DataFrame for reference

participant_id,item_type,response_time
i64,i64,f64
1,1,1.966667
1,2,1.85
2,1,1.7
2,2,1.85
3,1,2.125


### Answer

In [14]:
# your code here

### Result

In [15]:
vars = [wide]
if all([v is not None for v in vars]):
    # this should print your variables
    display(wide)

    # this should not throw any errors
    assert wide["participant_id"].equals(pl.Series([1, 2, 3]))
    assert wide["1"].round(2).equals(pl.Series([1.97, 1.70, 2.13]))
    assert wide["2"].round(2).equals(pl.Series([1.85, 1.85, 2.30]))
    assert wide["rt_diff"].round(2).equals(pl.Series([-0.12, 0.15, 0.17]))

## Problem: cleaning and summarizing a dataset (2 points)

When using Polars methods, we can call one method at a time and create a new variable each time, like this:

```python
result1 = df.method1(...)
result2 = result1.method2(...)
```

However, it often makes more sense to chain together methods calls and run multiple operations in one command, using code like this:

```python
result = df.method1(...).method2(...)
```

or this, if we split over multiple lines:

```python
result = (
    df.method1(...)
    .method2(...)
)
```

Given the `study` DataFrame imported below, use a chain of Polars method calls to do the following:

* Recode the `item_type` column so that `1` is `"word"` and `2` is `"picture"`.
* Recode the `response` column so that `1` is `"yes"` and `0` is `"no"`.
* Calculate the mean response time for each combination of `participant_id`, `item_type`, and `response`.

Assign the result to a variable called `mean_rt`. Note that some of these commands have been used elsewhere in the assignment; the exercise here is to put them all together.

0.5 points for each of the three steps above; 0.5 points for completing them in one chain of method calls.

### Setup

In [16]:
study = pl.read_csv("study.csv", null_values="n/a").drop_nulls()
mean_rt = None
study.head()  # display part of the DataFrame for reference

participant_id,trial_type,item_type,response,response_time
i64,str,i64,i64,f64
1,"""target""",1,1,1.5
1,"""lure""",2,0,2.3
1,"""target""",2,1,1.7
1,"""target""",1,0,2.2
1,"""lure""",2,1,1.8


### Answer

In [17]:
# your code here

### Result

In [18]:
vars = [mean_rt]
if all([v is not None for v in vars]):
    # this should print your variables
    sorted = mean_rt.sort("participant_id", "item_type", "response")
    with pl.Config(tbl_rows=50):
        display(sorted)

    # this should not throw any errors
    assert sorted["item_type"][:4].equals(pl.Series(["picture", "picture", "word", "word"]))
    assert sorted["response"][:4].equals(pl.Series(["no", "yes", "no", "yes"]))
    assert sorted["response_time"].round(1).equals(
        pl.Series([2.3, 1.7, 2.2, 1.5, 2.0, 1.4, 1.7, 2.8, 2.1, 2.4, 2.0])
    )

## Problem (graduate students): cleaning, filtering, and aggregating (6 points)

### Import data and run basic cleaning (2 points)

Import the `datasets` module from the `datascipsych` package. Get the path to the Osth & Fox (2019) dataset using `datasets.get_dataset_file("Osth2019")` and load the data using `pl.read_csv`. Use the `datasets.clean_osth` function to recode null values and add a `probe_type` column. Assign the cleaned dataset to a variabled called `cleaned`.

### Filter the data to get test phase data (2 points)

Filter the `cleaned` DataFrame to get just test-phase data and drop trials where `response` is `null`. Assign your result to a variabled called `test`.

### Calculate mean response time by condition (2 points)

For the `test` DataFrame, recode the `response` column so that `1` is now `"yes"` and `0` is now `"no"`. Calculate mean response time for each combination of probe type and response. Assign the result to a variable called `rt`.

### Setup

In [19]:
cleaned = None
test = None
rt = None

### Answer

In [20]:
# your code here

### Result

In [21]:
vars = [cleaned, test, rt]
if all([v is not None for v in vars]):
    # this should print your variables
    display(cleaned)
    display(test)
    display(rt)

    # this should not throw any errors
    assert cleaned["response"].null_count() == 53796
    assert (test["phase"] == "test").all()
    assert test["response"].null_count() == 0
    assert rt.sort("response", "probe_type")["RT"].round(2).equals(
        pl.Series([1.40, 1.28, 1.36, 1.27])
    )

## Problem (graduate students): joining datasets (2 points)

Read about dataset [joins](https://docs.pola.rs/user-guide/transformations/joins/) in Polars. There are many options for joining datasets together; we will use a simple *Equi Inner* join.

Say we have separately calculated mean response time and mean accuracy for a set of participants, and now we want to get that information together in one DataFrame. Take the `response_time` and `accuracy` DataFrames below and join them based on the `participant_id` column. Assign the result to a variable called `stats`.

The joined DataFrame should have the following data (the order of rows and columns may vary, depending on how you do the join):

| participant_id | response_time | accuracy |
| -------------- | ------------- | -------- |
| 01             | 1.2           | 0.9      |
| 02             | 1.5           | 0.75     |
| 03             | 1.7           | 0.65     |

### Setup

In [22]:
response_time = pl.DataFrame(
    {
        "participant_id": ["03", "02", "01"],
        "response_time": [1.7, 1.5, 1.2],
    }
)
accuracy = pl.DataFrame(
    {
        "participant_id": ["01", "02", "03"],
        "accuracy": [0.9, 0.75, 0.65],
    }
)
stats = None
display(response_time)
display(accuracy)

participant_id,response_time
str,f64
"""03""",1.7
"""02""",1.5
"""01""",1.2


participant_id,accuracy
str,f64
"""01""",0.9
"""02""",0.75
"""03""",0.65


### Answer

In [23]:
# your code here

### Result

In [24]:
vars = [stats]
if all([v is not None for v in vars]):
    # this should print your variables
    display(stats)

    # this should not throw any errors
    sorted = stats.sort("participant_id")
    assert sorted["participant_id"].equals(pl.Series(["01", "02", "03"]))
    assert sorted["response_time"].equals(pl.Series([1.2, 1.5, 1.7]))
    assert sorted["accuracy"].equals(pl.Series([0.9, 0.75, 0.65]))