# Assignment 13 draft (NOT FINALIZED)

Please fill in blanks in the *Modified* and *Answer* sections of this notebook. Most problems just involve taking existing code (in the *Original* section) and writing a modified version of the code (in the *Modified* section). DO NOT MODIFY SETUP OR ORIGINAL CELLS. See the [README](https://github.com/mortonne/datascipsych) for instructions on setting up a Python environment to run this notebook.

Write your answers for each problem. Then restart the kernel, run all cells, and then save the notebook. Upload your notebook to Canvas.

If you get stuck, read through the other notebooks in this directory, ask us for help in class, or ask other students for help in class or on the weekly discussion board.

## Problem: code style (2 points)

Change the following code to meet the guidelines described in the [Coding Best Practices](https://mortonne.github.io/datascipsych/assignments/assignment13/coding_best_practices.html#use-consistent-code-style) lecture. Read the code in the Original section and write your modified code in the Modified section.

### Standardize import statements (1 point)

Move the import statements to the top of the code cell (0.5 points) and use standard names for each of the packages (0.5 points).

### Apply Black formatting (1 point)

Apply Black-style formatting to the code.

### Original

In [1]:
from datascipsych import datasets
file = datasets.get_dataset_file('Osth2019')

import polars as pl
data = pl.read_csv( file )
targets = data.filter(pl.col("type") == "intact").group_by("subj").agg(pl.col("response").mean())
import numpy as N
x = N.arange(10)
y = N.sum(x)

### Modified

In [2]:
# your modified version of the code here

## Problem: variable names (2 points)

The code below uses very generic variable names that are not very informative. 

Rename the variables to describe what they refer to. There isn't a specific right answer, but try to choose names that are descriptive but not too long.

Make sure you rename each variable everywhere it appears in the code cell, so that the code still works.

### Original

In [3]:
import polars as pl
from datascipsych import datasets

a = datasets.get_dataset_file("Osth2019")
b = datasets.clean_osth(pl.read_csv(a))
c = b.filter(pl.col("phase") == "test")
d = c.drop_nulls().group_by("response").agg(pl.col("RT").mean())

### Modified

In [4]:
# your modified version of the code here

## Problem: code comments (2 points)

The code below has good variable names, but could still use a little clarification. The first block sets up some arrays with information about six trials, including the trial number, trial type, and response for each trial. The second block takes these arrays and calculates the hit rate and false alarm rate for those trials.

Add a comment above each of the two code blocks to describe what they are doing. Good comments are usually short, declarative sentences, like `# Read the raw data` or `# Calculate mean recall for each condition`.

There is no specific correct answer here. Try to come up with two relatively short comments (one line each, less than 80 characters) that help make the code easier to understand.

### Original

In [5]:
import numpy as np

trial_number = np.array([1, 2, 3, 4, 5, 6])
trial_type = np.array(["target", "lure", "lure", "lure", "target", "target"])
old_response = np.array([1, 0, 1, 0, 0, 1])

hit_rate = np.mean(old_response[trial_type == "target"])
false_alarm_rate = np.mean(old_response[trial_type == "lure"])

### Modified

In [6]:
# your modified version of the code here

## Problem: using loops (2 points)

The code below reads in CSV files for each subject, calculates the total number of correct trials for that subject, and then adds the correct trials across subjects.

Change the code to calculate the same sum using a `for` loop instead.

### Original

In [7]:
import polars as pl

n1 = pl.read_csv("data/sub-01_beh.csv")["correct"].sum()
n2 = pl.read_csv("data/sub-02_beh.csv")["correct"].sum()
n3 = pl.read_csv("data/sub-03_beh.csv")["correct"].sum()
n4 = pl.read_csv("data/sub-04_beh.csv")["correct"].sum()
n5 = pl.read_csv("data/sub-05_beh.csv")["correct"].sum()
n6 = pl.read_csv("data/sub-06_beh.csv")["correct"].sum()
n7 = pl.read_csv("data/sub-07_beh.csv")["correct"].sum()
n8 = pl.read_csv("data/sub-08_beh.csv")["correct"].sum()
total = n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8
print(total)

48


### Modified

In [8]:
# your modified version of the code here

## Problem: using functions (2 points)

Polars expressions can be stored as variables. For example, you could create a variable defining a filter expression, and then use that variable with the filter function to get the rows of a DataFrame that match that expression:

```python
test_expr = pl.col("phase") == "test"
test_trials = df.filter(test_expr)
```

The code in the Original section uses a Polars expression to calculate mean and SEM for conditions in two DataFrames. Write two functions called `mean_expr` and `sem_expr` that take a column name and return a Polars expression to calculate the given statistic. Rewrite the original code to use your functions.

### Setup

In [9]:
import polars as pl
from datascipsych import datasets

dataset_file = datasets.get_dataset_file("Osth2019")
df_osth = datasets.clean_osth(pl.read_csv(dataset_file))
mean_response_type = (
    df_osth.filter(pl.col("phase") == "test")
    .group_by("subj", "probe_type")
    .agg(pl.col("response").mean())
    .sort("subj", "probe_type")
)
mean_response_lag = (
    df_osth.filter((pl.col("phase") == "test") & (pl.col("probe_type") == "lure"))
    .group_by("subj", "lag")
    .agg(pl.col("response").mean())
    .sort("subj", "lag")
)

### Original

In [10]:
stats_response_type = (
    mean_response_type.group_by("probe_type")
    .agg(
        mean=pl.col("response").mean(),
        sem=pl.col("response").std() / pl.col("response").len().sqrt()
    )
)
stats_response_lag = (
    mean_response_lag.group_by("lag")
    .agg(
        mean=pl.col("response").mean(),
        sem=pl.col("response").std() / pl.col("response").len().sqrt()
    )
)

### Modified

In [11]:
# your modified version of the code here

## Problem: making a function more flexible (2 points)

The `subject_mean_response` function defined in the Original section can be used to calculate the mean response for each combination of subject and probe type. 

Rewrite the function to be more flexible. In addition to `df`, your modified version should take a `subject` input, which indicates the name of the column with subject labels, and a `condition` input, which indicates the name of the column with conditions that we want to split up. 

Your new function should work the same as the original function if you set `subject="subj"` and `condition="probe_type"`.

### Setup

In [12]:
import polars as pl
from datascipsych import datasets

dataset_file = datasets.get_dataset_file("Osth2019")
df_osth = datasets.clean_osth(pl.read_csv(dataset_file))
df_test = df_osth.filter(pl.col("phase") == "test")
df_test_lures = df_osth.filter((pl.col("phase") == "test") & (pl.col("probe_type") == "lure"))

### Original

In [13]:
def subject_mean_response(df):
    means = (
        df.group_by("subj", "probe_type")
        .agg(pl.col("response").mean())
        .sort("subj", "probe_type")
    )
    return means


mean_probe_type = subject_mean_response(df_test)

### Modified

In [14]:
# your modified version of the code here

# uncomment the lines below to test your function
# mean_probe_type = subject_mean_response(df_test, "subj", "probe_type")
# mean_lure_lag = subject_mean_response(df_test_lures, "subj", "lag")

## Problem (graduate students): using pathlib (2 points)

### Pathlib example

The [pathlib module](https://docs.python.org/3/library/pathlib.html#basic-use) provides tools for working with filesystem paths. For example, say you have data files stored in `data/sub-101/beh/sub-101_task.tsv`, `data/sub-102/beh/sub-102_task.tsv`, etc., for a number of subjects. If you have the different directory names stored in variables, you can use pathlib to make the full paths easily. By putting `/` in between the different directory names, you can make the full filepath for a given subject:

```python
from pathlib import Path
import polars as pl
data_dir = Path("data")  # make a filepath object; now we can make subdirectories using "/"
data_type = "beh"
subject = "101"
filepath = data_dir / f"sub-{subject}" / data_type / f"sub-{subject}_task.tsv"
data = pl.read_csv(filepath, separator="\t")
```

The `/` separates different directories. This works on Windows, Linux, and macOS; the pathlib module will figure out the correct directory separator to use for your system.

### Adapting non-pathlib code

A common way to construct a path is to create a string by concatenating parts together. The code below uses string concatenation to get the path to the Osth & Fox (2019) dataset. The `".."` path indicates going back one directory, so `"../../src"` goes back twice and then into the `src` directory in this project. 

Rewrite the code to use `Path` instead, starting with `Path("..")` and constructing the full path using the `/` operator.

### Original

In [15]:
import os
package_name = "datascipsych"
data_dir = "data"
filename = "Osth2019.csv"
data_file = "../../src/" + package_name + "/" + data_dir + "/" + filename
pl.read_csv(data_file).head()

cycle,trial,phase,type,word1,word2,response,RT,correct,lag,serPos1,serPos2,subj,intactLag,prevResponse,prevRT
i64,i64,str,str,str,str,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64
0,-1,"""study""","""intact""","""formal""","""positive""",-1,-1.0,-1,-1,0,0,101,0,0,0
0,0,"""study""","""intact""","""skin""","""careful""",-1,-1.0,-1,-1,1,1,101,0,0,0
0,1,"""study""","""intact""","""upon""","""miss""",-1,-1.0,-1,-1,2,2,101,0,0,0
0,2,"""study""","""intact""","""single""","""tradition""",-1,-1.0,-1,-1,3,3,101,0,0,0
0,3,"""study""","""intact""","""prove""","""airport""",-1,-1.0,-1,-1,4,4,101,0,0,0


### Modified

In [16]:
# your modified version of the code here

## Problem (graduate students): using NumPy to speed up calculations (2 points)

One of the big advantages of NumPy is that it is much faster than basic Python code. In Jupyter, you can time how long it takes to run code by placing `%%timeit` at the top of a cell that you want to time. Read about timeit in the [iPython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit).

The Original section has a `for` loop that calculates a total over 100,000 random numbers. Rewrite that code to use a NumPy function instead and use `%%timeit` to time how long it takes to execute on average. How long did it take to run the Original code? The Modified code? Write your answers in the Answer section.

### Setup

In [17]:
rng = np.random.default_rng(1)
x = rng.normal(size=100000)

### Original

In [18]:
%%timeit
total = 0
for xi in x:
    total += xi

3.45 ms ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Modified

In [19]:
# your code here

### Answer

> How long did it take to run the Original code, on average?

[answer here]

> How long did it take to run the NumPy version, on average?

[answer here]

## Problem (graduate students): type hints (2 points)

Read about [using type hints](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html#functions) to annotate functions with information about the datatype of arguments and return values. Type hints can be used with tools like [mypy](https://mypy-lang.org/) to analyze code and spot potential problems where datatypes don't match up in different parts of code (for example, if the way a function is called does not match up with the datatypes expected by that function). Type hints can also be useful for documenting your code, for example by making it clear what type of data is expected for each argument and return value of your functions.

Write a modified version of the function below with type hints for the arguments and return value. The `hit_rate` and `false_alarm_rate` arguments will be `float`, the `n` argument will be `int`, and the `adjust` argument will be `bool`. The return value will be `float`.

### Original

In [20]:
from scipy import stats

def dprime(hit_rate, false_alarm_rate, n=None, adjust=True):
    """Calculate d-prime with optional adjustment for rates of 0 or 1."""
    if adjust:
        if n is None:
            raise ValueError("Must specify n to use adjustment.")
        min_rate = 1 / n
        max_rate = 1 - 1 / n
        hit_rate = np.clip(hit_rate, min_rate, max_rate)
        false_alarm_rate = np.clip(false_alarm_rate, min_rate, max_rate)
    d_prime = stats.norm.ppf(hit_rate) - stats.norm.ppf(false_alarm_rate)
    return d_prime

### Modified

In [21]:
# your modified version of the code here

## Problem (graduate students): simplifying Polars code (2 points)

Read about [Polars conditionals](https://docs.pola.rs/user-guide/expressions/basic-operations/#conditionals), which work sort of like `if/elif/else` statements. The code below uses `when/then` function calls to label each trial's response as a hit ("old" response to a target), miss ("new" response to a target), false alarm ("old" response to a lure), or correct rejection ("new" response to a lure). At the end, the `alias` function call indicates the name we want for the new column. Instead of using `alias`, we could have also indicated this by naming the `when/then` chain `response_time` in the input to the `with_columns` function, but using `alias` makes this example a little easier to read. The `then` calls determine what the value of the new column should be for each set of trials; for example, `then(pl.lit("hit"))` indicates that the new column should be set to `"hit"` for those trials. In Polars, strings are often interpreted as column names, so we cannot just write `"hit"`; instead, we use the `pl.lit` (literal) function to indicate that we want the value to be `"hit"`, as opposed to us trying to access a column called `"hit"`.

Note that there is a lot of repeated code here. For example, the `(pl.col("probe_type") == "target")` expression appears twice and is not very easy to read. 

Rewrite the `add_response_type` function to create four intermediate variables named `target`, `lure`, `old`, and `new`, corresponding to the Polars expressions that filter for those types of trials. Use those variables in the four `when` calls to indicate which trials should have each label. For example, a `"hit"` trial corresponds to `target & old`. Because of how Polars works, this version of the function will run just as fast as the original and will be easier to read.

### Original

In [22]:
from datascipsych import datasets
import polars as pl


def add_response_type(df):
    df = df.with_columns(
        pl.when((pl.col("probe_type") == "target") & (pl.col("response") == 1))
        .then(pl.lit("hit"))
        .when((pl.col("probe_type") == "target") & (pl.col("response") == 0))
        .then(pl.lit("miss"))
        .when((pl.col("probe_type") == "lure") & (pl.col("response") == 1))
        .then(pl.lit("false alarm"))
        .when((pl.col("probe_type") == "lure") & (pl.col("response") == 0))
        .then(pl.lit("correct rejection"))
        .alias("response_type")
    )
    return df


df_osth = datasets.clean_osth(pl.read_csv(dataset_file)).filter(
    pl.col("phase") == "test"
)
add_response_type(df_osth).head()

subj,cycle,phase,trial,type,word1,word2,response,RT,correct,lag,probe_type,response_type
i64,i64,str,i64,str,str,str,i64,f64,i64,i64,str,str
101,0,"""test""",-1,"""rearranged""","""waste""","""degree""",0,2.312,1,2,"""lure""","""correct rejection"""
101,0,"""test""",0,"""rearranged""","""needed""","""able""",0,3.542,1,1,"""lure""","""correct rejection"""
101,0,"""test""",1,"""rearranged""","""single""","""clean""",0,2.084,1,3,"""lure""","""correct rejection"""
101,0,"""test""",2,"""rearranged""","""train""","""useful""",0,1.669,1,2,"""lure""","""correct rejection"""
101,0,"""test""",3,"""rearranged""","""knees""","""various""",0,2.326,1,5,"""lure""","""correct rejection"""


### Modified

In [23]:
# your modified version of the code here