# Coding Best Practices

When first getting started with coding, developers often write code that is "good enough" and then stop making improvements. However, this code may be hard to understand, contain bugs, and be hard to reuse or extend. Professional software developers have developed best practices to help avoid these problems.

## Code style

Python is mostly pretty flexible about how code can be formatted. But there is a standard code style that is easy to use and helps make your code easier to read.

Here is some code before formatting. With the random use of spaces and long lines of code, this is pretty hard to read.

In [1]:
import polars as pl
from datascipsych import datasets

def myfunction( x, y ):
    #add some numbers
    z  = x+y
    return z
l=[1,2,3,4]
d={'a':1,"b":2,"c":3}
df = pl.read_csv(datasets.get_dataset_file("Morton2013"), null_values="n/a").filter(pl.col("study")).group_by("subject", "list_type", "input").agg(pl.col("recall").mean())

Luckily, we can use [Black](https://black.readthedocs.io/en/stable/), a tool for automatic reformatting of Python code, to reformat it. Black has different ways of running it, including a command line tool and a plugin for VSCode. Now the code is much easier to read.

In [2]:
import polars as pl
from datascipsych import datasets


def myfunction(x, y):
    # add some numbers
    z = x + y
    return z


l = [1, 2, 3, 4]
d = {"a": 1, "b": 2, "c": 3}
df = (
    pl.read_csv(datasets.get_dataset_file("Morton2013"), null_values="n/a")
    .filter(pl.col("study"))
    .group_by("subject", "list_type", "input")
    .agg(pl.col("recall").mean())
)

Black automatically reformats to match Python formatting guidelines, plus some additional rules that Black uses to increase consistency. The name "Black" comes from a quote from Henry Ford about the Model T car: "Any customer can have a car painted any color that he wants so long as it is black".

I used to have a lot of recommendations for how to format code. Now I tell people: "just use Black."

Of course, Black won't change anything about how the code runs, so there are some recommended guidelines that it won't implement. For example, it's recommended that module import statements be placed at the top of a module. This makes it easier to see what modules are being used in the file and how they are named.

In [3]:
def myfunction(x, y):
    # add some numbers
    z = x + y
    return z


import numpy as np  # not recommended (comes after other code)
a = np.arange(6)

We should generally move the import statement to the top of the file before other code, unless there's a good reason to import it somewhere else.

In [4]:
import numpy as np


def myfunction(x, y):
    # add some numbers
    z = x + y
    return z


a = np.arange(6)

Note also that Python style guidelines recommend having two lines above and below each function definition, to make them easier to spot separately from other code.

### Exercise: code style

Use Black to reformat the following code. Also, make it so the import statements are all at the top of the cell.

In [5]:
import numpy as np
b = np.zeros((1,2))
import polars as pl
data = pl.DataFrame({"trial":[1,2,3,4], "correct":[0,1,1,0], "response_time":[1.2,3.4,2.3,5.6]})

## Coding principles

Code developers have tried to explain the difference between good and bad code in various ways. We'll review a few useful principles that can help guide your programming.

### DRY: Don't repeat yourself

The DRY principle says that we should avoid repeating ourselves when writing code. Programming languages are designed so that we should not have to write the same code over and over again. Repetitive code is harder to extend and debug.

For example, say we have data from 8 subjects, in separate files, that we want to read and analyze. One way to do this is by running 8 different calls to `pl.read_csv`, changing the filename each time and assigning each one to a variable. After reading in the files, we can combine them into one DataFrame using `pl.concat`.

In [6]:
df1 = pl.read_csv("data/sub-01_beh.csv")
df2 = pl.read_csv("data/sub-02_beh.csv")
df3 = pl.read_csv("data/sub-03_beh.csv")
df4 = pl.read_csv("data/sub-04_beh.csv")
df5 = pl.read_csv("data/sub-05_beh.csv")
df6 = pl.read_csv("data/sub-05_beh.csv")
df7 = pl.read_csv("data/sub-05_beh.csv")
df8 = pl.read_csv("data/sub-05_beh.csv")
df_all = pl.concat([df1, df2, df3, df4, df5, df6, df7, df8])

If we do lots of copying and pasting, this is relatively simple to write, but hard to work with in the future. What if the study is ongoing, and more subjects are being added? You would have to add and edit code each time to add those new subjects. What if the folder that the data are in changes, say to `rawdata` instead of `data`? You would have to edit the path of each file.

There is a better way: don't repeat yourself.

In this example, we use a `for` loop instead. We have to do more thinking in advance, to figure out how to write the `for` loop, create each filename from the subject number, and add each DataFrame to our list of DataFrames. But the code doesn't need to be edited much as more subjects are added to the dataset. You can just change the `n_subj` variable.

In [7]:
n_subj = 8
df_list = []
for i in range(1, n_subj + 1):
    filename = f"data/sub-{i:02}_beh.csv"
    df = pl.read_csv(filename)
    df_list.append(df)
df_all = pl.concat(df_list)

Writing functions can also help you avoid repeating yourself when coding. For example, say we want to exclude trials where the response time was an outlier, according to a standard criterion for detecting outliers. Say that we have two DataFrames with data from different experiments, but we want to apply the same sort of filtering to both.

In [8]:
df_rt1 = pl.DataFrame(
    {
        "subject": ["01", "01", "01", "01", "02", "02", "02", "02"],
        "response_time": [0.3, 0.6, 1.2, 0.9, 0.8, 0.4, 0.5, 3.4],
    }
)
df_rt2 = pl.DataFrame(
    {
        "subject": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
        "condition": [1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2],
        "response_time": [1.2, 1.3, 1.6, 1.1, 1.0, 0.9, 0.3, 1.7, 1.8, 2.2, 2.3, 1.9, 1.8, 4.2, 0.4, 1.0, 2.3, 1.4]
    }
)

We can run this filtering on each DataFrame individually, like below, repeating the same long expression for each DataFrame. If we want to run this calculation again on another dataset in another context, such as a different analysis notebook, we'll have to remember how to define Q1, Q3, the IQR, etc.

In [9]:
rt = pl.col("response_time")
q1 = rt.quantile(0.25)
q3 = rt.quantile(0.75)
iqr = q3 - q1
df_rt1_filt = df_rt1.filter(~((rt < q1 - 1.5 * iqr) | (rt > q3 + 1.5 * iqr)))
df_rt2_filt = df_rt2.filter(~((rt < q1 - 1.5 * iqr) | (rt > q3 + 1.5 * iqr)))

Instead, we could write a function. For example, let's make a function that takes in a DataFrame and filters out outliers.

In [10]:
def filter_rt_outliers(df):
    """Remove trials where the response time is an outlier."""
    rt = pl.col("response_time")
    q1 = rt.quantile(0.25)
    q3 = rt.quantile(0.75)
    iqr = q3 - q1
    return df.filter(~((rt < q1 - 1.5 * iqr) | (rt > q3 + 1.5 * iqr)))

Now we don't need to remember the formula every time we remove outliers; we can just call the function.

In [11]:
df_rt1_filt = filter_rt_outliers(df_rt1)
df_rt2_filt = filter_rt_outliers(df_rt2)

If we need to remember how it works, that's easy to look up, because the formula is only defined in one place, because we followed the DRY principle.

### Enhance flexibility using soft-coding

It's rare to write code that does everything you need on the first draft. Often, you will get code from someone else that does a lot of what you need to do, but will not work for your purposes without changes.

Don't be afraid to make changes. Code is meant to be revised, especially if you're using Git to track your changes.

Let's go back to the outlier filtering example. What limitations does it have?

In [12]:
def filter_rt_outliers(df):
    """Remove trials where the response time is an outlier."""
    rt = pl.col("response_time")
    q1 = rt.quantile(0.25)
    q3 = rt.quantile(0.75)
    iqr = q3 - q1
    return df.filter(~((rt < q1 - 1.5 * iqr) | (rt > q3 + 1.5 * iqr)))

One problem is that it assumes that the column with response time is named `"response_time"`. Another issue is that it assumes that you would only want to filter out trials that are outliers with response time. But you could also have some other measure with outliers. For example, say you wanted to exclude participants whose performance is an outlier.

We can solve both of these problems by adding an input that determines the column to use. This is called shifting from *hard-coding*, where some value is written directly in the code, to *soft-coding*, where the value is taken in by the function, making the function more flexible.

In [13]:
def filter_outliers(df, column):
    """Remove trials where some measure is an outlier."""
    x = pl.col(column)
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    iqr = q3 - q1
    return df.filter(~((x < q1 - 1.5 * iqr) | (x > q3 + 1.5 * iqr)))

Note that the column doesn't necessarily represent response time anymore, so we have renamed the `rt` variable to `x` to reflect that.

### The Zen of Python

The [Zen of Python](https://peps.python.org/pep-0020/) is a set of principles written by an early Python developer to describe how to write good Python code.

Here are a few of the core principles:

Beautiful is better than ugly. (Good code is elegant.)

Explicit is better than implicit. (Communicate your intentions.)

Simple is better than complex. (Simplify when possible.)

Readability counts. (Code is read more often than it is written, so make it easy to read.)

There should be one-- and preferably only one --obvious way to do it. (Python tries to make it obvious how to do something.)