# Getting ready for an analysis

[Chris
Gorgolewski](https://reproducibility.stanford.edu/team/chris-gorgolewski)
kindly gave this exercise in our Berkeley courses.  He released the exercise
under the [CC-by license](https://creativecommons.org/licenses/by/4.0), and
the associated data file `24719.f3_beh_CHYM.csv` under the [CC-0
license](https://creativecommons.org/share-your-work/public-domain/cc0/).

The exercise gets you started on:

* using variables;
* numbers and strings;
* opening and reading from files.


## Introduction

Fill in the code cells containing `...` in the cells below.

When you have filled in the code, run any tests for the cell, in the cell
below.  Some cells don't have test cells below them.

If your code was not correct, the tests will give an `AssertionError`.  If the
code appears to be correct, the test cell will run without error.


## The analysis

Define a `number_of_subjects` variable and assign value 20 to it.

In [1]:
#- Your code here.
number_of_subjects = 20

Now run the cell below to test the code above.  Remember, if it gives an `AssertionError`, then there's a problem with your code above.  If it runs without error, your code above probably gave the right answer.

In [2]:
# Run this cell to test the cell above.
assert 'number_of_subjects' in dir()  # Variable exists
assert number_of_subjects is not ...  # Variable changed from template
assert number_of_subjects == 20  # Variable has correct value.

Display the statement: `We have <x> subjects` where `<x>` is the number of
subjects.  For example, if you had 10 subjects (you don't) you should see `We have 10 subjects`.

In [3]:
#- Your code here
print("We have {number_of_subjects} subjects")

Define two variables: `control_subjects` and `treatment_subjects` and assign values
18 and 12 respectively.

In [4]:
#- Your code here
control_subjects = 18
treatment_subjects = 12

In [5]:
# Run this cell to test the cell above.
assert 'control_subjects' in dir()  # Variable exists
assert 'treatment_subjects' in dir()  # Variable exists
assert control_subjects == 18  # Variable has correct value.
assert treatment_subjects == 12  # Variable has correct value.

Now define a new variable `all_subjects` which is a sum of the two.

In [6]:
#- Your code here
all_subjects = control_subjects + treatment_subjects

In [7]:
# Run this cell to test the cell above.
assert 'all_subjects' in dir()  # Variable exists
assert all_subjects == 30  # Variable has correct value.

Now have a look at the textbook page on [string formatting](https://textbook.nipraxis.org/string_formatting.html).

The directories for each subject have names:

- `sub000`
- `sub001`
- etc, up to
- `sub029`

Here's how to build a string for subject number 20, using string formatting
(see the page above):

In [8]:
val = 20
# Make a suitable string using the value of `val`
f'sub{val:03d}'

Now use a `for` loop to print the subject identifiers for each subject, each subject identifier on one line.  This will start:

```
sub000
sub001
sub002
```

In [9]:
#- Starting at 0 and going up to (not including) the value of `all_subjects`
for i in range(0, all_subjects):
    print(f"sub{i:03d}")

Create a variable `list_of_subjects` containing a list of identifiers you
just printed.

In [10]:
list_of_subjects = []
for i in range(0, all_subjects):
    list_of_subjects.append(f"sub{i:03d}")
# Show the result
list_of_subjects

In [11]:
# Test the list creation.
assert 'list_of_subjects' in dir()  # Variable exists.
assert isinstance(list_of_subjects, list)  # It's a list.
assert len(list_of_subjects) == all_subjects  # Has right number of elements.
# Test some elements.
assert list_of_subjects[1] == 'sub001'
assert list_of_subjects[12] == 'sub012'
assert list_of_subjects[-1] == 'sub029'

Unfortunately the data from some subjects were corrupted. Create another
variable `list_of_excluded_subjects` containing the strings `'sub005'` and
`'sub008'`.

In [12]:
list_of_excluded_subjects = ['sub005', 'sub008']

In [13]:
# Test the list creation.
assert 'list_of_excluded_subjects' in dir()  # Variable exists.
assert isinstance(list_of_excluded_subjects, list)  # It's a list.
assert len(list_of_excluded_subjects) == 2  # Has right number of elements.
assert list_of_excluded_subjects[0] == 'sub005'
assert list_of_excluded_subjects[-1] == 'sub008'

Now create a *list* of all subjects apart from those in the
`list_of_excluded_subjects`.  Put this new list into the variable
`good_subjects`.

**Hint**: you can use a `for` loop.

In [14]:
good_subjects = []
for i in range(all_subjects):
    subject = list_of_subjects[i]
    if subject in list_of_excluded_subjects:
        continue
    good_subjects.append(subject)
# Show the result
good_subjects

In [15]:
# Test the list creation.
assert 'good_subjects' in dir()  # Variable exists.
assert isinstance(good_subjects, list)  # It's a list.
assert len(good_subjects) == all_subjects - 2  # Has right number of elements.
assert good_subjects[5] == 'sub006'
assert good_subjects[-1] == 'sub029'

Do we expect the same identifier to be on the list of subjects twice? If not
what other data structure could we use to contain the subjects?  Consider
whether you could use this data structure to solve your code problem without a
`for` loop.   See the [solution](first_analysis_solution.ipynb) for a way of
doing this.

**Hint**: you can use the `set` type and subtraction operator (with sets).

In [16]:
#- See the solution for a way of solving the problem above
#- without using a `for` loop.
good_subjects_again = set(list_of_subjects) - set(list_of_excluded_subjects)
# Show the result
good_subjects_again

For the next step, we return to using the `good_subjects` list you should have
built above.

Now write out your ordered subjects to text file called `usable_subjects.txt`.
Write one subject identifier per line.

In [17]:
# Make sure we have a sorted list.
good_list = sorted(good_subjects)
# Open the file for writing
fobj = open('usable_subjects.txt', 'wt')
for i in range(len(good_list)):
    fobj.write(good_list[i] + '\n')
# Don't forget to close the file.
fobj.close()

In [18]:
# Some tests
import os.path as op
assert op.isfile('usable_subjects.txt')
lines = open('usable_subjects.txt', 'rt').readlines()
assert len(lines) == len(good_subjects)
for i in range(len(lines)):
    assert lines[i].strip() == good_subjects[i]

For the rest of the class we would like you to work on the following task:

You are given a log file called `24719.f3_beh_CHYM.csv`.  It contains the
results from a behavioral experiment.

The first line in the file contains the header, giving the names of the
variables.  Each line after the first corresponds to one trial of the
experiment. We are interested in the average response time on trials where
response was the 'space' key and displayed shape ('trial_shape') was a
'red_square'.  Here are the first few lines of this file:

```
response,response_time,trial_ISI,trial_shape
None,0,2000,red_star
None,0,1000,red_circle
None,0,2500,green_triangle
None,0,1500,yellow_square
```

**Hints**

- When you read lines from a text file, they end with a carriage return;
- Remember `split`;
- String are not numbers;

See [the solution](first_analysis_solution.ipynb) for the — *solution*!

In [19]:
total_rt = 0.
n = 0
fobj = open("24719.f3_beh_CHYM.csv", 'rt')
lines = fobj.readlines()
for i in range(1, len(lines)):
    line = lines[i].strip()
    parts = line.split(',')
    if parts[0] == 'space' and parts[3] == 'red_square':
        total_rt += int(parts[1])
        n += 1
# Show the average reaction time.
print(total_rt / n)

In [20]:
# A simple test for the correct answer.
assert round(total_rt / n) == 382

## Bonus exercise

Have a look at the Pandas package especially the `read_csv` method. Could you
solve the previous exercise using this functionality?

See the [pandas solution](first_analysis_pandas.Rmd) for a solution using Pandas.

## Some final cleanup

Just to be tidy, let us delete the file we created above:

In [21]:
# Delete the usable_subjects.txt file.
# Just in case we are re-running the notebook.
import os
os.unlink('usable_subjects.txt')