# Processing tab separated log files

* This notebook builds on what we did in the `processing_simple_log_files.ipynb` notebook to work with log files generated by a PsychoPy task in __VERTICAL TAB SEPARATED FIELDS__ format


* We continue to focus on using:

    1. LISTS
    2. DICTIONARIES
    3. LOOPS


* And add in the use of:

    1. FUNCTIONS
    2. LIST FILTERING
    
    
------

#### HISTORY

* 11/9/18 mbod - initial version

----------------------------

## Task overview

-----------------------------



* This is what a log file looks like.

    * Each line in the file is a logged event with three fields separated by tabs
    
        1. timestamp since beginning of task in seconds
        2. field type (here they are all DATA)
        3. string containing the logged message

```
    4.4605 	DATA 	*** EVENT:start_task CODE:10 ***	
    4.4945 	DATA 	*** EVENT:start_instructions CODE:30 ***	
    4.4945 	DATA 	Showing instruction screen 1	
    11.5125 	DATA 	Keypress: space
    11.5126 	DATA 	Showing instruction screen 2	
    19.0356 	DATA 	Keypress: space
    19.0357 	DATA 	Showing instruction screen 3	
    23.6052 	DATA 	Keypress: space
    23.6053 	DATA 	*** EVENT:end_instructions CODE:31 ***	
    23.6100 	DATA 	*** MESSAGE:Fixation 5.0 secs EVENT:start_fixation CODE:20 ***	
    28.6105 	DATA 	*** EVENT:end_fixation CODE:21 ***	
    29.8764 	DATA 	VIDEO START a2r02h.mp4 dur 12.33	
    29.8764 	DATA 	*** MESSAGE:Showing a2r02h.mp4 dur 12.33 EVENT:start_video CODE:60 ***	
    42.2089 	DATA 	*** EVENT:end_video CODE:61 ***	
    42.2099 	DATA 	VIDEO END a2r02h.mp4	
    42.2303 	DATA 	*** EVENT:get_rating CODE:70 ***	
    45.1843 	DATA 	Keypress: 4
    45.1844 	DATA 	RATING 4	
    45.1844 	DATA 	*** MESSAGE:Key press 4 EVENT:rating_resp4 CODE:74 ***	
    45.6879 	DATA 	*** MESSAGE:Fixation 5.0 secs EVENT:start_fixation CODE:20 ***	
    50.6989 	DATA 	*** EVENT:end_fixation CODE:21 ***	
    51.9619 	DATA 	VIDEO START a1r01n.mp4 dur 12.16	
    51.9619 	DATA 	*** MESSAGE:Showing a1r01n.mp4 dur 12.16 EVENT:start_video CODE:60 ***	
    64.1220 	DATA 	*** EVENT:end_video CODE:61 ***	
    64.1225 	DATA 	VIDEO END a1r01n.mp4	
    64.1361 	DATA 	*** EVENT:get_rating CODE:70 ***	
    66.5044 	DATA 	Keypress: 2
    66.5044 	DATA 	RATING 2	
    ...
```

* Each **TRIAL** in this task involved the subject watching a video and then making a rating on how likely they would be to share it (on a 1-5 scale)


* So a trial looks like this in the log file:

```
    51.9619     DATA     VIDEO START a1r01n.mp4 dur 12.16    
    51.9619     DATA     *** MESSAGE:Showing a1r01n.mp4 dur 12.16 EVENT:start_video CODE:60 ***    
    64.1220     DATA     *** EVENT:end_video CODE:61 ***    
    64.1225     DATA     VIDEO END a1r01n.mp4    
    64.1361     DATA     *** EVENT:get_rating CODE:70 ***    
    66.5044     DATA     Keypress: 2
    66.5044     DATA     RATING 2 
```

* The relevant lines for a single trail

    * start with the message `VIDEO START`

    * and end with the message `RATING n`
    
    
                       

## GOAL

* Produce a list-of-dictionaries data structure:

```
    [
        { 'trial_num': 1,
          'video': 'filename.mp4',
          'share_rating': n
        },
        { 'trial_num': 2,
          'video': 'filename.mp4',
          'share_rating': n
        },
        { 'trial_num': 3,
          'video': 'filename.mp4',
          'share_rating': n
        },

        ...

        { 'trial_num': n,
          'video': 'filename.mp4',
          'share_rating': n
        }


    ]
```

* Do this for all logs in the log directory

### Import modules

* `import` brings groups of functions into your scripts workspace

* Modules are packages of related functions organized in a hierarchical 

In [5]:
import os

import pandas as pd

### Setup parameters

In [6]:
LOG_DIR = './log_data'

* Get a list of files in a directory using the `os.listdir()` function that:
    * takes a string argument indicating the path to the folder/directory
        * this can be _relative_ to the current working directory, e.g. `os.listdir('logs')` = list the `logs` folder in the same parent folder as the script/notebook or `os.listdir('../../data/logs')` = list the contents of the `logs` folder that is in the grandparent directory (up 2 levels) and inside the `data` folder in that directory
        * or absolute path from the __ROOT__ e.g. `os.listdir('/data00/projects/my_project/data/logs')`
        
    * returns a `list` of `strings` where each string item is a file or folder

In [7]:
logs_to_process = os.listdir(LOG_DIR)

In [8]:
len(logs_to_process)    # get the length of the list

43

In [13]:
type(logs_to_process)   # show the type of the return objects from os.listdir()

list

* So can use list indexing, slicing and list functions:

In [9]:
logs_to_process[0]

'log130.log'

In [15]:
logs_to_process[3:11]

['log133.log',
 'log134.log',
 'log135.log',
 'log136.log',
 'log137.log',
 'log138.log',
 'log139.log',
 'log140.log']

### Working out the task steps on one file

* The key element is breaking down the logical steps to go from the text log file to a dictionary of required values on a single test file.

* Once that is figured out then it is just a matter of repeating these steps for each of the files in the `logs_to_process` list.

#### Steps

1. load the contents of the log file into a string
2. create a list of events (i.e. one line for item)
3. create an empty list of trials to hold each `trial dictionary`
4. to process each trial:
    1. for each line look for video start (start of the trial)
    2. start a trial dictionary and include
        1. trial number
        2. video filename
        3. rating
    3. when trial complete push dictionary into trial list

### Step 1: Load contents of log file into a string

* The `open()` function returns a file object that has functions like `.read()` and `.readlines()` to access the content


* For example, this is how to get a file object:

In [20]:
fh=open('log_data/log130.log')

In [21]:
fh  # this is the file object

<_io.TextIOWrapper name='log_data/log130.log' mode='r' encoding='UTF-8'>

* Use `dir()` to list the attributes and functions for this object

In [22]:
print(', '.join(dir(fh)))

_CHUNK_SIZE, __class__, __del__, __delattr__, __dict__, __dir__, __doc__, __enter__, __eq__, __exit__, __format__, __ge__, __getattribute__, __getstate__, __gt__, __hash__, __init__, __init_subclass__, __iter__, __le__, __lt__, __ne__, __new__, __next__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__, _checkClosed, _checkReadable, _checkSeekable, _checkWritable, _finalizing, buffer, close, closed, detach, encoding, errors, fileno, flush, isatty, line_buffering, mode, name, newlines, read, readable, readline, readlines, seek, seekable, tell, truncate, writable, write, writelines


* For example the `.readline()` function returns a line at a time

* Each time it is called will return another line until the end of the file

In [23]:
fh.readline()

'5.5058 \tDATA \t*** EVENT:start_task CODE:10 ***\t\n'

In [25]:
fh.readline()

'5.5379 \tDATA \t*** EVENT:start_instructions CODE:30 ***\t\n'

* All lines in a file can be returned in a `list` using the `.readlines()` function


* Or the whole contents of the file can be read into a string object using the `.read()` function
    * For example, here we open the first log file in the `logs_to_process` list (__N.B.__ index 0) and read the contents into a string.

In [10]:
log_txt = open(os.path.join(LOG_DIR, logs_to_process[0])).read()

* Use slicing to show the first 1000 characters

In [12]:
log_txt[:1000]

'5.5058 \tDATA \t*** EVENT:start_task CODE:10 ***\t\n5.5379 \tDATA \t*** EVENT:start_instructions CODE:30 ***\t\n5.5380 \tDATA \tShowing instruction screen 1\t\n19.1799 \tDATA \tKeypress: space\n19.1800 \tDATA \tShowing instruction screen 2\t\n28.8425 \tDATA \tKeypress: space\n28.8426 \tDATA \tShowing instruction screen 3\t\n40.1126 \tDATA \tKeypress: space\n40.1127 \tDATA \t*** EVENT:end_instructions CODE:31 ***\t\n40.1179 \tDATA \t*** MESSAGE:Fixation 5.0 secs EVENT:start_fixation CODE:20 ***\t\n45.1340 \tDATA \t*** EVENT:end_fixation CODE:21 ***\t\n46.3937 \tDATA \tVIDEO START a2r15h.mp4 dur 13.59\t\n46.3937 \tDATA \t*** MESSAGE:Showing a2r15h.mp4 dur 13.59 EVENT:start_video CODE:60 ***\t\n59.9897 \tDATA \t*** EVENT:end_video CODE:61 ***\t\n59.9905 \tDATA \tVIDEO END a2r15h.mp4\t\n60.0151 \tDATA \t*** EVENT:get_rating CODE:70 ***\t\n63.6517 \tDATA \tKeypress: 1\n63.6519 \tDATA \tRATING 1\t\n63.6519 \tDATA \t*** MESSAGE:Key press 1 EVENT:rating_resp1 CODE:71 ***\t\n64.1555 \tDATA \t*

### Step 2. Create a list of events from log file

* We want a list of events from this log file.


* Each logged event is on a newline so use the string `.split()` function with `\n` as splitting delimiter

In [74]:
events = log_txt.split('\n')

In [75]:
len(events)

173

* We get a list of 173 lines from splitting the file on the `\n` character.


* Looking at the first 5 items in the resulting list we can see the three fields in each event are separated by a `\t` character.

In [31]:
events[:5]

['5.5058 \tDATA \t*** EVENT:start_task CODE:10 ***\t',
 '5.5379 \tDATA \t*** EVENT:start_instructions CODE:30 ***\t',
 '5.5380 \tDATA \tShowing instruction screen 1\t',
 '19.1799 \tDATA \tKeypress: space',
 '19.1800 \tDATA \tShowing instruction screen 2\t']

* We can split an event string using `.split('\t')`

In [23]:
events[0].split('\t')

['5.5058 ', 'DATA ', '*** EVENT:start_task CODE:10 ***', '']

* What the `split()` function does is match the delimiter string and add all preceeding characters the list:
```
        |     |xx|    |xx|                             |xx|  
        5.5058 \tDATA \t*** EVENT:start_task CODE:10 ***\t     
```

* Notice that we end up with a list of _four_ and not _three_ items because of the final `\t` results in an empty string


* String objects also have a `.strip()` function that removes whitespace (space, tab and new line/linefeed characters) from the start and end of a string.

In [33]:
my_string = '\n\t     NON-WHITESPACE   \t\t\n'

In [36]:
my_string

'\n\t     NON-WHITESPACE   \t\t\n'

In [35]:
my_string.strip()

'NON-WHITESPACE'

In [38]:
print("Length before strip() is {} characters\nLength after strip() is {} characters".format(len(my_string), len(my_string.strip())))

Length before strip() is 27 characters
Length after strip() is 14 characters


* So we can use the `strip()` function on an event line before splitting on the tab character to return a list of three items corresponding to the three fields

In [39]:
events[0].strip().split('\t')

['5.5058 ', 'DATA ', '*** EVENT:start_task CODE:10 ***']

* You can assigned named pointers to the items in a list by providing a list of names of the same length as the list on the left-hand side


* For example:

In [44]:
a,b,c = [1,2,3]

In [45]:
a

1

In [46]:
b

2

In [47]:
c

3

* So for our event fields:

In [40]:
ts, ftype, msg = events[0].strip().split('\t')

In [43]:
print("ts = '{}'\nftype = '{}'\nmsg = '{}'".format(ts, ftype, msg))

ts = '5.5058 '
ftype = 'DATA '
msg = '*** EVENT:start_task CODE:10 ***'


* The `_` character can be used to indicate that you don't want to keep a reference to an intervening item

In [53]:
ts, _, msg = events[0].strip().split('\t')

In [57]:
ts, msg

('5.5058 ', '*** EVENT:start_task CODE:10 ***')

* And `*_` at the end of the left-hand side assignment to indicate that all remaining values are of no interest


* This avoids an value unpacking error

In [62]:
a,b,c = [1,2,3,4,5,6,7,8,9,10]

ValueError: too many values to unpack (expected 3)

In [63]:
a,b,c,*_ = [1,2,3,4,5,6,7,8,9,10]

In [64]:
a,b,c

(1, 2, 3)

### Step 3. Create an empty list of trials to hold each trial dictionary

* Now we can start to build the rest of the steps to process the trial events in the log

In [66]:
trials = []

### Step 4. Process events to isolate trials

* The list item in the `events` list is an empty string because of the final `\n` in the `log_txt`

In [76]:
log_txt[-5:] # show last 5 characters of the log file string

'***\t\n'

In [77]:
events[-1]

''

* So we can apply `strip()` to the log file string before splitting on `\n`

* The resulting list will have one less item than before as the file item will be an event line

In [78]:
events2 = log_txt.strip().split('\n')

In [81]:
len(events2)

172

In [82]:
len(events)

173

In [83]:
events2[-1]

'396.1723 \tDATA \t*** EVENT:end_task CODE:11 ***'

* Now we want to walk through each event line and extract the
    * timestamp
    * message
    
  fields and ignore the middle field

In [90]:
# loop over events
for event in events2:
    
    # get ts and msg fields from event string
    ts, _, msg = event.strip().split('\t')

* Next we want to find lines where the `msg` field begins with the string: `'VIDEO START'`


* We can use the `.startswith()` string function, which returns a `True` or `False` value

In [89]:
msg.startswith('VIDEO START')

False

In [93]:
# loop over events
for event in events2:
    
    # get ts and msg fields from event string
    ts, _, msg = event.strip().split('\t')
    
    # test message field for 'VIDEO START`
    if msg.startswith('VIDEO START'):
        print(event)

46.3937 	DATA 	VIDEO START a2r15h.mp4 dur 13.59	
70.3490 	DATA 	VIDEO START a1r12n.mp4 dur 11.35	
93.1325 	DATA 	VIDEO START a1r10n.mp4 dur 12.38	
114.8483 	DATA 	VIDEO START a1r09n.mp4 dur 13.8	
138.6565 	DATA 	VIDEO START a2d16h.mp4 dur 13.8	
162.2774 	DATA 	VIDEO START a1d12n.mp4 dur 13.33	
184.6573 	DATA 	VIDEO START a2d13h.mp4 dur 7.08	
199.5249 	DATA 	VIDEO START a2r23h.mp4 dur 13.03	
220.9387 	DATA 	VIDEO START a1d11n.mp4 dur 12.33	
242.3834 	DATA 	VIDEO START a2r18h.mp4 dur 14.63	
265.5594 	DATA 	VIDEO START a2d18h.mp4 dur 13.12	
287.0370 	DATA 	VIDEO START a1d15n.mp4 dur 8.7	
305.5932 	DATA 	VIDEO START a2d21h.mp4 dur 14.59	
328.8387 	DATA 	VIDEO START a1d10n.mp4 dur 13.95	
352.1696 	DATA 	VIDEO START a1r11n.mp4 dur 8.53	
372.1727 	DATA 	VIDEO START a2r22h.mp4 dur 11.71	


* Once we find a event with `VIDEO START` we need to extract the video file name


* We can do that by splitting the message field string on whitespace and indexing the third item (index 2)

In [97]:
'VIDEO START a2r15h.mp4 dur 13.59'.split()

['VIDEO', 'START', 'a2r15h.mp4', 'dur', '13.59']

In [100]:
trials = []

# loop over events
for event in events2:
    
    # get ts and msg fields from event string
    ts, _, msg = event.strip().split('\t')
    
    # test message field for 'VIDEO START`
    if msg.startswith('VIDEO START'):
        
        # create a dictionary with the value of the mp4 filename
        trial_data = {'video': msg.split()[2]}
        
        # add the trial_data dictionary to the list of trials
        trials.append(trial_data)

* We now have a list of dictionaries for each trial

In [102]:
trials

[{'video': 'a2r15h.mp4'},
 {'video': 'a1r12n.mp4'},
 {'video': 'a1r10n.mp4'},
 {'video': 'a1r09n.mp4'},
 {'video': 'a2d16h.mp4'},
 {'video': 'a1d12n.mp4'},
 {'video': 'a2d13h.mp4'},
 {'video': 'a2r23h.mp4'},
 {'video': 'a1d11n.mp4'},
 {'video': 'a2r18h.mp4'},
 {'video': 'a2d18h.mp4'},
 {'video': 'a1d15n.mp4'},
 {'video': 'a2d21h.mp4'},
 {'video': 'a1d10n.mp4'},
 {'video': 'a1r11n.mp4'},
 {'video': 'a2r22h.mp4'}]

* The next step is to extract the share rating value for each trial.


* The appropriate message field begins with the uppercase string `RATING` followed by the rating value.


* So we:
    1. match the event line with `msg.startswith('RATING')`
    2. split the `msg` string on whitespace and get the second item: `msg.split()[1]`

In [108]:
trials2 = []

# loop over events
for event in events2:
    
    # get ts and msg fields from event string
    ts, _, msg = event.strip().split('\t')
    
    # test message field for 'VIDEO START`
    if msg.startswith('VIDEO START'):
        
        # create a dictionary with the value of the mp4 filename
        trial_data = {'video': msg.split()[2]}
        
    if msg.startswith('RATING'):
        trial_data['share_rating'] = msg.split()[1]
        # add the trial_data dictionary to the list of trials
        trials2.append(trial_data)

In [109]:
trials2

[{'share_rating': '1', 'video': 'a2r15h.mp4'},
 {'share_rating': '5', 'video': 'a1r12n.mp4'},
 {'share_rating': '5', 'video': 'a1r10n.mp4'},
 {'share_rating': '5', 'video': 'a1r09n.mp4'},
 {'share_rating': '2', 'video': 'a2d16h.mp4'},
 {'share_rating': '5', 'video': 'a1d12n.mp4'},
 {'share_rating': '1', 'video': 'a2d13h.mp4'},
 {'share_rating': '1', 'video': 'a2r23h.mp4'},
 {'share_rating': '5', 'video': 'a1d11n.mp4'},
 {'share_rating': '1', 'video': 'a2r18h.mp4'},
 {'share_rating': '2', 'video': 'a2d18h.mp4'},
 {'share_rating': '5', 'video': 'a1d15n.mp4'},
 {'share_rating': '2', 'video': 'a2d21h.mp4'},
 {'share_rating': '5', 'video': 'a1d10n.mp4'},
 {'share_rating': '4', 'video': 'a1r11n.mp4'},
 {'share_rating': '2', 'video': 'a2r22h.mp4'}]

* Lastly we want to have a trial number field in each trial dictionary


* We could do this in two different ways:
    1. keep a trial counter and increase its value each time a trial is matched and add it to the trial dictionary
    2. loop over the trial list (e.g. `trials2`) and add in a trial number in a second pass
    
    
* Here we'll implement the first possibility by using an integer object named `trial_num`
    ```
    trial_num = 0   # create an integer object and name it
    trial_num +=1   # increment the value of the object by one - this is the same as 
                    # trial_num = trial_num + 1 
                    
    ```
    

In [110]:
trials3 = []

trial_num = 0

# loop over events
for event in events2:
    
    # get ts and msg fields from event string
    ts, _, msg = event.strip().split('\t')
    
    # test message field for 'VIDEO START`
    if msg.startswith('VIDEO START'):
        trial_num += 1
        # create a dictionary with the value of the mp4 filename
        trial_data = {'video': msg.split()[2], 'trial_num': trial_num}
        
    if msg.startswith('RATING'):
        trial_data['share_rating'] = msg.split()[1]
        # add the trial_data dictionary to the list of trials
        trials3.append(trial_data)

In [111]:
trials3

[{'share_rating': '1', 'trial_num': 1, 'video': 'a2r15h.mp4'},
 {'share_rating': '5', 'trial_num': 2, 'video': 'a1r12n.mp4'},
 {'share_rating': '5', 'trial_num': 3, 'video': 'a1r10n.mp4'},
 {'share_rating': '5', 'trial_num': 4, 'video': 'a1r09n.mp4'},
 {'share_rating': '2', 'trial_num': 5, 'video': 'a2d16h.mp4'},
 {'share_rating': '5', 'trial_num': 6, 'video': 'a1d12n.mp4'},
 {'share_rating': '1', 'trial_num': 7, 'video': 'a2d13h.mp4'},
 {'share_rating': '1', 'trial_num': 8, 'video': 'a2r23h.mp4'},
 {'share_rating': '5', 'trial_num': 9, 'video': 'a1d11n.mp4'},
 {'share_rating': '1', 'trial_num': 10, 'video': 'a2r18h.mp4'},
 {'share_rating': '2', 'trial_num': 11, 'video': 'a2d18h.mp4'},
 {'share_rating': '5', 'trial_num': 12, 'video': 'a1d15n.mp4'},
 {'share_rating': '2', 'trial_num': 13, 'video': 'a2d21h.mp4'},
 {'share_rating': '5', 'trial_num': 14, 'video': 'a1d10n.mp4'},
 {'share_rating': '4', 'trial_num': 15, 'video': 'a1r11n.mp4'},
 {'share_rating': '2', 'trial_num': 16, 'video': 

### Apply these steps to each log file

* Now we have figured out the steps for processing one log file we can apply it to each log file in turn.

#### A note on using `os.path.join()` function

* When referencing a file or directory we often construct the file path with a base directory plus a sub folder plus a filename, e.g.
    ```
    /data/project/logs + task1 + subject1.log
    /data/project/logs + task1 + subject2.log
    /data/project/logs + task1 + subject3.log
    ...etc...
  
 ```
    
* We can do this by _adding_ strings together with the `+` operation, e.g.

In [112]:
'/data/projects/logs' + '/task1' + '/subject1.log'

'/data/projects/logs/task1/subject1.log'

* And with a loop we can create the file paths for, say, a list of five subjects

In [114]:
for subject in range(1,6):
    fpath = '/data/projects/logs' + '/task1' + '/subject' + str(subject) + '.log'
    print(fpath)

/data/projects/logs/task1/subject1.log
/data/projects/logs/task1/subject2.log
/data/projects/logs/task1/subject3.log
/data/projects/logs/task1/subject4.log
/data/projects/logs/task1/subject5.log


* You will often see code just like that in scripts you'll find online or get from other people


* BUT 
    * it's pretty ugly looking code 
    * it's not very robust as it requires you take care of getting the path syntax, i.e. the placement of forward slashes in the correct places
    
    
* For example, if you forget the `/` when trying to access `log130.log` in the `log_data` folder:

In [26]:
open('./log_data'+'log130.log')

FileNotFoundError: [Errno 2] No such file or directory: './log_datalog130.log'

* The `os` module has a submodule called `path` that has a function called `join` that helps construct a syntactically well-formed filepath

* Arguments for `join()` are the parts of a file path (i.e. folders, sub-folders and filename)

In [28]:
os.path.join('a','c','d')

'a/c/d'

In [116]:
for subject in range(1,6):
    fpath = os.path.join('/data','projects','logs','task1', 'subject{}.log'.format(subject))
    print(fpath)

/data/projects/logs/task1/subject1.log
/data/projects/logs/task1/subject2.log
/data/projects/logs/task1/subject3.log
/data/projects/logs/task1/subject4.log
/data/projects/logs/task1/subject5.log


* This works well in a loop to process the log files


* For example, here are the first ten log files in the list `logs_to_process` with the correct relative path 

In [124]:
for log in logs_to_process[:10]:
    fpath = os.path.join(LOG_DIR, log)
    print(fpath)

./log_data/log130.log
./log_data/log131.log
./log_data/log132.log
./log_data/log133.log
./log_data/log134.log
./log_data/log135.log
./log_data/log136.log
./log_data/log137.log
./log_data/log138.log
./log_data/log139.log


### The _not-so-good_ way to process all files

* We could combine the block of code we worked out above to extract trail data with this loop to construct the path to each log file, like this.
    * _Try and follow through the code and make sure you understand what it is doing._
    

In [118]:
subject_data = []

for log in logs_to_process:
    log_text = open(os.path.join(LOG_DIR, log)).read()
    
    events = log_text.strip().split('\n')
    
    trials = []
    trial_num = 0
    
    for event in events:
        ts, _, msg = event.strip().split('\t')
        
        if msg.startswith('VIDEO START'):
            trial_num+=1
            trial_data = {'video': msg.split()[2],
                          'trial_num': trial_num}
            
        if msg.startswith('RATING'):
            trial_data['share_rating']=msg.split()[1]
            
            trials.append(trial_data)
            
    subject_data.append((log, trials))
        

* An outer list, `subject_data`, is used to hold the trial list for each subject

In [121]:
len(subject_data)

43

* If you take a look at the first item in the `subject_data` list you can see that each item is a `tuple` (immuatble list) with two items:

In [123]:
subject_data[0]

('log130.log',
 [{'share_rating': '1', 'trial_num': 1, 'video': 'a2r15h.mp4'},
  {'share_rating': '5', 'trial_num': 2, 'video': 'a1r12n.mp4'},
  {'share_rating': '5', 'trial_num': 3, 'video': 'a1r10n.mp4'},
  {'share_rating': '5', 'trial_num': 4, 'video': 'a1r09n.mp4'},
  {'share_rating': '2', 'trial_num': 5, 'video': 'a2d16h.mp4'},
  {'share_rating': '5', 'trial_num': 6, 'video': 'a1d12n.mp4'},
  {'share_rating': '1', 'trial_num': 7, 'video': 'a2d13h.mp4'},
  {'share_rating': '1', 'trial_num': 8, 'video': 'a2r23h.mp4'},
  {'share_rating': '5', 'trial_num': 9, 'video': 'a1d11n.mp4'},
  {'share_rating': '1', 'trial_num': 10, 'video': 'a2r18h.mp4'},
  {'share_rating': '2', 'trial_num': 11, 'video': 'a2d18h.mp4'},
  {'share_rating': '5', 'trial_num': 12, 'video': 'a1d15n.mp4'},
  {'share_rating': '2', 'trial_num': 13, 'video': 'a2d21h.mp4'},
  {'share_rating': '5', 'trial_num': 14, 'video': 'a1d10n.mp4'},
  {'share_rating': '4', 'trial_num': 15, 'video': 'a1r11n.mp4'},
  {'share_rating': 

* The first item in the tuple is the log filename

In [125]:
subject_data[0][0]

'log130.log'

* And the second is the log trial list for that file

In [126]:
subject_data[0][1]

[{'share_rating': '1', 'trial_num': 1, 'video': 'a2r15h.mp4'},
 {'share_rating': '5', 'trial_num': 2, 'video': 'a1r12n.mp4'},
 {'share_rating': '5', 'trial_num': 3, 'video': 'a1r10n.mp4'},
 {'share_rating': '5', 'trial_num': 4, 'video': 'a1r09n.mp4'},
 {'share_rating': '2', 'trial_num': 5, 'video': 'a2d16h.mp4'},
 {'share_rating': '5', 'trial_num': 6, 'video': 'a1d12n.mp4'},
 {'share_rating': '1', 'trial_num': 7, 'video': 'a2d13h.mp4'},
 {'share_rating': '1', 'trial_num': 8, 'video': 'a2r23h.mp4'},
 {'share_rating': '5', 'trial_num': 9, 'video': 'a1d11n.mp4'},
 {'share_rating': '1', 'trial_num': 10, 'video': 'a2r18h.mp4'},
 {'share_rating': '2', 'trial_num': 11, 'video': 'a2d18h.mp4'},
 {'share_rating': '5', 'trial_num': 12, 'video': 'a1d15n.mp4'},
 {'share_rating': '2', 'trial_num': 13, 'video': 'a2d21h.mp4'},
 {'share_rating': '5', 'trial_num': 14, 'video': 'a1d10n.mp4'},
 {'share_rating': '4', 'trial_num': 15, 'video': 'a1r11n.mp4'},
 {'share_rating': '2', 'trial_num': 16, 'video': 

### We've solved the problem for this task!

* **BUT** it's not the cleanest or most elegant looking bit of code

* **AND** if we want to add some more steps to the trial data extraction, e.g. pulling out onset and durations and other pieces of data in the log events, we'll need to add more to the nested block in the second `for` loop.


#### Refactoring

* If your code starts to look like this and you have nested `for` loops containing blocks of code with more than 3 or 4 lines then you should think about __REFACTORING__ your code. One way to do this is to encapsulate repeated subsets of code into _functions_.

### Create a log file processing function

* A __FUNCTION__ is a block of code that carries a series of steps
    * It can take various inputs (zero or more) that are objects to be used in the function
    * And can return outputs (zero or more) that the result of the steps in the function
    
    
* A function is defined using `def` and has a signature which is:
    ```
        def FUNCTION_NAME(arg1, arg2, ..., argN):
            # --- CODE BLOCK ----
            
            return outputs
    ```
    

* The simplest form of a function is one that takes no inputs and returns no output

In [130]:
def my_function():
    
    print('This is my function')
    print('15*74 =', 15*74)

* Once defined you can _call_ a function with its name follwed by parentheses:

In [131]:
my_function()

This is my function
15*74 = 1110


* This can be the first step in refactoring your code


* But most often you'll want to pass some values yout function (e.g. the log filename) to be used in the steps in the functions code block.


* Here is a simple function that takes two arguments and returns the result of adding them together:

In [132]:
def my_sum(a,b):
    return a+b

* When you call the function you need to supply two values that will be accessible through named pointers `a` and `b` inside the function


* If you don't pass any or enough arguments you will get an error message

In [61]:
my_sum()

TypeError: my_sum() missing 2 required positional arguments: 'a' and 'b'

In [133]:
my_sum(1,2)

3

In [134]:
my_sum(13232, 123)

13355

In [135]:
my_sum('asdas','asds')

'asdasasds'

In [136]:
my_sum(1, 123.322)

124.322

#### Including a `docstring` to describe what your function does and what arguments it expects

* It is good practice to include a _docstring_ at the top of your function definition


* This is a multiline string that can contain anything you want but by convention has:
    1. a short description of what the function does
    2. a description of all the arguments and the expected values
    3. a description of what the function returns

In [69]:
def my_sum2(a,b):
    '''
    This is my function it does cool stuff by adding two items
    
    Args:
       a   - item1 
       b   - item2 (NOTE item1 and item2 should be of same type(ish))
       
    Returns:
       the results of apply + operation to item1 and item2
    '''
    return a+b

* The contents of the docstring will be included in what is returned by calling `help()` on your function

In [70]:
help(my_sum2)

Help on function my_sum2 in module __main__:

my_sum2(a, b)
    This is my function it does cool stuff by adding two items
    
    Args:
       a   - item1 
       b   - item2 (NOTE item1 and item2 should be of same type(ish))
       
    Returns:
       the results of apply + operation to item1 and item2



### Defining a function to process a log file

* Now we define a function called `process_log_file` that
    * takes a `filename` as INPUT (a path to the log file to process)
    * and returns a list-of-dictionaries where each dictionary represents a trial

In [137]:
def process_log_file(filename):
    '''
    load a log file and process contents for trial events
    
    Args:
        filename    - path to a .log file
    
    Returns:
        a list of dictionaries where each list item is a trial
        and each trial is a dictionary with:
            trial_num
            video
            share_rating
            
            
    '''
    
    log_txt = open(filename).read()
    
    events = log_txt.strip().split('\n')
    
    trials = []
    trial_num = 0
    
    for event in events:
        
        ts, _, msg = event.strip().split('\t')
        
        if msg.startswith('VIDEO START'):
            trial_num +=1
            
            trial_data = {
                            'trial_num': trial_num,
                            'video': msg.split()[2]
                         }
            
            
        if msg.startswith('RATING'):
            trial_data['share_rating']=msg.split()[1]
            
            trials.append(trial_data)
            
            
    return trials

* Then we can call the function like this:

In [138]:
process_log_file('./log_data/log131.log')

[{'share_rating': '1', 'trial_num': 1, 'video': 'a1d03n.mp4'},
 {'share_rating': '1', 'trial_num': 2, 'video': 'a1r05n.mp4'},
 {'share_rating': '2', 'trial_num': 3, 'video': 'a2r04h.mp4'},
 {'share_rating': '3', 'trial_num': 4, 'video': 'a2r02h.mp4'},
 {'share_rating': '4', 'trial_num': 5, 'video': 'a2d09h.mp4'},
 {'share_rating': '2', 'trial_num': 6, 'video': 'a1d07n.mp4'},
 {'share_rating': '2', 'trial_num': 7, 'video': 'a2d06h.mp4'},
 {'share_rating': '1', 'trial_num': 8, 'video': 'a1r08n.mp4'},
 {'share_rating': '3', 'trial_num': 9, 'video': 'a2r13h.mp4'},
 {'share_rating': '2', 'trial_num': 10, 'video': 'a2d04h.mp4'},
 {'share_rating': '2', 'trial_num': 11, 'video': 'a2r07h.mp4'},
 {'share_rating': '2', 'trial_num': 12, 'video': 'a1d01n.mp4'},
 {'share_rating': '1', 'trial_num': 13, 'video': 'a1d08n.mp4'},
 {'share_rating': '2', 'trial_num': 14, 'video': 'a1r06n.mp4'},
 {'share_rating': '1', 'trial_num': 15, 'video': 'a1r01n.mp4'},
 {'share_rating': '2', 'trial_num': 16, 'video': 

### A better solution

* So now with the `process_log_file()` function encapsulating the data extraction code we can remove the inner loop in the previous code


* The result of processing each file could be stored in various data structures. Here we'll use a dictionary where the key is the filename and the corresponding value will be the list-of-dictionaries returned by the function

In [139]:
log_data = {}

for log in logs_to_process:
    
    log_filename = os.path.join(LOG_DIR, log)
    print('Processing', log_filename)
    
    log_data[log] = process_log_file(log_filename)
            

Processing ./log_data/log130.log
Processing ./log_data/log131.log
Processing ./log_data/log132.log
Processing ./log_data/log133.log
Processing ./log_data/log134.log
Processing ./log_data/log135.log
Processing ./log_data/log136.log
Processing ./log_data/log137.log
Processing ./log_data/log138.log
Processing ./log_data/log139.log
Processing ./log_data/log140.log
Processing ./log_data/log141.log
Processing ./log_data/log142.log
Processing ./log_data/log143.log
Processing ./log_data/log144.log
Processing ./log_data/log145.log
Processing ./log_data/log146.log
Processing ./log_data/log147.log
Processing ./log_data/log148.log
Processing ./log_data/log149.log
Processing ./log_data/log150.log
Processing ./log_data/log151.log
Processing ./log_data/log152.log
Processing ./log_data/log153.log
Processing ./log_data/log154.log
Processing ./log_data/log155.log
Processing ./log_data/log156.log
Processing ./log_data/log157.log
Processing ./log_data/log158.log
Processing ./log_data/log159.log
Processing

In [140]:
len(log_data)

43

In [141]:
log_data.keys()

dict_keys(['log130.log', 'log131.log', 'log132.log', 'log133.log', 'log134.log', 'log135.log', 'log136.log', 'log137.log', 'log138.log', 'log139.log', 'log140.log', 'log141.log', 'log142.log', 'log143.log', 'log144.log', 'log145.log', 'log146.log', 'log147.log', 'log148.log', 'log149.log', 'log150.log', 'log151.log', 'log152.log', 'log153.log', 'log154.log', 'log155.log', 'log156.log', 'log157.log', 'log158.log', 'log159.log', 'log160.log', 'log161.log', 'log162.log', 'log163.log', 'log164.log', 'log165.log', 'log166.log', 'log167.log', 'log168.log', 'log169.log', 'log170.log', 'log171.log', 'log172.log'])

In [142]:
log_data['log130.log']

[{'share_rating': '1', 'trial_num': 1, 'video': 'a2r15h.mp4'},
 {'share_rating': '5', 'trial_num': 2, 'video': 'a1r12n.mp4'},
 {'share_rating': '5', 'trial_num': 3, 'video': 'a1r10n.mp4'},
 {'share_rating': '5', 'trial_num': 4, 'video': 'a1r09n.mp4'},
 {'share_rating': '2', 'trial_num': 5, 'video': 'a2d16h.mp4'},
 {'share_rating': '5', 'trial_num': 6, 'video': 'a1d12n.mp4'},
 {'share_rating': '1', 'trial_num': 7, 'video': 'a2d13h.mp4'},
 {'share_rating': '1', 'trial_num': 8, 'video': 'a2r23h.mp4'},
 {'share_rating': '5', 'trial_num': 9, 'video': 'a1d11n.mp4'},
 {'share_rating': '1', 'trial_num': 10, 'video': 'a2r18h.mp4'},
 {'share_rating': '2', 'trial_num': 11, 'video': 'a2d18h.mp4'},
 {'share_rating': '5', 'trial_num': 12, 'video': 'a1d15n.mp4'},
 {'share_rating': '2', 'trial_num': 13, 'video': 'a2d21h.mp4'},
 {'share_rating': '5', 'trial_num': 14, 'video': 'a1d10n.mp4'},
 {'share_rating': '4', 'trial_num': 15, 'video': 'a1r11n.mp4'},
 {'share_rating': '2', 'trial_num': 16, 'video': 

## Exercises

1. Modify `process_log_file` to use the timestamp value for each trial and the movie duration to add two extra fields to each trial dictionary

    * `onset`
    * `dur`
    ```
        [
            {'dur': '8.53', 'onset': '34.1293 ', 'share_rating': '2', 'trial_num': 1, 'video': 'a1r11n.mp4'},
            {'dur': '7.08', 'onset': '53.5779 ', 'share_rating': '4', 'trial_num': 2, 'video': 'a2d13h.mp4'},
            ...
        ]
    ```
    
  call the new function `process_log_file_v2`
      ```
      ```
  
2. Modify `process_log_file` to include:
    
    * the participant id in each trial dictionary where the participant id is the number in the file name prefixed with s, e.g. `log130.log => s130`
    * a `condition` field with values:
        * `HUMOR`
        * `NONHUMOR`
      based on the last character before the `.mp4` in the video filename, e.g.
          `a2d13h.mp4` would be `HUMOR` and `a1r11n.mp4` would be `NONHUMOR`

## Solutions

In [144]:
def process_log_file_v2(filename):
    '''
    load a log file and process contents for trial events
    
    Args:
        filename    - path to a .log file
    
    Returns:
        a list of dictionaries where each list item is a trial
        and each trial is a dictionary with:
            trial_num
            video
            share_rating
            onset
            dur
    '''
    
    log_txt = open(filename).read()
    
    events = log_txt.strip().split('\n')
    
    trials = []
    trial_num = 0
    
    for event in events:
        
        ts, _, msg = event.strip().split('\t')
        
        if msg.startswith('VIDEO START'):
            trial_num +=1
            
            
            
            trial_data = {
                            'trial_num': trial_num,
                            'video': msg.split()[2],
                            'onset': ts,
                            'dur': msg.split()[4]
                         }
            
            
        if msg.startswith('RATING'):
            trial_data['share_rating']=msg.split()[1]
            
            trials.append(trial_data)
            
            
    return trials

In [145]:
process_log_file_v2('./log_data/log136.log')

[{'dur': '8.53',
  'onset': '34.1293 ',
  'share_rating': '2',
  'trial_num': 1,
  'video': 'a1r11n.mp4'},
 {'dur': '7.08',
  'onset': '53.5779 ',
  'share_rating': '4',
  'trial_num': 2,
  'video': 'a2d13h.mp4'},
 {'dur': '13.12',
  'onset': '70.4277 ',
  'share_rating': '3',
  'trial_num': 3,
  'video': 'a2d18h.mp4'},
 {'dur': '13.95',
  'onset': '93.7531 ',
  'share_rating': '4',
  'trial_num': 4,
  'video': 'a1d10n.mp4'},
 {'dur': '14.59',
  'onset': '119.1040 ',
  'share_rating': '2',
  'trial_num': 5,
  'video': 'a2d21h.mp4'},
 {'dur': '12.33',
  'onset': '142.7124 ',
  'share_rating': '3',
  'trial_num': 6,
  'video': 'a1d11n.mp4'},
 {'dur': '14.63',
  'onset': '163.9275 ',
  'share_rating': '3',
  'trial_num': 7,
  'video': 'a2r18h.mp4'},
 {'dur': '12.38',
  'onset': '187.6043 ',
  'share_rating': '4',
  'trial_num': 8,
  'video': 'a1r10n.mp4'},
 {'dur': '13.59',
  'onset': '208.5594 ',
  'share_rating': '4',
  'trial_num': 9,
  'video': 'a2r15h.mp4'},
 {'dur': '13.8',
  'onset