This script can be used to preprocess diary files (see below) and/or to remove invalid records (ie reporting no intrusions, but entering values for vividness and distress).

In [18]:
# import some libraries
import os
import pandas as pd
import numpy as np

Input directory. You may have to change this, depending on where you are running this from. os.getcwd() just gets teh current directory. Note also that the ouput directory will be created in the current directory, so either navigate to where you want things to be stored or provide an explicit directory name below (eg input_dir = r"path\to\mydir") - r is for raw string so you won't have a problem with backslashes.
Also note: I downloaded raw qualtrics files and saved them as 'diary1.csv', 'diary2.csv',... If this is NOT what you saved the files as, you need to modify this in the list comprehension below.
Eg if you saved the files as some_data1.csv, some_data2.csv, you need to modify the list coprehension to:
[file for file in os.listdir(input_dir) if 'some_data' in file]
The point is: you are identifying something that is distinct about the diary files, so you are only operating on the files you are actually interested in processing.

In [95]:
input_dir = os.getcwd()
input_files = [file for file in os.listdir(input_dir) if 'diary' in file]

Create output directory, but check first if it already exists and do nothing if that's the case.

In [150]:
# make output dir
output_dir = os.path.join(os.getcwd(),"processed_diaries")
try:
    os.makedirs(output_dir)
except OSError:
    # if directory already exists
    pass

Utilities.
These are the functions that are used in the main body of the script below. Hopefully the names and docstrings are sufficient explanation, but happy to answer any qs.

In [152]:
def remove_incomplete_rows(in_df,finished_col):
    """
    remove rows containing incomplete records
    
    Parameters
    ----------
    in_df:  pd DataFrame
        input dataframe to operate on
    finished_col:   str
        name of column containing complete/incomplete info
        NB: looking for the col with BOOLEAN not %
    Returns
    -------
        df w/o incomplete records
    """
    in_df = in_df[in_df[finished_col]==True]
    return in_df

def rename_diary_cols(in_df, start_phrase = None, col_num = None):
    """
    Rename the columns referring to diary content
    Parameters
    ----------
    in_df:  pd DataFrame
        input dataframe to operate on
    start_phrase:   str
        phrase to look for in column name
    col_num:    str
        column marker, eg "column 1"
    """
    if col_num is None:
        in_df = in_df.rename(columns = {in_df.filter(like = start_phrase, axis = 1).columns[0]: "had_intrusions"})
    else:
        if "1" in col_num:
            new_names = ['_'.join(['content',str(num)]) for num in np.arange(1,13)]
        elif "2" in col_num:
            new_names = ['_'.join(['freq',str(num)]) for num in np.arange(1,13)]
        elif "3" in col_num:
            new_names = ['_'.join(['distress',str(num)]) for num in np.arange(1,13)]
        else:
            new_names = ['_'.join(['vivid',str(num)]) for num in np.arange(1,13)]
        
        old_names = in_df.filter(like = col_num.upper(), axis = 1).columns
        in_df.rename(columns = dict(zip(old_names,new_names)),inplace = True)
    return in_df

def select_columns(in_df, select_list):
    """
    Select columns to process.
    
    Parameters
    ----------
    in_df:  pd DataFrame
        dataframe to operate on
    select_list: list[str]
        names of columns to retain
    Returns
    -------
    in_df w/o irrelevant columns
    """
    diary_cols = [f for f in in_df.columns
                    if any
                    (k in f for k in 
                    ["had_intrusions","content","freq","distress","vivid"])]
    select_list.extend(diary_cols)
    in_df = in_df.loc[:,select_list]
    return in_df

def strip_col_names(in_df):
    """
    strip col names for easier handling
    
    Paramters
    ---------
    in_df:  pd dataframe
        dataframe to operate on
    
    Returns
    -------
        in_df w stripped col names
    """
    in_df.columns = [f.strip(":") for f in in_df.columns]
    in_df.columns = [f.replace(" ","_") for f in in_df.columns]
    return in_df

def preprocess_frame(in_df,finished_col,select_list):
    """ preprocess dataframe"""
    in_df = remove_incomplete_rows(in_df,finished_col)
    in_df = rename_diary_cols(in_df, start_phrase = "have you experienced")
    col_nums = [' '.join(['column',str(num)]) for num in np.arange(1,5)]
    for col in col_nums:
        in_df = rename_diary_cols(in_df,col_num = col)
    in_df = select_columns(in_df,select_list)
    in_df = strip_col_names(in_df)
    return in_df

def rem_dat_no_ints(in_df):
    """
    Remove data for records w/out intrusions
    if had_intrusions == "No",
    set content/vivid/freq/distress to NaN

    Parameters
    ---------
    in_df:  pd DataFrame
        input dataframe to operate on
    Returns
    -------
        in_df w data modified
    """
    diary_cols = [col for col in in_df.columns if 
                    any(name in col for name in
                    ['distress','vivid'])]
    in_df.loc[in_df.had_intrusions=='No',diary_cols] = np.nan
    return in_df


Main body.
Set save = 1 if you want the processed file to be stored in output_dir, 0 if not.
This applies the following steps:
For each file in the input folder that contains the word 'diary':
- remove incomplete records (retain only rows where Finished == True)
- make the col names for the diary related stuff less unwieldy
- remove cols we're not interested in (you can customize by passing different parameters to preprocess_dataframe (ie change the col names in [])
- strip punctuation (:) from cols, replace whitespace with _
- replace data in distress/vividness cols with np.nan if had_intrusions==No
- if save, save to output_dir under name_processed.csv

In [157]:
# save to output dir?
save = 1
for file in input_files:
    diary_file = pd.read_csv(os.path.join(input_dir,file),skiprows = [0,2])
    diary_file = preprocess_frame(diary_file, 'Finished',["Start Date","Participant number:"])
    diary_file = rem_dat_no_ints(diary_file)
    if save:
        diary_file.to_csv(os.path.join(output_dir, '_'.join([file[:-4],'processed.csv'])),index = False)
