# Harvard MMCA data preparation codes
This notebook analyses data code, in particular data_metric sheet, obtained from literature review of collaboration analytics dataset that was done by LTI lab, Harvard University.

The original file was changed for the analysis purpose.
* In the data sheet of data_metric, an additional column was added in the last containing '#'. The reason behind this was to allow record-wise parsing of the sheet. It actually contained records expanding to multiple lines which made it difficult to be parsed by libraries (e.g., pandas).

In [28]:
import pandas as pd
import pprint as pp

In [13]:
def getPaperYear(paper_id,year_df=year):
    """
    This function simply takes paper_id as input and return the year in which paper was published.
    """
    year = year_df.loc[year_df['paper_id_new']==paper_id,:]['year'].to_list()[0]
    return year


def getRecord(lines,index):
    """
    Function to parse metric sheet. The proper exeuction of this function requires adding '#' in 
    the last column of the sheet.
    
    params:
        
        lines: lines read from csv file.
        index: line number from where parsing starts for the new record
        
    returns:
    
        record    : parsed record in a single line
        line_index: line number for the next record
        
    """
    
    line = lines[index]
    

    line_index = index
    record = ''
    # adding line until stop symbol occurs
    while line_index < len(lines):
        record += lines[line_index]
        if '#' in lines[line_index]:
            break
        line_index += 1
    return record,line_index+1
            

In [65]:
# Loading csv files
file = open('../metric.csv')
lines = file.readlines()

# Parsing data files record-wise
records = []
current_record_index = 1 # skipping first line (line number:0) which contains column names

while current_record_index < len(lines):
    record,next_record_index = getRecord(lines,current_record_index)
    
    if record != '': # to address cases when empty line occurs
        current_record_index = next_record_index
        records.append(record)
    else:
        current_record_index += 1
    
print('Total {} records are parsed'.format(len(records)))

Total 155 records are parsed


In [114]:
# sample of record
print('Raw record:\n\n',getRecord(lines,30))

Raw record:

 ('3;4;edwin;"VI) EDA\nVII) ECG";"VI) Physiological\nVII) Physiological";"VI) Varioport 16-bit digital skin conductance amplifier\nVII) modified Lead II configuration";"VI) Varioport-B portable recorder system\nVII) NS";1) physiological linkage;1) Physiological;1) EDA;1) individual;1) VI&VII;1) correlation;"A) behavioral Involvement\nB) empathy\nC) negative Feelings\nD) perceived comprehension";"A) Social Presence in Gaming Questionnaire\nB) Social Presence in Gaming Questionnaire\nC) Social Presence in Gaming Questionnaire\nD) Social Presence Inventory Questionnaire";"A) cognitive engagement\nB) affective\nC) affective\nD) learning";"A) process\nB) process\nC) process\nD) product";"1-A:regression: sig\n1-B:regression: sig\n1-C:regression: sig\n1-D:regression: sig";#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;\n', 50)


The record still contains new line as part of record. We will now process this record to make to python ready.

In [32]:
def parseRecord(text):
    """
    This function pre-process the record to transform the record into 
    the format which supports easier data processing.
    """
    processed = text.split('#')[0]
    items = processed.split(';')
    items_processed = []    
    for ind,item in enumerate(items):
        items_processed.append(item.replace('\n',';'))
    return items_processed

In [115]:
print('Processed record:\n\n')
pp.pprint(parseRecord(records[2]))

Processed record:


['3',
 '4',
 'edwin',
 '"VI) EDA;VII) ECG"',
 '"VI) Physiological;VII) Physiological"',
 '"VI) Varioport 16-bit digital skin conductance amplifier;VII) modified Lead '
 'II configuration"',
 '"VI) Varioport-B portable recorder system;VII) NS"',
 '1) physiological linkage',
 '1) Physiological',
 '1) EDA',
 '1) individual',
 '1) VI&VII',
 '1) correlation',
 '"A) behavioral Involvement;B) empathy;C) negative Feelings;D) perceived '
 'comprehension"',
 '"A) Social Presence in Gaming Questionnaire;B) Social Presence in Gaming '
 'Questionnaire;C) Social Presence in Gaming Questionnaire;D) Social Presence '
 'Inventory Questionnaire"',
 '"A) cognitive engagement;B) affective;C) affective;D) learning"',
 '"A) process;B) process;C) process;D) product"',
 '"1-A:regression: sig;1-B:regression: sig;1-C:regression: sig;1-D:regression: '
 'sig"',
 '']


## Preparing a CSV file for processing
Now, we will prepare a CSV file which will contain processed records.

In [48]:
def getId(text):
    if not text.isdigit():
        year = text.split(' ')[0]
    else:
        year = text
    return int(year)

In [82]:
# Preparing a pandas data frame
metric_df = pd.DataFrame(columns = ['paper_id','year','data','data_group',
                                   'metric','metric_larger','metric_smaller',
                                  'outcome_smaller','outcome_larger','outcome_type','relationship'])

for record in records:
    col_wise_data = parseRecord(record)
    temp_df = {'paper_id':col_wise_data[0],
              'year':getPaperYear(getId(col_wise_data[0])),'data':col_wise_data[3],'data_group':col_wise_data[4],
              'metric':col_wise_data[7],'metric_larger':col_wise_data[8],
              'metric_smaller':col_wise_data[9],
              'outcome_smaller':col_wise_data[13],'outcome_larger':col_wise_data[15],
              'outcome_type':col_wise_data[16],
              'relationship':col_wise_data[17]}
    temp_df = pd.DataFrame(temp_df,index=[0])
    metric_df = pd.concat([metric_df,temp_df],axis=0)
        

In [83]:
metric_df.reset_index(inplace=True)

In [84]:
metric_df.shape

(155, 12)

In [85]:
metric_df.to_csv('metric_dataframe.csv',index=False)

## Adding code to extract information in formatted manner
Now, we will add functions to extract information from the dataframe.

In [112]:
def getItemsInDict(text):
    """
    This function takes a string which contains data in following format. 
    The function then process the string, extract the information in dictionary data structure.
     sample: VI) EDA
    
    params:
    
        text: string
        
    returns:
    
        dictionary with extracted information
    
    """
    # remove additional quotes
    text_no_quotes = text.replace('\"','')
    text_items = text_no_quotes.split(';')
    
    pre_text_items = [item for item in text_items if item != '' ]

    labels = [item.split(')')[1].strip() for item in pre_text_items]
    index = [item.split(')')[0].strip() for item in pre_text_items]
    
    pre_met = {}
    for ind,lab in zip(index,labels):
        pre_met[ind] = lab
    return pre_met


def getRelationshipData(data,metrics_org,outcome_smaller):
    """
    This function processess relationship codes and relate the metric and outcome indices to actual
    metrics and outcomes.
    
    params:
    
        data  : relationship string
        
        metrics_org: actual metrics reported in the paper
        outcome_smaller: actually outcome reported in the paper
        
    returns:
    
        returns a dictionary where keys are the metrics and values are the outcomes.
    """
    text_no_quotes = data.replace('\"','')
    text_items = text_no_quotes.split(';')
    
    pre_text_items = [item for item in text_items if item != '' ]
    rel_tuples = []
    for rel in pre_text_items:
        parts = rel.split(':')
        rel_type = '' if len(parts) < 3 else parts[2]
        rel_method = parts[1]
        rel_parts = parts[0].split('-')
        metrics = rel_parts[0]
        outcomes = rel_parts[1]
        metrics = [item.strip() for item in metrics.split(',')]
        outcomes = [item.strip() for item in outcomes.split(',')]
        for metric in metrics:
            for outcome in outcomes:
                #print('adding {}=>{}'.format([metrics_org[metric]],outcome_smaller[outcome]))
                try:
                    rel_tuples.append((metrics_org[metric],outcome_smaller[outcome],rel_method))
                except:
                    pass
    return rel_tuples   
 

In [116]:
# Showing the same record which has paper id:3 (as above) in processed format

paper_record = metric_df.iloc[3,:].to_dict()

year = paper_record['year']
metrics_org = getItemsInDict(paper_record['metric'])
metrics_larger = getItemsInDict(paper_record['metric_larger'])
metrics_smaller = getItemsInDict(paper_record['metric_smaller'])
outcome_larger = getItemsInDict(paper_record['outcome_larger'])
outcome_smaller = getItemsInDict(paper_record['outcome_smaller'])
relationship = getRelationshipData(paper_record['relationship'],metrics_org,outcome_smaller)
    
record_dict = {'year':year,
              'metrics_org':metrics_org,
              'metrics_larger':metrics_larger,
              'metrics_smaller':metrics_smaller,
              'outcome_larger':outcome_larger,
              'outcome_smaller':outcome_smaller,
              'relationship':relationship}
print('Final record:\n\n')
pp.pprint(record_dict)


Final record:


{'metrics_larger': {'1': 'Verbal', '2': 'Gaze', '3': 'Verbal'},
 'metrics_org': {'1': 'non-verbal speaking metrics',
                 '2': 'visual attention',
                 '3': 'verbal dominance and information metrics'},
 'metrics_smaller': {'1': 'Speech Participation',
                     '2': 'Visual Attention',
                     '3': 'Speech Participation'},
 'outcome_larger': {'A': 'interpersonal relationship / perception',
                    'B': 'interpersonal relationship / perception'},
 'outcome_smaller': {'A': 'perceived leadership',
                     'B': 'perceived contribution'},
 'relationship': [('verbal dominance and information metrics',
                   'perceived leadership',
                   ' correlation '),
                  ('verbal dominance and information metrics',
                   'perceived contribution',
                   ' correlation '),
                  ('non-verbal speaking metrics',
                   'perceived lea