# About data_log_data_to_sequence

This notebook creates minable sequences from log data given a predetermined translation of log actions to sequence actions.

In [1]:
%load_ext autoreload
%autoreload 1
%aimport utils_read_parsing
%aimport utils_sequence_parsing
from utils_read_parsing import *
from utils_sequence_parsing import converter, Sequence
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None);pd.set_option('display.max_rows', None);pd.set_option('precision', 2)

# Loading log data


We get all log files per student. They are stored in a dictionary like this:
```python
    log_files_per_sim = {'beers': {student1: [log_file_1.txt,log_file_2.txt], ...
                         'capacitor': {student1: [log_file_1.txt,log_file_2.txt], ...}
```

In [2]:
log_files_per_sim = {}
for sim in ['beers','capacitor']:
    log_files_per_sim[sim] = get_parsed_log_files_per_student_for_sim(sim)

The file Sarah_beers_log_files_per_student.txt has been unpickled and loaded
The file Sarah_capacitor_log_files_per_student.txt has been unpickled and loaded


# Loading rules

In [3]:
rules_beers = pd.read_csv('sequence_parsing_rules_beers.txt', sep='\t')
rules_capacitor = pd.read_csv('sequence_parsing_rules_capacitor.txt', sep='\t')
rules_capacitor_without_state = pd.read_csv('sequence_parsing_rules_capacitor_without_state.txt', sep='\t')

rules_as_dict = {}
rules_as_dict['beers'] = [(rule['Sequence Action'],rule.drop('Sequence Action').dropna().to_dict()) for i,rule in rules_beers.iterrows()]
rules_as_dict['capacitor'] = [(rule['Sequence Action'],rule.drop('Sequence Action').dropna().to_dict()) for i,rule in rules_capacitor.iterrows()]
rules_as_dict['capacitor_without_state'] = [(rule['Sequence Action'],rule.drop('Sequence Action').dropna().to_dict()) for i,rule in rules_capacitor_without_state.iterrows()]

# Parsing both sims for all students

In [12]:
df_all = get_df_all_factors()

In [13]:
ids = list(set(df_all['sid']))
N = len(ids)
print N

147


Note that students with state data in Capacitors are simply not being parsed.

In [14]:
%%time
all_seqs = {}
for sim in ['capacitor','beers']:
    all_seqs[sim] = {}
    for student in ids:
        files = log_files_per_sim[sim][student]
        files.sort() #sorts log by date and time
        seq = Sequence([],sid=student,sim=sim, timecoords=[])
        for f in files:
            df = prep_parsing_data(f) #removes model events, adds pauses with threshold of 9s
            df = df[df['Event']!='dragged'] #remove drag events, keep dragStart and dragEnded
            if df.empty:
                continue
            else:
                df['Sequence Action'] = df.apply(converter, args=([rules_as_dict[sim]]), axis=1)
                if sim=='capacitor' and df['Charge'].isnull().values.any():
                    print "Student {0} has no state data so using special parser".format(student)
                    df['Sequence Action'] = df.apply(converter, args=([rules_as_dict[sim+'_without_state']]), axis=1)
                if df[df['Sequence Action']=='no_match_found'].empty and df['Sequence Action'].isnull().values.any() == False:
                    pass
                else:
                    print "Lack of matches:", df[df['Sequence Action']=='no_match_found']
                    print "NA events:", df[df['Sequence Action'].isnull()].head(1)
                    raise ValueError("some events were not parsed")
            seq.extend_seq(list(df['Sequence Action']))
            seq.extend_timecoords(list(df['Time']))
        all_seqs[sim][student] = seq

  df = pd.read_table(parsing_file, sep='\t')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Student 23836160 has no state data so using special parser
Student 41947147 has no state data so using special parser
Student 11612162 has no state data so using special parser
Student 64006159 has no state data so using special parser
Student 85915167 has no state data so using special parser
Student 11929166 has no state data so using special parser
Student 24511163 has no state data so using special parser
Student 27451164 has no state data so using special parser
Student 90447168 has no state data so using special parser
Student 15055169 has no state data so using special parser
Student 46792161 has no state data so using special parser
Student 24566161 has no state data so using special parser
Student 77047160 has no state data so using special parser
Student 82788161 has no state data so using special parser
Wall time: 4min 48s


## Export parsed sequences

In [22]:
import pickle
#create files
pickle_out_seqs = open(os.path.join(BIG_FOLDER,'all_massaged_data\\dict_by_sim_by_student_parsed_seqs.txt'),"wb")
#dump data
pickle.dump(all_seqs, pickle_out_seqs)
#close files
pickle_out_seqs.close()