#manipulate_regonline_output

This notebook reads the RegOnline output into a pandas DataFrame and reworks it to have each row contain the attendee, the Doppler Primer Session, the Monday Breakout session, and the Tuesday breakout session in each row.

In [73]:
import pandas as pd
import re

#### Read the RegOnline output into a pandas DataFrame

In [213]:
df = pd.read_excel('/Users/matt/projects/EPRV/data/AttendeeReportCrop.xls', encoding='utf-8')

In [214]:
df.columns

Index([u'AgendaItem', u'RegId', u'GroupId', u'FirstName', u'LastName', u'Company'], dtype='object')

In [215]:
#use this if you want to encode the unicode values to ascii.
#this is helpful for displaying non-roman characters within
#the IPython notebook:
df['FirstName'] = [el.encode('ascii', 'ignore') for el in df['FirstName'].values]

In [216]:
joao = df.loc[38, 'FirstName']
print(joao)
print(joao.encode('ascii', 'replace'))
print(joao.encode('ascii', 'ignore'))
print(joao.encode('ascii', 'xmlcharrefreplace'))
print(joao.encode('ascii', 'backslashreplace'))

Joo
Joo
Joo
Joo
Joo


In [217]:
df.loc[36:37]

Unnamed: 0,AgendaItem,RegId,GroupId,FirstName,LastName,Company
36,Doppler Primer: Instrumentation Challenges,79809251,79809251,Jason,Eastman,CfA
37,Doppler Primer: Not Attending,79200819,79200819,Michael,Endl,McDonald Observatory / University of Texas


#### Extract the Sunday Sessions

RegOnline outputs multiple entries for each person, and each entry differs by the `AgendaItem`. `AgendaItem`s exist for all sessions happening on all days. In this section, we extract the sessions happening on Sunday, which are all prefixed by "Doppler Primer: ".

In [197]:
sundf = df[df['AgendaItem'].str.contains('Doppler Primer:')].copy()
len(sundf)

110

Let's create two new columns in our DataFrame: the `Primer`, and the `PrimerID`. The `Primer` column will contain the name of the Doppler Primer session (minus the `Doppler Primer: ` prefix), and the `PrimerID` will be a session identifier that will later be used in plotting.

In [186]:
sundf['PrimerID'] = 0

In [187]:
sundf['Primer'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in sundf['AgendaItem']]

In [188]:
sundf[['AgendaItem', 'Primer']].head(3)

Unnamed: 0,AgendaItem,Primer
0,Doppler Primer: Instrumentation Challenges,Instrumentation Challenges
1,Doppler Primer: Doppler code,Doppler code
2,Doppler Primer: Spot Modeling,Spot Modeling


In [189]:
sundf['Primer'].unique()[0]

u'Instrumentation Challenges'

In [191]:
dopID = 0
for agItem in sundf['Primer'].unique():
    sundf.loc[sundf['Primer'] == agItem, 'PrimerID'] = dopID
    dopID += 1

In [194]:
sundf[['AgendaItem', 'Primer', 'PrimerID']].head(4)

Unnamed: 0,AgendaItem,Primer,PrimerID
0,Doppler Primer: Instrumentation Challenges,Instrumentation Challenges,0
1,Doppler Primer: Doppler code,Doppler code,1
2,Doppler Primer: Spot Modeling,Spot Modeling,2
3,Doppler Primer: Spot Modeling,Spot Modeling,2


#### Extract the Monday Sessions

Now to do the same for the Monday sessions.

In [208]:
mondf = df[df['AgendaItem'].str.contains('Monday Break-out:')].copy()
len(mondf)

120

In [209]:
mondf['MonID'] = 0

mondf['Monday'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in mondf['AgendaItem']]

mondf['Monday'].unique()

monID = 0
for agItem in mondf['Monday'].unique():
    mondf.loc[mondf['Monday'] == agItem, 'MonID'] = monID
    monID += 1

In [210]:
mondf[['AgendaItem', 'Monday', 'MonID']].head(4)

Unnamed: 0,AgendaItem,Monday,MonID
110,Monday Break-out: Fiber Optic Scrambling,Fiber Optic Scrambling,0
111,Monday Break-out: Not attending,Not attending,1
112,Monday Break-out: Fiber Optic Scrambling,Fiber Optic Scrambling,0
113,Monday Break-out: Telluric Contamination,Telluric Contamination,2


#### Extract Tuesday Sessions

In [206]:
tuedf = df[df['AgendaItem'].str.contains('Tuesday Break-out:')].copy()
len(tuedf)

120

In [211]:
tuedf['TueID'] = 0

tuedf['Tuesday'] = [re.search(r'(.*):\s(.*)$', item).group(2) for item in tuedf['AgendaItem']]

tuedf['Tuesday'].unique()

tuesID = 0
for agItem in tuedf['Tuesday'].unique():
    tuedf.loc[tuedf['Tuesday'] == agItem, 'TueID'] = tuesID
    tuesID += 1

In [212]:
tuedf[['AgendaItem', 'Tuesday', 'TueID']].head(4)

Unnamed: 0,AgendaItem,Tuesday,TueID
230,Tuesday Break-out: Statistical techniques,Statistical techniques,0
231,Tuesday Break-out: Statistical techniques,Statistical techniques,0
232,Tuesday Break-out: Detection Threshold Criteria,Detection Threshold Criteria,1
233,Tuesday Break-out: Detection Threshold Criteria,Detection Threshold Criteria,1


#### Combine the DataFrames

In [225]:
fulldf = df[['RegId', 'GroupId', 'FirstName', 'LastName', 'Company']]

In [226]:
print(len(fulldf))
fulldf = fulldf.drop_duplicates()
print(len(fulldf))

350
120
