# Strategy

There are quite a few tools for generating PDFs. The most popular is [reportlab](https://pypi.python.org/pypi/reportlab) (people seem to recommend reportlab's [platypus](http://www.reportlab.com/apis/reportlab/2.4/platypus.html) for "simple" pdf [generation](https://www.reportlab.com/docs/platypus-example.py)).

[Markdown](http://daringfireball.net/projects/markdown/syntax) document (see [CommonMark](http://commonmark.org/)) is a great intermediate format, so we could make one of those and then convert it to PDF. If that's the desired route, we'll probably use [python-markdown2](https://github.com/trentm/python-markdown2) to generate HTML and [xhtml2pdf](https://github.com/chrisglass/xhtml2pdf) to generate the PDF (similar to what is [done](https://omz-forums.appspot.com/pythonista/post/6427727661891584) [here](https://gist.github.com/SpotlightKid/0efb4d07f28af1c8fc1b)). xhtml2pdf does depend on reportlab, which is a bit of a downside, but not too bad. Another option is to use [pdftk](https://pypi.python.org/pypi/pdfkit) which wraps [wkhtmltopdf](http://wkhtmltopdf.org/).

Similarly, I thought about [Sphinx](http://sphinx-doc.org/) with [reStructuredText](http://docutils.sourceforge.net/rst.html), but that needs a working LaTeX environment to produce PDFs.

A previous version of this script directly used reportlab/platypus. 

At the moment, I think the only really important output is the HTML/CSS which gets converted to PDF. So, instead of going through Markdown or reStructuredText, I'm just going to spit out my own HTML/CSS. I'll then feed it through one of the HTML/CSS --> PDF conversion chains.

# Parsing the Excel

There are a few good Excel parsers these days, but [pandas](http://pandas.pydata.org/) is nice and standard. It also understands XLS files in addition to XLSX, which is important since we don't want to make people convert to XLSX by hand in order to use a different parse.

In this case, we need to know the structure of the XLS documents. There should be two sheets. The first (called "RawData") contains columns like "Path" "CourseCode" etc. and then "Question_1", "Question_2" etc. The second (called "QuestionMapper") contains "Question 1" etc. in column A and the text of the question in column B.

The first row is a header in both cases.

I have no idea how fragile this structure is, so I'll explicitly refer to the sheets by name. That way, there's a decent chance this script will break if someone changes the underlying format.

In [14]:
import pandas as pd
import numpy as np
import os
xl_filename = 'data/Biophysics-(Spring-2015).xlsx'
pdf_filename = os.path.splitext(xl_filename)[0] + '.pdf'
html_filename = os.path.splitext(xl_filename)[0] + '.html'
answers = pd.io.excel.read_excel(xl_filename,sheetname='RawData')
questionmap = pd.io.excel.read_excel(xl_filename,sheetname='QuestionMapper')

We want a per-student list of questions and answers. My first thought is to stick everything into a dictionary. We want to make sure to return the results in the correct order, so we could use an ordered dict. I think it's easier just to keep an ordered list of questions.

In [15]:
questions = questionmap["Question"].values

In [16]:
print questions

[u'What were the most positive features of this course?'
 u'What is your assessment of the design, materials and assignments in this course?'
 u'How could this course be improved next time it is offered?'
 u"How well were\xa0[InstructorName] 's objectives (stated or implied) fulfilled?"
 u"What were [InstructorName]'s strongest contributions to this course?"
 u"How could [InstructorName]'s teaching be improved?"
 u'What influence did\xa0[InstructorName] have on your interest in this subject?'
 u'In the space below, please provide a statement about the quality of your performance in this course.'
 u'Students are expected to sign these forms (by adding your name), and should know that unsigned forms are unlikely to be taken seriously by evaluating committees.']


About the below code:

When we iterate through the rows, `idx` is the number of the row, and `qd` comes to us as the "question dictionary" where row 1 is expected to name the columns, and we can then look up entries by name. For example, column A happens to be "Column" and column B is "Question", so asking for `qd['Question']` gets the thing in column B.

`qm` is then my "question map": it maps something like "Question 1" to "What were the most positive features of this course"


In [17]:
#qm will map the column names (Question_1) to question text
qm = {}
for (idx,qd) in questionmap.iterrows():
    qn = qd['Column'].replace(' ','_')
    qt = qd['Question']
    qm[qn] = qt

In [18]:
print qm

{u'Question_3': u'How could this course be improved next time it is offered?', u'Question_2': u'What is your assessment of the design, materials and assignments in this course?', u'Question_1': u'What were the most positive features of this course?', u'Question_7': u'What influence did\xa0[InstructorName] have on your interest in this subject?', u'Question_6': u"How could [InstructorName]'s teaching be improved?", u'Question_5': u"What were [InstructorName]'s strongest contributions to this course?", u'Question_4': u"How well were\xa0[InstructorName] 's objectives (stated or implied) fulfilled?", u'Question_9': u'Students are expected to sign these forms (by adding your name), and should know that unsigned forms are unlikely to be taken seriously by evaluating committees.', u'Question_8': u'In the space below, please provide a statement about the quality of your performance in this course.'}


Now let's grab the data that should be common to all rows

In [19]:
path = answers.Path[0]
course_code = answers.CourseCode[0]
course_title = answers.CourseTitle[0]
instructor_name = answers.InstructorName[0]
enrollments = answers.Enrollments[0]
# We know we're not extracting the following from each row, so keep quiet about it later.
knownskips = ['Path','CourseCode','CourseTitle','UniqueID','InstructorName','Enrollments']

And now let's slurp up the data per student.

In [20]:
a = {}
for (idx,student) in answers.iterrows():
    a[idx] = {}
    for colname in answers.columns:
        col_name = colname.replace(' ','_')
        if col_name in qm:
            #print "Looking up",col_name
            a[idx][qm[col_name]] = student[colname]
        else:
            if colname not in knownskips:
                print "Could not find",colname

Now we're ready to stamp out the text, believe it or not. The only cute thing is that `pandas` uses nan ("not a number") to represent missing data. We'll use `numpy` (imported above as `np`) to test for nan, and turn it into "No answer given."

In [21]:
a[0][questions[0]]

u'This class integrated physics and biology in a fascinating way.'

In [22]:
def is_nan(x): 
    try: return np.isnan(x) 
    except: return False #isnan only eats strings


In [86]:
html = '''
<html>
<head>
<style>
h1 {{
    text-align: 
    center;
}}
p.name {{    
    font-weight: bold;
    font-size: large;
    margin-top: 2em;
    margin-bottom: 0em;
}}
p.question {{
    font-weight: bold;
    margin-top: 0em;
    margin-bottom: 0em;
}}
p.answer {{
    text-align: justify;
    margin-top: 0em;
    margin-bottom: 0em;
}}
</style>
</head>
<body>
<h1>{title}</h1>
<h1>{code}</h1>
<h1>{instructor}</h1>
<h1>Answers from {a} of {b} enrolled students</h1>
'''.format(
    title=course_title,code=course_code,instructor=instructor_name,
    a=len(a),b=enrollments
)

In [87]:
for idx in sorted(a):
    html += '''<div class="response">
    <p class="name">Student {i} ({n})</p>
    '''.format(
            i=idx+1, n=a[idx][questions[-1]]
        )
    for question in questions[:-1]:
        question_cor_name = question.replace("[InstructorName]", instructor_name)
        if u'\xa0' in question_cor_name: 
            question_cor_name = question_cor_name.replace(u'\xa0', u' ') #Corrects for unicode encoding error
        answer = a[idx][question]
        if is_nan(answer):
            answer = 'No answer given.'
        html += '''<p class="question">{q}</p>
        <p class="answer">{a}</p>
        '''.format(q=question_cor_name,a=answer)
    html += '</div>'

In [88]:
open(html_filename,'w').write(html)