# Scraping survey questions

When understanding the results of a survey, it helps to know what the options were. The options in each question can usually be derived from the range of survey responses (e.g., unique responses), but a better approach is to construct a representation of the survey independent of the responses. In this notebook, I'll walk through extracting the necessary data from the survey html page. I'll use `requests` to download the survey page, and `BeautifulSoup` to search the page for question elements. The result is a csv data file with one row per question.

## Download the survey html with requests

We need the survey url to scrape the questions. Google survey urls can be a bit unruly, so I've stored the url for the madpy survey in a text file, "madpy-survey-url.txt". Then I just `get()` the survey page with requests.

In [6]:
import requests
survey_url = open('madpy-survey-url.txt').read().strip()  # Don't forget to strip the newline!
response = requests.get(survey_url)
response.content



## Extracting html elements with BeautifulSoup

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html5lib')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <link href="https://ssl.gstatic.com/docs/spreadsheets/forms/favicon_qp2.png" rel="shortcut icon" sizes="16x16"/>
  <title>
   MadPy
  </title>
  <link data-id="_cl" href="https://www.gstatic.com/_/freebird/_/ss/k=freebird.v.111qn70h46w08.L.W.O/d=1/rs=AMjVe6gcfX-_2qpMPPY6AF2rEy9kUTD1og" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,400i,500,700&amp;subset=latin,vietnamese,latin-ext,cyrillic,greek,cyrillic-ext,greek-ext" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/css?family=Product+Sans&amp;subset=latin,vietnamese,latin-ext,cyrillic,greek,cyrillic-ext,greek-ext" rel="stylesheet" type="text/css"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <script type="text/javascript">
   window.WIZ_global_data = {w2btAe: '%.@.null,null,\x22\x22,true]\n'};
  </script>
  <style id="WTVccd">
   .freebirdThemedTab .exportTab .freebirdThemedBadge {background-color: rgb

We've got the survey page, but it's no fun digging through that to try to find the CSS elements we need. It would be better to just open the page in Chrome or Firefox and use the Developer Tools to locate the elements we want. Here I'm storing the survey data in a file "madpy-survey.html" and opening it in the default webbrowser.

In [9]:
with open('madpy-survey.html', 'wb') as f:
    f.write(response.content)

import webbrowser
import os
webbrowser.open('file://' + os.path.abspath('madpy-survey.html'))

True

## Storing html elements in a pandas.DataFrame

My first search is a `<div>` for every question. First I select the `<div>` that contains the question list. Then I select all `<div>` elements with the question class. I store these in a `panda.DataFrame`.

In [14]:
import pandas
CLS_QUESTION_LIST = 'freebirdFormviewerViewItemList'
CLS_QUESTION = 'freebirdFormviewerViewItemsItemItem'

question_list = soup.find('div', attrs={'class': CLS_QUESTION_LIST})
questions = pandas.DataFrame({
    'div': question_list.find_all('div', attrs={'class': CLS_QUESTION})
})
questions

Unnamed: 0,div
0,"<div class=""freebirdFormviewerViewItemsItemIte..."
1,"<div class=""freebirdFormviewerViewItemsItemIte..."
2,"<div class=""freebirdFormviewerViewItemsItemIte..."
3,"<div class=""freebirdFormviewerViewItemsItemIte..."
4,"<div class=""freebirdFormviewerViewItemsItemIte..."
5,"<div class=""freebirdFormviewerViewItemsItemIte..."
6,"<div class=""freebirdFormviewerViewItemsItemIte..."
7,"<div class=""freebirdFormviewerViewItemsItemIte..."
8,"<div class=""freebirdFormviewerViewItemsItemIte..."
9,"<div class=""freebirdFormviewerViewItemsItemIte..."


Now I can extract the bits of data I need from each question. First up is getting the question titles.

In [15]:
CLS_QUESTION_TITLE = 'freebirdFormviewerViewItemsItemItemTitle'
questions['title'] = questions['div'].apply(
    lambda div: div.find('div', attrs={'class': CLS_QUESTION_TITLE}).text
)
questions

Unnamed: 0,div,title
0,"<div class=""freebirdFormviewerViewItemsItemIte...",Email address *
1,"<div class=""freebirdFormviewerViewItemsItemIte...",How are you using Python?
2,"<div class=""freebirdFormviewerViewItemsItemIte...",What kind(s) of events would you like to see a...
3,"<div class=""freebirdFormviewerViewItemsItemIte...",What format(s) of events would you like to see...
4,"<div class=""freebirdFormviewerViewItemsItemIte...",At what skill level would you like to see even...
5,"<div class=""freebirdFormviewerViewItemsItemIte...",What day(s) of the week work best for you?
6,"<div class=""freebirdFormviewerViewItemsItemIte...",What time(s) of day work best for you?
7,"<div class=""freebirdFormviewerViewItemsItemIte...",How frequently would you be interested in atte...
8,"<div class=""freebirdFormviewerViewItemsItemIte...",Where in town works best for you to attend the...
9,"<div class=""freebirdFormviewerViewItemsItemIte...",At what kind of facilities would you like to s...


Next I want to extract the choices from each question that has choices.

In [17]:
CLS_QUESTION_CHECKBOX = 'freebirdFormviewerViewItemsCheckboxLabel'
CLS_QUESTION_RADIO = 'freebirdFormviewerViewItemsRadioChoice'
CLS_QUESTION_TYPES = [CLS_QUESTION_CHECKBOX, CLS_QUESTION_RADIO]

def extract_choices(div):
    choices = []
    for CLS in CLS_QUESTION_TYPES:
        choice_divs = div.find_all(attrs={'class': CLS})
        if len(choice_divs) > 0:
            choices.extend([div.text for div in choice_divs])
    return choices

questions['choices'] = questions['div'].apply(extract_choices)
questions

Unnamed: 0,div,title,choices
0,"<div class=""freebirdFormviewerViewItemsItemIte...",Email address *,[]
1,"<div class=""freebirdFormviewerViewItemsItemIte...",How are you using Python?,"[Software (engineering/developing), Science (s..."
2,"<div class=""freebirdFormviewerViewItemsItemIte...",What kind(s) of events would you like to see a...,"[Software (engineering/developing), Science (s..."
3,"<div class=""freebirdFormviewerViewItemsItemIte...",What format(s) of events would you like to see...,"[Presentations, Hack Nights, Python Schooling,..."
4,"<div class=""freebirdFormviewerViewItemsItemIte...",At what skill level would you like to see even...,"[Introductory, Intermediate, Expert]"
5,"<div class=""freebirdFormviewerViewItemsItemIte...",What day(s) of the week work best for you?,"[Monday, Tuesday, Wednesday, Thursday, Friday,..."
6,"<div class=""freebirdFormviewerViewItemsItemIte...",What time(s) of day work best for you?,"[Work hours (8am - 6pm), After work hours (6pm..."
7,"<div class=""freebirdFormviewerViewItemsItemIte...",How frequently would you be interested in atte...,"[Never, A few times a year, Monthly, Weekly]"
8,"<div class=""freebirdFormviewerViewItemsItemIte...",Where in town works best for you to attend the...,"[Far West, Near West, Downtown, East]"
9,"<div class=""freebirdFormviewerViewItemsItemIte...",At what kind of facilities would you like to s...,"[Libraries, Bars, Restaurants, Offices, Other:]"


## A pandas "gotcha": writing to csv

Although it's easy to work with python objects like lists and bs4 Tags in pandas DataFrames, when we want to save those more complicated objects to a plaintext file, it's better to put everything in a safe plaintext representation, like json. Here I'm convering the Series of python list choices to a json string.

In [18]:
import json
questions['choices_json'] = questions.choices.apply(lambda x: json.dumps(x))
questions

Unnamed: 0,div,title,choices,choices_json
0,"<div class=""freebirdFormviewerViewItemsItemIte...",Email address *,[],[]
1,"<div class=""freebirdFormviewerViewItemsItemIte...",How are you using Python?,"[Software (engineering/developing), Science (s...","[""Software (engineering/developing)"", ""Science..."
2,"<div class=""freebirdFormviewerViewItemsItemIte...",What kind(s) of events would you like to see a...,"[Software (engineering/developing), Science (s...","[""Software (engineering/developing)"", ""Science..."
3,"<div class=""freebirdFormviewerViewItemsItemIte...",What format(s) of events would you like to see...,"[Presentations, Hack Nights, Python Schooling,...","[""Presentations"", ""Hack Nights"", ""Python Schoo..."
4,"<div class=""freebirdFormviewerViewItemsItemIte...",At what skill level would you like to see even...,"[Introductory, Intermediate, Expert]","[""Introductory"", ""Intermediate"", ""Expert""]"
5,"<div class=""freebirdFormviewerViewItemsItemIte...",What day(s) of the week work best for you?,"[Monday, Tuesday, Wednesday, Thursday, Friday,...","[""Monday"", ""Tuesday"", ""Wednesday"", ""Thursday"",..."
6,"<div class=""freebirdFormviewerViewItemsItemIte...",What time(s) of day work best for you?,"[Work hours (8am - 6pm), After work hours (6pm...","[""Work hours (8am - 6pm)"", ""After work hours (..."
7,"<div class=""freebirdFormviewerViewItemsItemIte...",How frequently would you be interested in atte...,"[Never, A few times a year, Monthly, Weekly]","[""Never"", ""A few times a year"", ""Monthly"", ""We..."
8,"<div class=""freebirdFormviewerViewItemsItemIte...",Where in town works best for you to attend the...,"[Far West, Near West, Downtown, East]","[""Far West"", ""Near West"", ""Downtown"", ""East""]"
9,"<div class=""freebirdFormviewerViewItemsItemIte...",At what kind of facilities would you like to s...,"[Libraries, Bars, Restaurants, Offices, Other:]","[""Libraries"", ""Bars"", ""Restaurants"", ""Offices""..."
