# Scraping survey questions from a Google form

When understanding the results of a survey, it helps to know what the options were. The options in each question can usually be derived from the range of survey responses (e.g., unique responses), but a better approach is to construct a representation of the survey independent of the responses. In this notebook, I'll walk through extracting the necessary data from the html page containing the Google form. I'll use `requests` to download the survey page, and `BeautifulSoup` to search the page for question elements. The result is a csv data file with one row per question.

## Download the survey html with requests

We need the survey url to scrape the questions. Google survey urls can be a bit unruly, so I've stored the url for the Madpy survey in a text file, "madpy-survey-url.txt". Then I just `get()` the survey page with requests.

In [1]:
! cat madpy-survey-url.txt

https://docs.google.com/forms/d/e/1FAIpQLSdIg3yZqSPxCac-ESLjnlfEcE5PLBo02TeBP42lgZJlUlry5w/viewform


In [2]:
import requests
survey_url = open('madpy-survey-url.txt').read().strip()  # Don't forget to strip the newline!
response = requests.get(survey_url)
response.content



## Extracting html elements with BeautifulSoup

The next step is to make it easier to search the html by turning it into soup with `BeautifulSoup`. I import the "BeautifulSoup" class and create some soup by giving it the html from the response. There are a number of parsers that you can use to turn something into soup, so you have to name one. I'm using the "html.parser" that is built in to python. For more information on parsers, see the section on [installing a parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) in the BeautifulSoup docs.

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <link href="https://ssl.gstatic.com/docs/spreadsheets/forms/favicon_qp2.png" rel="shortcut icon" sizes="16x16">
   <title>
    MadPy
   </title>
   <link data-id="_cl" href="https://www.gstatic.com/_/freebird/_/ss/k=freebird.v.-1p1gb81lxpse7.L.W.O/d=1/rs=AMjVe6hs1UUFmJER3aTTWdZ-iezMG14dpg" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,400i,500,700&amp;subset=latin,vietnamese,latin-ext,cyrillic,greek,cyrillic-ext,greek-ext" rel="stylesheet" type="text/css">
     <link href="https://fonts.googleapis.com/css?family=Product+Sans&amp;subset=latin,vietnamese,latin-ext,cyrillic,greek,cyrillic-ext,greek-ext" rel="stylesheet" type="text/css">
      <meta content="width=device-width, initial-scale=1" name="viewport">
       <script type="text/javascript">
        window.WIZ_global_data = {w2btAe: '%.@.null,null,\x22\x22,true]\n'};
       </script>
       <style id="WTVccd">
        .freebirdThemedTab .exportTab .freebird

We've got the survey page, but it's no fun digging through that to try to find the CSS elements we need. It would be better to just open the page in Chrome or Firefox and use the Developer Tools to locate the elements we want. Here I'm storing the survey data in a file "madpy-survey.html" and opening it in the default webbrowser.

In [4]:
with open('madpy-survey.html', 'wb') as f:
    f.write(response.content)

import webbrowser
import os
webbrowser.open('file://' + os.path.abspath('madpy-survey.html'))

True

In [5]:
# CSS classes retrieved through manual inspection
CLS_QUESTION = 'freebirdFormviewerViewItemsItemItem'
CLS_QUESTION_TITLE = 'freebirdFormviewerViewItemsItemItemTitle'

CLS_QUESTION_CHECKBOX = 'freebirdFormviewerViewItemsCheckboxLabel'
CLS_QUESTION_RADIO = 'freebirdFormviewerViewItemsRadioChoice'
CLS_QUESTION_TYPES = [CLS_QUESTION_CHECKBOX, CLS_QUESTION_RADIO]

## Storing html elements in a pandas.DataFrame

My first search is a `<div>` for every question. I store these in a `panda.DataFrame`. This is not necessarily the most convenient of representations for these divs, but it works, and it makes it easy to visually inspect results and also to write the survey data to a plaintext representation (csv).

In [7]:
# Find all the question divs in the soup
question_divs = soup.find_all('div', attrs={'class': CLS_QUESTION})

# Make a DataFrame with a single column holding bs4.element.Tag objects for each question
import pandas
questions = pandas.DataFrame({'div': question_divs})
questions.insert(0, 'qid', ['q{}'.format(i) for i in list(questions.index)])
questions

Unnamed: 0,qid,div
0,q0,"<div class=""freebirdFormviewerViewItemsItemIte..."
1,q1,"<div class=""freebirdFormviewerViewItemsItemIte..."
2,q2,"<div class=""freebirdFormviewerViewItemsItemIte..."
3,q3,"<div class=""freebirdFormviewerViewItemsItemIte..."
4,q4,"<div class=""freebirdFormviewerViewItemsItemIte..."
5,q5,"<div class=""freebirdFormviewerViewItemsItemIte..."
6,q6,"<div class=""freebirdFormviewerViewItemsItemIte..."
7,q7,"<div class=""freebirdFormviewerViewItemsItemIte..."
8,q8,"<div class=""freebirdFormviewerViewItemsItemIte..."
9,q9,"<div class=""freebirdFormviewerViewItemsItemIte..."


Now I can extract the bits of data I need from each question. First up is getting the question titles. To do this we apply an anonymous function to each element of a `pandas.Series`, extracting the text for the question title div.

In [8]:
def extract_question_title(div):
    return div.find('div', attrs={'class': CLS_QUESTION_TITLE}).text

questions['question'] = questions['div'].apply(extract_question_title)
questions

Unnamed: 0,qid,div,question
0,q0,"<div class=""freebirdFormviewerViewItemsItemIte...",Email address *
1,q1,"<div class=""freebirdFormviewerViewItemsItemIte...",How are you using Python?
2,q2,"<div class=""freebirdFormviewerViewItemsItemIte...",What kind(s) of events would you like to see a...
3,q3,"<div class=""freebirdFormviewerViewItemsItemIte...",What format(s) of events would you like to see...
4,q4,"<div class=""freebirdFormviewerViewItemsItemIte...",At what skill level would you like to see even...
5,q5,"<div class=""freebirdFormviewerViewItemsItemIte...",What day(s) of the week work best for you?
6,q6,"<div class=""freebirdFormviewerViewItemsItemIte...",What time(s) of day work best for you?
7,q7,"<div class=""freebirdFormviewerViewItemsItemIte...",How frequently would you be interested in atte...
8,q8,"<div class=""freebirdFormviewerViewItemsItemIte...",Where in town works best for you to attend the...
9,q9,"<div class=""freebirdFormviewerViewItemsItemIte...",At what kind of facilities would you like to s...


Next I want to extract the choices from each question that has choices.

In [9]:
def extract_choices(div):
    """Extract any choices from any of the question types.
    
    Globals:
        CLS_QUESTION_TYPES
    
    SMELLS! Just tries to grab every choice it can.
    """
    choices = []
    for CLS in CLS_QUESTION_TYPES:
        choice_divs = div.find_all(attrs={'class': CLS})
        if len(choice_divs) > 0:
            choices.extend([div.text for div in choice_divs])
    return choices

questions['choices'] = questions['div'].apply(extract_choices)
questions

Unnamed: 0,qid,div,question,choices
0,q0,"<div class=""freebirdFormviewerViewItemsItemIte...",Email address *,[]
1,q1,"<div class=""freebirdFormviewerViewItemsItemIte...",How are you using Python?,"[Software (engineering/developing), Science (s..."
2,q2,"<div class=""freebirdFormviewerViewItemsItemIte...",What kind(s) of events would you like to see a...,"[Software (engineering/developing), Science (s..."
3,q3,"<div class=""freebirdFormviewerViewItemsItemIte...",What format(s) of events would you like to see...,"[Presentations, Hack Nights, Python Schooling,..."
4,q4,"<div class=""freebirdFormviewerViewItemsItemIte...",At what skill level would you like to see even...,"[Introductory, Intermediate, Expert]"
5,q5,"<div class=""freebirdFormviewerViewItemsItemIte...",What day(s) of the week work best for you?,"[Monday, Tuesday, Wednesday, Thursday, Friday,..."
6,q6,"<div class=""freebirdFormviewerViewItemsItemIte...",What time(s) of day work best for you?,"[Work hours (8am - 6pm), After work hours (6pm..."
7,q7,"<div class=""freebirdFormviewerViewItemsItemIte...",How frequently would you be interested in atte...,"[Never, A few times a year, Monthly, Weekly]"
8,q8,"<div class=""freebirdFormviewerViewItemsItemIte...",Where in town works best for you to attend the...,"[Far West, Near West, Downtown, East]"
9,q9,"<div class=""freebirdFormviewerViewItemsItemIte...",At what kind of facilities would you like to s...,"[Libraries, Bars, Restaurants, Offices, Other:]"


## A pandas "gotcha": writing to csv

Although it's easy to work with python objects like lists and `bs4.element.Tag` objects in pandas DataFrames, when we want to save those more complicated objects to a plaintext file, it's better to put everything in a safe plaintext representation, like json. Here I'm convering the Series of python list choices to a json string.

In [10]:
import json
questions['choices_json'] = questions.choices.apply(lambda x: json.dumps(x))
questions = questions[['qid', 'question', 'choices_json']]
questions.head(3)

Unnamed: 0,qid,question,choices_json
0,q0,Email address *,[]
1,q1,How are you using Python?,"[""Software (engineering/developing)"", ""Science..."
2,q2,What kind(s) of events would you like to see a...,"[""Software (engineering/developing)"", ""Science..."


In [11]:
questions.to_csv("data/questions.csv", index=False)

Now that we have the survey responses in tidy data format, and an independent representation of survey questions and options, we are ready to visualize the results of the survey.