Intro goes here

In [1]:
import requests

baseUrl = 'https://www.ssa.gov/OACT/babynames/index.html'
session = requests.session()
response = session.get(baseUrl)

In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html5lib')
form = soup.find('form')
form

<form action="/cgi-bin/popularnames.cgi" method="post" name="popnames" onsubmit="return submitIt();">
                <p>
                  <input id="year" maxlength="4" name="year" pattern="\d{4}" required="" size="5" style="width:100px" title="Birth Year: Must be 4 numbers" type="text" value="2016"/>
                  <label for="year" style="display:inline;">  Birth Year</label><br/>
                </p>
                <p>
                  <select id="rank" name="top" size="1" style="width:100px">
                    <option value="20">Top 20</option>
                    <option value="50">Top 50</option>
                    <option value="100">Top 100</option>
                    <option value="500">Top 500</option>
                    <option value="1000">Top 1000</option>
                  </select>
                  <label for="rank" style="display:inline;">  Popularity</label><br/>
                </p>
                <fieldset>
                  <legend>Name rankings may in

While putting together this post I wanted to have the html and javascript code I'm including here and there be highlighted. It took me a while to figure out how, but in the end [reading the docs](http://pygments.org/docs/quickstart/) ended up being the best solution. After a bit of fiddling I had [a utility function for highlighting various languages in jupyter notebooks](https://github.com/njvrzm/zeroth/blob/master/zeroth/utils/jupretty.py).

In [3]:
from zeroth.utils import pretty

pretty(str(form), 'html')

Notice that the form has a javascript submitter - we'll want to check if that does anything important before going on.

In [4]:
scripts = soup.findAll('script')
scripts

[<script src="/framework/js/ssa.internet.head.js"></script>,
 <script src="chkinput.js"></script>,
 <script src="/framework/js/ssa.internet.body.js"></script>]

"chkinput.js" sounds promising...

In [5]:
import urllib
script_url = urllib.parse.urljoin(baseUrl, scripts[1]['src'])
script_url

'https://www.ssa.gov/OACT/babynames/chkinput.js'

In [6]:
pretty(session.get(script_url).text, 'javascript')

Ok, `submitIt` just does validation. We can ignore that. So, back to the form:

In [7]:
pretty(str(form), 'html')

We need to find the action url for the form and all the parameters to send. In this case the parameters all come from `input` and `select` elements.

In [8]:
action = urllib.parse.urljoin(baseUrl, form['action'])
print(action)

https://www.ssa.gov/cgi-bin/popularnames.cgi


In [9]:
form.findAll('input')

[<input id="year" maxlength="4" name="year" pattern="\d{4}" required="" size="5" style="width:100px" title="Birth Year: Must be 4 numbers" type="text" value="2016"/>,
 <input id="percent" name="number" type="radio" value="p"/>,
 <input id="number" name="number" type="radio" value="n"/>,
 <input type="submit" value="  Go  "/>,
 <input type="reset" value="Reset"/>]

In [10]:
form.findAll('select')

[<select id="rank" name="top" size="1" style="width:100px">
                     <option value="20">Top 20</option>
                     <option value="50">Top 50</option>
                     <option value="100">Top 100</option>
                     <option value="500">Top 500</option>
                     <option value="1000">Top 1000</option>
                   </select>]

So for our post data we need a `year` from 1880 to 2016, a `number` with value either "n" or "p", and a `top` with value 20, 50, 100, 500 or 1000. Later I'll want to get as much data as possible, but for now it'll be more convenient to examine the dataif we use 10 for `top`. I can see either number or percent being interesting but the former seems a bit more useful.

In [11]:
result = session.post(action, data = {'year': 1880, 'number': 'n', 'top': 10})
page = BeautifulSoup(result.text, 'html5lib')

From having looked at the page in the browser,it's clear the data is being presented as a table, but if you didn't have that to go by it's still a pretty good bet.

In [12]:
tables = page.findAll('table')

In [13]:
len(tables)

4

In [14]:
[(i, len(str(table))) for i, table in enumerate(tables)]

[(0, 357), (1, 2578), (2, 1778), (3, 499)]

Even with just ten rows, our target is almost certainly one of the two bigger tables. My intuition is that the larger big one is probably just a formatting wrapper around the smaller big one, so let's look at the third table.

In [15]:
names = tables[2]
pretty(str(names), 'html')

The table doesn't really have any useful attributes for searching by, but it does have that `caption` element in it. If that's the only caption on the page, we can use it...

In [16]:
len(page.findAll('caption'))

1

Perfect. When I write the cleaned-up code later I'll use the following to get just the table I want:

In [17]:
names = page.find('caption').find_parent('table')

Now let's look at the structure of the table.

In [18]:
firstRow = names.find('tr')
pretty(str(firstRow), 'html')

In [19]:
columns = [th.text for th in firstRow.findAll('th')]
columns

['Rank', 'Male name', 'Number of males', 'Female name', 'Number of females']

I just learned about this next function while writing this example. Given a BeautifulSoup tag you can call `findNextSiblings` on it, giving it a tag name; this will return a list of all tags with that name at the same level. Very handy for exactly this case, where we want all the rows of the table _after_ the header row.

In [20]:
rows = firstRow.findNextSiblings('tr')
rows[0]

<tr align="right">
 <td>1</td> <td>John</td><td>9,655</td>
 <td>Mary</td>
<td>7,065</td>
</tr>

A small helper function to extract data from the row and clean it up:

In [21]:
def getValues(tr):
    return [td.text.replace(',', '') for td in tr.findAll('td')]

getValues(rows[0])

['1', 'John', '9655', 'Mary', '7065']

Then we merge the data with the headers to make a usable dictionary:

In [22]:
dict(zip(columns, getValues(rows[0])))

{'Female name': 'Mary',
 'Male name': 'John',
 'Number of females': '7065',
 'Number of males': '9655',
 'Rank': '1'}

In [23]:
data = [dict(zip(columns, getValues(row))) for row in rows]
len(data)

11

We're expecting 10 rows, so there must be a footer.

In [24]:
data[-1]

{'Rank': 'Note: Rank 1 is the most popular\nrank 2 is the next most popular and so forth. All names are from Social Security card applications\n              for births that occurred in the United States.\n'}

Good to know, but it's not data.

In [25]:
data.pop(-1)

{'Rank': 'Note: Rank 1 is the most popular\nrank 2 is the next most popular and so forth. All names are from Social Security card applications\n              for births that occurred in the United States.\n'}

In [26]:
len(data)

10

In [27]:
data[7]

{'Female name': 'Alice',
 'Male name': 'Thomas',
 'Number of females': '1414',
 'Number of males': '2534',
 'Rank': '8'}

And that's it! All the code we need is pretty much here. I cleaned it up a bit and put together a [social security popular names scraper](https://github.com/njvrzm/zeroth/blob/master/zeroth/namesss/namesss.py).