# Background

In this notebook I work out how to query the TTB database so we can get a list of valid `TTBID`s. I do this by making use of the `advanced search` functionality of the online database.

_Origin Codes_ for the US work as follows
* 00: American (this is actually different than the state level difference)
* 01-49: Each state __EXCEPT__ Alaska, which is 4E

When iterating through pages to find the valid urls/ID's it's worth noting that __AT MOST__ 500 results are returned, even if there are more valid entries. To give a sense of scale, a query for the year of 2016 returned 147,073 results, of which we only had access to the  At the time of writing it isn't yet clear why or how that limit is imposed.

The work around for this is to simply reduce the scope of search results so that we are guarenteed to get fewer than 500 hits per query. We can do this intelligently if we parse the number of results returned and if greater than 500, reduce the scope. The obvious sliders are:
* Location (ie state)
* Date (year, month, day)
* Type of product
* Approved (we really only care about approved and currently in production items)

An alternative (and perhaps easier?) method, is to simple search by `TTBID` range. Recall, `TTBID`s work as follows:

<div class="alert alert-block alert-info">
TTB ID - This is a unique, 14 digit number assigned by TTB to track each COLA.  The first 5 digits represent the calendar year and Julian date the application was received by TTB. The next 3 digits tell how the application was received (001 = e-filed; 002 & 003 = mailed/overnight; 000 = hand delivered). The last 6 digits is a sequential number that resets for each day and for each received code.
</div>

# Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import re

# Basic scraping

Queries to the database are done via a `POST` system with the search criteria outlined below. The results, can be found by looking at the table rows (tr) that are either light (lt) or dark (dk). Note that we only get one page of results back at a time and so are limited to 50 results per page.

In [2]:
url = r'https://www.ttbonline.gov/colasonline/publicSearchColasAdvancedProcess.do'

payload = {'searchCriteria.dateCompletedFrom':'09/01/2016',
           'searchCriteria.dateCompletedTo':'09/24/2017',
           'searchCriteria.productOrFancifulName':'',
           'searchCriteria.productNameSearchType':'E',
           'searchCriteria.classTypeDesired':'desc',
           'searchCriteria.classTypeCode':'',
           'searchCriteria.originCodeArray':'00',
           'searchCriteria.ttbIdFrom':'',
           'searchCriteria.ttbIdTo':'',
           'searchCriteria.serialNumFrom':'',
           'searchCriteria.serialNumTo':'',
           'searchCriteria.permitId':'',
           'searchCriteria.vendorCode':''
            }
    
params = {'action': 'search'}

S = requests.Session()

response = S.post(url, params=params, data=payload)
soup = BeautifulSoup(response.text, 'html5lib')

In [3]:
soup.select('tr.dk a')

[<a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=08254001000056">08254001000056</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=08296001000145">08296001000145</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=10281001000077">10281001000077</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=10302001000109">10302001000109</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=11269001000048">11269001000048</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=11299001000094">11299001000094</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=12130001000098">12130001000098</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=12181001000311">12181001000311</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=12181001000323">12181001000323</a>,
 <a href="

In [4]:
soup.select('tr.lt a')

[<a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=07061001000076">07061001000076</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=08254001000085">08254001000085</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=10123001000030">10123001000030</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=10281001000085">10281001000085</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=11033001000335">11033001000335</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=11269001000050">11269001000050</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=11304001000039">11304001000039</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=12181001000308">12181001000308</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=12181001000319">12181001000319</a>,
 <a href="

We can select out the link to the next page as follows

In [5]:
soup.select('div.pagination')

[<div class="pagination">
                 <a href="javascript:void(0)" onclick="win1=popup('publicPrintableResults.do?titlePrefix=COLAs&amp;path=/publicSearchColasAdvancedProcess','printableResultsWin');win1.focus();">Printable Version</a>
                 <br/><br/>
                 <a href="javascript:void(0)" onclick="win2=popup('publicSaveSearchResultsToFile.do?path=/publicSearchColasBasicProcess','saveSearchResults');win2.focus();">Save Search Results To File</a>
                 <br/><br/>
                 1 to 20 of 500 (Total Matching Records: 2010) | <a href="publicPageAdvancedCola.do?action=page&amp;pgfcn=nextset">Next &gt;</a>
               </div>, <div class="pagination">
                 <a href="javascript:void(0)" onclick="win1=popup('publicPrintableResults.do?titlePrefix=COLAs&amp;path=/publicSearchColasAdvancedProcess','printableResultsWin');win1.focus();">Printable Version</a>
                 <br/><br/>              
                 1 to 20 of 500 (Total Matching 

In [6]:
soup.select('div.pagination a[href*=page]')

[<a href="publicPageAdvancedCola.do?action=page&amp;pgfcn=nextset">Next &gt;</a>,
 <a href="publicPageAdvancedCola.do?action=page&amp;pgfcn=nextset">Next &gt;</a>]

We can try to follow along to the next page like so

__NOTE:__ the url changes, __AND__ that we have to use the same session. I believe this is because the `next` button logic relies upon the `JSESSIONID` cookie to serve up the correct next page

In [7]:
url = r'https://www.ttbonline.gov/colasonline/publicPageAdvancedCola.do'

params = {'action': 'page',
          'pgfcn': 'nextset'}

response = S.get(url, params=params)
soup = BeautifulSoup(response.text, 'html5lib')

In [8]:
soup.select('tr.dk a')

[<a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13091001000570">13091001000570</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13091001000576">13091001000576</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13156001000001">13156001000001</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=15231001000516">15231001000516</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16188001000613">16188001000613</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16189001000075">16189001000075</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16209001000039">16209001000039</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16215001000435">16215001000435</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16216001000455">16216001000455</a>,
 <a href="

In [9]:
soup.select('tr.lt a')

[<a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13091001000567">13091001000567</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13091001000572">13091001000572</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13151001000328">13151001000328</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=13156001000090">13156001000090</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16158001000184">16158001000184</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16189001000073">16189001000073</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16189001000077">16189001000077</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16209001000766">16209001000766</a>,
 <a href="viewColaDetails.do?action=publicDisplaySearchAdvanced&amp;ttbid=16215001000438">16215001000438</a>,
 <a href="

In [38]:
[link.get_text() for link in soup.select('tr.lt a')]

['13091001000567',
 '13091001000572',
 '13151001000328',
 '13156001000090',
 '16158001000184',
 '16189001000073',
 '16189001000077',
 '16209001000766',
 '16215001000438',
 '16217001000441']

## Number of responses

In [21]:
res = soup.select('div.pagination')

In [31]:
tmp = res[0]

In [33]:
tmp.get_text()

'\n                Printable Version\n                \n                Save Search Results To File\n                \n                < Previous | 21 to 40 of 500 (Total Matching Records: 2010) | Next >\n              '

Extract the whole line as a matching group

In [34]:
re.findall(r'(Total Matching Records: [0-9]+)', res[0].get_text())

['Total Matching Records: 2010']

Extract just the number of matches as a group

In [42]:
re.findall(r'\(Total Matching Records: ([0-9]+)\)', res[0].get_text())

['2010']