# Aims

Grab every piece of key data for a given entry in the TTB database

# Background research

`Urls` for TTB are formatted like so:

> `https://www.ttbonline.gov/colasonline/viewColaDetails.do?action=publicDisplaySearchBasic&ttbid=17115001000140`

__Note:__ it is really only the `ttdid` that changes! Exactly what this means is described [in the definition of terms page](https://www.ttbonline.gov/colasonline/defOfTerms.do). The `ttdid` definition is copied below.

<div class="alert alert-block alert-info">
TTB ID - This is a unique, 14 digit number assigned by TTB to track each COLA.  The first 5 digits represent the calendar year and Julian date the application was received by TTB. The next 3 digits tell how the application was received (001 = e-filed; 002 & 003 = mailed/overnight; 000 = hand delivered). The last 6 digits is a sequential number that resets for each day and for each received code.
</div>

__Note:__ the `action` tag determines whether we get the actual form with the image of the label `action=publicFormDisplay` or the more minimal COLA detail `action=publicDisplaySearchBasic`

# Imports

In [42]:
%matplotlib inline
import matplotlib
import seaborn as sns

import requests
from bs4 import BeautifulSoup

# Basic scraping

### Scraping publicDisplaySearchBasic

In [6]:
url = r'https://www.ttbonline.gov/colasonline/viewColaDetails.do'
params = {'action': 'publicDisplaySearchBasic',
          'ttbid': 17115001000140}
response = requests.get(url, params=params)
soup = BeautifulSoup(response.text, 'html5lib')

Most of the relevant info that we want (at least initially) is stored in `div` elements with `class=box`. There should be two of them.

In [12]:
assert(len(soup.select('div.box')) == 2)

In [None]:
soup.select('div.box')

### Scraping publicFormDisplay

In [13]:
url = r'https://www.ttbonline.gov/colasonline/viewColaDetails.do'
params = {'action': 'publicFormDisplay',
          'ttbid': 17115001000140}
response = requests.get(url, params=params)
soup = BeautifulSoup(response.text, 'html5lib')

In [22]:
response.url

'https://www.ttbonline.gov/colasonline/viewColaDetails.do?action=publicFormDisplay&ttbid=17115001000140'

In [30]:
imgs = soup.select('img[src]')
imgs

[<img 193"="" alt="Authorized Signature" height="49" src="/colasonline/publicViewSignature.do?filename=MGWebster1.jpg
                         			&amp;source=c width="/>,
 <img alt="Label Image: Brand (front)" height="650" src="/colasonline/publicViewAttachment.do?filename=LEMONGRASSKEG_COLA.jpg&amp;filetype=l"/>]

The first imgage is the signature, but we could perhaps parse better by looking at the `alt`

In [32]:
imgs[1]['src']

'/colasonline/publicViewAttachment.do?filename=LEMONGRASSKEG_COLA.jpg&filetype=l'

In [40]:
img_url = 'https://www.ttbonline.gov' + imgs[1]['src']
img_url

'https://www.ttbonline.gov/colasonline/publicViewAttachment.do?filename=LEMONGRASSKEG_COLA.jpg&filetype=l'

In [43]:
%%html
<img src=https://www.ttbonline.gov/colasonline/publicViewAttachment.do?filename=LEMONGRASSKEG_COLA.jpg&filetype=l>