# Concept

The goal here is to build a sequential web scraper based off of the TTBID. I think that the simplest thing to do will just be to start with a basic easy date and then just increment until we get an error back. Since things are _supposedly_ sequential, we can interate easily. A good test will be to try for a small test range. It may also be worth trying to get those parallel scraping tools operational

<div class="alert alert-block alert-info">
TTB ID - This is a unique, 14 digit number assigned by TTB to track each COLA.  The first 5 digits represent the calendar year and Julian date the application was received by TTB. The next 3 digits tell how the application was received (001 = e-filed; 002 & 003 = mailed/overnight; 000 = hand delivered). The last 6 digits is a sequential number that resets for each day and for each received code.
</div>

# Imports

In [57]:
import requests
from bs4 import BeautifulSoup
import re

import datetime
import pymongo
import warnings

import pandas as pd


import sys
sys.path.append(r'../ScrapingTools')
from TTB_scraping import TTB_Scraper
from time import sleep


### Early prototypes

In [58]:
start_date = '01/30/2016'
stop_date = '01/1/2017'

In [59]:
# Set up connection to mongodb
client = pymongo.MongoClient() # Connect to default client
db = client.TTB # Get a database (note: lazy evaluation)
TTB = db.TTB # the actual collection

# convert dates to datetime format
date_start = datetime.datetime.strptime(start_date, '%m/%d/%Y')
date_stop = datetime.datetime.strptime(stop_date, '%m/%d/%Y')

# iterate over each date
curr_date = date_start
while (curr_date < date_stop):

    # iterate over each recieve code
    curr_reccode = 0
    while curr_reccode <= 3:

        # increment each sequence 
        cont_seq = True
        curr_seqnum = 1
        while cont_seq:
            # prep the strings for the ttbid
            jdate='{year}{day}'.format(year=curr_date.strftime('%y'), day=curr_date.strftime('%j'))
            reccode='{:03d}'.format(curr_reccode)
            seqnum='{:06d}'.format(curr_seqnum)

            # prep the query
            ttbid = '{jdate}{reccode}{seqnum}'.format(jdate=jdate, reccode=reccode, seqnum=seqnum)

            query = TTB_Scraper(ttbid)
            parsed_data = query.get_basic_form_data()

            # if we got a valid response
            if parsed_data:
                query_data = {'_id': ttbid,
                         'recieve_date':curr_date.strftime('%m/%d/%Y'),
                         'recieve_code': reccode,
                         'seq_num': seqnum}
                
                # concatinated data we will add to our database
                output = {**query_data, **parsed_data}

                cont_seq = True
                curr_seqnum += 1
                # Insert result into database
                try:
                    TTB.insert_one(output)
                    #print('Successfully added: {}'.format(ttbid))
                except pymongo.errors.DuplicateKeyError:
                    warnings.warn('_id: {ttbid} is already in database, skipping...'.format(ttbid=ttbid))
            else:
                cont_seq = False
            sleep(0.1)
        curr_reccode += 1
    curr_date += datetime.timedelta(days=1)


In [63]:
ttbid

'16366003000001'

In [9]:
output

{'ApprovalDate': '01/20/2016',
 'BrandName': 'RUBOR VITICULTORES',
 'Class/TypeCode': 'DESSERT /PORT/SHERRY/(COOKING) WINE',
 'ContactInformation': 'CHRISTOPHERTERRELL\nPhone Number:(510) 717-4829\nFax Number:(419) 710-4829\n',
 'FancifulName': '',
 'ForSaleIn': '',
 'Formula': '',
 'GrapeVarietal(s)': 'N/A',
 'OriginCode': 'SPAIN',
 'PlantRegistry/BasicPermit/BrewersNo(Other)': '',
 'PlantRegistry/BasicPermit/BrewersNo(PrincipalPlaceofBusiness)': 'CA-I-15980\nTERRELL WINES, CHRISTOPHER JAMES TERRELL\n751 13TH ST , TREASURE ISLAND\nSAN FRANCISCO, CA 94130\n',
 'Qualifications': 'TTB has not reviewed this label for type size, characters per inch or contrasting background.The responsible industry member must continue to ensure that the mandatory information on the actual labels is displayed in the correct type size, number of characters per inch, and on a contrasting background in accordance with the TTB labeling regulations, 27 CFR parts 4, 5, 7, and 16, as applicable.\nTTB has not revi

In [13]:
assert(output['TTBID'] == output['_id'])

In [14]:
# Set up connection to mongodb
client = pymongo.MongoClient() # Connect to default client
db = client.TTB # Get a database (note: lazy evaluation)
TTB = db.TTB # the actual collection

In [26]:
try:
    res = TTB.insert_one(output)
except pymongo.errors.DuplicateKeyError:
    warnings.warn('_id: {ttbid} is already in database, skipping...'.format(ttbid=ttbid))



In [17]:
res.inserted

'16001001000001'

In [None]:
def build_ttb_database(start_date, stop_date):

# Getting data from the Mongo DB

In [None]:
# Set up connection to mongodb
client = pymongo.MongoClient() # Connect to default client
db = client.TTB # Get a database (note: lazy evaluation)
TTB = db.TTB # the actual collection

In [62]:
TTB.count() # number of elements in the database

3978

In [32]:
a = TTB.distinct('TTBID') # list of distinct TTBID's
len(a)

7

In [34]:
TTB.find_one('16001001000002')

{'ApprovalDate': '01/19/2016',
 'BrandName': 'RUBOR VITICULTORES',
 'Class/TypeCode': 'DESSERT /PORT/SHERRY/(COOKING) WINE',
 'ContactInformation': 'CHRISTOPHERTERRELL\nPhone Number:(510) 717-4829\nFax Number:(419) 710-4829\n',
 'FancifulName': '',
 'ForSaleIn': '',
 'Formula': '',
 'GrapeVarietal(s)': 'N/A',
 'OriginCode': 'SPAIN',
 'PlantRegistry/BasicPermit/BrewersNo(Other)': '',
 'PlantRegistry/BasicPermit/BrewersNo(PrincipalPlaceofBusiness)': 'CA-I-15980\nTERRELL WINES, CHRISTOPHER JAMES TERRELL\n751 13TH ST , TREASURE ISLAND\nSAN FRANCISCO, CA 94130\n',
 'Qualifications': 'TTB has not reviewed this label for type size, characters per inch or contrasting background.The responsible industry member must continue to ensure that the mandatory information on the actual labels is displayed in the correct type size, number of characters per inch, and on a contrasting background in accordance with the TTB labeling regulations, 27 CFR parts 4, 5, 7, and 16, as applicable.\nTTB has not revi

We can preint out some basic stats like so:

In [61]:
# print collection statistics
#print(db.command("collstats", "TTB"))

# print database statistics
print(db.command({"dbstats": 1,  'scale': 1024}))

{'db': 'TTB', 'collections': 1, 'views': 0, 'objects': 3978, 'avgObjSize': 1664.4544997486173, 'dataSize': 6466.015625, 'storageSize': 1756.0, 'numExtents': 0, 'indexes': 1, 'indexSize': 84.0, 'ok': 1.0}


Estimate for one year's worth of entries

In [50]:
(208/408) * 147073

74978.39215686274

# Mongo into Pandas

The following snippet _should_ turn every element of our mongodb into a list which is then parsed by pandas into a df

In [64]:
df = pd.DataFrame(list(TTB.find()))

In [65]:
df.columns

Index(['ApprovalDate', 'BrandName', 'Class/TypeCode', 'ContactInformation',
       'FancifulName', 'ForSaleIn', 'Formula', 'GrapeVarietal(s)',
       'OriginCode', 'PlantRegistry/BasicPermit/BrewersNo(Other)',
       'PlantRegistry/BasicPermit/BrewersNo(PrincipalPlaceofBusiness)',
       'Qualifications', 'Serial#', 'Status', 'TTBID', 'TotalBottleCapacity',
       'TypeofApplication', 'VendorCode', 'WineVintage', '_id', 'recieve_code',
       'recieve_date', 'seq_num'],
      dtype='object')

In [66]:
df['TTBID']

0       16001001000001
1       16001001000002
2       16001001000003
3       16001001000004
4       16001001000005
5       16001001000006
6       16001001000007
7       16002001000001
8       16002001000002
9       16002001000003
10      16002001000004
11      16003001000001
12      16003001000002
13      16003001000003
14      16003001000004
15      16003001000005
16      16003001000006
17      16003001000007
18      16003001000008
19      16003001000009
20      16003001000010
21      16003001000011
22      16003001000012
23      16003001000013
24      16003001000014
25      16003001000015
26      16003001000016
27      16004001000001
28      16004001000002
29      16004001000003
             ...      
3948    16364001000006
3949    16364001000007
3950    16364001000008
3951    16364001000009
3952    16364001000010
3953    16364001000011
3954    16364001000012
3955    16365001000001
3956    16365001000002
3957    16365001000003
3958    16365001000004
3959    16365001000005
3960    163