## Full text search on WRDS SEC Analytics Suite

Link: [WRDS SEC Analytics Suite](https://wrds-web.wharton.upenn.edu/wrds/ds/sec/search/fsearch.cfm?navId=359)
    
Search for: tax
    
In: Correspondence
    
Period: Jan 2016 - Jan 2018

4,629 results

## Load results (csv) and download these

In [3]:
import csv
from random import *

# make list of dictionaries
with open('WRDS_SEC_search_tax_in_comment_letters.csv') as f:
    letters = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)]

letters[0:2]

[{'id': '000157/1571371/0001104659-17-053266.txt',
  'conformed_submission_type': 'CORRESP',
  'filer_central_index_key': '0001571371',
  'filer_former_company_former_conformed_name': '',
  'filer_company_conformed_name': 'Summit Materials, LLC',
  'accession_number': '0001104659-17-053266',
  'filed_as_of_date': '2017-08-23 00:00:00+00:00',
  'conformed_period_of_report': '2017-12-27 00:00:00+00:00'},
 {'id': '000162/1621563/0001104659-17-053266.txt',
  'conformed_submission_type': 'CORRESP',
  'filer_central_index_key': '0001621563',
  'filer_former_company_former_conformed_name': '',
  'filer_company_conformed_name': 'Summit Materials, Inc.',
  'accession_number': '0001104659-17-053266',
  'filed_as_of_date': '2017-08-23 00:00:00+00:00',
  'conformed_period_of_report': '2017-12-27 00:00:00+00:00'}]

In [4]:
# example: 000088/880417/0000906318-03-000077.txt
# needs to become: https://www.sec.gov/Archives/edgar/data/880417/000090631803000077/0000906318-03-000077.txt
# this function turns the 'shortened' url into the actual url that can be accessed
def idToUrl( url ):
    # cut the url in three pieces (e.g. '000088' ,'880417', and '0000906318-03-000077.txt')
    urlPieces = url.split('/')	
    # we need to make a piece that is like the last piece, but without '-' and have '.txt' removed
    midpiece = urlPieces[2].replace("-", "")[:-4]
    # glue the pieces together
    fullUrl  = 'https://www.sec.gov/Archives/edgar/data/' + urlPieces[1] + '/' + midpiece + '/' + urlPieces[2]
    return (fullUrl)

In [5]:
idToUrl('000157/1571371/0001104659-17-053266.txt')

'https://www.sec.gov/Archives/edgar/data/1571371/000110465917053266/0001104659-17-053266.txt'

In [9]:
import requests, time

myCounter = 0

for letter in letters[0:20]:    
    myCounter += 1
    # get filing
    r = requests.get(  idToUrl ( letter['id'])  )
    # write to file
    with open('letters/' + str(myCounter) + '.html', 'wb') as f:
        f.write( r.content )
    # sleep 1 second
    time.sleep(1)    

## Convert HTML files to text files

In [11]:
with open('letters/1.html') as f:
    content = f.read()

In [12]:
# grab <DOCUMENT>.. <TEXT> through </TEXT>
import re
doc = re.findall("<DOCUMENT>.*?<TEXT>(.*?)<\/TEXT>", content, flags=re.DOTALL)  
doc[0][0:100]
len(doc)

2

In [3]:
import html
# need to do: pip install w3lib
from w3lib.html import replace_entities
# functions that converts html to text
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return replace_entities(cleantext)

In [4]:
text = cleanhtml( doc[0])
text[0:500]

'\n\n\n\n\n\n\n\n  \n\n\n\xa0\n\n\n\n    \n\n1550   Wynkoop Street, 3rd\xa0Fl   Denver, Colorado 80202\n\xa0\n(303)   893-0012 Office    (303) 893-6993 Fax       \n\n\nsummit-materials.com         \n\xa0\nVIA COURIER AND EDGAR\n\xa0\n\n\n\n\xa0    \n\nAugust\xa023, 2017         \n\xa0\nRe:\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 Summit Materials,\xa0Inc.\nSummit Materials, LLC\nForm\xa010-K for Fiscal Year Ended December\xa031, 2016\nFiled February\xa028, 2017\nFile No.\xa01-36873\nFile No.\xa0333-187556\n\xa0\nTerence O’Brien\nAccounting Branch Chief\nOffice of Manufacturing and Construction\nSecuriti'

In [13]:
import glob
# files holds a list with elements each being a path to a html file
files = glob.glob('letters/*.html')

for file in files:
    # grab the counter (1, 2, ...) that is part of the path (e.g. 20 in letters\20.html )
    myCounter = re.findall(r'(\d*)\.html', file)[0]	    
    # read the file
    with open( file) as f:
        content = f.read()
    # grab first document (exhibits are separate documents)
    doc = re.findall("<DOCUMENT>.*?<TEXT>(.*?)<\/TEXT>", content, flags=re.DOTALL)
    if (len(doc)) >= 1:
        # clean up
        text = cleanhtml( doc[0] )    
        # write to disk    
        with open('letters_text/' + str(myCounter) + '.txt', 'w', encoding="utf-8") as f:
            f.write( text )     
        # all good
        print(myCounter, file, len(content))
    else:
        print('Could not find TEXT in file ', file)

1 letters\1.html 124125
10 letters\10.html 30881
11 letters\11.html 17513
12 letters\12.html 14447
13 letters\13.html 36022
14 letters\14.html 34407
15 letters\15.html 33780
16 letters\16.html 18504
17 letters\17.html 724508
18 letters\18.html 724508
19 letters\19.html 55372
2 letters\2.html 124125
20 letters\20.html 43204
3 letters\3.html 75337
4 letters\4.html 44726
5 letters\5.html 45138
6 letters\6.html 33895
7 letters\7.html 64519
8 letters\8.html 75865
9 letters\9.html 18183


## SEC Master Archive

Link: [http://www.wrds.us/index.php/repository/view/25](http://www.wrds.us/index.php/repository/view/25)
    
The filing dataset holds CIK, filing date, form type (10-K, 8-K, etc) and url. 