## SEC Edgar web scrapping
by Name

#### Step 1: import libraries 

In [1]:
import os
from urllib.request import urlopen

#### Step 2: define the Report class
This class contains 5 elements:

*name* :       the company name

*formtype* :   the form type

*CIK* :        the CIK number of the company

*date* :       the date

*URL* :        the url to the K-10 filing

##### Dealing with URL
Taking an example url

https://www.sec.gov/Archives/edgar/data/1798581/0001798581-20-000002-index.htm

Observe that the url ends by _0001798581-20-000002-index.htm_

While its corresponding K-10 filing text url is

https://www.sec.gov/Archives/edgar/data/1798581/000179858120000002/0001798581-20-000002.txt

Therefore, we only need to perform

texturl = url - _0001798581-20-000002-index.htm_ + _000179858120000002_ + _0001798581-20-000002.txt_

In [2]:
# define function for generating the dictionary containing the report info 
def gen_dict(name, formtype, cik, date, url):
    # remove the tailling spaces for name
    D = dict()
    l = name.split(' ')
    l = [i for i in l if i != '']
    D['name'] = ' '.join(l)
   
    # remove the tailling spaces for formtype
    l = formtype.split(' ')
    l = [i for i in l if i != '']
    D['formtype'] = ' '.join(l)
      
    # remove the tailling spaces for CIK
    D['CIK'] = cik.split(' ')[0]
    
    # remove the tailling spaces for date
    D['date'] = date.split(' ')[0]
        
    # remove the tailling spaces for url
    D['URL'] = url.split(' ')[0]
    
    return D
        
# define the download function    
def download(target, path='./Reports'):
    ''' Download the K-10 filing and save it in the folder'''
    try:
        os.mkdir(path)
    except:
        # directory already exists
        pass
        
    # modify the link so that we get the correct url for the .txt filing
    l = target['URL'].split('/')
    s = l[-1]
    serialnbr = s.split('-')[0:3]
    txturl = '/'.join(l[0:-1]) + '/' + ''.join(serialnbr) + '/' + '-'.join(serialnbr) + '.txt'
       
    # get the text from url
    # observe the url format, we modify the url so it is the link to the texual data
    # basically we modify the serial number and add to the url
    response = urlopen(txturl)
    text = str(response.read())
    lines = text.split('\\n')
    fname = '-'.join(serialnbr) + '.txt'
        
    # write in the text file line by line
    fname = path + '/' + fname
    with open(fname, 'w') as f:
        for line in lines:
            f.write(line.replace('\\t','\t') + '\n')


#### Step 3: fetch data from crawler.idx
Since the values in _crawler.idx_ are not necessarily seperated by tab nor comma, we need to split them manually.

In fact, the values are seperated by a random number of spaces, for the reason of prettiness.

Here, I used a greedy method, that is to take the spaces in the string and then remove the spaces, so in the end we got a clean string without unecessary spaces.

In [3]:
# get all lines from crawler.idx
with open('./crawler.idx') as f:
    lines = f.readlines()

FileNotFoundError: [Errno 2] No such file or directory: './crawler.idx'

In [4]:
# get the table header
# get the starting and ending positions of data in the table
header = lines[7]
# 'b' means beginning, 'e' means ending
pos1_e = 61

In [5]:
# clean the unnecessary lines of description texts 
lines = lines[9:-1]

#### Step 4: download all the filing text in crawler.idx

In [None]:
# put useful data in the class
for line in lines:
    valuelist = list()
    # retrieve values from the current line
    valuelist.append(line[:pos1_e])
    valuelist.append(line[pos1_e+1:pos1_e+12])
    valuelist.append(line[pos1_e+13:pos1_e+24])
    valuelist.append(line[pos1_e+25:pos1_e+36])
    valuelist.append(line[pos1_e+37:])
    # creat Report dictionary and put corresponding values
    target = gen_dict(valuelist[0],valuelist[1],valuelist[2],valuelist[3],valuelist[4])
    # download the filing in text form 
    download(target)