Forms:
    10-K
    10-Q
    8-K
    Proxy Statement
    Forms 3, 4, 5
    Schedulem13D
    144

10-K:
    NLP:
        Business?
        Risk Factors
        Unresolved Staff Comments
        Properties
        Legal Proceedings
        Mine Safety Disclosures
        

In [1]:
import pandas as pd
import re
import requests
import unicodedata
from bs4 import BeautifulSoup



## Get txt document from SEC website

In [2]:
htmlText = r"https://www.sec.gov/Archives/edgar/data/1318605/000119312511054847/0001193125-11-054847.txt"



#get response
response = requests.get(htmlText)

# parse response
soup = BeautifulSoup(response.content, 'lxml')

## Decode Text

In [3]:
from unicodedata import normalize

print('%r' % normalize('NFD', u'\u00C7'))  # decompose: convert Ç to "C + ̧"
print('%r' % normalize('NFC', u'C\u0327')) # compose: convert "C + ̧" to Ç

'Ç'
'Ç'


## Get Important Text Positions


Business 

Risk Factors
Unresolved Staff Comments
Properties
Legal Proceedings
Reserved  

Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities 
Selected Financial Data
Management’s Discussion and Analysis of Financial Condition and Results of Operations
Quantitative and Qualitative Disclosures About Market Risk
Financial Statements and Supplementary Data
Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
Controls and Procedures 
Other Information

Directors, Executive Officers and Corporate Governance
Executive Compensation
Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
Certain Relationships and Related Transactions, and Director Independence
Principal Accountant Fees and Services

Exhibits and Financial Statement Schedules

In [30]:
u'\u00C7', u'\xa0'

('Ç', '\xa0')

In [16]:
def restore_windows_1252_characters(restore_string):
    """
        Replace C1 control characters in the Unicode string s by the
        characters at the corresponding code points in Windows-1252,
        where possible.
    """

    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
        
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)




# def remove_newlines(string):
#     # \p, \xa0

## Define a master dictionary to house all filings

In [5]:
# define new dict to house all filings
master_filings_dict = {}

#define a new key for each filing
accession_number = '0001104659-04-027382'

#add key to dict and add new level
master_filings_dict[accession_number] = {} 

#add next levels
master_filings_dict[accession_number]['sec_header_content'] = {}
master_filings_dict[accession_number]['filing_documents'] = None

## Extracting SEC header tag

In [6]:
#grab the sec header doc
sec_header_tag = soup.find('sec-header')
#store header info in dict
master_filings_dict[accession_number]['sec_header_content']['sec_header_text'] = sec_header_tag
sec_header_tag

<sec-header>0001193125-11-054847.hdr.sgml : 20110303
<acceptance-datetime>20110303144736
ACCESSION NUMBER:		0001193125-11-054847
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		8
CONFORMED PERIOD OF REPORT:	20101231
FILED AS OF DATE:		20110303
DATE AS OF CHANGE:		20110303

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			TESLA MOTORS INC
		CENTRAL INDEX KEY:			0001318605
		STANDARD INDUSTRIAL CLASSIFICATION:	MOTOR VEHICLES &amp; PASSENGER CAR BODIES [3711]
		IRS NUMBER:				912197729
		STATE OF INCORPORATION:			DE

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-34756
		FILM NUMBER:		11659731

	BUSINESS ADDRESS:	
		STREET 1:		3500 DEER CREEK RD
		CITY:			PALO ALTO
		STATE:			CA
		ZIP:			94070
		BUSINESS PHONE:		650-681-5000

	MAIL ADDRESS:	
		STREET 1:		3500 DEER CREEK RD
		CITY:			PALO ALTO
		STATE:			CA
		ZIP:			94070
</acceptance-datetime></sec-header>

In [7]:
soup.find('sec-header')

<sec-header>0001193125-11-054847.hdr.sgml : 20110303
<acceptance-datetime>20110303144736
ACCESSION NUMBER:		0001193125-11-054847
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		8
CONFORMED PERIOD OF REPORT:	20101231
FILED AS OF DATE:		20110303
DATE AS OF CHANGE:		20110303

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			TESLA MOTORS INC
		CENTRAL INDEX KEY:			0001318605
		STANDARD INDUSTRIAL CLASSIFICATION:	MOTOR VEHICLES &amp; PASSENGER CAR BODIES [3711]
		IRS NUMBER:				912197729
		STATE OF INCORPORATION:			DE

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-34756
		FILM NUMBER:		11659731

	BUSINESS ADDRESS:	
		STREET 1:		3500 DEER CREEK RD
		CITY:			PALO ALTO
		STATE:			CA
		ZIP:			94070
		BUSINESS PHONE:		650-681-5000

	MAIL ADDRESS:	
		STREET 1:		3500 DEER CREEK RD
		CITY:			PALO ALTO
		STATE:			CA
		ZIP:			94070
</acceptance-datetime></sec-header>

## Parsing the docs

In [8]:
#initialize master dict
master_document_dict = {}

#loop through each doc in the filing
for filing_document in soup.find_all('document'):
    
    #define document id
    document_id = filing_document.type.find(text=True, recursive=False).strip()
    
    #document sequence
    document_sequence = filing_document.sequence.find(text=True, recursive=False).strip()
    
    #document filename
    document_filename = filing_document.filename.find(text=True, recursive=False).strip()
    
    #document description
    document_description = filing_document.description.find(text=True, recursive=False).strip()
    
    #insert the key
    master_document_dict[document_id] = {}
    
    
    
    
    master_document_dict[document_id]['document_sequence'] = document_sequence
    master_document_dict[document_id]['document_filename'] = document_filename
    master_document_dict[document_id]['document_description'] = document_description
    
    #add document content
    master_document_dict[document_id]['document_code'] = filing_document.extract()
    
    #get all text
    filing_doc_text = filing_document.find('text').extract()
    master_document_dict[document_id]['filing_doc_text'] = filing_doc_text
    
    #get thematic breaks
    thematic_breaks = filing_doc_text.find_all('hr',{'width':'100%'})
    
    #convert all thematic breaks to string
    all_thematic_breaks = [str(thematic_break) for thematic_break in thematic_breaks]
    
    #prep the document text for splitting - convert to string
    filing_doc_string = str(filing_doc_text)
    
    #if there are thematic breaks
    if len(all_thematic_breaks) > 0:
        
        #defing the regex delimeter pattern
        regex_delimiter_pattern = '|'.join(map(re.escape, all_thematic_breaks))
        
        #split doc on each break
        split_filing_string = re.split(regex_delimiter_pattern, filing_doc_string)
        
        #store doc
        master_document_dict[document_id]['pages_code'] = split_filing_string
        
        
        
    # handle the case where there are no thematic breaks.
    elif len(all_thematic_breaks) == 0:

        # handles so it will display correctly.
        split_filing_string = all_thematic_breaks
        
        # store the document as is, since there are no thematic breaks. In other words, no splitting.
        master_document_dict[document_id]['pages_code'] = [filing_doc_string]
    

    # display some information to the user.
    print('-'*80)
    print('The document {} was parsed.'.format(document_id))
#     print('There was {} page(s) found.'.format(len(all_page_numbers)))
    print('There was {} thematic breaks(s) found.'.format(len(all_thematic_breaks)))
    

# store the documents in the master_filing_dictionary.
master_filings_dict[accession_number]['filing_documents'] = master_document_dict

print('-'*80)
print('All the documents for filing {} were parsed and stored.'.format(accession_number))


# master_document_dict

--------------------------------------------------------------------------------
The document 10-K was parsed.
There was 150 thematic breaks(s) found.
--------------------------------------------------------------------------------
The document EX-10.47 was parsed.
There was 26 thematic breaks(s) found.
--------------------------------------------------------------------------------
The document EX-21.1 was parsed.
There was 0 thematic breaks(s) found.
--------------------------------------------------------------------------------
The document EX-23.1 was parsed.
There was 0 thematic breaks(s) found.
--------------------------------------------------------------------------------
The document EX-31.1 was parsed.
There was 0 thematic breaks(s) found.
--------------------------------------------------------------------------------
The document EX-31.2 was parsed.
There was 0 thematic breaks(s) found.
--------------------------------------------------------------------------------
The do

In [9]:
for x in master_document_dict:
    print(x)

10-K
EX-10.47
EX-21.1
EX-23.1
EX-31.1
EX-31.2
EX-32.1
GRAPHIC


In [10]:
# master_document_dict['10-K']['pages_code']
str(master_document_dict['10-K']['filing_doc_text']).find('Risk Factors')

16375

In [25]:
master_document_dict['10-K']['pages_code']

151

In [17]:
restored_string = restore_windows_1252_characters(master_document_dict['10-K']['pages_code'][0])

In [21]:
restored_string.find('Risk Factors')

-1

In [39]:
restored_string[7547:7554]

'whether'

# Now we'll break up specifically 10-K Documents

### Split up into Parts 1 - 4

In [None]:
Business 

Risk Factors
Unresolved Staff Comments
Properties
Legal Proceedings
Reserved  

Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities 
Selected Financial Data
Management’s Discussion and Analysis of Financial Condition and Results of Operations
Quantitative and Qualitative Disclosures About Market Risk
Financial Statements and Supplementary Data
Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
Controls and Procedures 
Other Information

Directors, Executive Officers and Corporate Governance
Executive Compensation
Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
Certain Relationships and Related Transactions, and Director Independence
Principal Accountant Fees and Services

Exhibits and Financial Statement Schedules

## Reading tables