# Capstone Project

## Introduction

For my capstone project I want to try to use machine learning to aggregate information from different financial analytical approaches 

Tesla's disruption to the automotive industry began back in 2008 with its exclusive highend Roadster () up until the recent reveal of the mainstream Model Y ().

As the pioneer and leader in EV sales worldwide (), Tesla perhaps has gained more attention in its financial sustainability (); as of this writing, Tesla has yet produced a profittable quarter(). This ***uncertainity of unprofitablity*** has resonated throughout the EV market and automotive market in large, which in turn made Tesla to become one of the most important benchmark indicator of the EV story (). 

~~~~~

This study aims to explore whether Tesla's business model is more related to traditional automakers or more closely related to tech companies. We will use a multi-class classification 

In this report I will attempt to predict the Tesla's stock price using different financial and statistical models. In the end I will discuss the performance of each model and some future recommendations. 

~~~~~

### **Investment Horizon of Models** 

1) **Intrinsic Value** Long Term (Quarters/Years)

2) **Discounted Cash Flow** Mid Term (Quarters)

3) **Comparaable Analysis** Mid Term (Days/Months)

4) **Time Series** Short (Days)

    - Time Autoregressive Integrated Moving Average (ARIMA)
     https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3168423

Model Evalution
AUROC


# Libraries

In [2]:
# Import Library 

import re
import requests
import unicodedata
from bs4 import BeautifulSoup
from pprint import pprint
import numpy as np
from numpy import array
import pandas as pd 
import csv
import pickle
from IPython.display import display
import pickle
from collections import Counter
pd.set_option('display.max_columns', 100)

# Define Text Normalization Function

In [3]:
def restore_windows_1252_characters(restore_string):
    """
    Replace C1 control characters in the Unicode string s by the
    characters at the corresponding code points in Windows-1252,
    where possible.
    """

def to_windows_1252(match):
    try:
        return bytes([ord(match.group(0))]).decode('windows-1252')
    except UnicodeDecodeError:
        # No character at the corresponding code point: remove it.
        return ''

    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)

# Grab the Document Content

In [4]:
def grab_doc_content( brand, CIK ):
    
    company = {}
    
    company['auto'] = {}
    auto = {brand : CIK
           }
    company['auto'] = auto 

    key = list(auto.copy().values())
    
    return key

In [5]:
grab_doc_content('Tesla', '0001318605')

['0001318605']

In [6]:
master_filings_dict = {}
    
file_code = {}
    
file_text = {}

Accession_Number_URL = {}

In [7]:
# URL Directory For CIK    
dir_url = r'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=10-&dateb=&owner=include&count=100'
dir_url_list = [dir_url.format(x) for x in grab_doc_content('Tesla', '0001318605')]

#print('Directory URL: {}'.format(dir_url_list))
doc_url_list = [] 

# FOR-loop yielding Accession Numbers from CIK/URL Directory.
for CIK_num in grab_doc_content('Tesla', '0001318605'):

    doc_url = r'https://www.sec.gov/Archives/edgar/data/{CIKx}/{xx}/{yy}.txt'
    doc_url_new = doc_url.format(CIKx = CIK_num, xx='{xx}', yy ='{yy}')
    doc_url_list.append(doc_url_new)


In [8]:
def url_maker(dir_url):
    dir_url_list = [dir_url.format(x) for x in grab_doc_content('Tesla', '0001318605')]

    #print('Directory URL: {}'.format(dir_url_list))
    doc_url_list = [] 

    # FOR-loop yielding Accession Numbers from CIK/URL Directory.
    for CIK_num in grab_doc_content('Tesla', '0001318605'):

        doc_url = r'https://www.sec.gov/Archives/edgar/data/{CIKx}/{xx}/{yy}.txt'
        doc_url_new = doc_url.format(CIKx = CIK_num, xx='{xx}', yy ='{yy}')
        
        doc_url_list.append(doc_url_new)
        
        url_lists = zip( dir_url_list , doc_url_list )
        
    return url_lists

In [9]:
url = r'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=10-&dateb=&owner=include&count=100'
for dir_url, doc_url in url_maker(url): 
    print(1, dir_url)
    print(2, doc_url)

1 https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001318605&type=10-&dateb=&owner=include&count=100
2 https://www.sec.gov/Archives/edgar/data/0001318605/{xx}/{yy}.txt


In [10]:
Accession_Number_URL = {}
url = r'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=10-&dateb=&owner=include&count=100'

for dir_url, doc_url in url_maker(url):
    
    response = requests.get(dir_url)
    soup = BeautifulSoup(response.content, 'lxml')
    text = soup.get_text(strip=True)

    cleaned_text = re.findall('Acc-no: \d+-\d+-\d+' , text)

    accession_number = [n.replace('Acc-no: ', '') for n in cleaned_text]
    accessionnumber = [num.replace('-', '') for num in accession_number]

    accession_numbers = zip(accessionnumber, accession_number)
    
    cikk = [ cikk_.replace('CIK=', '') for cikk_ in re.findall('CIK=\d+', dir_url) ][0]
    CIKK = {cikk : accessionnumber}
    
    for (a,b) in accession_numbers: 
        master_filings_dict[b] = {}
        master_filings_dict[b]['sec_header_content'] = {}
        master_filings_dict[b]['filing_documents'] = None

        doc_url_single = doc_url.format(xx = a, yy = b)

        file_url_list = []

        

        file_url_list.append( doc_url_single )
        
        Accession_Number_URL.update({ b : file_url_list})

In [11]:
A_N = []
for AN, url in Accession_Number_URL.items():
    
    if AN == '0001193125-14-403635':
        break
        
    else:
        A_N.append(AN)
print(A_N)
print(len(A_N))

['0001564590-20-019931', '0001564590-20-018984', '0001564590-20-004475', '0001564590-19-038256', '0001564590-19-026445', '0001564590-19-013462', '0001564590-19-003165', '0001564590-18-026353', '0001564590-18-019254', '0001564590-18-011086', '0001564590-18-002956', '0001564590-17-021343', '0001564590-17-015705', '0001564590-17-009968', '0001564590-17-003118', '0001564590-16-026820', '0001564590-16-023024', '0001564590-16-018886', '0001564590-16-013195', '0001564590-15-009741', '0001564590-15-006666', '0001564590-15-003789', '0001564590-15-001031']
23


## Save each Filing.

In [35]:
QorK = {'10-Q' : [],
        '10-K' : []
       }

for acc_num, url in Accession_Number_URL.items():
    
    master_document_dict = {}
    
    # create a stop point
    if acc_num == '0001193125-14-403635':
        break

    else:
        
        # grab the response
        response = requests.get(url[0])

    # Soupify
        # pass it through the parser, in this case let's just use lxml because the tags seem to follow xml.
        soup = BeautifulSoup(response.content, 'lxml')
        filing_document = soup.find('document')
        
    # Parsing
        # Document type ->> document_id
        document_id = filing_document.type.find(text=True, recursive=False).strip()
        # Document sequence ->> document_sequence
        document_sequence = filing_document.sequence.find(text=True, recursive=False).strip()
        # Document filename ->> document_filename
        document_filename = filing_document.filename.find(text=True, recursive=False).strip()
        # Document description ->> document_description
        document_description = filing_document.description.find(text=True, recursive=False).strip()
        
    # Storage    
        # initalize our document dictionary
        master_document_dict[document_id] = {}

        # add the different parts, we parsed up above.
        master_document_dict[document_id]['document_sequence'] = document_sequence
        master_document_dict[document_id]['document_filename'] = document_filename
        master_document_dict[document_id]['document_description'] = document_description
        
# Scraping
        # grab the text portion of the document, this will be used to split the document into pages.
        filing_doc_text = filing_document.find('text').extract()

        # find all the thematic breaks, these help define page numbers and page breaks.
        all_thematic_breaks = filing_doc_text.find_all('hr')

        # Locate and store page number via list comprehension.
        all_page_numbers = [thematic_break.previous_sibling.previous_sibling.get_text(strip=True) for thematic_break in all_thematic_breaks]

        # convert all thematic breaks to a string so it can be used for parsing
        all_thematic_breaks = [str(thematic_break) for thematic_break in all_thematic_breaks]

        # prep the document text for splitting, this means converting it to a string.
        filing_doc_string = str(filing_doc_text)
    
        # handle the case where there are thematic breaks.
        if len(all_thematic_breaks) > 0: 

            # define the regex delimiter pattern, this would just be all of our thematic breaks.
            regex_delimiter_pattern = '|'.join(map(re.escape, all_thematic_breaks))

            # split the document along each thematic break.
            split_filing_string = re.split(regex_delimiter_pattern, filing_doc_string)

            # store the document itself
            master_document_dict[document_id]['pages_code'] = split_filing_string

        # handle the case where there are no thematic breaks.
        elif len(all_thematic_breaks) == 0:

            # handles so it will display correctly.
            split_filing_string = all_thematic_breaks

            # store the document as is, since there are no thematic breaks. In other words, no splitting.
            master_document_dict[document_id]['pages_code'] = [filing_doc_string]
            
        # display some information to the user.
        print('-'*80)
        print('The document {} was parsed.'.format(document_id))
        print('There was {} page(s) found.'.format(len(all_page_numbers)))
        print('There was {} thematic breaks(s) found.'.format(len(all_thematic_breaks)))

        # store the documents in the master_filing_dictionary.
        master_filings_dict[acc_num]['filing_documents'] = master_document_dict
        
        # if document is 10-Q
        if document_id == '10-Q':
            QorK['10-Q'].append(acc_num) # add acc_num to QorK in 10-Q as key
        
        # if document is 10-K
        if document_id == '10-K':
            QorK['10-K'].append(acc_num) # add acc_num to QorK in 10-K as key
        
        del master_document_dict
        
        print('-'*80)
        print('All the documents for filing {} were parsed and stored.'.format(acc_num))
    


--------------------------------------------------------------------------------
The document 10-Q was parsed.
There was 71 page(s) found.
There was 71 thematic breaks(s) found.
--------------------------------------------------------------------------------
All the documents for filing 0001564590-20-019931 were parsed and stored.
--------------------------------------------------------------------------------
The document 10-K/A was parsed.
There was 67 page(s) found.
There was 67 thematic breaks(s) found.
--------------------------------------------------------------------------------
All the documents for filing 0001564590-20-018984 were parsed and stored.
--------------------------------------------------------------------------------
The document 10-K was parsed.
There was 158 page(s) found.
There was 158 thematic breaks(s) found.
--------------------------------------------------------------------------------
All the documents for filing 0001564590-20-004475 were parsed and store

In [16]:
string_test = master_filings_dict['0001564590-20-019931']['filing_documents']['10-Q']['pages_code']


## Anchor Check Function



In [17]:
# Variations of HTML anchors.
CBS_A = '<a name="Consolidated_Balance_Sheets'
CBS_B = 'id="CONSOLIDATED_BALANCE_SHEETS'
CBS3 = 'name="CONSOLIDATED_BALANCE_SHEETS"' 
CBS10Knew = 'id="Consolidated_Balance_Sheets"'

SoO_A = '<a name="Statements_of_Operations"' 
SoO_B = 'id="CONSOLIDATED_STATEMENTS_OPERATIONS"'
SoO3 = 'name="CONSOLIDATED_STATEMENTS_OPERATIONS"'
CSoO10K = 'name="Consolidated_Statements_of_Operations"'
CSoO10Knew = 'id="Consolidated_Statements_of_Operations"'

CF_A = '<a name="Statements_of_Cash'
CF_B = 'id="CONSOLIDATED_STATEMENTS_CASH_FLOWS"'
CF3 = 'name="CONSOLIDATED_STATEMENTS_CASH_FLOWS"'
CCF10K = 'name="Consolidated_Statements_of_Cash_Flows"'
CCF10Knew = 'id="Consolidated_Statements_of_Cash_Flows"'

def anchor_check(page_code):
    
    # if the page has one of these anchors
    if (CBS_A in page_code) or (CBS_B in page_code) or (CBS3 in page_code) or (CBS10Knew in page_code):
        return True

    #elif (SoO_A in page_code) or (SoO_B in page_code) or (SoO3 in page_code) or (CSoO10K in page_code) or (CSoO10Knew in page_code) :
        #return True

    #elif (CF_A in page_code) or (CF_B in page_code) or (CF3 in page_code) or (CCF10K in page_code) or (CCF10Knew in page_code):
        #return True


    else:
        return False


# Saving string test to local file

### You're looking at the following file.

string_test = master_filings_dict['0001564590-20-019931']['filing_documents']['10-Q']['pages_code']


In [None]:
with open("stringtest.csv", "wb") as fp:
    pickle.dump(string_test, fp)

# Loading string test from local file

In [None]:
import pickle
with open("stringtest.csv", "rb") as fp:
    string_test = pickle.load(fp)

## Cleaning Columns Function


In [18]:
month_dict = {'December' : 'Q4',
              'November' : 'Q4',
             'October' : 'Q4',
             'September' : 'Q3',
             'August': 'Q3',
             'July' : 'Q3',
             'June' : 'Q2',
             'May' : 'Q2',
             'April' : 'Q2',
             'March' : 'Q1',
             'Feburary' : 'Q1',
             'January' : 'Q1'}

### TESTING.

In [19]:
financials = ['Quarter', 'Year', 'Assets', 'Current assets',
       'Cash and cash equivalents',
       'Restricted cash and marketable securities', 'Accounts receivable',
       'Inventory', 'Prepaid expenses and other current assets',
       'Total current assets', 'Operating lease vehicles, net',
       'Property, plant and equipment, net', 'Restricted cash', 'Other assets',
       'Total assets',
       'Current liabilities', 'Accounts payable', 'Accrued Liabilities',
       'Deferred Revenue into ', 'Capital lease obligations, current portion',
       'Customer deposits', 'Convertible Senior Notes (Note 8 )',
       'Total current liabilities',
       'Capital lease obligations, less current portion', 'Deferred Revenue',
       'Convertible Senior Notes', 'Resale value guarantee',
       'Other long-term liabilities', 'Total liabilities',
       'Commitments and contingencies (Note 11)', 'Convertible Senior Notes']

In [None]:
name_conversion_dict = {
    
}
def col_name_check( column_name ):
    
    for 

In [40]:
# Shortening Column name.
def clean_col_name( element ):
    
    if re.findall('Common stock.*', element):
        
        element = element.replace(element, 'Common Stock') 

        return element
        
    elif re.findall('Preferred stock.*', element):
        
        element = element.replace(element, 'Preferred Stock') 
        
        return element
    
    elif re.findall("Total liabilities and stockholders' equity.*", element):
        
        element = element.replace(element, 'Total liabilities and equity' ) 
        
        return element

    elif re.findall('Accounts receivable.*', element):
        
        element = element.replace(element, 'Accounts receivable' ) 
        
        return element

    elif re.findall('Accrued liabilities.*', element):
        
        element = element.replace(element, 'Accrued Liabilities' ) 
        
        return element

    elif re.findall('Deferred revenue.*', element):
        
        element = element.replace(element, 'Deferred Revenue' ) 
        
        return element

    elif re.findall('Redeemable noncontrolling interests in subsidiaries.*', element):
        
        element = element.replace(element, 'Noncontrolling interests in subsidiaries' ) 
        
        return element
    
    elif re.findall('Convertible senior notes.*', element, flags=re.I):
        #print('before: ', element)
        element = element.replace(element, 'Convertible Senior Notes' )
        #print('after: ', element)
        return element
    
    else:
        return element

In [22]:
# Throw column list-like object into cleaning function. 
def cleaning_column(column, m):
     
    column_list = []

    for element_post, element in enumerate(column):
        
        # Handling the first "row" in the table i.e. column name.
        if element_post == 0:

            element = unicodedata.normalize('NFKD', element)
            
            element = element.replace( '\n', '')
            
            clean_ele = clean_col_name( element )
            
            column_list.append(clean_ele)
            
        # Use a dictionary to convert the Months into numbers.
        #elif element_post == :
            #month = re.sub('({})'.format('|'.join(map(re.escape, month_dict.keys()))), lambda m: month_dict[m.group()], dat)
            #column_list.extend(month)
        #elif element == '(' or ')': 
            #pint(element)
            #continue
            
        else:
            
            element = element.replace( '\n$' and '\n\xa0' and '\n', '0')
            
            element = element.replace( ',', '' )

            pattern = re.compile(r'[\d]+,[\d]+|[\d]+')

            res = pattern.findall(element)

            res = [int(ele) * m for ele in res]

            column_list.extend(res)

    return column_list
    

## Cleaning Tables Function

In [23]:
def date( column ):
    
    date = []
    
    for ele in column:
        
        norm_ele = unicodedata.normalize('NFKD', ele)
        
        date_data = re.sub('({})'.format('|'.join(map(re.escape, month_dict.keys()))), lambda m: month_dict[m.group()], norm_ele)

        element = date_data.replace( '\n' and ' ' and '\n ', '0')
        
        pattern = re.compile(r'\w+')

        res = pattern.findall(element)
        
        res = str(res[0]) 

        date.append(res)

    date.pop(-1)

    date.insert(0, 'Quarter')

    return date

In [24]:
def year( column ):
    
    years = []
                
    for year in column:

        norm_year = unicodedata.normalize('NFKD', year)

        clean_year = norm_year.replace( '\n', '0')

        years.append(str(int(clean_year)))
        
    years.pop(-1)

    years.insert(0, 'Year')
    
    return years

In [25]:
def table_extractor( page, multiplier ):
    
    # Convert string to BS object
    soup = BeautifulSoup(page, 'html5')

    # then get all the rows in the table.
    table_rows = soup.find_all('tr')
    
    if len( [tr.text for tr in table_rows[0].find_all('td')] ) < 3:
        del table_rows[0]
    
    single_table = []
    
    # Rotate through each column, adding in three structures to single_table.
    for tr_post, tr in enumerate(table_rows):

        td = tr.find_all('td')

        column = [tr.text for tr in td]
        
        if tr_post == 0:
            
            col_date = date( column )

            single_table.append( col_date )
         
        elif tr_post == 1: 
            
            col_year = year( column )
            
            single_table.append( col_year )
            
        else: 
            
            cleaned_data = cleaning_column( column , multiplier )
            
            if len(cleaned_data) > 10:
                del cleaned_data[3]
            
            single_table.append( cleaned_data )
            
    return single_table

In [42]:
def df_clean( single_table ):
    #print(single_table)
    table_df = pd.DataFrame(single_table)

    table_df = table_df.transpose()  

    table_df.drop(table_df.index[1:3], 0, inplace=True)

    table_df.drop(table_df.index[2:], 0, inplace=True)

    table_df.columns = table_df.iloc[0]

    table_df = table_df.drop(table_df.index[0])
    
    return table_df

In [27]:
def table(page_code):
    
    doc_tables=[]

    for page_post, page in enumerate(page_code):
        
        if anchor_check(page) and ('in millions' in page):
            
            single_table = table_extractor( page, 10**6 )
            
            table_df = df_clean(single_table)
                        
            doc_tables.append(table_df)
        
        elif anchor_check(page) and ('in thousands' in page):
            
            single_table = table_extractor( page, 10**3 )
            
            table_df = df_clean(single_table)
            
            doc_tables.append(table_df)
                
    return doc_tables
    

In [43]:
string_test = master_filings_dict['0001564590-20-004475']['filing_documents']['10-K']['pages_code']
test_df = table(string_test)
table_df = test_df[0]
table_df

Unnamed: 0,Quarter,Year,Assets,Current assets,Cash and cash equivalents,Restricted cash,Accounts receivable,Inventory,Prepaid expenses and other current assets,Total current assets,"Operating lease vehicles, net","Solar energy systems, net","Property, plant and equipment, net",Operating lease right-of-use assets,"Intangible assets, net",Goodwill,"MyPower customer notes receivable, net of current portion","Restricted cash, net of current portion",Other assets,Total assets,Liabilities,Current liabilities,Accounts payable,Accrued Liabilities,Deferred Revenue,Resale value guarantees,Customer deposits,Current portion of debt and finance leases,Total current liabilities,"Debt and finance leases, net of current portion",Deferred Revenue.1,"Resale value guarantees, net of current portion",Other long-term liabilities,Total liabilities,Commitments and contingencies (Note 16),Noncontrolling interests in subsidiaries,Equity,Stockholders' equity,Preferred Stock,Common Stock,Additional paid-in capital,Accumulated other comprehensive loss,Accumulated deficit,Total stockholders' equity,Noncontrolling interests in subsidiaries.1,Total liabilities and equity
3,Q4,2019,0,0,6268000000,246000000,1324000000,3552000000,713000000,12103000000,2447000000,6138000000,10396000000,1218000000,339000000,198000000,393000000,269000000,808000000,34309000000,0,0,3771000000,2905000000,1163000000,317000000,726000000,1785000000,10667000000,11634000000,1207000000,36000000,2655000000,26199000000,0,643000000,0,0,0,0,12737000000,36000000,6083000000,6618000000,849000000,34309000000


In [None]:
table_df.columns

# Save Files Locally. 

In [44]:
for k, v in QorK.items():
    
    #if k == '10-Q' and v == '0001564590-17-009968':
        
        for i, vv in enumerate(v):

            # first grab 10-Q documents only
            pages = master_filings_dict[v[i]]['filing_documents'][k]['pages_code']

            try:
                # Extract Table from html code and wrap it as a list of df.
                table_df = table(pages)                 

            except: 
                print('Something went wrong in file {}, file type = {}'.format(v[i], k))

            with open('{}-{}.csv'.format(v[i], k), "wb") as fp:   # Pickling
                pickle.dump(table_df, fp)

            # display a status to the user.
            print('All the pages from {} document {} have been tableized and saved.'.format(k, v[i]))
            print('-'*80)      

All the pages from 10-Q document 0001564590-20-019931 have been tableized and saved.
--------------------------------------------------------------------------------
All the pages from 10-Q document 0001564590-19-038256 have been tableized and saved.
--------------------------------------------------------------------------------
All the pages from 10-Q document 0001564590-19-026445 have been tableized and saved.
--------------------------------------------------------------------------------
All the pages from 10-Q document 0001564590-19-013462 have been tableized and saved.
--------------------------------------------------------------------------------
All the pages from 10-Q document 0001564590-18-026353 have been tableized and saved.
--------------------------------------------------------------------------------
All the pages from 10-Q document 0001564590-18-019254 have been tableized and saved.
--------------------------------------------------------------------------------
All 

# Comparing the Old and the New.

In [None]:
# Sample: Older filing style. 
import pickle
pd.set_option('display.max_columns', 100)
with open("0001564590-19-038256-10-Q.csv", "rb") as fp:
    file = pickle.load(fp)
df01 = file[0]

# df01['CSN_SUM'] = df01['Convertible Senior Notes'].values.sum()
# del df01['Convertible Senior Notes']

# df01['DeferredRevenue'] = df01['Deferred Revenue'].values.sum()
# del df01['Deferred Revenue']

# df01['NoncontrollingInterests'] = df01['Noncontrolling interests in subsidiaries'].values.sum()
# del df01['Noncontrolling interests in subsidiaries']

print(df01.shape)
df01

In [None]:
duplicate_columns( df01 )

### Newer Filing

In [None]:
# Sample: Newer filing style. 
import pickle 
import pandas as pd
pd.set_option('display.max_columns', 100)
with open("0001564590-20-019931-10-Q.csv", "rb") as fp:
    file = pickle.load(fp)
df02 = file[0]
print(df02.shape) 
df02

In [None]:
checking_duplicate_columns( df02 )

In [None]:
print('Using the Function')

duplicate_columns( df02 )


In [None]:
# Slamming the Raw DataFrame into the dissected column cleanser. 
print('Not Using the Function')
duplicates = [k for (k,v) in Counter(df02.columns).items() if v > 1] 

for dupli in duplicates: 
    res = df02[dupli].values.sum()
    df02['Total {}'.format(dupli)] = res
    df02.drop(dupli, axis=1, inplace=True)
    
df02

In [None]:
df02.columns

# Duplicate col_name, add column tgt.
df2 = df02.copy()

# df2['CSN_SUM'] = df2['Convertible Senior Notes'].values.sum()
# del df2['Convertible Senior Notes']

df2['DeferredRevenue'] = df2['Deferred Revenue'].values.sum()
del df2['Deferred Revenue']

df2['NoncontrollingInterests'] = df2['Noncontrolling interests in subsidiaries'].values.sum()
del df2['Noncontrolling interests in subsidiaries']

#df2.set_index(['Year', 'Quarter'], inplace=True)

print( df2.shape )
df2

In [None]:
checking_duplicate_columns( df2 )

# Concating dfs

### Example code: 

In [None]:
c_ = pd.concat([df2, df01], axis=0, sort=False)

print(c_.shape)
c_

In [None]:
# Serious Stuff. 
def duplicate_columns( DataF ):
    
    duplicates = [k for (k,v) in Counter(DataF.columns).items() if v > 1] 

    for dupli in duplicates: 
        
        res = DataF[dupli].values.sum()
        
        DataF['Serious {}'.format(dupli)] = res
        
        DataF.drop(dupli, axis=1, inplace=True)

    return DataF

In [None]:
# Just for checking.
def checking_duplicate_columns( DataF ):
    
    duplicates = [k for (k,v) in Counter(DataF.columns).items() if v > 1] 

    
    print('Shape of df: ', DataF.shape )
    return duplicates

In [None]:
c_.set_index(['Year', 'Quarter'], inplace=True)
print(c_.shape)
c_

In [None]:
c1 = c_.copy()
c1.dropna( axis= 1, inplace=True)
print(c1.shape)
c1

# Loading

In [None]:
QorK = {
    '10-Q': ['0001564590-20-019931', '0001564590-19-038256', '0001564590-19-026445', '0001564590-19-013462', '0001564590-18-026353', '0001564590-18-019254', '0001564590-18-011086', '0001564590-17-021343', '0001564590-17-015705', '0001564590-17-009968', '0001564590-16-026820', '0001564590-16-023024', '0001564590-16-018886', '0001564590-15-009741', '0001564590-15-006666', '0001564590-15-003789'],
    '10-K': ['0001564590-20-004475', '0001564590-19-003165', '0001564590-18-002956', '0001564590-17-003118', '0001564590-16-013195', '0001564590-15-001031']}
QorK

In [None]:
# Concating all Balance Sheet together.

for k, v in QorK.items():
    
    #if k == '10-Q' and v == '0001564590-17-009968':
        
        for i, vv in enumerate(v):

            # first grab 10-Q documents only
            pages = master_filings_dict[v[i]]['filing_documents'][k]['pages_code']

            try:
                # Extract Table from html code and wrap it as a list of df.
                table_df = table(pages)                 

            except: 
                print('Something went wrong in file {}, file type = {}'.format(v[i], k))

            with open('{}-{}.csv'.format(v[i], k), "wb") as fp:   # Pickling
                pickle.dump(table_df, fp)
                
                  
pd.set_option('display.max_columns', 100)
with open("0001564590-15-001031-10-K.csv", "rb") as fp:
    file = pickle.load(fp)
df1 = file[0]


# Statement Function 

In [None]:
def BS10_K(df)
    
    missing_values = [ '\n', None, '\n—', '\n0', '\n$', '\n()']
    
    # Clean df.
    df = df.drop([1,4,5,6,7,8], axis=1, inplace=True)
    

In [None]:
def CF10_K(df)

In [None]:
def SoO10_K(df)

In [None]:
def BS10_Q(table):
    
    BSdf = []
    missing_values = [ '\n', None, '\n—', '\n0', '\n$', '\n()']
    
    # Clean the table. 
    
    row = table.replace(missing_values, '', regex=True)
    file = file.drop(axis=1, columns = [1,4:], inplace=True)

    statement = BSddf.DataFrame(row)
    BSdf.append(statement)
    
    return BSdf

In [None]:
def CF10_Q(table)

In [None]:
res_table = []
for table_num, table in enumerate(file):
    
    if table_num > 1:
        print(table_num)
        cleaned_row = cleaning_row(table)

        res_table.append(cleaned_row)

print(res_table)

In [None]:
# Creating Statment specific files for 
TSLA_BS10_K = []
TSLA_CF10_K = []
TSLA_SoO10_K = []
TSLA_BS10_Q = []
TSLA_CF10_Q = []
TSLA_SoO10_Q = []

for k, v in QorK.items():
    
    if k == '10-K':
        
        for i in range(len(v)):
            
    if k == '10-Q'
    
        for i in range(len(v)):

https://stackoverflow.com/questions/50950614/converting-column-into-multi-index-column