# Handling the dirtiness of the the data

This is a scratch notebook for the data processing code


Some things to handle:
1) Muliple reports filed and amended



Columns:  
`AssetsCurrent` seems to have a lot of companies  
`AccruedLiabilities` has very few  
`Revenues`: some companies, interesting edge cases  


### Small notes about the data

1) The 2023 FY results are (for the most part) not yet released. This is consistent with other years, where most form 10-K are released in the first quarter of the year. This particular data is, of course, released quarterly.


In [8]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.pylabtools import figsize

import seaborn as sns
import plotly.express as px

import numpy as np
import pandas as pd
import polars as pl
import sqlite3

import statsmodels.formula.api as smf

In [9]:
%run fetch_data.py

%run web_utilities.py

## Sticking to 2022 for the moment

In [10]:
con = sqlite3.connect('data/processed/all10k.db')

How many companies filed reports in 2022?

In [11]:
con.execute("""SELECT COUNT(DISTINCT(cik)) FROM sub WHERE fy='2022';""").fetchall()

[(6279,)]

In [12]:
all_sub_22 = pd.read_sql_query("""SELECT * FROM sub WHERE fy='2022';""", con)

In [13]:
all_sub_22.shape

(6280, 36)

One duplicate: see analysis below

## What fields do I need?

Possible fields:
- EntityCommonStockSharesOutstanding: number of shares
- EntityPublicFloat: market cap
- CommonStockSharesAuthorized
- CommonStockSharesOutstanding

- EarningsPerShareBasic
- EarningsPerShareDiluted

- Assets
- AssetsCurrent
- LiabilitiesCurrent
- Liabilities
- LiabilitiesAndStockholdersEquity
- StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest

(I think)
Liabilities = LiabilitiesAndStockholdersEquity - StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest


- StockholdersEquity

- Revenues (does not appear for MSFT)
- CommonStockDividendsPerShareDeclared



In [15]:
all_tags = con.execute("""SELECT DISTINCT tag FROM tag;""").fetchall()

In [16]:
len(all_tags)

571931

In [17]:
all_fields = con.execute("""SELECT DISTINCT tag FROM num;""").fetchall()

In [18]:
len(all_fields)

978805

In [25]:
all_fields_22 = con.execute("""SELECT DISTINCT tag FROM num WHERE dyear = '2022';""").fetchall()

In [26]:
len(all_fields_22)

179602

In [25]:
from fetch_data import get_entries

In [27]:
get_entries(2022, ['CommonStockMember'])

Unnamed: 0,adsh,tag,data_year,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance


In [41]:
'CommonStockMember'

'Common'

### To do:

- Select all fields for a couple of examples to get a hand on what is available.


# Edge cases:

I have a feeling that there will be a lot of edge cases with this data, so I'll document them here

### Multiple revenue values

1. Note that the smaller values sum to the largest value, so this is some break down of revenues.
2. However, this breakdown is not the same as that seen in the orginal document: https://www.sec.gov/ixviewer/ix.html?doc=/Archives/edgar/data/846475/000141057822000453/zyxi-20211231x10k.htm

In [38]:
data = get_entries(2021, ['Revenues'])

data[data['name'] == 'ZYNEX INC']

Unnamed: 0,adsh,tag,data_year,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance
1830,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,24127000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1831,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,31022000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1832,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,34785000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1833,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,40367000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1834,0001410578-22-000453,Revenues,2021,us-gaap/2021,,4,USD,130301000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml


In [24]:
data[data['name'] == 'ZYNEX INC']['value'].sum() / 2

130301000.0

In this case, the key distinguishing factor is qtrs, which is 4 for the total. (Note that not all fields will use qtrs = 4, see below, for example, where qtrs = 0)

### Multiple current assets

In this case, there are multiple subsidiaries. The balances seem right for each subsidiary and for the company as a whole.

Here, `coreg` None signifies the correct field.

In [37]:
data = get_entries(2021, ['AssetsCurrent'])
duplicates = data[ data['adsh'].duplicated(keep=False) ]
duplicates[ duplicates.cik == '4904'].sort_values('value')

Unnamed: 0,adsh,tag,data_year,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance
1698,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,OhioPowerCo,0,USD,327500000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1695,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AEPTransmissionCo,0,USD,331300000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1694,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AEPTexasInc.,0,USD,347200000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1699,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,PublicServiceCoOfOklahoma,0,USD,386800000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1697,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,IndianaMichiganPowerCo,0,USD,439400000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1700,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,SouthwesternElectricPowerCo,0,USD,668500000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1696,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AppalachianPowerCo,0,USD,925300000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1693,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,,0,USD,7809200000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml


In [39]:
duplicates[ duplicates.cik == '4904']['value'].sum()/2

5617600000.0

In this case, the whole company has some assets of its own.

### Multiple reports due to an aquisition

In [17]:
con = sqlite3.connect('data/processed/all10k.db')
all_sub_22 = pd.read_sql_query("""SELECT * FROM sub WHERE fy='2022';""", con)

all_sub_22[ all_sub_22.duplicated('cik', keep=False) ]

Unnamed: 0,adsh,cik,name,sic,countryba,stprba,cityba,zipba,bas1,bas2,...,period,fy,fp,filed,accepted,prevrpt,detail,instance,nciks,aciks
71,0001493152-22-024263,1847846,8I ACQUISITION 2 CORP.,8000,SG,,SINGAPORE,59817,C/O 6 EU TONG SEN STREET,#08-13 THE CENTRAL,...,20220731,2022,FY,20220829,2022-08-29 14:58:00.0,0,1,form10-k_htm.xml,1,
5528,0001493152-23-022805,1847846,EUDA HEALTH HOLDINGS LTD,8000,SG,,SINGAPORE,59817,C/O 6 EU TONG SEN STREET,#08-13 THE CENTRAL,...,20221231,2022,FY,20230628,2023-06-28 17:16:00.0,0,1,form10-k_htm.xml,1,


These entities have the different names but the same cik

In [24]:
different = (all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].iloc[0] != all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].iloc[1])

all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].loc[:,different]

Unnamed: 0,adsh,name,stprba,stprma,stprinc,former,changed,fye,period,filed,accepted,aciks
71,0001493152-22-024263,8I ACQUISITION 2 CORP.,,,,,,731,20220731,20220829,2022-08-29 14:58:00.0,
5528,0001493152-23-022805,EUDA HEALTH HOLDINGS LTD,,,,8I ACQUISITION 2 CORP.,20210224.0,1231,20221231,20230628,2023-06-28 17:16:00.0,


Aha! We have a 'former' field that show the aquirer