# Handling the dirtiness of the the data

This is a scratch notebook for the data processing code


Some things to handle:
1) Muliple reports filed and amended



Columns:  
`AssetsCurrent` seems to have a lot of companies  
`AccruedLiabilities` has very few  
`Revenues`: some companies, interesting edge cases  


### Small notes about the data

1) The 2023 FY results are (for the most part) not yet released. This is consistent with other years, where most form 10-K are released in the first quarter of the year. This particular data is, of course, released quarterly.


In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.pylabtools import figsize

import seaborn as sns
import plotly.express as px

import numpy as np
import pandas as pd
import polars as pl
import sqlite3

import statsmodels.formula.api as smf

In [2]:
%run fetch_data.py

%run web_utilities.py

## Sticking to 2022 for the moment

In [3]:
con = sqlite3.connect('data/processed/all10k.db')

How many companies filed reports in 2022?

In [4]:
con.execute("""SELECT COUNT(DISTINCT(cik)) FROM sub WHERE fy='2022';""").fetchall()

[(6279,)]

In [5]:
all_sub_22 = pd.read_sql_query("""SELECT * FROM sub WHERE fy='2022';""", con)

In [6]:
all_sub_22.shape

(6280, 36)

One duplicate: see analysis below

### Data processing

In [7]:
data = get_entries(2022, ['EntityCommonStockSharesOutstanding', 'EarningsPerShareBasic', 'Assets', 'Liabilities'])

#### duplicate removal

In [33]:
data[0:1000].groupby(by=['cik', 'tag']).apply( lambda x: x)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,adsh,tag,ddate,dyear,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance
cik,tag,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1001250,Assets,145,0001001250-22-000122,Assets,2022-06-30,2022,us-gaap/2021,,0,USD,2.091000e+10,,1001250,ESTEE LAUDER COMPANIES INC,2022-06-30,False,el-20220630_htm.xml
1001250,EarningsPerShareBasic,215,0001001250-22-000122,EarningsPerShareBasic,2022-06-30,2022,us-gaap/2021,,4,USD,6.640000e+00,,1001250,ESTEE LAUDER COMPANIES INC,2022-06-30,False,el-20220630_htm.xml
1001907,Assets,809,0001437749-22-022502,Assets,2022-06-30,2022,us-gaap/2022,,0,USD,5.622100e+07,,1001907,ASTROTECH CORP,2022-06-30,False,astc20220705_10k_htm.xml
1001907,EntityCommonStockSharesOutstanding,361,0001437749-22-022502,EntityCommonStockSharesOutstanding,2022-08-31,2022,dei/2022,,0,shares,5.063085e+07,,1001907,ASTROTECH CORP,2022-06-30,False,astc20220705_10k_htm.xml
1002047,Assets,1,0000950170-22-011708,Assets,2022-04-30,2022,us-gaap/2021,,0,USD,1.002600e+10,,1002047,"NETAPP, INC.",2022-04-30,False,ntap-20220429_htm.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96793,Assets,610,0001564590-22-032555,Assets,2022-06-30,2022,us-gaap/2022,,0,USD,2.794300e+07,,96793,SUNLINK HEALTH SYSTEMS INC,2022-06-30,False,ssy-10k_20220630_htm.xml
96793,EarningsPerShareBasic,841,0001564590-22-032555,EarningsPerShareBasic,2022-06-30,2022,us-gaap/2022,,4,USD,-2.900000e-01,,96793,SUNLINK HEALTH SYSTEMS INC,2022-06-30,False,ssy-10k_20220630_htm.xml
96793,EntityCommonStockSharesOutstanding,483,0001564590-22-032555,EntityCommonStockSharesOutstanding,2022-09-30,2022,dei/2022,,0,shares,7.031603e+06,,96793,SUNLINK HEALTH SYSTEMS INC,2022-06-30,False,ssy-10k_20220630_htm.xml
98338,Assets,770,0001213900-22-048182,Assets,2022-05-31,2022,us-gaap/2022,,0,USD,2.435430e+07,,98338,TSR INC,2022-05-31,False,f10k2022_tsrinc_htm.xml


#### pivoting

In [12]:
print(data[['cik', 'tag']].duplicated().sum())

2065


In [14]:
no_duplicates = data[ ~data[['cik', 'tag']].duplicated()]

In [27]:
pivot = no_duplicates[['cik', 'tag', 'value']].pivot(index='cik', columns='tag', values='value')
pivot

tag,Assets,EarningsPerShareBasic,EntityCommonStockSharesOutstanding,Liabilities
cik,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000045,1.835700e+08,0.39,,6.718400e+07
1000209,2.259879e+09,1.86,,1.889355e+09
1000228,8.607000e+09,3.95,,3.936000e+09
1000229,5.783540e+08,0.42,,
1000230,4.055781e+07,-0.05,7893194.0,1.839809e+07
...,...,...,...,...
99106,9.412000e+06,0.02,,1.973600e+07
99250,1.565959e+10,,,
99302,1.777620e+08,1.52,,9.158600e+07
99780,8.724300e+09,0.73,,7.454700e+09


Notes:
1. almost no one has shares outstanding?

In [None]:
data[~data.duplicated()].shape

## What fields do I need?

Possible fields:
- EntityCommonStockSharesOutstanding: number of shares
- EntityPublicFloat: market cap
- CommonStockSharesAuthorized
- CommonStockSharesOutstanding

- EarningsPerShareBasic
- EarningsPerShareDiluted

- Assets
- AssetsCurrent
- LiabilitiesCurrent
- Liabilities
- LiabilitiesAndStockholdersEquity
- StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest

(I think)
Liabilities = LiabilitiesAndStockholdersEquity - StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest


- StockholdersEquity

- Revenues (does not appear for MSFT)
- CommonStockDividendsPerShareDeclared


- NetIncomeLoss
- ProfitLoss


In [15]:
all_tags = con.execute("""SELECT DISTINCT tag FROM tag;""").fetchall()

In [16]:
len(all_tags)

571931

In [17]:
all_fields = con.execute("""SELECT DISTINCT tag FROM num;""").fetchall()

In [18]:
len(all_fields)

978805

In [25]:
all_fields_22 = con.execute("""SELECT DISTINCT tag FROM num WHERE dyear = '2022';""").fetchall()

In [26]:
len(all_fields_22)

179602

### To do:

- Select all fields for a couple of examples to get a hand on what is available.


# Exploring different types of fields

Done:
- Are `Assets` and `LiabilitiesAndStockholdersEquity` the same thing? - essentially, yes.


To do:
- Some of the things that we do for repeat checks are actually required for all fields generally: eg, checking whether it is a subsidiary reporting.
- Investigate companies with missing assets reports
- data didn't fetch for this one: https://www.sec.gov/ixviewer/ix.html?doc=/Archives/edgar/data/732712/000073271223000012/vz-20221231.htm

- some companies 'incorporate financial statements by reference': how to automatically parse these?



## Balance Sheet

In [3]:
data = get_data(2022, ['Liabilities', 'LiabilitiesAndStockholdersEquity', 'StockholdersEquity', 'Assets'])

### Q: Are 'Assets' and 'LiabilitiesAndStockholdersEquity' the same thing?

In [4]:
same = (data['LiabilitiesAndStockholdersEquity'] == data['Assets'])
print(same.sum(), data.shape)

6097 (6201, 10)


They are almost entirly the same

In [5]:
nanAssets = data[~same]['Assets'].isna()
nanLplusSE = data[~same]['LiabilitiesAndStockholdersEquity'].isna()

print('total count: ', (~same).sum())
print('nan entries: ', (nanAssets | nanLplusSE ).sum())

total count:  104
nan entries:  98


Most places one of the two is nan.

In [6]:
diffs = data[~same][ ~nanAssets & ~ nanLplusSE]
diffs[['Assets', 'LiabilitiesAndStockholdersEquity']]

Unnamed: 0,Assets,LiabilitiesAndStockholdersEquity
1416,464145000.0,464146000.0
2210,2537695000.0,19105000.0
3674,6.0,5.0
4074,8114329.0,8114328.0
4334,899910.0,899911.0
6044,3809252.0,3809253.0


In [7]:
diffs['Assets'] - diffs['LiabilitiesAndStockholdersEquity']

1416   -1.000000e+03
2210    2.518590e+09
3674    1.000000e+00
4074    1.000000e+00
4334   -1.000000e+00
6044   -1.000000e+00
dtype: float64

Most of these are essentially rounding errors, except for the second one.

In [8]:
diffs

Unnamed: 0,adsh,Assets,Liabilities,LiabilitiesAndStockholdersEquity,StockholdersEquity,cik,name,period_filed,prevrpt,url
1416,0000950170-23-010241,464145000.0,268128000.0,464146000.0,,1825265,TCW DIRECT LENDING VIII LLC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
2210,0001193125-23-053980,2537695000.0,1537964000.0,19105000.0,,1418076,SLR INVESTMENT CORP.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
3674,0001477932-23-002714,6.0,369937.0,5.0,-369932.0,1651992,"APPSOFT TECHNOLOGIES, INC.",2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
4074,0001493152-23-012437,8114329.0,6236132.0,8114328.0,1878196.0,1329606,"CLEAN ENERGY TECHNOLOGIES, INC.",2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
4334,0001553350-22-000785,899910.0,287951.0,899911.0,611960.0,1848334,"OKMIN RESOURCES, INC.",2022-06-30,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6044,0001829126-23-002409,3809252.0,5798393.0,3809253.0,-1989140.0,1514443,"AMERIGUARD SECURITY SERVICES, INC.",2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...


Looking at the balance sheet, this appears to be the entry for a subsidiary. A good thing to check for!

Overall, yes, they are the same.

In [9]:
bothna = (data['LiabilitiesAndStockholdersEquity'].isna() & data['Assets'].isna())

In [10]:
bothna.sum()

28

### Missing Liabilities: how often can we impute them

In [11]:
data.shape

(6201, 10)

In [12]:
data.isna().sum()

adsh                                  0
Assets                               47
Liabilities                         764
LiabilitiesAndStockholdersEquity     79
StockholdersEquity                  432
cik                                   0
name                                  0
period_filed                          0
prevrpt                               0
url                                   0
dtype: int64

Since `Assets` and `LiabilitiesAndStockholdersEquity` are both na only in a rare number of cases, we can hope to impute the missing value in all of these cases

In [13]:
missingLiab = data.isna()[ 'Liabilities']
missingStockholdersEquity  = data.isna()[ 'StockholdersEquity']

In [14]:
print('both missing: ', (missingLiab & missingStockholdersEquity).sum() )

both missing:  69


Some alternative figures:
- `LiabilitiesFairValueDisclosure`


- `MembersEquity`

- `PartnersCapital`

- `StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest`

- `LiabilitiesNoncurrent`

Some utilities seem to have their own ways of reporting...
- `tve:TotalLiabilities`

### Do the Liabilities and Shareholder Equity add up?

In [36]:
data = get_data(2022, ['Liabilities', 'LiabilitiesAndStockholdersEquity', 'StockholdersEquity'])

data_entries = data[['Liabilities', 'LiabilitiesAndStockholdersEquity', 'StockholdersEquity']]
data_entries = data_entries[ ~data_entries.isna().any(axis=1) ]

In [52]:
differences = data_entries['LiabilitiesAndStockholdersEquity'] - data_entries['Liabilities'] - data_entries['StockholdersEquity']
where = differences[differences != 0].index

In [60]:
data.loc[where]

Unnamed: 0,adsh,Liabilities,LiabilitiesAndStockholdersEquity,StockholdersEquity,cik,name,period_filed,prevrpt,url
2,0000002969-22-000054,1.349020e+10,2.719260e+10,1.314400e+10,2969,"AIR PRODUCTS & CHEMICALS, INC.",2022-09-30,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
8,0000004904-23-000011,6.930110e+10,9.346940e+10,2.389340e+10,4904,AMERICAN ELECTRIC POWER CO INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
11,0000005272-23-000007,4.843990e+11,5.266340e+11,4.000200e+10,5272,"AMERICAN INTERNATIONAL GROUP, INC.",2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
25,0000008868-23-000005,3.402900e+09,2.161700e+09,-1.244700e+09,8868,AVON PRODUCTS INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
27,0000009389-23-000011,1.638200e+10,1.990900e+10,3.461000e+09,9389,BALL CORP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
...,...,...,...,...,...,...,...,...,...
6173,0001902733-23-000024,2.305070e+08,1.301014e+09,1.067625e+09,1902733,"NCINO, INC.",2023-01-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6184,0001903596-23-000263,1.539584e+07,6.563294e+06,-1.010273e+07,792935,ETHEMA HEALTH CORP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6187,0001903596-23-000319,1.006462e+07,1.372732e+06,-7.975792e+06,1286648,GZ6G TECHNOLOGIES CORP.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6188,0001903596-23-000342,1.844698e+07,8.432410e+05,-1.561957e+07,1530746,"KAYA HOLDINGS, INC.",2022-12-31,True,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...


There are a large number of rows that don't fully add up.

How much of this is due to minority interests in the shares?

In [62]:
d2 = get_data(2022, ['Liabilities', 'LiabilitiesAndStockholdersEquity', 'StockholdersEquity', 'Assets', 'StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest'])

In [98]:
bothEquity= d2[ ~d2.StockholdersEquity.isna() & ~d2.StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest.isna() ]
bothEquity[bothEquity.StockholdersEquity >
           bothEquity.StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest
            ]

Unnamed: 0,adsh,Assets,Liabilities,LiabilitiesAndStockholdersEquity,StockholdersEquity,StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest,cik,name,period_filed,prevrpt,url
12,0000005513-23-000034,6.143490e+10,5.223740e+10,6.143490e+10,9.197500e+09,-2.756600e+09,5513,UNUM GROUP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
37,0000014846-23-000005,7.321180e+08,4.820480e+08,7.321180e+08,2.500880e+08,2.500700e+08,14846,BRT APARTMENTS CORP.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
71,0000029332-23-000015,2.029460e+08,1.714320e+08,2.029460e+08,3.151400e+07,2.190000e+05,29332,DIXIE GROUP INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
92,0000037996-23-000012,2.558840e+11,2.127170e+11,2.558840e+11,4.324200e+10,4.316700e+10,37996,FORD MOTOR CO,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
148,0000059527-23-000004,3.180546e+09,2.146505e+09,3.180546e+09,1.034140e+09,1.034041e+09,59527,LINCOLN ELECTRIC HOLDINGS INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
...,...,...,...,...,...,...,...,...,...,...,...
6168,0001879016-23-000003,2.604860e+08,5.803900e+07,2.604860e+08,2.063750e+08,2.024470e+08,1879016,IVANHOE ELECTRIC INC.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6177,0001899883-23-000002,2.478399e+09,1.689015e+09,2.478399e+09,5.516230e+08,5.247940e+08,1899883,FTAI INFRASTRUCTURE INC.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6192,0001903596-23-000319,1.372732e+06,1.006462e+07,1.372732e+06,-7.975792e+06,-8.691883e+06,1286648,GZ6G TECHNOLOGIES CORP.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6193,0001903596-23-000342,8.432410e+05,1.844698e+07,8.432410e+05,-1.561957e+07,-1.760374e+07,1530746,"KAYA HOLDINGS, INC.",2022-12-31,True,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...


In [87]:
d3 = get_numbers(2022,  ['Liabilities', 'LiabilitiesAndStockholdersEquity', 'StockholdersEquity', 'Assets', 'StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest'])

In [92]:
d3[ d3.adsh =='0000029332-23-000015']

Unnamed: 0,adsh,tag,ddate,dyear,version,coreg,qtrs,uom,value,footnote
5241,0000029332-23-000015,Assets,2022-12-31,2022,us-gaap/2022,,0,USD,202946000.0,
10335,0000029332-23-000015,Liabilities,2022-12-31,2022,us-gaap/2022,,0,USD,171432000.0,
17997,0000029332-23-000015,LiabilitiesAndStockholdersEquity,2022-12-31,2022,us-gaap/2022,,0,USD,202946000.0,
25401,0000029332-23-000015,StockholdersEquity,2022-12-31,2022,us-gaap/2022,,0,USD,31514000.0,
30142,0000029332-23-000015,StockholdersEquityIncludingPortionAttributable...,2022-12-31,2022,us-gaap/2022,,0,USD,219000.0,


Here again, we see what has essentially become a theme of this project:
1. There are many possible things that can happen
    - Ford has negative minority interest: it has a subsidiary that has negative value
2. They are not always clear outside the context of the documents themselves
    - Dixie reports a `StockholdersEquityIncludingPortionAttributable...` for some its investments, not for the company as a whole. This is hard to detect!

In [108]:
bothEquity[ bothEquity.	StockholdersEquity != bothEquity.StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest]

Unnamed: 0,adsh,Assets,Liabilities,LiabilitiesAndStockholdersEquity,StockholdersEquity,StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest,cik,name,period_filed,prevrpt,url
2,0000002969-22-000054,2.719260e+10,1.349020e+10,2.719260e+10,1.314400e+10,1.370240e+10,2969,"AIR PRODUCTS & CHEMICALS, INC.",2022-09-30,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
4,0000003570-23-000042,4.126600e+10,,4.126600e+10,-2.969000e+09,-1.710000e+08,3570,CHENIERE ENERGY INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
8,0000004904-23-000011,9.346940e+10,6.930110e+10,9.346940e+10,2.389340e+10,2.412240e+10,4904,AMERICAN ELECTRIC POWER CO INC,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
11,0000005272-23-000007,5.266340e+11,4.843990e+11,5.266340e+11,4.000200e+10,4.223500e+10,5272,"AMERICAN INTERNATIONAL GROUP, INC.",2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
12,0000005513-23-000034,6.143490e+10,5.223740e+10,6.143490e+10,9.197500e+09,-2.756600e+09,5513,UNUM GROUP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
...,...,...,...,...,...,...,...,...,...,...,...
6189,0001903596-23-000263,6.563294e+06,1.539584e+07,6.563294e+06,-1.010273e+07,-9.232543e+06,792935,ETHEMA HEALTH CORP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6192,0001903596-23-000319,1.372732e+06,1.006462e+07,1.372732e+06,-7.975792e+06,-8.691883e+06,1286648,GZ6G TECHNOLOGIES CORP.,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6193,0001903596-23-000342,8.432410e+05,1.844698e+07,8.432410e+05,-1.561957e+07,-1.760374e+07,1530746,"KAYA HOLDINGS, INC.",2022-12-31,True,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...
6194,0001906324-23-000014,8.855800e+09,3.921200e+09,8.855800e+09,4.934600e+09,-6.760000e+07,1906324,QUIDELORTHO CORP,2022-12-31,False,https://www.sec.gov/ixviewer/ix.html?doc=/Arch...


Here's what I need to do:
1. Have a 'best in class' type of data that checks for various conditions
2. Have subordinate data quality indicators, ie numbers of asterices, and perform various levels of imputation corresponding to the different qualities.

I think I should split my data fetching into various specific fetchers for clarity.

## Revenues and cash flows

Financial question: which profit do I care about: `GrossProfit`, `OperatingIncomeLoss`, or `NetIncomeLoss`? `NetIncomeLoss` appears to be the bottom line associated with earnings per share, so that is probabily the correct thing to use

Alternatives:

`RevenueFromContractWithCustomerExcludingAssessedTax`

In [11]:
(data['StockholdersEquity'].isna() & data['LiabilitiesAndStockholdersEquity'].isna()).sum()

33

# Edge cases:

I have a feeling that there will be a lot of edge cases with this data, so I'll document them here

### Multiple revenue values

1. Note that the smaller values sum to the largest value, so this is some break down of revenues.
2. However, this breakdown is not the same as that seen in the orginal document: https://www.sec.gov/ixviewer/ix.html?doc=/Archives/edgar/data/846475/000141057822000453/zyxi-20211231x10k.htm

In [38]:
data = get_entries(2021, ['Revenues'])

data[data['name'] == 'ZYNEX INC']

Unnamed: 0,adsh,tag,data_year,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance
1830,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,24127000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1831,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,31022000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1832,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,34785000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1833,0001410578-22-000453,Revenues,2021,us-gaap/2021,,1,USD,40367000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml
1834,0001410578-22-000453,Revenues,2021,us-gaap/2021,,4,USD,130301000.0,,846475,ZYNEX INC,2021-12-31,False,zyxi-20211231x10k_htm.xml


In [24]:
data[data['name'] == 'ZYNEX INC']['value'].sum() / 2

130301000.0

In this case, the key distinguishing factor is qtrs, which is 4 for the total. (Note that not all fields will use qtrs = 4, see below, for example, where qtrs = 0)

### Multiple current assets

In this case, there are multiple subsidiaries. The balances seem right for each subsidiary and for the company as a whole.

Here, `coreg` None signifies the correct field.

In [37]:
data = get_entries(2021, ['AssetsCurrent'])
duplicates = data[ data['adsh'].duplicated(keep=False) ]
duplicates[ duplicates.cik == '4904'].sort_values('value')

Unnamed: 0,adsh,tag,data_year,version,coreg,qtrs,uom,value,footnote,cik,name,period_filed,prevrpt,instance
1698,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,OhioPowerCo,0,USD,327500000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1695,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AEPTransmissionCo,0,USD,331300000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1694,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AEPTexasInc.,0,USD,347200000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1699,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,PublicServiceCoOfOklahoma,0,USD,386800000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1697,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,IndianaMichiganPowerCo,0,USD,439400000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1700,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,SouthwesternElectricPowerCo,0,USD,668500000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1696,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,AppalachianPowerCo,0,USD,925300000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml
1693,0000004904-22-000024,AssetsCurrent,2021,us-gaap/2021,,0,USD,7809200000.0,,4904,AMERICAN ELECTRIC POWER CO INC,2021-12-31,False,aep-20211231_htm.xml


In [39]:
duplicates[ duplicates.cik == '4904']['value'].sum()/2

5617600000.0

In this case, the whole company has some assets of its own.

### Multiple reports due to an aquisition

In [17]:
con = sqlite3.connect('data/processed/all10k.db')
all_sub_22 = pd.read_sql_query("""SELECT * FROM sub WHERE fy='2022';""", con)

all_sub_22[ all_sub_22.duplicated('cik', keep=False) ]

Unnamed: 0,adsh,cik,name,sic,countryba,stprba,cityba,zipba,bas1,bas2,...,period,fy,fp,filed,accepted,prevrpt,detail,instance,nciks,aciks
71,0001493152-22-024263,1847846,8I ACQUISITION 2 CORP.,8000,SG,,SINGAPORE,59817,C/O 6 EU TONG SEN STREET,#08-13 THE CENTRAL,...,20220731,2022,FY,20220829,2022-08-29 14:58:00.0,0,1,form10-k_htm.xml,1,
5528,0001493152-23-022805,1847846,EUDA HEALTH HOLDINGS LTD,8000,SG,,SINGAPORE,59817,C/O 6 EU TONG SEN STREET,#08-13 THE CENTRAL,...,20221231,2022,FY,20230628,2023-06-28 17:16:00.0,0,1,form10-k_htm.xml,1,


These entities have the different names but the same cik

In [24]:
different = (all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].iloc[0] != all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].iloc[1])

all_sub_22[ all_sub_22.duplicated('cik', keep=False) ].loc[:,different]

Unnamed: 0,adsh,name,stprba,stprma,stprinc,former,changed,fye,period,filed,accepted,aciks
71,0001493152-22-024263,8I ACQUISITION 2 CORP.,,,,,,731,20220731,20220829,2022-08-29 14:58:00.0,
5528,0001493152-23-022805,EUDA HEALTH HOLDINGS LTD,,,,8I ACQUISITION 2 CORP.,20210224.0,1231,20221231,20230628,2023-06-28 17:16:00.0,


Aha! We have a 'former' field that show the aquirer