# SI 608 Project – Workspace
<span style="font-size: 18px;">General scratchpad workspace that preloads all the dataframes.</span>
<br>See <code>./modules</code> to review how libraries are installed and imported, as well as where the data is loaded, cleaned, and formatted. This is only here as a helpful tool, make a copy and do whatever you'd like. Or don't use this at all if that's preferable.

[OpenSecrets Data Dictionary Index](../../docs/open_source_data_dictionary.md)
<br><small><em>(View the index with markdown preview)</em></small>

## Environment

#### Settings
Configure certain behaviors in this notebook.

In [5]:
DISPLAY_DF = True # for showdf() -> df.head()
SAVE_DF = True # for to_csv() -> pd.to_csv()

#### Initialize
Init file contains helper functions used throughout the project.

In [7]:
%run modules/init.ipynb

Initializing project...
pandas is already installed.
matplotlib is already installed.
networkx is already installed.
numpy is already installed.
...initialization complete.


#### Datasets

This module provides a single function for all of the *contribution* data from OpenSecrets.

In [9]:
%run modules/data.ipynb

Loading data module...
...data module loaded.


---
## Data

### 527 data

#### cmtes527

In [13]:
# OpenSecrets Data Definition: 527 Committees
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20527%20Cmtes.htm
columns_cmtes527 = ['cycle', 'rpt', 'ein', 'crp527name', 'affiliate', 'ultorg', 
                    'recipcode', 'cmteid', 'cid', 'eccmteid', 'party', 
                    'primcode', 'source', 'ffreq', 'ctype', 'csource', 'viewpt',
                    'comments', 'state']

if not os.path.exists('../../data/open_secrets/527/cmtes527.csv'):
    process_data('../../data/open_secrets/527/cmtes527.txt', n_expected_fields=len(columns_cmtes527), headers=columns_cmtes527, show_errs=False)

df_cmtes527 = pd.read_csv('../../data/open_secrets/527/cmtes527.csv', on_bad_lines='skip')

In [14]:
showdf(df_cmtes527)

Unnamed: 0,|2002|,|Q302|,|861006189|,|American Electronics Assn|,||,|American Electronics Assn|.1,|PB|,||.1,||.2,||.3,Unnamed: 10,|C5000|,|WebPN|,|Q|,|F|,Unnamed: 15,|N|,||.4,|AZ|
0,|2008|,|Q308|,|262108560|,|California 2008 GOP Delegation Corporate|,||,|California 2008 GOP Delegation|,|RP|,||,||,||,|R|,|Z5100|,|Name|,|Q|,|F|,||,|C|,||,|CA|
1,|2000|,|Q400|,|912101097|,|Alabama League of Environmental Action|,||,|Alabama League of Environmental Action|,|PI|,||,||,||,,|JE300|,|Name|,|Q|,|S|,|Name|,|L|,||,|AL|
2,|2012|,|Q412|,|522257109|,|International Brotherhood of Electrical Workers|,||,|International Brotherhood of Electrical Workers|,|PL|,|C00027342|,||,||,| |,|LC150|,|PAC|,|Q|,|F|,|Name|,|L|,||,|DC|


#### expends527

In [16]:
# OpenSecrets Data Dictionary 527 Expenditure Data - from IRS Form 8872B
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20527%20Expenditures.htm
columns_expends527 = ['rpt', 'formid', 'schbid', 'orgname', 'ein', 'recipient', 
                    'recipientcrp', 'amount', 'date', 'expcode', 'source', 
                    'purpose', 'addr1', 'addr2', 'city', 'state', 'zip',
                    'employer', 'occupation']

if not os.path.exists('../../data/open_secrets/527/expends527.csv'):
    process_data('../../data/open_secrets/527/expends527.txt', nrows=500, headers=columns_expends527, n_expected_fields=len(columns_cmtes527), show_errs=False)

df_expends527 = pd.read_csv('../../data/open_secrets/527/expends527.csv', nrows=10000, on_bad_lines='skip')

In [17]:
showdf(df_expends527)

Unnamed: 0,|Q210|,|9595787|,|2016057|,|Republican State Leadership Cmte|,|050532524|,|VERIZON|,|Verizon Communications|,125,04/16/2010,|A70|,|@new|,|TELEPHONE|,|PO BOX 660720|,||,|DALLAS|,|TX|,|75266|,|NA|,|NA|.1
0,|Q210|,|9595787|,|2016059|,|Republican State Leadership Cmte|,|050532524|,|VERIZON WIRELESS|,|Verizon Wireless|,141,04/09/2010,|A70|,|@new|,|CELL PHONE|,|PO BOX 25505|,||,|LEHIGH VALLEY|,|PA|,|18002|,|NA|,|NA|
1,|Q210|,|9595791|,|2016223|,|GOPAC|,|521237780|,|ADP|,|Automatic Data Processing Inc|,414,04/09/2010,|W10|,|@new|,|PAYROLL TAXES|,|8094 SAND PIPER CIRCLE|,||,|WHITE MARSH|,|MD|,|21236|,|NA|,|NA|
2,|Q210|,|9595791|,|2016225|,|GOPAC|,|521237780|,|ADP|,|Automatic Data Processing Inc|,78,04/23/2010,|W10|,|@new|,|PAYROLL SERVICES|,|8094 SAND PIPER CIRCLE|,||,|WHITE MARSH|,|MD|,|21236|,|NA|,|NA|


#### rcpts527

In [19]:
# OpenSecrets Data Dictionary 527 Contribution Data - from IRS Form 8872A
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20527%20Receipts.htm
columns_rcpts527 = ['id', 'rpt', 'formid', 'schaid', 'contribid', 'contrib', 
                    'amount', 'date', 'orgname', 'ultorg', 'realcode', 
                    'recipid', 'recipcode', 'party', 'recipient', 'city', 'state',
                    'zip', 'zip4', 'pmsa', 'employer', 'occupation', 'ytd', 'gender', 'source']

if not os.path.exists('../../data/open_secrets/527/rcpts527.csv'):
    process_data('../../data/open_secrets/527/rcpts527.txt', nrows=10000, headers=columns_rcpts527, n_expected_fields=len(columns_rcpts527), show_errs=False)

df_rcpts527 = pd.read_csv('../../data/open_secrets/527/rcpts527.csv', nrows=10000, on_bad_lines='skip')

In [20]:
showdf(df_rcpts527)

Unnamed: 0,981,|Q210|,|9595837|,|2017490|,||,|WEST LA DEMOCRATIC CLUB|,1,04/18/2010,|West La Democratic Club|,||.1,|Z9600|,|270160261|,|PI|,||.2,|ActBlue Technical Services|,|BURBANK|,|CA|,|91502|,||.3,|4480|,|NA|,|NA|.1,473,||.4,|Rept|
0,982,|Q210|,|9595837|,|2017492|,||,|WINOGRAD FOR CONGRESS 2010|,259,04/18/2010,|Winograd For Congress 2010|,||,|Z9600|,|270160261|,|PI|,||,|ActBlue Technical Services|,|BURBANK|,|CA|,|91502|,||,|4480|,|NA|,|NA|,1049,||,|Rept|
1,983,|Q210|,|9595837|,|2017387|,||,|FDL ACTION PAC|,4,04/18/2010,|Fdl Action Pac|,||,|Z9600|,|270160261|,|PI|,||,|ActBlue Technical Services|,|WASHINGTON|,|DC|,|20016|,||,|8840|,|NA|,|NA|,1524,||,|Rept|
2,984,|Q210|,|9595837|,|2017390|,||,|FRANKEN MVPS|,190,04/18/2010,|Franken Mvps|,||,|Z9600|,|270160261|,|PI|,||,|ActBlue Technical Services|,|MINNEAPOLIS|,|MN|,|55458|,||,|5120|,|NA|,|NA|,662,||,|Rept|


---
### Campaign Finance 18 data
#### cands18

In [52]:
# OpenSecrets Data Definition: Candidates
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20Candidates%20Data.htm
columns_cands18 = ['cycle', 'feccandid', 'cid', 'firstlastp', 'party', 'distidrunfor', 
                    'distidcurr', 'currcand', 'cyclecand', 'crpico', 'recipcode', 
                    'nopacs']

if not os.path.exists('../../data/open_secrets/CampaignFin18/cands18.csv'):
    process_data('../../data/open_secrets/CampaignFin18/cands18.txt', headers=columns_cands18, n_expected_fields=len(columns_cands18), show_errs=False)

df_cands18 = pd.read_csv('../../data/open_secrets/CampaignFin18/cands18.csv', on_bad_lines='skip')

# # Remove party labels from names: '3', 'R', 'D', 'I', 'L', 'U', 'i'
# df_cands18['firstlast__cands18'] = df_cands18['firstlastp__cands18'].apply(
#     lambda x: x.replace(" (3)", "").replace(" (R)", "").replace(" (D)", "").replace(" (I)", "").replace(" (L)", "").replace(" (U)", "").replace(" (i)", "") if isinstance(x, str) else x
# )

Reading line 1527 of 7639...
Reading line 3054 of 7639...
Reading line 4581 of 7639...
Reading line 6108 of 7639...
Reading line 7635 of 7639...
Processed data saved as ../../data/open_secrets/CampaignFin18/cands18_clean.csv


FileNotFoundError: [Errno 2] No such file or directory: '../../data/open_secrets/CampaignFin18/cands18.csv'

In [None]:
showdf(df_cands18)

#### cmtes18
*All cmtes, lead cmtes, pac cmtes*

In [None]:
# OpenSecrets Table Definition: Committee table
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20for%20Cmtes.htm
columns_cmtes18 = ['cycle', 'cmteid', 'pacshort', 'affiliate', 'ultorg', 'recipid', 
                    'recipcode', 'feccandid', 'party', 'primcode', 'source', 'sensitive',
                    'foreign', 'active']

if not os.path.exists('../../data/open_secrets/CampaignFin18/cmtes18.csv'):
    process_data('../../data/open_secrets/CampaignFin18/cmtes18.txt', headers=columns_cmtes18, n_expected_fields=len(columns_cmtes18), show_errs=False)

df_cmtes18 = pd.read_csv('../../data/open_secrets/CampaignFin18/cmtes18.csv', on_bad_lines='skip')

**All cmtes**

In [None]:
print(len(df_cmtes18))
showdf(df_cmtes18)

**Split lead and non-lead cmtes**

In [None]:
# I need to get transactions from NOT lead pac to IS lead pac.
# Use this column, pacid__pacs18, and lookup if value is a leadpac
# Then, remove all pacid__pacs18 that represent leadpacs.
df_recipid_cmtes18 = df_cmtes18[['cmteid__cmtes18', 'recipid__cmtes18']]

# Lead pac committees pacids, for filtering.
df_recipid_lead_cmtes18 = df_recipid_cmtes18[df_recipid_cmtes18['recipid__cmtes18'].str.startswith('N', na=False)]
df_recipid_lead_cmtes18 = df_recipid_lead_cmtes18[['cmteid__cmtes18']]

# Non-lead pac committees pacids, for filtering.
df_recipid_pac_cmtes18 = df_recipid_cmtes18[df_recipid_cmtes18['recipid__cmtes18'].str.startswith('C', na=False)]
df_recipid_pac_cmtes18 = df_recipid_pac_cmtes18[['cmteid__cmtes18']]

In [None]:
# Lead pac committees (filtered).
df_lead_cmtes18 = df_cmtes18[df_cmtes18['cmteid__cmtes18'].isin(df_recipid_lead_cmtes18['cmteid__cmtes18'])]
df_lead_cmtes18.columns = df_lead_cmtes18.columns.str.replace(r'(.*?)__(.*)', r'\1_lead__\2', regex=True)
print(len(df_lead_cmtes18))
showdf(df_lead_cmtes18)

In [None]:
# Non-lead pac committees (filtered).
df_pac_cmtes18 = df_cmtes18[df_cmtes18['cmteid__cmtes18'].isin(df_recipid_pac_cmtes18['cmteid__cmtes18'])]
df_pac_cmtes18.columns = df_pac_cmtes18.columns.str.replace(r'(.*?)__(.*)', r'\1_pac__\2', regex=True)
print(len(df_pac_cmtes18))
showdf(df_pac_cmtes18)

#### pac_other18 – pacs to pacs
*All pacs, pac-to-pac, pac-to-cand*

**All pacs**

In [None]:
# OpenSecrets Data Definition for PAC to PAC Data (Pac_other table)
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20PAC%20to%20PAC%20Data.htm
columns_pac_other18 = ['cycle', 'fecrecno', 'filerid', 'donorcmte', 'contriblendtrans', 'city', 'state', 
                            'zip', 'fecoccemp', 'primcode', 'date', 'amount', 'recipid', 'party', 'otherid',
                            'recipcode', 'recipprimcode', 'amend', 'report', 'pg', 'microfilm', 'type',
                            'realcode', 'source']

if not os.path.exists('../../data/open_secrets/CampaignFin18/pac_other18.csv'):
    process_data('../../data/open_secrets/CampaignFin18/pac_other18.txt', headers=columns_pac_other18, n_expected_fields=len(columns_pac_other18), show_errs=False)

df_pac_other18 = pd.read_csv('../../data/open_secrets/CampaignFin18/pac_other18.csv', on_bad_lines='skip')

In [None]:
# Identify the donor pacid.
# The "filerid" is the donor if "type" starts with "2".
# The "otherid" is the donor if "type" starts with "1".
df_pac_other18['donorid__pac_other18'] = df_pac_other18.apply(
    lambda row: row['otherid__pac_other18'] if row['type__pac_other18'].startswith('1')
    else (row['filerid__pac_other18'] if row['type__pac_other18'].startswith('2') else None),
    axis=1
)

In [None]:
# Identify the recipient lead pacid (starting with "C").
# The "filerid" is the recipient if "type" starts with "1".
# The "otherid" is the recipient if "type" starts with "2".
df_pac_other18['recippacid__pac_other18'] = df_pac_other18.apply(
    lambda row: row['otherid__pac_other18'] if row['type__pac_other18'].startswith('2')
    else (row['filerid__pac_other18'] if row['type__pac_other18'].startswith('1') else None),
    axis=1
)

In [None]:
showdf(df_pac_other18)

In [None]:
# # Notice that candidates are never filers in pac_other18.
# df_pac_other18[df_pac_other18['filerid__pac_other18'].str.startswith('N', na=False)]

**Pacs to pacs**

In [None]:
# Flows from pacs to pacs (non-lead/candidate)
df_pac_to_pac = df_pac_other18[~ df_pac_other18['recipid__pac_other18'].str.startswith('N', na=False)]

In [None]:
showdf(df_pac_to_pac)

**Pacs to cands**

In [None]:
# Flows from pacs *directly* to indiv candidates
df_pac_to_cand = df_pac_other18[df_pac_other18['recipid__pac_other18'].str.startswith('N', na=False)]

In [None]:
showdf(df_pac_to_cand)

#### pacs18 – pacs to cands

In [None]:
# Pacs18 – Lead pacs only.
# OpenSecrets Data Definition: PAC table (PACs to Candidates)
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20for%20PAC%20to%20Cands%20Data.htm
# "pacid" who represents "realcode" (industry or ideology) "di" (directly or indirectly) contributes "amount" to to "cid".
# NOTE: pacid__pacs18 never equals cid__pacs18 – no self-contributions.
columns_pacs18 = ['cycle', 'fecrecno', 'pacid', 'cid', 'amount', 'date', 'realcode', 
                            'type', 'di', 'feccandid']

if not os.path.exists('../../data/open_secrets/CampaignFin18/pacs18.csv'):
    process_data('../../data/open_secrets/CampaignFin18/pacs18.txt', headers=columns_pacs18, n_expected_fields=len(columns_pacs18), show_errs=False)

df_pacs18 = pd.read_csv('../../data/open_secrets/CampaignFin18/pacs18.csv', on_bad_lines='skip')

In [None]:
showdf(df_pacs18)

In [None]:
# # Lead pac transactions
# df_lead_pacs18 = df_pacs18[df_pacs18['pacid__pacs18'].isin(df_recipid_lead_cmtes18['cmteid__cmtes18'])]
# print(len(df_lead_pacs18))
# showdf(df_lead_pacs18)

In [None]:
# # Non-lead pac transactions
# df_pac_pacs18 = df_pacs18[df_pacs18['pacid__pacs18'].isin(df_recipid_pac_cmtes18['cmteid__cmtes18'])]
# print(len(df_pac_pacs18))
# showdf(df_pac_pacs18)

#### indivs18

In [None]:
# OpenSecrets Data Definition: Individual Contribution Data
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20for%20Individual%20Contribution%20Data.htm
columns_indivs18 = ['cycle', 'fectransid', 'contribid', 'contrib_last', 'contrib_first', 'recipid', 'orgname', 
                    'ultorg', 'realcode', 'date', 'amount', 'street', 'city', 'state',
                    'zip', 'recipcode', 'type', 'cmteid', 'otherid', 'gender', 'microfilm',
                    'occupation', 'employer', 'source']

# This dataset is huge, and crashes my computer.
# Takes 6.5min to read the file.

if not os.path.exists('../../data/open_secrets/CampaignFin18/indivs18.csv'):
    process_data('../../data/open_secrets/CampaignFin18/indivs18.txt', headers=columns_indivs18, nrows=1000, n_expected_fields=len(columns_indivs18), show_errs=False)

df_indivs18 = pd.read_csv('../../data/open_secrets/CampaignFin18/indivs18.csv', on_bad_lines='skip', nrows=1000)

In [None]:
showdf(df_indivs18)

---
### Expends18 data
#### expends18

In [None]:
# OpenSecrets Data Dictionary for Expenditure Data - from FEC electronic filings
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20Expenditures.htm
columns_expends18 = ['cycle', 'id', 'transid', 'crpfilerid', 
                     'recipcode', 'pacshort', 'crprecipname', 
                     'expcode', 'amount', 'date', 'city', 'state', 
                     'zip', 'cmteid_ef', 'candid', 'type',
                     'descrip', 'pg', 'elecother', 'enttype',
                     'source']

if not os.path.exists('../../data/open_secrets/Expend18/expends18.csv'):
    process_data('../../data/open_secrets/Expend18/expends18.txt', headers=columns_expends18, nrows=1000, n_expected_fields=len(columns_expends18), show_errs=False)

df_expends18 = pd.read_csv('../../data/open_secrets/Expend18/expends18.csv', on_bad_lines='skip', nrows=1000)

In [None]:
# All pac expenditures
showdf(df_expends18)

In [None]:
# Lead pac expenditures.
df_lead__expends18 = df_expends18[df_expends18['crpfilerid__expends18'].str.startswith('N')]
print(len(df_lead__expends18))
showdf(df_lead__expends18)

In [None]:
# Non-lead pac expenditures.
df_pac__expends18 = df_expends18[df_expends18['crpfilerid__expends18'].str.startswith('C')]
print(len(df_pac__expends18))
showdf(df_pac__expends18)

---
### Lobby data
#### lob_agency

In [None]:
# OpenSecrets Data Definition for Lobbying Data: Lobby agencies
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_agency.htm
columns_lob_agency = ['uniqid', 'agencyid', 'agency']

if not os.path.exists('../../data/open_secrets/Lobby/lob_agency.csv'):
    process_data('../../data/open_secrets/Lobby/lob_agency.txt', headers=columns_lob_agency, n_expected_fields=len(columns_lob_agency), show_errs=False)

df_lob_agency = pd.read_csv('../../data/open_secrets/Lobby/lob_agency.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_agency)

#### lob_bills

In [None]:
# OpenSecrets Data Definition for Lobbying Data: Lobby bills
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_bills.htm
columns_lob_bills = ['b_id', 'si_id', 'congno', 'bill_name']

if not os.path.exists('../../data/open_secrets/Lobby/lob_bills.csv'):
    process_data('../../data/open_secrets/Lobby/lob_bills.txt', headers=columns_lob_bills, n_expected_fields=len(columns_lob_bills), show_errs=False)

df_lob_bills = pd.read_csv('../../data/open_secrets/Lobby/lob_bills.csv', on_bad_lines='skip')
df_lob_bills['bill_name__lob_bills'] = df_lob_bills['bill_name__lob_bills'].apply(lambda x: x[:-2])

In [None]:
showdf(df_lob_bills)

#### lob_indus

In [None]:
# OpenSecrets Data Definition for Lobbying Data: Lobby industries
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_indus.htm
columns_lob_indus = ['client', 'sub', 'total', 'year', 'catcode']

if not os.path.exists('../../data/open_secrets/Lobby/lob_indus.csv'):
    process_data('../../data/open_secrets/Lobby/lob_indus.txt', headers=columns_lob_indus, n_expected_fields=len(columns_lob_indus), show_errs=False)

df_lob_indus = pd.read_csv('../../data/open_secrets/Lobby/lob_indus.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_indus)

#### lob_issue

In [None]:
# OpenSecrets Data Definition for Lobbying Data: Lobby issues
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_issues.htm
columns_lob_issue = ['si_id', 'uniqid', 'issueid', 'issue', 'specificissue', 'year']

if not os.path.exists('../../data/open_secrets/Lobby/lob_issue.csv'):
    process_data('../../data/open_secrets/Lobby/lob_issue.txt', headers=columns_lob_issue, n_expected_fields=len(columns_lob_issue), show_errs=False)

df_lob_issue = pd.read_csv('../../data/open_secrets/Lobby/lob_issue.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_issue)

#### lob_issue_no_specific

In [None]:
# OpenSecrets Data Definition for Lobbying Data: Lobby issues (no specific issue)
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_issues.htm
columns_lob_issue_no_specific = ['si_id', 'uniqid', 'issueid', 'issue', 'year']

if not os.path.exists('../../data/open_secrets/Lobby/lob_issue_NoSpecficIssue.csv'):
    process_data('../../data/open_secrets/Lobby/lob_issue_NoSpecficIssue.txt', headers=columns_lob_issue_no_specific, n_expected_fields=len(columns_lob_issue_no_specific), show_errs=False)

df_lob_issue_no_specific = pd.read_csv('../../data/open_secrets/Lobby/lob_issue_NoSpecficIssue.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_issue_no_specific)

#### lob_lobbying

In [None]:
# OpenSecrets Data Definitions for Lobbying Data: Lobbying
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_lobbying.htm
columns_lob_lobbying = ['uniqid','registrant_raw','registrant','isfirm','client_raw','client','ultorg','amount',
                        'catcode','source','self','includensfs','use',
                       'ind', 'year', 'type', 'typelong', 'affiliate']

if not os.path.exists('../../data/open_secrets/Lobby/lob_lobbying.csv'):
    process_data('../../data/open_secrets/Lobby/lob_lobbying.txt', headers=columns_lob_lobbying, n_expected_fields=len(columns_lob_lobbying), show_errs=False)

df_lob_lobbying = pd.read_csv('../../data/open_secrets/Lobby/lob_lobbying.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_lobbying)

#### lob_lobbyist

In [None]:
# OpenSecrets Data Definition for Lobbyists
# https://www.opensecrets.org/resources/datadictionary/Data%20Dictionary%20lob_lobbyists.htm
columns_lob_lobbyist = ['uniqid', 'lobbyist_lastname_std', 'lobbyist_firstname_std', 'lobbyist_lastname_raw', 
                     'lobbyist_firstname_raw', 'lobbyist_id', 'year', 'officialposition', 'cid', 'formercongmem']

if not os.path.exists('../../data/open_secrets/Lobby/lob_lobbyist.csv'):
    process_data('../../data/open_secrets/Lobby/lob_lobbyist.txt', headers=columns_lob_lobbyist, n_expected_fields=len(columns_lob_lobbyist), show_errs=False)

df_lob_lobbyist = pd.read_csv('../../data/open_secrets/Lobby/lob_lobbyist.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_lobbyist)

#### lob_rpt

In [None]:
# OpenSecrets Data Definitions for Lobbying Data: Report types
# No documentation provided on OpenSecrets.com
columns_lob_rpt = ['typelong', 'typeshort']

if not os.path.exists('../../data/open_secrets/Lobby/lob_rpt.csv'):
    process_data('../../data/open_secrets/Lobby/lob_rpt.txt', headers=columns_lob_rpt, n_expected_fields=len(columns_lob_rpt), show_errs=False)

df_lob_rpt = pd.read_csv('../../data/open_secrets/Lobby/lob_rpt.csv', on_bad_lines='skip')

In [None]:
showdf(df_lob_rpt)

### IDs and categories
#### CRP_ID

In [None]:
install_if_needed('xlrd')
import xlrd

In [None]:
# Candidate ids
# This dataset is very different, so load it independently.
columns_crp_ids = ['blank_excel_column__crp_ids', 'cid__crp_ids', 'crpname__crp_ids', 'party__crp_ids', 'distidrunfor__crp_ids', 'feccandid__crp_ids'] # Blank excel column is necessary.
columns_crp_ids = dict(enumerate(columns_crp_ids))
df_crp_ids = pd.read_excel('../../data/open_secrets/CRP_IDs.xls', header=None, skiprows=15)
df_crp_ids = df_crp_ids.drop(df_crp_ids.columns[0], axis=1)
df_crp_ids = df_crp_ids.rename(columns=columns_crp_ids)

In [None]:
showdf(df_crp_ids)

#### CRP_Categories

In [None]:
from io import StringIO
crp_filepath = '../../data/open_secrets/CRP_Categories.txt'
with open(crp_filepath, 'r') as file:
    lines = file.readlines()

header_line_index = next(i for i, line in enumerate(lines) if line.startswith('Catcode'))
table_data = ''.join(lines[header_line_index:])
df_crp_cats = pd.read_csv(StringIO(table_data), sep='\t')
df_crp_cats.columns = df_crp_cats.columns.str.lower().str.replace(' ', '_')
df_crp_cats.columns = [col + '__crp_cats' for col in df_crp_cats.columns]

In [None]:
showdf(df_crp_cats)