# Place Stock Trades into Senator Dataframe

## 1. Understand the Senator Trading Report (STR) Dataframe

In [1]:
import pandas as pd
#https://docs.google.com/spreadsheets/d/1lH_LpTgRlfzKvpRnWYgoxlkWvJj0v1r3zN3CeWMAgqI/edit?usp=sharing
try:
    sen_df = pd.read_csv("Senator Stock Trades/Senate Stock Watcher 04_16_2020 All Transactions.csv")
except:
    sen_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Senate%20Stock%20Trades/Senate%20Stock%20Watcher%2004_16_2020%20All%20Transactions.csv")
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link
0,03/04/2020,Joint,--,INGERSOLL RAND PLC SHARES (Exchanged) <br> TRA...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
1,03/04/2020,Self,--,INGERSOLL RAND PLC SHS (Exchanged) <br> TRANE ...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...


In [2]:
sen_df.type.unique()

array(['Exchange', 'Sale (Full)', 'Purchase', 'Sale (Partial)'],
      dtype=object)

There are 4 types of trades.
Exchanges: Exchange 1 stock for another
Sale (Full): Selling all of their stock
Purchase: Buying a stock
Sale (Partial): Selling some of that particular stock

In [3]:
n_exchanges = len(sen_df.loc[sen_df['type'] == "Exchange"])
n_trades = len(sen_df)
print("There are " +str(n_exchanges) +" exchange trades out of a total of " +str(n_trades)+ " trades.")
sen_df = sen_df.loc[sen_df['type'] != "Exchange"]


There are 84 exchange trades out of a total of 8600 trades.


At this point in time, I will exclude exchange trades because they are so few and wish to build the basic structure of the project. As you can see, this would require splitting up the exchange into two rows with each company and so on. I may include this step later if time permits. 

There should now be 8516 trades remaining in the dataframe. Let's make sure this is so.

In [4]:
n_trades = len(sen_df)
print("There are " +str(n_trades)+ " trades in the dataframe")

There are 8516 trades in the dataframe


In [5]:
n_blank_ticker = len(sen_df.loc[sen_df['ticker'] == "--"])
print("There are " +str(n_blank_ticker) +" trades w/o a ticker out of a total of " +str(n_trades)+ " trades")
sen_df = sen_df.loc[sen_df['ticker'] != "--"]

There are 1872 trades w/o a ticker out of a total of 8516 trades


For the same reasons we excluded exchange trades, we will also exclude trades without a ticker (which all public stocks have - the ticker is their identifier on the stock exchange). Eliminating trades without a ticker takes out trades of other types of securities (corporate bonds, municipal securities, non-public stock).

There should now be 6644 trades remaining in the dataframe. Let's make sure this is so.

In [6]:
n_trades = len(sen_df)
print("There are " +str(n_trades)+ " trades in the dataframe")

There are 6644 trades in the dataframe


## 2. Add Data to STR Dataframe  

### Import Data

In this step we will be using company information such as market cap and industry from online lists provided by the NYSE, NASDAQ, and ASXL exchange. Links can be found here:https://stackoverflow.com/questions/25338608/download-all-stock-symbol-list-of-a-market

In [7]:
ticker_list = list()
try:
    NYSE_df = pd.read_csv("NYSEcompanylist.csv")
except:
    NYSE_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NYSEcompanylist.csv")
    
try:
    NASDAQ_df = pd.read_csv("NASDAQcompanylist.csv")
except:
    NASDAQ_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NASDAQcompanylist.csv")
    
ticker_list.append(NYSE_df)
ticker_list.append(NASDAQ_df)

NYSE_df.head()

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Unnamed: 8
0,DDD,3D Systems Corporation,7.46,$886.75M,,Technology,Computer Software: Prepackaged Software,https://old.nasdaq.com/symbol/ddd,
1,MMM,3M Company,145.74,$83.83B,,Health Care,Medical/Dental Instruments,https://old.nasdaq.com/symbol/mmm,
2,WBAI,500.com Limited,4.15,$178.45M,2013.0,Consumer Services,Services-Misc. Amusement & Recreation,https://old.nasdaq.com/symbol/wbai,
3,WUBA,58.com Inc.,52.0,$7.79B,2013.0,Technology,"Computer Software: Programming, Data Processing",https://old.nasdaq.com/symbol/wuba,
4,EGHT,8x8 Inc,18.89,$1.94B,,Technology,EDP Services,https://old.nasdaq.com/symbol/eght,


In [8]:
NASDAQ_df.head()

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Unnamed: 8
0,TXG,"10x Genomics, Inc.",64.89,$6.24B,2019.0,Capital Goods,Biotechnology: Laboratory Analytical Instruments,https://old.nasdaq.com/symbol/txg,
1,YI,"111, Inc.",5.11,$417.28M,2018.0,Health Care,Medical/Nursing Services,https://old.nasdaq.com/symbol/yi,
2,PIH,"1347 Property Insurance Holdings, Inc.",4.64,$27.93M,2014.0,Finance,Property-Casualty Insurers,https://old.nasdaq.com/symbol/pih,
3,PIHPP,"1347 Property Insurance Holdings, Inc.",25.7615,$18.03M,,Finance,Property-Casualty Insurers,https://old.nasdaq.com/symbol/pihpp,
4,TURN,180 Degree Capital Corp.,2.175,$67.69M,,Finance,Finance/Investors Services,https://old.nasdaq.com/symbol/turn,


In [9]:
"""
Add data for Berkshire Hathaway, Lions Gate Entertainment, and Royal Dutch Shell to the NYSE company list. While
#these companies are in the company list, their fields are empty. Also, change the tickers of these companies to 
#match Senate Stock Data (since dashes are used instead of periods in that dataset, we make sure the same is true 
in the NYSE company list). What matters is consistent convention here.
"""

row_count = 0
replacement_count = 0

for row_tuple in NYSE_df.itertuples():
    
    if replacement_count == 4:
        break
    
    if row_tuple.Symbol == "BRK.B":
        #row_tuple.Symbol = "BRK-B"
        NYSE_df.at[row_count, 'Symbol'] = "BRK-B"
        #Shares outstanding reported in Q1 2020 financial reports, stock price from May 6, when this data is dated
        #row_tuple.MarketCap = "$420.02B"
        NYSE_df.at[row_count, 'MarketCap'] = "$420.02B"
        #row_tuple.Sector = "Miscellaneous"
        NYSE_df.at[row_count, 'Sector'] = "Miscellaneous"
        #row_tuple.industry = "Conglomerate"
        NYSE_df.at[row_count, 'industry'] = "Conglomerate"
        replacement_count = replacement_count + 1
        
    if row_tuple.Symbol == "LGF.B":
        #row_tuple.Symbol = "LGF-B"
        #Shares outstanding reported in Q1 2020 financial reports, stock price from May 6, when this data is dated
        #row_tuple.MarketCap = "$14.62B"
        #row_tuple.Sector = "Consumer Services"
        #row_tuple.industry = "Movies/Entertainment"
        
        NYSE_df.at[row_count, 'Symbol'] = "LGF-B"
        NYSE_df.at[row_count, 'MarketCap'] = "$14.62B"
        NYSE_df.at[row_count, 'Sector'] = "Consumer Services"
        NYSE_df.at[row_count, 'industry'] = "Movies/Entertainment"  
        replacement_count = replacement_count + 1

    if row_tuple.Symbol == "RDS.A":
        #row_tuple.Symbol = "RDS-A"
        #Shares outstanding reported in Q1 2020 financial reports, stock price from May 6, when this data is dated
        #row_tuple.MarketCap = "$122.28B"
        #row_tuple.Sector = "Energy"
        #row_tuple.industry = "Oil & Gas Production"
        
        NYSE_df.at[row_count, 'Symbol'] = "RDS-A"
        NYSE_df.at[row_count, 'MarketCap'] = "$122.28B"
        NYSE_df.at[row_count, 'Sector'] = "Energy"
        NYSE_df.at[row_count, 'industry'] = "Oil & Gas Production"  
        replacement_count = replacement_count + 1

    if row_tuple.Symbol == "RDS.B":
        #row_tuple.Symbol = "RDS-B"
        #Shares outstanding reported in Q1 2020 financial reports, stock price from May 6, when this data is dated
        #row_tuple.MarketCap = "$122.09B"
        #row_tuple.Sector = "Energy"
        #row_tuple.industry = "Oil & Gas Production"
        
        NYSE_df.at[row_count, 'Symbol'] = "RDS-B"
        NYSE_df.at[row_count, 'MarketCap'] = "$122.09B"
        NYSE_df.at[row_count, 'Sector'] = "Energy"
        NYSE_df.at[row_count, 'industry'] = "Oil & Gas Production"
        replacement_count = replacement_count + 1
        
    row_count = row_count + 1
    
#Confirm changes have been made successfully
for row_tuple in NYSE_df.itertuples():
    
    if row_tuple.Symbol == "BRK-B":
        print (row_tuple)
        
    if row_tuple.Symbol == "LGF-B":
        print (row_tuple)
        
    if row_tuple.Symbol == "RDS-A":
        print (row_tuple)
        
    if row_tuple.Symbol == "RDS-B":
        print (row_tuple)


Pandas(Index=365, Symbol='BRK-B', Name='Berkshire Hathaway Inc.', LastSale=nan, MarketCap='$420.02B', IPOyear=nan, Sector='Miscellaneous', industry='Conglomerate', _8='https://old.nasdaq.com/symbol/brk.b', _9=nan)
Pandas(Index=1700, Symbol='LGF-B', Name='Lions Gate Entertainment Corporation', LastSale=nan, MarketCap='$14.62B', IPOyear=2016.0, Sector='Consumer Services', industry='Movies/Entertainment', _8='https://old.nasdaq.com/symbol/lgf.b', _9=nan)
Pandas(Index=2426, Symbol='RDS-A', Name='Royal Dutch Shell PLC', LastSale=nan, MarketCap='$122.28B', IPOyear=nan, Sector='Energy', industry='Oil & Gas Production', _8='https://old.nasdaq.com/symbol/rds.a', _9=nan)
Pandas(Index=2427, Symbol='RDS-B', Name='Royal Dutch Shell PLC', LastSale=nan, MarketCap='$122.09B', IPOyear=nan, Sector='Energy', industry='Oil & Gas Production', _8='https://old.nasdaq.com/symbol/rds.b', _9=nan)


In [10]:
#There are also 2 instances where a wrong ticker for Berkshire Hathaway is found in the Senate Stock data
#(BRKB is used as opposed to BRK-B). Thus, we correct for those instances here.

#Find indices of these two trades
for row_tuple in sen_df.itertuples():
    if row_tuple.ticker == "BRKB":
        print (row_tuple)
        
#We can see that the indices are 1207 and 4611, so we will manually modify the ticker field of these trades.

sen_df.at[1207, 'ticker'] = "BRK-B"
sen_df.at[4611, 'ticker'] = "BRK-B"

len(sen_df)

Pandas(Index=1207, transaction_date='12/06/2018', owner='Self', ticker='BRKB', asset_description='Berkshire Hathaway Inc.', asset_type='Stock', type='Purchase', amount='$15,001 - $50,000', comment='--', senator='Jerry Moran,  ', ptr_link='https://efdsearch.senate.gov/search/view/ptr/dafb67b1-d578-402d-9e26-c3c2a575b42c/')
Pandas(Index=4611, transaction_date='03/02/2017', owner='Spouse', ticker='BRKB', asset_description='Berkshire Hathaway Inc.', asset_type='Stock', type='Purchase', amount='$15,001 - $50,000', comment='--', senator='Susan M Collins', ptr_link='https://efdsearch.senate.gov/search/view/ptr/ea48ecad-1959-4f7a-ace0-778a71d06437/')


6644

In [11]:
#Get sector data for each stock trade

sector_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in NYSE_df.itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.Sector == "n/a":
                sector_data.append("none")
            else:
                sector_data.append(row_tuple_tic.Sector)
            break
    if count == 0:
        for row_tuple_tic in NASDAQ_df.itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.Sector == "n/a":
                    sector_data.append("none")
                else:
                    sector_data.append(row_tuple_tic.Sector)
                break
    if count == 0:
        sector_data.append("none")
        
print(sector_data[0:9])

['Capital Goods', 'Capital Goods', 'none', 'none', 'Basic Industries', 'Consumer Services', 'Finance', 'Technology', 'Health Care']


In [12]:
#make sure length matches number of rows in df
print(len(sector_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0

for i in sector_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
    
print(no_ticker_cnt)

6644
1141


In [13]:
#Get industry data for each stock trade

industry_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in NYSE_df.itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.industry == "n/a":
                industry_data.append("none")
            else:
                industry_data.append(row_tuple_tic.industry)
            break
    if count == 0:
        for row_tuple_tic in NASDAQ_df.itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.industry == "n/a":
                    industry_data.append("none")
                else:
                    industry_data.append(row_tuple_tic.industry)
                break
    if count == 0:
        industry_data.append("none")
        
print(industry_data[0:9])

['Biotechnology: Laboratory Analytical Instruments', 'Industrial Machinery/Components', 'none', 'none', 'Paints/Coatings', 'Building operators', 'Major Banks', 'Semiconductors', 'Major Pharmaceuticals']


In [14]:
#make sure length matches number of rows in df
print(len(industry_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0

for i in industry_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
    
print(no_ticker_cnt)

6644
1141


In [15]:
#Get market cap data for each stock trade

mktcap_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in NYSE_df.itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.MarketCap == "n/a":
                mktcap_data.append("none")
            else:
                mktcap_data.append(row_tuple_tic.MarketCap)
            break
    if count == 0:
        for row_tuple_tic in NASDAQ_df.itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.MarketCap == "n/a":
                    mktcap_data.append("none")
                else:
                    mktcap_data.append(row_tuple_tic.MarketCap)
                break
    if count == 0:
        mktcap_data.append("none")
        
print(mktcap_data[0:9])

['$47.15B', '$8.58B', 'none', 'none', '$8.74B', '$34.86B', '$92.93B', '$52.6B', '$190.9B']


In [16]:
#make sure length matches number of rows in df
print(len(mktcap_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0

for i in mktcap_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
        
print(no_ticker_cnt)

6644
1141


In [17]:
#add new columns to df

sen_df['mkt_cap'] = mktcap_data
sen_df['sector'] = sector_data
sen_df['industry'] = industry_data

sen_df = sen_df.fillna("none")
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,mkt_cap,sector,industry
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$47.15B,Capital Goods,Biotechnology: Laboratory Analytical Instruments
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$8.58B,Capital Goods,Industrial Machinery/Components
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,none,none,none
5,02/27/2020,Joint,WCMIX,WCM Focused International Growth Fund Institut...,Other Securities,Purchase,"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,none,none,none
6,02/27/2020,Joint,RPM,RPM International Inc.,Stock,Sale (Partial),"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,$8.74B,Basic Industries,Paints/Coatings


In [18]:
"""
Print out names of companies with missing data to find out why we have so many misses (~17% of our data).
There seem to be 3 reasons for this:
1. Companies merging with another or being acquired (or even acquiring and taking the acquired company's name - very rare)
2. Foreign companies (listed abroad)
3. American companies listed abroad - this applies to a very small number of trades
"""

from collections import Counter

company_missing_data = list()
for row_tuple in sen_df.itertuples():
    if row_tuple.mkt_cap == "none":
        company_missing_data.append(row_tuple.asset_description)

print(Counter(company_missing_data))


Counter({'First Data Corporation': 76, 'Revere Bank': 36, 'Halyard Health, Inc.': 36, 'CBS Corporation (NYSE)': 35, 'Whole Foods Market, Inc.': 29, 'Williams Partners L.P. (NYSE)': 27, 'Versum Materials, Inc.': 25, 'Bollore': 24, 'SPDR S&amp;P 500 ETF': 22, 'Celgene Corporation': 20, 'Exelis Inc. Common Stock Ex-Dis (NYSE)': 18, 'USG Corporation': 16, 'CBS Corporation': 15, 'Weight Watchers International, Inc.': 14, 'Cablevision Systems Corporation (NYSE)': 13, 'Michael Kors Holdings Limited (NYSE)': 13, 'VCA Inc.': 12, 'ACE Limited (NYSE)': 12, 'Nestl\\u00e9 S.A.': 11, 'DowDuPont Inc.': 11, 'Celgene Corporation (NASDAQ)': 11, 'KLX Inc. (NASDAQ)': 11, 'Axiall Corporation': 11, 'United Technologies Corporation': 10, 'Energy Transfer Partners, L.P.': 9, 'Targa Resources Partners LP (NYSE)': 9, 'Energy Transfer Equity, L.P. (NYSE)': 9, 'Qlik Technologies, Inc. (NASDAQ)': 9, 'Deutsche Global Infrastructure A (NASDAQ)': 9, 'Cohen &amp; Steers Dividend Value A (NASDAQ)': 9, 'Raytheon Company

In [19]:
#Get a view of how many industries are found in our senate stock data.

industry_dict = Counter(industry_data)

industry_list = list()
for x in industry_dict:
    industry_list.append(x)
    
print(industry_list[0:9])
n_industries = len(industry_list)
#since 'none' is included in our list
n_industries = n_industries - 1
print("There are " + str(n_industries) + " industries covered by the trades of senators.")

Counter({'none': 1141, 'Major Banks': 350, 'Major Pharmaceuticals': 317, 'Natural Gas Distribution': 203, 'Industrial Machinery/Components': 187, 'Computer Manufacturing': 180, 'Oil & Gas Production': 166, 'Consumer Electronics/Appliances': 165, 'Telecommunications Equipment': 157, 'Computer Software: Prepackaged Software': 138, 'Television Services': 137, 'Clothing/Shoe/Accessory Stores': 135, 'Semiconductors': 132, 'Integrated oil Companies': 131, 'Major Chemicals': 112, 'Computer Software: Programming, Data Processing': 112, 'Department/Specialty Retail Stores': 106, 'Beverages (Production/Distribution)': 102, 'Services-Misc. Amusement & Recreation': 97, 'Biotechnology: Biological Products (No Diagnostic Substances)': 96, 'Package Goods/Cosmetics': 91, 'Auto Manufacturing': 90, 'Medical/Dental Instruments': 83, 'Computer peripheral equipment': 81, 'Business Services': 78, 'Diversified Commercial Services': 77, 'Real Estate Investment Trusts': 76, 'Hotels/Resorts': 73, 'Consumer Elec

In [20]:
import string

industry_size_data = list()

for row_tuple in sen_df.itertuples():
    industry_size = row_tuple.industry
    
    if industry_size == 'none':
        industry_size_data.append("none")
        continue
    
    size = row_tuple.mkt_cap
    factor = 0
    x = size.find("M")
    if x != -1:
        factor = 1000000
    else:
        factor = 1000000000
    
    size = size.lstrip("$")
    size = size.rstrip("MB")
    size = float(size)
    size = size*factor
    
    if size < 500000000:
        industry_size = industry_size + "1"
        industry_size_data.append(industry_size)
        continue
    elif size < 1000000000:
        industry_size = industry_size + "2"
        industry_size_data.append(industry_size)
        continue
    elif size < 10000000000:
        industry_size = industry_size + "3"
        industry_size_data.append(industry_size)
        continue
    elif size < 50000000000:
        industry_size = industry_size + "4"
        industry_size_data.append(industry_size)
        continue
    elif size < 100000000000:
        industry_size = industry_size + "5"
        industry_size_data.append(industry_size)
        continue
    elif size < 500000000000:
        industry_size = industry_size + "6"
        industry_size_data.append(industry_size)
        continue
    else:
        industry_size = industry_size + "7"
        industry_size_data.append(industry_size)
        continue
    
print(industry_size_data[0:9])
print(len(industry_size_data))

['Biotechnology: Laboratory Analytical Instruments4', 'Industrial Machinery/Components3', 'none', 'none', 'Paints/Coatings3', 'Building operators4', 'Major Banks5', 'Semiconductors5', 'Major Pharmaceuticals6']
6644


In [21]:
#add the new column to df

sen_df['classification'] = industry_size_data
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,mkt_cap,sector,industry,classification
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$47.15B,Capital Goods,Biotechnology: Laboratory Analytical Instruments,Biotechnology: Laboratory Analytical Instruments4
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$8.58B,Capital Goods,Industrial Machinery/Components,Industrial Machinery/Components3
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,none,none,none,none
5,02/27/2020,Joint,WCMIX,WCM Focused International Growth Fund Institut...,Other Securities,Purchase,"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,none,none,none,none
6,02/27/2020,Joint,RPM,RPM International Inc.,Stock,Sale (Partial),"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,$8.74B,Basic Industries,Paints/Coatings,Paints/Coatings3


In [22]:
#create a list of all the classifications per industry across whole dataframe, to get a view of the breakdown in
#classifications across each industry

classification_industry_breakdown = list()

for x in industry_list:
    y = list()
    for row_tuple in sen_df.itertuples():
        if row_tuple.industry == x:
            y.append(row_tuple.classification)
    classification_industry_breakdown.append(y)
    
print(classification_industry_breakdown[0:9])

[['Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instru