# Place Stock Trades into Senator Dataframe

## 1. Understand the Senator Trading Report (STR) Dataframe

In [1]:
import pandas as pd
#https://docs.google.com/spreadsheets/d/1lH_LpTgRlfzKvpRnWYgoxlkWvJj0v1r3zN3CeWMAgqI/edit?usp=sharing
try:
    sen_df = pd.read_csv("Senator Stock Trades/Senate Stock Watcher 04_16_2020 All Transactions.csv")
except:
    sen_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Senate%20Stock%20Trades/Senate%20Stock%20Watcher%2004_16_2020%20All%20Transactions.csv")
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link
0,03/04/2020,Joint,--,INGERSOLL RAND PLC SHARES (Exchanged) <br> TRA...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
1,03/04/2020,Self,--,INGERSOLL RAND PLC SHS (Exchanged) <br> TRANE ...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...


In [2]:
sen_df.type.unique()

array(['Exchange', 'Sale (Full)', 'Purchase', 'Sale (Partial)'],
      dtype=object)

There are 4 types of trades.
Exchanges: Exchange 1 stock for another
Sale (Full): Selling all of their stock
Purchase: Buying a stock
Sale (Partial): Selling some of that particular stock

In [3]:
n_exchanges = len(sen_df.loc[sen_df['type'] == "Exchange"])
n_trades = len(sen_df)
print("There are " +str(n_exchanges) +" exchange trades out of a total of " +str(n_trades)+ " trades.")
sen_df = sen_df.loc[sen_df['type'] != "Exchange"]


There are 84 exchange trades out of a total of 8600 trades.


At this point in time, I will exclude exchange trades because they are so few and wish to build the basic structure of the project. As you can see, this would require splitting up the exchange into two rows with each company and so on. I may include this step later if time permits. 

There should now be 8516 trades remaining in the dataframe. Let's make sure this is so.

In [4]:
n_trades = len(sen_df)
print("There are " +str(n_trades)+ " trades in the dataframe")

There are 8516 trades in the dataframe


In [5]:
n_blank_ticker = len(sen_df.loc[sen_df['ticker'] == "--"])
print("There are " +str(n_blank_ticker) +" trades w/o a ticker out of a total of " +str(n_trades)+ " trades")
sen_df = sen_df.loc[sen_df['ticker'] != "--"]

There are 1872 trades w/o a ticker out of a total of 8516 trades


For the same reasons we excluded exchange trades, we will also exclude trades without a ticker (which all public stocks have - the ticker is their identifier on the stock exchange). Eliminating trades without a ticker takes out trades of other types of securities (corporate bonds, municipal securities, non-public stock).

There should now be 6644 trades remaining in the dataframe. Let's make sure this is so.

In [6]:
n_trades = len(sen_df)
print("There are " +str(n_trades)+ " trades in the dataframe")

There are 6644 trades in the dataframe


## 2. Add Data to STR Dataframe  

### Import Data

In this step we will be using company information such as market cap and industry from online lists provided by the NYSE, NASDAQ, and ASXL exchange. Links can be found here:https://stackoverflow.com/questions/25338608/download-all-stock-symbol-list-of-a-market

In [7]:
ticker_list = list()
try:
    NYSE_df = pd.read_csv("NYSEcompanylist.csv")
except:
    NYSE_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NYSEcompanylist.csv")
    
try:
    NASDAQ_df = pd.read_csv("NASDAQcompanylist.csv")
except:
    NASDAQ_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NASDAQcompanylist.csv")
    
#try:
#    ASXL_df = pd.read_csv("ASXLcompanylist.csv")
#except:
#    ASXL_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/ASXLcompanylist.csv")
    
ticker_list.append(NYSE_df)
ticker_list.append(NASDAQ_df)
#ticker_list.append(ASXL_df)


#ticker_list[0].head()


In [20]:
#Get sector data for each stock trade

sector_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in ticker_list[0].itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.Sector == "n/a":
                sector_data.append("none")
            else:
                sector_data.append(row_tuple_tic.Sector)
            break
    if count == 0:
        for row_tuple_tic in ticker_list[1].itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.Sector == "n/a":
                    sector_data.append("none")
                else:
                    sector_data.append(row_tuple_tic.Sector)
                break
    if count == 0:
        sector_data.append("none")
        
print(sector_data[0:9])

['Capital Goods', 'none', 'none', 'Basic Industries', 'Consumer Services', 'Finance', 'Technology', 'Health Care', 'Consumer Services']


In [9]:
#make sure length matches number of rows in df
print(len(sector_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0
#counter for how many times stock was found in exch data set, but that company's data had no a value for the field
ticker_missing_field_cnt = 0

for i in sector_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
        
    if i == "n/a":
        ticker_missing_field_cnt = ticker_missing_field_cnt + 1
    
print(no_ticker_cnt)
print(ticker_missing_field_cnt)

6644
1206
0


In [10]:
#Get industry data for each stock trade

industry_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in ticker_list[0].itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.industry == "n/a":
                industry_data.append("none")
            else:
                industry_data.append(row_tuple_tic.industry)
            break
    if count == 0:
        for row_tuple_tic in ticker_list[1].itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.industry == "n/a":
                    industry_data.append("none")
                else:
                    industry_data.append(row_tuple_tic.industry)
                break
    if count == 0:
        industry_data.append("none")
        
print(industry_data[0:9])

['Biotechnology: Laboratory Analytical Instruments', 'Industrial Machinery/Components', 'none', 'none', 'Paints/Coatings', 'Building operators', 'Major Banks', 'Semiconductors', 'Major Pharmaceuticals', 'Newspapers/Magazines', 'Oil & Gas Production', 'Food Chains', 'Major Chemicals', 'Hotels/Resorts', 'Hotels/Resorts', 'Major Banks', 'Food Chains', 'Food Chains', 'Major Banks', 'Food Chains', 'Clothing/Shoe/Accessory Stores', 'Clothing/Shoe/Accessory Stores', 'Clothing/Shoe/Accessory Stores', 'Auto Manufacturing', 'Food Chains', 'Clothing/Shoe/Accessory Stores', 'Food Chains', 'none', 'none', 'Clothing/Shoe/Accessory Stores', 'none', 'Major Banks', 'none', 'none', 'Industrial Machinery/Components', 'Department/Specialty Retail Stores', 'Newspapers/Magazines', 'Broadcasting', 'Industrial Machinery/Components', 'Oil Refining/Marketing', 'Natural Gas Distribution', 'Food Chains', 'Food Chains', 'Natural Gas Distribution', 'Natural Gas Distribution', 'Food Chains', 'Natural Gas Distributio

In [11]:
#make sure length matches number of rows in df
print(len(industry_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0
#counter for how many times stock was found in exch data set, but that company's data had no a value for the field
ticker_missing_field_cnt = 0

for i in industry_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
        
    if i == "n/a":
        ticker_missing_field_cnt = ticker_missing_field_cnt + 1
    
print(no_ticker_cnt)
print(ticker_missing_field_cnt)

6644
1206
0


In [12]:
#Get market cap data for each stock trade

mktcap_data = list()
for row_tuple in sen_df.itertuples():
    tic = row_tuple.ticker
    count = 0
    for row_tuple_tic in ticker_list[0].itertuples():
        sym = row_tuple_tic.Symbol
        if tic == sym:
            count = count+1
            if row_tuple_tic.MarketCap == "n/a":
                mktcap_data.append("none")
            else:
                mktcap_data.append(row_tuple_tic.MarketCap)
            break
    if count == 0:
        for row_tuple_tic in ticker_list[1].itertuples():
            sym = row_tuple_tic.Symbol
            if tic == sym:
                count = count+1
                if row_tuple_tic.MarketCap == "n/a":
                    mktcap_data.append("none")
                else:
                    mktcap_data.append(row_tuple_tic.MarketCap)
                break
    if count == 0:
        mktcap_data.append("none")
        
print(mktcap_data[0:9])

['$47.15B', '$8.58B', 'none', 'none', '$8.74B', '$34.86B', '$92.93B', '$52.6B', '$190.9B', '$7.75B', '$3.32B', '$25.42B', '$32.84B', '$8.88B', '$8.88B', '$198.15B', '$25.42B', '$25.42B', '$198.15B', '$25.42B', '$2.51B', '$2.51B', '$2.51B', '$32.58B', '$25.42B', '$2.51B', '$25.42B', 'none', 'none', '$2.51B', 'none', '$198.15B', 'none', 'none', '$3.26B', '$1.57B', '$7.57B', '$22.2B', '$1.81B', '$731.8M', '$64.54B', '$25.42B', '$25.42B', '$43.07B', '$64.54B', '$25.42B', '$43.07B', '$25.42B', '$4.48B', '$25.42B', '$25.42B', '$25.42B', '$9.43B', '$21.71B', '$8.47B', '$3.29B', '$7.1B', '$8.47B', '$4.48B', '$277.69B', '$189.41B', '$7.57B', '$3.74B', '$35.1B', '$26.58B', '$3.26B', '$25.42B', '$103.35B', '$3.29B', '$32.58B', '$8.47B', '$25.42B', '$25.42B', '$198.15B', '$25.42B', '$7.1B', '$1.44B', '$1.44B', '$8.47B', '$101.2B', '$25.42B', '$900.01B', '$25.42B', '$25.44B', 'none', '$26.58B', '$1.55B', '$3.26B', '$7.57B', '$1.55B', '$25.42B', '$1.57B', '$198.15B', 'none', '$101.2B', '$198.15B', '

In [13]:
#make sure length matches number of rows in df
print(len(mktcap_data))

#counter for how many times the stock traded by senator not found in exchange data set
no_ticker_cnt = 0
#counter for how many times stock was found in exch data set, but that company's data had no a value for the field
ticker_missing_field_cnt = 0

for i in mktcap_data:
    if i == "none":
        no_ticker_cnt = no_ticker_cnt + 1
        
    if i == "n/a":
        ticker_missing_field_cnt = ticker_missing_field_cnt + 1
        
print(no_ticker_cnt)
print(ticker_missing_field_cnt)

6644
1206
0


In [14]:
#add new columns to df

sen_df['mkt_cap'] = mktcap_data
sen_df['sector'] = sector_data
sen_df['industry'] = industry_data

sen_df = sen_df.fillna("none")
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,mkt_cap,sector,industry
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$47.15B,Capital Goods,Biotechnology: Laboratory Analytical Instruments
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$8.58B,Capital Goods,Industrial Machinery/Components
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,none,none,none
5,02/27/2020,Joint,WCMIX,WCM Focused International Growth Fund Institut...,Other Securities,Purchase,"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,none,none,none
6,02/27/2020,Joint,RPM,RPM International Inc.,Stock,Sale (Partial),"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,$8.74B,Basic Industries,Paints/Coatings


In [15]:
"""
Print out names of companies with missing data to find out why we have so many misses (around 18% of our data).
From quick insights, seem to be 3 reasons for this:
1. Mismatched tickers (e.g. berkshire class b ticker is BRK.B as in exch data but denoted BRK-B in senate data)
2. Foreign companies (listed abroad)
3. American companies listed abroad
"""

company_missing_data = list()
for row_tuple in sen_df.itertuples():
    if row_tuple.mkt_cap == "none":
        company_missing_data.append(row_tuple.asset_description)

print(company_missing_data[0:9])

['Siemens Aktiengesellschaft', 'WCM Focused International Growth Fund Institutiona', 'Berkshire Hathaway Inc.', 'Berkshire Hathaway Inc.', 'Berkshire Hathaway Inc.', 'Berkshire Hathaway Inc.', 'Berkshire Hathaway Inc.', 'Bollore', 'Bollore', 'Bollore', 'Bollore', 'Bollore', 'Cheniere Energy, Inc.', 'Bollore', 'ViacomCBS Inc.', 'SPDR S&amp;P 500 ETF Trust', 'Sprott Physical Gold and Silver Trust', 'Vanguard FTSE Emerging Markets Index Fund ETF Shar', 'Berkshire Hathaway Inc.', 'Tencent Holdings Limited', 'Tencent Holdings Limited', 'ViacomCBS Inc.', 'ViacomCBS Inc.', 'ViacomCBS Inc.', 'ViacomCBS Inc.', 'Cheniere Energy, Inc.', 'ViacomCBS Inc.', 'Cheniere Energy, Inc.', 'Vanguard Intermediate-Term Bond Index Fund ETF Sha', 'Vanguard GNMA Fund Admiral Shares', 'Templeton Global Bond Fund Advisor Class', 'iShares Edge MSCI USA Momentum Factor ETF', 'SPDR S&amp;P 500 ETF', 'CBS Corporation', 'Lions Gate Entertainment Corp.', 'Bollore', 'CBS Corporation', 'SPDR S&amp;P 500 ETF', 'CBS Corpora

In [16]:
#Get a view of how many industries are found in our senate stock data.

from collections import Counter
industry_dict = Counter(industry_data)

industry_list = list()
for x in industry_dict:
    industry_list.append(x)
    
print(industry_list[0:9])
n_industries = len(industry_list)
#since 'none' is included in our list
n_industries = n_industries - 1
print("There are " + str(n_industries) + " industries covered by the trades of senators.")

['Biotechnology: Laboratory Analytical Instruments', 'Industrial Machinery/Components', 'none', 'Paints/Coatings', 'Building operators', 'Major Banks', 'Semiconductors', 'Major Pharmaceuticals', 'Newspapers/Magazines', 'Oil & Gas Production', 'Food Chains', 'Major Chemicals', 'Hotels/Resorts', 'Clothing/Shoe/Accessory Stores', 'Auto Manufacturing', 'Department/Specialty Retail Stores', 'Broadcasting', 'Oil Refining/Marketing', 'Natural Gas Distribution', 'Services-Misc. Amusement & Recreation', 'Rental/Leasing Companies', 'Package Goods/Cosmetics', 'Beverages (Production/Distribution)', 'EDP Services', 'Food Distributors', 'Computer Software: Programming, Data Processing', 'Restaurants', 'Specialty Chemicals', 'Apparel', 'Consumer Electronics/Video Chains', 'Computer Software: Prepackaged Software', 'Containers/Packaging', 'Telecommunications Equipment', 'Medical/Nursing Services', 'Business Services', 'Integrated oil Companies', 'Computer Communications Equipment', 'Packaged Foods', '

In [17]:
import string

industry_size_data = list()

for row_tuple in sen_df.itertuples():
    industry_size = row_tuple.industry
    
    if industry_size == 'none':
        industry_size_data.append("none")
        continue
    
    size = row_tuple.mkt_cap
    factor = 0
    x = size.find("M")
    if x != -1:
        factor = 1000000
    else:
        factor = 1000000000
    
    size = size.lstrip("$")
    size = size.rstrip("MB")
    size = float(size)
    size = size*factor
    
    if size < 500000000:
        industry_size = industry_size + "1"
        industry_size_data.append(industry_size)
        continue
    elif size < 1000000000:
        industry_size = industry_size + "2"
        industry_size_data.append(industry_size)
        continue
    elif size < 10000000000:
        industry_size = industry_size + "3"
        industry_size_data.append(industry_size)
        continue
    elif size < 50000000000:
        industry_size = industry_size + "4"
        industry_size_data.append(industry_size)
        continue
    elif size < 100000000000:
        industry_size = industry_size + "5"
        industry_size_data.append(industry_size)
        continue
    elif size < 500000000000:
        industry_size = industry_size + "6"
        industry_size_data.append(industry_size)
        continue
    else:
        industry_size = industry_size + "7"
        industry_size_data.append(industry_size)
        continue
    
print(industry_size_data[0:9])
#print(len(industry_size_data))

['Biotechnology: Laboratory Analytical Instruments4', 'Industrial Machinery/Components3', 'none', 'none', 'Paints/Coatings3', 'Building operators4', 'Major Banks5', 'Semiconductors5', 'Major Pharmaceuticals6', 'Newspapers/Magazines3', 'Oil & Gas Production3', 'Food Chains4', 'Major Chemicals4', 'Hotels/Resorts3', 'Hotels/Resorts3', 'Major Banks6', 'Food Chains4', 'Food Chains4', 'Major Banks6', 'Food Chains4', 'Clothing/Shoe/Accessory Stores3', 'Clothing/Shoe/Accessory Stores3', 'Clothing/Shoe/Accessory Stores3', 'Auto Manufacturing4', 'Food Chains4', 'Clothing/Shoe/Accessory Stores3', 'Food Chains4', 'none', 'none', 'Clothing/Shoe/Accessory Stores3', 'none', 'Major Banks6', 'none', 'none', 'Industrial Machinery/Components3', 'Department/Specialty Retail Stores3', 'Newspapers/Magazines3', 'Broadcasting4', 'Industrial Machinery/Components3', 'Oil Refining/Marketing2', 'Natural Gas Distribution5', 'Food Chains4', 'Food Chains4', 'Natural Gas Distribution4', 'Natural Gas Distribution5', '

In [18]:
#add the new column to df

sen_df['classification'] = industry_size_data
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,mkt_cap,sector,industry,classification
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$47.15B,Capital Goods,Biotechnology: Laboratory Analytical Instruments,Biotechnology: Laboratory Analytical Instruments4
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,$8.58B,Capital Goods,Industrial Machinery/Components,Industrial Machinery/Components3
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,none,none,none,none
5,02/27/2020,Joint,WCMIX,WCM Focused International Growth Fund Institut...,Other Securities,Purchase,"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,none,none,none,none
6,02/27/2020,Joint,RPM,RPM International Inc.,Stock,Sale (Partial),"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,$8.74B,Basic Industries,Paints/Coatings,Paints/Coatings3


In [19]:
#create a list of all the classifications per industry across whole dataframe, to get a view of the breakdown in
#classifications across each industry

classification_industry_breakdown = list()

for x in industry_list:
    y = list()
    for row_tuple in sen_df.itertuples():
        if row_tuple.industry == x:
            y.append(row_tuple.classification)
    classification_industry_breakdown.append(y)
    
print(classification_industry_breakdown[0:9])

[['Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instruments3', 'Biotechnology: Laboratory Analytical Instruments4', 'Biotechnology: Laboratory Analytical Instru