# Place Stock Trades into Senator Dataframe

## 1. Understand the Senator Trading Report (STR) Dataframe

In [1]:
import pandas as pd
#https://docs.google.com/spreadsheets/d/1lH_LpTgRlfzKvpRnWYgoxlkWvJj0v1r3zN3CeWMAgqI/edit?usp=sharing
try:
    sen_df = pd.read_csv("Senator Stock Trades/Senate Stock Watcher 04_16_2020 All Transactions.csv")
except:
    sen_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Senate%20Stock%20Trades/Senate%20Stock%20Watcher%2004_16_2020%20All%20Transactions.csv")
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link
0,03/04/2020,Joint,--,INGERSOLL RAND PLC SHARES (Exchanged) <br> TRA...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
1,03/04/2020,Self,--,INGERSOLL RAND PLC SHS (Exchanged) <br> TRANE ...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...


Amount is a range of dollar amounts. For future analysis we can turn this into two columns of min and max amount.

In [2]:
#strip characters that are not numeric
string_list = sen_df['amount']
string_list2 = list(map(lambda each:each.replace('$', ""), string_list))
string_list3 = list(map(lambda each:each.replace(',', ""), string_list2))
string_list4 = list(map(lambda each:each.replace('Over ', ""), string_list3))
#split up strings into min and max amounts
string_sep = list(map(lambda each:each.partition(' - '), string_list4))
string_sep_df = pd.DataFrame(string_sep)
string_sep_df.columns = ['min_amount', 'sep', 'max_amount']
#add min and max amounts to sen_df as numerics
sen_df['min_amount'] = pd.to_numeric(string_sep_df['min_amount'])
sen_df['max_amount'] = pd.to_numeric(string_sep_df['max_amount'])
sen_df.head()

Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,min_amount,max_amount
0,03/04/2020,Joint,--,INGERSOLL RAND PLC SHARES (Exchanged) <br> TRA...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,1001,15000.0
1,03/04/2020,Self,--,INGERSOLL RAND PLC SHS (Exchanged) <br> TRANE ...,Stock,Exchange,"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,1001,15000.0
2,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,1001,15000.0
3,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,15001,50000.0
4,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,15001,50000.0


In [3]:
sen_df.type.unique()

array(['Exchange', 'Sale (Full)', 'Purchase', 'Sale (Partial)'],
      dtype=object)

There are 4 types of trades.
Exchanges: Exchange 1 stock for another
Sale (Full): Selling all of their stock
Purchase: Buying a stock
Sale (Partial): Selling some of that particular stock

In [4]:
n_exchanges = len(sen_df.loc[sen_df['type'] == "Exchange"])
n_trades = len(sen_df)
print("There are " +str(n_exchanges) +" exchange trades out of a total of " +str(n_trades)+ " trades")
sen_df = sen_df.loc[sen_df['type'] != "Exchange"]


There are 84 exchange trades out of a total of 8600 trades


At this point in time, I will exclude exchange trades because they are so few and wish to build the basic structure of the project. As you can see, this would require splitting up the exchange into two rows with each company and so on. I may include this step later if time permits. 

## 2. Add Data to STR Dataframe  

### Import Data

In this step we will be using company information such as market cap and industry from online lists provided by the NYSE, NASDAQ, and ASXL exchange. Links can be found here:https://stackoverflow.com/questions/25338608/download-all-stock-symbol-list-of-a-market

In [5]:
ticker_list = list()
try:
    NYSE_df = pd.read_csv("NYSEcompanylist.csv")
except:
    NYSE_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NYSEcompanylist.csv")
    
try:
    NASDAQ_df = pd.read_csv("NASDAQcompanylist.csv")
except:
    NASDAQ_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/NASDAQcompanylist.csv")
    
#try:
#    ASXL_df = pd.read_csv("ASXLcompanylist.csv")
#except:
#    ASXL_df = pd.read_csv("https://github.com/pkm29/big_data_final_project/raw/master/Stocks/ASXLcompanylist.csv")
    
ticker_list.append(NYSE_df)
ticker_list.append(NASDAQ_df)
#ticker_list.append(ASXL_df)


ticker_list[0].append(ticker_list[1], ).head()


Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Unnamed: 8
0,DDD,3D Systems Corporation,7.46,$886.75M,,Technology,Computer Software: Prepackaged Software,https://old.nasdaq.com/symbol/ddd,
1,MMM,3M Company,145.74,$83.83B,,Health Care,Medical/Dental Instruments,https://old.nasdaq.com/symbol/mmm,
2,WBAI,500.com Limited,4.15,$178.45M,2013.0,Consumer Services,Services-Misc. Amusement & Recreation,https://old.nasdaq.com/symbol/wbai,
3,WUBA,58.com Inc.,52.0,$7.79B,2013.0,Technology,"Computer Software: Programming, Data Processing",https://old.nasdaq.com/symbol/wuba,
4,EGHT,8x8 Inc,18.89,$1.94B,,Technology,EDP Services,https://old.nasdaq.com/symbol/eght,


### Create script for R Studio to use quantmod

In this step we will be placing the stock price data from that day for each trade. Later on we will do more analysis with things such as future profit/avoided losses

First we need to use the ticker column to match stock prices to each stock. We must then filter out rows that don't have ticker symbols, as some trades are for things such as municipal securities that don't have stock listings. Below is a count of transactions that don't have a ticker symbol, interestingly large number of stock transactions

In [6]:
#list(zip(sen_df.transaction_date, sen_df.ticker))
sen_df_tick_error = sen_df.loc[sen_df['ticker'] == "--"]
print(len(sen_df_tick_error))
sen_df_tick_error.asset_type.value_counts()


1872


Stock                 890
Municipal Security    311
Other Securities      254
Corporate Bond        223
Non-Public Stock       64
Stock Option            1
Name: asset_type, dtype: int64

Here we wish to quickly examine the composition of these error trades

In [7]:
sen_error_count_df = pd.DataFrame({'num_error_transactions': sen_df_tick_error.senator.value_counts()})
sen_error_count_df.index.name = 'Senators with Ticker Errors'
sen_error_count_df['total_transactions'] = sen_df.senator.value_counts()
sen_error_count_df['error_trans/total_trans'] = sen_error_count_df['num_error_transactions']/sen_error_count_df['total_transactions']
sen_error_count_df.style.format({'error_trans/total_trans': '{:,.2f}'.format})
sen_error_count_df.head(n=10)
#sen_df_tick_error.loc[sen_df_tick_error['asset_type'] == 'Stock'].head()

Unnamed: 0_level_0,num_error_transactions,total_transactions,error_trans/total_trans
Senators with Ticker Errors,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thomas R Carper,718,819,0.876679
"David A Perdue , Jr",440,3128,0.140665
Rick Scott,102,102,1.0
Gary C Peters,85,123,0.691057
Susan M Collins,61,589,0.103565
Sheldon Whitehouse,59,859,0.068685
Lamar Alexander,59,59,1.0
Pat Roberts,39,407,0.095823
Patrick J Toomey,39,272,0.143382
"Angus S King, Jr.",38,65,0.584615


It appears some senators make many errors with their ticker symbol, since we cannot analyze these for their profits just yet, we will put this to the side to examine later. Note that senators such as Thomas R Carper, Rick Scott, and Lamar Alexander have large numbers of trades with no ticker symbol

We will then remove rows where there is no ticker. Next we need a list of tickers and dates to send to our RStudio script to get the stock data. 

In [8]:
sen_df_ticker = sen_df.loc[sen_df['ticker'] != '--']
dates_series = sen_df_ticker['transaction_date']
corrected_dates = list(map(lambda each:each.replace('/', "-"), dates_series))
date_ticker_df = pd.DataFrame({'transaction_date': corrected_dates, 'ticker': sen_df_ticker['ticker']})

date_ticker_df.to_csv("date_ticker_df.csv")
print("Number of entries: " + str(len(sen_df_ticker)))

Number of entries: 6644


### Import data created by Rstudio and quantmod library

In [9]:
import os
from glob import glob
filenames = glob("Transaction Date Stocks" + '/*.csv')
dfs = pd.concat([pd.read_csv(f) for f in filenames], ignore_index=True)
#dfs.to_csv("dfs_test.csv")
sen_df_ticker = sen_df_ticker.reset_index(drop=True)
#sen_df_ticker.to_csv("sen_df_ticker_test.csv")
#dfs.set_index('date', inplace=True)
print("Number of entries: " + str(len(dfs)))
print("Number of entries with error messages: " + str(len(dfs.loc[dfs['Error_message'] != "None"])))
print("Percent error message: " + str(len(dfs.loc[dfs['Error_message'] != "None"])/len(dfs)))
dfs.rename(columns={'date': 'transaction_date'}, inplace=True)

Number of entries: 6706
Number of entries with error messages: 830
Percent error message: 0.12376975842529078


Merging sen_df_ticker, the dataframe with ticker symbols. Note that there's a slight discrepency due to exchange trades accidentally being a part of the list of dates and tickers to get data for. This doesn't affect us, just means we scraped a little more data than necessary.

In [10]:
dfs.transaction_date = list(map(lambda each:each.replace('-', "/"), dfs.transaction_date))
#print(sen_df_ticker.transaction_date)
df_inner = pd.concat([sen_df_ticker, dfs], join='inner', axis=1)
df_inner.to_csv("df_inner.csv")
print("Number of entries: " + str(len(df_inner)))
df_inner.head()

Number of entries: 6644


Unnamed: 0,transaction_date,owner,ticker,asset_description,asset_type,type,amount,comment,senator,ptr_link,...,max_amount,transaction_date.1,ticker.1,Open,High,Low,Close,Volume,Adjusted,Error_message
0,03/11/2020,Self,ILMN,"Illumina, Inc.",Stock,Sale (Full),"$1,001 - $15,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,...,15000.0,03/11/2020,ILMN,261.429993,263.929993,241.470001,246.009995,1860200.0,246.009995,
1,03/11/2020,Self,CGNX,Cognex Corporation,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,...,50000.0,03/11/2020,CGNX,41.0,41.099998,38.91,39.400002,1611100.0,39.400002,
2,03/11/2020,Self,SIEGY,Siemens Aktiengesellschaft,Stock,Sale (Full),"$15,001 - $50,000",--,Sheldon Whitehouse,https://efdsearch.senate.gov/search/view/ptr/4...,...,50000.0,03/11/2020,SIEGY,45.700001,45.889999,44.09,44.389999,428600.0,44.389999,
3,02/27/2020,Joint,WCMIX,WCM Focused International Growth Fund Institut...,Other Securities,Purchase,"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,...,100000.0,02/27/2020,WCMIX,45.700001,45.889999,44.09,44.389999,428600.0,44.389999,"Error in getSymbols.yahoo(Symbols = ""WCMIX"", e..."
4,02/27/2020,Joint,RPM,RPM International Inc.,Stock,Sale (Partial),"$50,001 - $100,000",The filer's portfolio is managed by a third pa...,Daniel S Sullivan,https://efdsearch.senate.gov/search/view/ptr/e...,...,100000.0,02/27/2020,RPM,68.18,69.150002,66.040001,66.080002,632300.0,66.080002,
