# Stock Trades by Members of the US House of Representatives

### Introduction

For this project, I decided to implement a classification model to predict the geographical region of a candidate based on information about a stock trade. I chose to investigate the candidate's region for a few reasons:
- Ideally, I would have tried to predict the state of the candidate. However, due to the limited size of the dataset (where some states are underrepresented) I believed that the region was the next best option, where there would be enough representation to run a good machine learning model.
- I wanted to investigate whether the geographical location of candidates could be predicted based upon the types of stock trades they make. I thought that there would be many interesting questions that could be investigated through this area. Are some regions' candidates more interested in investing in certain industries/companies? Do some regions make more large transactions? Which regions are most active in the stock market?

The evaluation metric I used to measure the "success" of my model was accuracy. I considered using precision and recall as well, but for this data I felt that the precision and recall were not as important or encompassing as accuracy.

In [400]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [401]:
from datetime import datetime

In [782]:
transactions_raw_init = pd.read_csv("all_transactions.csv")

# Keep only necessary columns
transactions_raw = transactions_raw_init.drop(
    columns=['ptr_link', 'asset_description', 'owner'])


# Clean Transaction and disclosure dates
def fix_date(date):
    if len(date)!=10:
        date_list = date.split('-')
        date_str = '-'.join([date_list[0][:-1], date_list[1], date_list[2]])
        return datetime.strptime(date_str, '%Y-%m-%d')
    else:
        return datetime.strptime(date, '%Y-%m-%d')
    
transactions_raw['disclosure_date'] = transactions_raw['disclosure_date'].apply(
    lambda x: datetime.strptime(x,'%m/%d/%Y'))
transactions_raw['transaction_date'] = transactions_raw['transaction_date'].apply(fix_date)
transactions_raw['transaction_date'] = transactions_raw['transaction_date'].transform(
    lambda x: x if len(str(x.year))==4 else 'drop')
transactions_raw = transactions_raw[transactions_raw['transaction_date']!='drop']


# Cleaning up sale types - replace all sale types as sales
def replace_sale_types(string):
    if string.split('_')[0] == 'sale':
        return 'sale'
    else:
        return string
    
transactions_raw['type'] = transactions_raw['type'].apply(replace_sale_types)


# Adding a state column using district codes
states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California",
          "CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia",
          "HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas",
          "KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts",
          "MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana",
          "NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico",
          "NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma",
          "OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina",
          "SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont",
          "VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming",
          "DC":"District of Columbia","GU":'Guam'}

transactions_raw['state'] = transactions_raw['district'].transform(lambda x: x[:2])
transactions = transactions_raw.replace({'state':states})


# Cleaning the Amount column to show bottom end of price range
bottom_end = transactions['amount'].str.split(' - ').transform(lambda x: x[0].strip('$'))
bottoms = bottom_end.str.strip(' +').str.replace(',', '').str.strip(' -').apply(lambda x: int(x))

def reassign_bottom(number):
    if number == 1_001:
        return 1000
    if number == 15_000:
        return 15001
    if number == 1_000_000:
        return 1_000_001
    else:
        return number

bottom_price = bottoms.apply(reassign_bottom)
transactions['amount'] = bottom_price


# Adding a region column by state
states_to_regions = {
    'Washington': 'West', 'Oregon': 'West', 'California': 'West', 'Nevada': 'West',
    'Idaho': 'West', 'Montana': 'West', 'Wyoming': 'West', 'Utah': 'West',
    'Colorado': 'West', 'Alaska': 'West', 'Hawaii': 'West', 'Maine': 'Northeast',
    'Vermont': 'Northeast', 'New York': 'Northeast', 'New Hampshire': 'Northeast',
    'Massachusetts': 'Northeast', 'Rhode Island': 'Northeast', 'Connecticut': 'Northeast',
    'New Jersey': 'Northeast', 'Pennsylvania': 'Northeast', 'North Dakota': 'Midwest',
    'South Dakota': 'Midwest', 'Nebraska': 'Midwest', 'Kansas': 'Midwest',
    'Minnesota': 'Midwest', 'Iowa': 'Midwest', 'Missouri': 'Midwest', 'Wisconsin': 'Midwest',
    'Illinois': 'Midwest', 'Michigan': 'Midwest', 'Indiana': 'Midwest', 'Ohio': 'Midwest',
    'West Virginia': 'South', 'District of Columbia': 'South', 'Maryland': 'South',
    'Virginia': 'South', 'Kentucky': 'South', 'Tennessee': 'South', 'North Carolina': 'South',
    'Mississippi': 'South', 'Arkansas': 'South', 'Louisiana': 'South', 'Alabama': 'South',
    'Georgia': 'South', 'South Carolina': 'South', 'Florida': 'South', 'Delaware': 'South',
    'Arizona': 'Southwest', 'New Mexico': 'Southwest', 'Oklahoma': 'Southwest',
    'Texas': 'Southwest'}

transactions['region'] = transactions['state']
transactions = transactions.replace({'region':states_to_regions})
transactions = transactions[transactions['region']!='Guam'].reset_index().drop(columns=['index'])


# Add disclosure delays (in days) column
transactions['disclosure_delay'] = transactions['disclosure_date'] - transactions['transaction_date']
transactions['disclosure_delay'] = transactions['disclosure_delay'].transform(lambda x: x.days)
transactions['disclosure_delay'].unique()


# Add disclosure month
transactions['disclosure_month'] = transactions['disclosure_date'].transform(lambda x: x.month)


# Turn cap gains over 200 from Bool to Numerical
transactions['cap_gains_over_200_usd'] = transactions[
    'cap_gains_over_200_usd'].transform(lambda x: int(x))


# Replace unknown values of tickers with nan
transactions['ticker'] = transactions['ticker'].replace('--', np.NaN)


# After cleaning
transactions.head()



Unnamed: 0,disclosure_year,disclosure_date,transaction_date,ticker,type,amount,representative,district,cap_gains_over_200_usd,state,region,disclosure_delay,disclosure_month
0,2021,2021-10-04,2021-09-27 00:00:00,BP,purchase,1000,Hon. Virginia Foxx,NC05,0,North Carolina,South,7,10
1,2021,2021-10-04,2021-09-13 00:00:00,XOM,purchase,1000,Hon. Virginia Foxx,NC05,0,North Carolina,South,21,10
2,2021,2021-10-04,2021-09-10 00:00:00,ILPT,purchase,15001,Hon. Virginia Foxx,NC05,0,North Carolina,South,24,10
3,2021,2021-10-04,2021-09-28 00:00:00,PM,purchase,15001,Hon. Virginia Foxx,NC05,0,North Carolina,South,6,10
4,2021,2021-10-04,2021-09-17 00:00:00,BLK,sale,1000,Hon. Alan S. Lowenthal,CA47,0,California,West,17,10


### Baseline Model

The baseline model used the transactions data which was cleaned (as seen below). The main purpose of cleaning was to turn each variable into quantitative or categorical data so that it could be used in the model. Regions and states were not given in the dataset, so using the candidate's district, I created those two respective columns of investigation. After cleaning, the main features of the baseline model were as follows:
- type: 
    - The type of transaction that occured. Categorical data (purchase, sale, exchange)
- amount: 
    - The minimum transaction amount of the trade based upon the original amount bins in the dataset. Categorical data (1_000, 15_001, 50_001, 100_001, 250_001, 500_001, 1_000_001, 5_000_001, 50_000_000)
- cap_gains_over_200_usd: 
    - A true or false value indicating whether a capital gain of over 200 USD was earned through the trade. Categorical data (0, 1)
- disclosure_delay: 
    - The number of days AFTER the trade that the trade was disclosed. Quantitative data.
- disclosure_year:
    - The year that the transaction was disclosed. Categorical data (2020, 2021, 2022)
- disclosure_month:
    - The month that the transaction was disclosed. Categorical data (1-12)
    
    
The feature of investigation was:
- region:
    - The region of a candidate. Categorical data (South, West, Northeast, Southwest, Midwest)
    
One-hot encoding was applied to the categorical data features via a Column Transformer, and using a pipeline the base model was created with a Decision Tree Classifier which was chosen initially for is simplicity of understanding. The transactions data was split on a 80/20 ratio with training/testing. The training accuracy came out to about 0.487, and the testing accuracy was a little bit lower at 0.439. Overall, this result had a better accuracy than labeling all the data as one region which was promising, but did not come close to being at a satisfactory level as it still remained even below 0.5.

In [663]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

In [588]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown='ignore'), 
         ['type', 'amount', 'disclosure_year', 'disclosure_month'])
    ])

In [596]:
pl_base = Pipeline([('cat', preprocessor), ('clf', DecisionTreeClassifier())])

X = transactions[['type', 'amount', 'cap_gains_over_200_usd', 'disclosure_delay', 
                  'disclosure_year', 'disclosure_month']]
y = transactions['region']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

pl_base.fit(X_train, y_train)
pl_base.score(X_train, y_train)

0.4867446859326487

In [597]:
pl_base.score(X_test, y_test)

0.43871378541865647

### Final Model

For the final model, the main tasks were to add in a couple more features and to determine which classifier would yield the best accuracy. The features that were added to the base model are as follows:
- disclosure_delay:
    - This feature was already in the base model, but was inputted as the raw count of days. Using a Standard Scaler, the delays were standardized and re-inputted into the final model.
    - This feature was good for the final model because it greatly reduced the range of the dataset, which makes it easier for classification patterns to emerge.
- ticker:
    - This is the stock symbol of the name of the stock that was being traded. I ran a Simple Imputer to fill in the missing values which had been cleaned from the dataset, and inputted it as categorical data.
    - This feature was good for the final model because it gave another variable to build predictions off of, as the previous features alone were not sufficient and not as relevant to the regional trades.

Next, I experimented with 3 different classification models to see which would return the most accurate results with a reasonable amount of computational power. I tested sklearn's Decision Tree, Linear SVC, and Random Forests. The training accuracy results were respectively 0.997, 0.687, and 0.436 leading me to use a Decision Tree as the classfier for the final model. The parameters for the best decision tree model were found using a grid search with 5 layers of cross-validation, and turned out to be: {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 2}. The final test score with the best parameters on the Decision Tree Classifier was about 0.745, which was a great improvement over the final score of the basic model.

#### Added Features: Standard Scale disclosure delays, Simple Impute the missing tickers and add them in the one hot encoding

In [599]:
quantitative_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

In [600]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

In [601]:
preprocessor_final = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, 
         ['type', 'amount', 'disclosure_year', 'disclosure_month', 'ticker']),
        ('quant', quantitative_transformer, ['disclosure_delay'])
    ])

In [611]:
X = transactions[['type', 'amount', 'cap_gains_over_200_usd', 'disclosure_delay', 
                  'disclosure_year', 'disclosure_month', 'ticker']]
y = transactions['region']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

#### Testing Decision Tree model

In [681]:
dtree_params = {
    'max_depth': [2, 5, 10, 25, 50, 100, None], 
    'min_samples_split': [2, 3, 5, 7, 10, 15, 20, 25],
    'criterion': ['gini', 'entropy']
}
len(dtree_params['max_depth']) * len(dtree_params['min_samples_split']) * len(dtree_params['criterion'])

112

In [682]:
pl_dtree = Pipeline([('process', preprocessor_final), 
                    ('search', GridSearchCV(DecisionTreeClassifier(), dtree_params, cv=5))])

In [683]:
pl_dtree.fit(X_train, y_train)
pl_dtree.score(X_train, y_train)

0.9979301011065997

In [684]:
pl_dtree.named_steps['search'].best_params_

{'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 2}

#### Testing Linear SVC model

In [659]:
svm_params = {
    'penalty': ['l1', 'l2'],
    'max_iter': [9000, 9500, 10000],
    'dual': [False]
}

In [660]:
pl_lsvc = Pipeline([('process', preprocessor_final), 
                    ('search', GridSearchCV(LinearSVC(), svm_params, cv=5))])

In [661]:
pl_lsvc.fit(X_train, y_train)
pl_lsvc.score(X_train, y_train)

0.6873656556006688

In [662]:
pl_lsvc.named_steps['search'].best_params_

{'dual': False, 'max_iter': 9000, 'penalty': 'l2'}

#### Testing Random Forest model

In [674]:
rforest_params = {
    'max_depth': [2, 3, 5], 
    'min_samples_split': [2, 5, 10, 15, 25],
    'criterion': ['gini', 'entropy']
}
len(rforest_params['max_depth']) * len(rforest_params['min_samples_split']) * len(rforest_params['criterion'])

30

In [675]:
pl_rforest = Pipeline([('process', preprocessor_final), 
                    ('search', GridSearchCV(RandomForestClassifier(), rforest_params, cv=5))])

In [676]:
pl_rforest.fit(X_train, y_train)
pl_rforest.score(X_train, y_train)

0.43650983201974364

In [677]:
pl_rforest.named_steps['search'].best_params_

{'criterion': 'gini', 'max_depth': 5, 'min_samples_split': 2}

#### Final score of best model

In [685]:
pl_dtree.score(X_test, y_test)

0.7449856733524355

### Fairness Evaluation

I decided to conduct the fairness analysis on the "large transactions" subset of my data. In this project, "large transactions" mean trade amounts that were larger than 100,000 USD. The parity measure I decided to use was accuracy, as I still felt that it would be the best and most generalizable measure of fairness given that the question of investigation does not necessitate extremely high values of precision or recall. I used random permutations to shuffle the regions of the transactions data, then drew samples of the size of the "large transactions" subset to find their testing scores. The null hypothesis of the experiment was that the "large transactions" subset and the transactions data would have no difference in testing score. However, through the experiment I rejected the null hypothesis at a p-value of 0.0 and concluded that the "large transactions" subset had a higher accuracy score than the general dataset. This suggests that certain regions are more likely to make stock trades of large transaction amounts, and my model was successful in picking up this pattern. 

#### Will be using accuracy as the measure of fairness - precision/recall is not very relevant for this data.

Subset of data: transactions larger than 100,000 vs whole dataset 

- Null Hypothesis: large transactions do not have a higher measure of accuracy. 
- Alternative Hypothesis: large transactions do have a higher measure of accuracy.

In [761]:
transactions_large = transactions[transactions['amount']>100000]


X_large = transactions_large[['type', 'amount', 'cap_gains_over_200_usd', 'disclosure_delay', 
                  'disclosure_year', 'disclosure_month', 'ticker']]
y_large = transactions_large['region']

Xl_train, Xl_test, yl_train, yl_test = train_test_split(X_large, y_large, train_size=0.8)

In [762]:
pl_dtree.fit(Xl_train, yl_train)
pl_dtree.score(Xl_train, yl_train)

1.0

In [766]:
large_score = pl_dtree.score(Xl_test, yl_test)
large_score

0.7668161434977578

#### Randomly permute the regions and get samples of size transactions_large

In [718]:
from tqdm import tqdm

In [759]:
shuffled_scores = []
iterations = 100
shuff_transactions = transactions.copy()

for i in tqdm(range(iterations)):
    shuff_transactions['region'] = np.random.permutation(shuff_transactions['region'])
    sample_shuff = shuff_transactions.sample(transactions_large.shape[0])
    
    X_shuff = sample_shuff[['type', 'amount', 'cap_gains_over_200_usd', 'disclosure_delay', 
                  'disclosure_year', 'disclosure_month', 'ticker']]
    y_shuff = sample_shuff['region']

    Xshuff_train, Xshuff_test, yshuff_train, yshuff_test = train_test_split(
        X_shuff, y_shuff, train_size=0.8)
    
    pl_dtree.fit(Xshuff_train, yshuff_train)
    score = pl_dtree.score(Xshuff_test, yshuff_test)
    shuffled_scores.append(score)

100%|█████████████████████████████████████████| 100/100 [08:24<00:00,  5.04s/it]


In [772]:
shuffled_scores[:5]

[0.3183856502242152,
 0.29596412556053814,
 0.25112107623318386,
 0.21524663677130046,
 0.32286995515695066]

In [770]:
p_value = (shuffled_scores > large_score).sum()/iterations
p_value

0.0

#### Reject the null hypothesis. Large transactions have a higher measure of accuracy.