# Predicting Buy or Sell via Stock Trades of Congress Members

## Summary of Findings


### Introduction
Objective: Predicting whether a trade is BUY or SELL.

Type: Binary classification: there is only BUY (1) or SELL (0).

Response Variable: type of transaction (BUY/SELL). We chose this as a indicator to detect market trends.

Metric: F1-Score. We are using this because it a good balance between precision and recall. We believe precision and recall are important when predicting whether a trade is a buy or sell. 

### Baseline Model
Made our first Pipeline using the KNNeighborsClassifier model

    Features: 
        Quantitative:
            1. Day of transaction 
            2. Month of transaction
            3. Year of transaction        
        Nominal:
            4. Party
        

How: For the quantitative features, we used helper functions to strip respective parts of the date string. For 'party', we used one hot encoding to encode categorical values to numerical values.

Why: For quantitative columns, buy or sell depends on the date since historical events such as the 2008 recession can cause systematic changes to buying and selling behaviors. For 'party', the buy or sell can depend on party affiliation since the parties could have different buying and selling patterns

Model Performance: Using F1-Score for our metric of evaluation, we achieved a score of 0.6520694259012015 on our test set. We are able to conclude that this model is decent, since it's a lot better than the baseline accuracy of 0.5252973381159146 (guessing all 'buy'). Our pipeline improved our F-1 score by 0.12677208778528692.

Generalization Ability: Our baseline model ability to generalize on unseen is mediocre since our pipeline only improved by around 13.5 percent. 

### Final Model

New Features: 
    
    Quantitative:
        1. Day of disclosure 
        2. Month of disclosure
        3. Year of disclosure
        4. est_amount
    Nominal:
        4. ticker
        5. owner
        
How: For the quantitative features (beside est_amount), we used helper functions to strip respective parts of the date string. For 'ticker', we used one hot encoding to encode categorical values to numerical values. For est_amount, we converted it to z-score based on owner type. 

Why: For 'ticker', the buy or sell can depend on the performance of a specific stock. For 'owner', buy or sell can depend on the type of stock ownership since there could be different trading behaviors for different groups, such as joint owners buying more than selling. For example, when a company has a risk of being delisted, there is a higher proportion of individuals selling than buying that stock. For quantitative columns (beside est_amount), buy or sell depends on disclosure date since congress members would likely disclose their buys and sales on separate dates. Therefore, we added disclosure dates in addition to transaction date. For est_amount, buy or sell depends on the amount sold due to individuals being more likely to have larger panic sales than panic buys.

Generalization Ability: We improved the final model's F1-Score from 0.5252973381159146 to 0.7525264394829612, or about a 23 percent increase. With an F1-Score of 0.7525264394829612, the final model's ability to generalize on unseen data is a lot better than the baseline model. 

### Fairness Analysis

Null Hypothesis: Our model is fair. Its F1-Score for Republicans and Democrats are roughly the same, and differences would be due to random chance.

Alternative Hypothesis: Our model is unfair. Its F1-Score for Republicans and Democrats are significantly different.

alpha = 0.05

p-value: 0.0

Conclusion: We observed a p-value of 0.0, which is less than our alpha of 0.05. Therefore, we have enough evidence to reject our null hypothesis. Therefore, our model is unfairly predicting more accurately for Republicans.

## Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [51]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score

In [50]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

### Baseline Model

Read in cleaned stock transaction file that we combined with party affiliation from project 3

In [4]:
df = pd.read_csv('transactions_w_party.csv')

In [5]:
# Dropped the index column
df = df.drop(['index'], axis = 1)
# Dropped the three Libertarian Party member's rows to simplify our model
df = df[df['Party']!='Libertarian']
# Dropped NaN owner values
df['owner'] = df['owner'].fillna('missing')

Simplified the four exchange categories, 'purchase', 'sale_partial, 'sale_full', 'exchange' to just buy and sell

In [6]:
# We dropped exchange
df = df[df['type']!='exchange']
df['type'] = df['type'].replace({'purchase': 'buy', 'sale_partial': 'sell', 'sale_full': 'sell'})

Created training and testing data for our model

In [7]:
# train test split
X = df[['transaction_date','Party']]
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Helper functions to be used in our function transformer

In [8]:
# Turns string date into number of days since year 0 for the transaction_date column
def num_tdays(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[0]) * 365 + int(ser.split('-')[1]) * 30 + int(ser.split('-')[-1])))
# Turns string date into number of days since year 0 for the disclosure_date column
def num_ddays(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: int(ser.split('-')[0]) * 365 + int(ser.split('-')[1]) * 30 + int(ser.split('-')[-1])))

In [9]:
# Turns string date into day of the month for transaction_date
def day(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[-1])))
# Turns string date into month for transaction_date
def month(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[1])))
#Turns string date into year for transaction_date
def year(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[0])))

Made our first Pipeline using the KNNeighborsClassifier model

    Features: 
        Quantitative:
            1. Day of transaction 
            2. Month of transaction
            3. Year of transaction        
        Nominal:
            4. Party
How: For the quantitative features, we used helper functions to strip respective parts of the date string. For 'party', we used one hot encoding to encode categorical values to numerical values.

Why: For quantitative columns, buy or sell depends on the date since historical events such as the 2008 recession can cause systematic changes to buying and selling behaviors. For 'party', the buy or sell can depend on party affiliation since the parties could have different buying and selling patterns

In [45]:
# Column Transformer to encode the four above features
feature_eng_pipeline = ColumnTransformer([
    ('day', FunctionTransformer(day), ['transaction_date']),
    ('month', FunctionTransformer(month), ['transaction_date']),
    ('year', FunctionTransformer(year), ['transaction_date']),
    ('nominal', OneHotEncoder(), ['Party'])]
)
# Pipeline to make combine column transforming and KNN Classifier
pl = Pipeline([
    # Performs feature engineering 
    ('features', feature_eng_pipeline),
    ('tree', KNeighborsClassifier(n_neighbors=3))
])
# Fits the training data
pl.fit(X_train, y_train)
# F1 Score for the training set
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.7121951219512195

In [46]:
# F1 Score for testing set
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.6520694259012015

In [12]:
# baseline accuracy
np.mean(y_train == 'buy')

0.5248300604229608

In [47]:
# improvement 
0.6520694259012015 - 0.5252973381159146

0.12677208778528692

Model Performance:
    
    Using F1-Score for our metric of evaluation, we achieved a score of 0.6520694259012015. We are able to conclude that this model is decent, since it's a lot better than the baseline accuracy of 0.5252973381159146 (guessing all 'buy'). Our pipeline improved our F-1 score by 0.12677208778528692.

### Final Model

New Features: 
    
    Quantitative:
        1. Day of disclosure 
        2. Month of disclosure
        3. Year of disclosure
        4. est_amount
    Nominal:
        4. ticker
        5. owner
        
How: For the quantitative features (beside est_amount), we used helper functions to strip respective parts of the date string. For 'ticker', we used one hot encoding to encode categorical values to numerical values. For est_amount, we converted it to z-score based on owner type. 

Why: For 'ticker', the buy or sell can depend on the performance of a specific stock. For 'owner', buy or sell can depend on the type of stock ownership since there could be different trading behaviors for different groups, such as joint owners buying more than selling. For example, when a company has a risk of being delisted, there is a higher proportion of individuals selling than buying that stock. For quantitative columns (beside est_amount), buy or sell depends on disclosure date since congress members would likely disclose their buys and sales on separate dates. Therefore, we added disclosure dates in addition to transaction date. For est_amount, buy or sell depends on the amount sold due to individuals being more likely to have larger panic sales than panic buys.

In [14]:
# train test split
X = df[['transaction_date','est_amount','Party','disclosure_date','ticker', 'owner']]
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Created function to standardized estimated amount by owner type

In [15]:
# Standardizes to z-score by group
from sklearn.base import BaseEstimator, TransformerMixin
# Class to make function
class StdScalerByGroup(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.grps_ = dict()

    def fit(self, X, y=None):
        df = pd.DataFrame(X)
        self.grps_ = dict(X.groupby(X.columns[0]).agg(['mean', 'std']))
        return self

    def transform(self, X, y=None):

        try:
            getattr(self, "grps_")
        except AttributeError:
            raise RuntimeError("You must fit the transformer before tranforming the data!")
        
        groups = X.iloc[:, 0].unique()
        output = X.iloc[:, 1:].copy()
        for c in X.columns[1:]: 
            total_vals = pd.Series(dtype='float') 
            for g in groups:
                group = X[X.iloc[:, 0] == g][c]
                grouped = (group - self.grps_[(c, 'mean')][g]) / self.grps_[(c, 'std')][g]
                total_vals = pd.concat([total_vals, pd.Series(grouped)])
            output[c] = total_vals
        return output

In [16]:
# Turns string date into day of the month for disclosure_date
def dday(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[-1]))
# Turns string date into month for disclosure_date
def dmonth(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[1]))
#Turns string date into year for disclosure_date
def dyear(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[0]))

In [27]:
# KNN ClASSIFIER WITH NEW FEATURES

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StdScalerByGroup(), ['owner', 'est_amount'])
])
pl1 = Pipeline([
    ('features', feature_eng_pipeline),
    ('KN', KNeighborsClassifier(n_neighbors=2))
])
pl1.fit(X_train, y_train)
f1_score(pl1.predict(X_train), np.array(y_train),pos_label='buy')

0.877894570457599

In [28]:
f1_score(pl1.predict(X_test), np.array(y_test),pos_label='buy')

0.7525264394829612

 ##### Model Selection:
    We manually checked two other models: DecisionTreeClassifier, and RandomTreeClassifier. Using the same features in each model, we found that our F1-score was the highest using the KNeighborsClassifier, with DecisionTreeClassifier in second, and RandomTreeClassifier in third. 

In [29]:
#RANDOM FOREST CLASSIFIER

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StdScalerByGroup(), ['owner', 'est_amount'])
])
pl2 = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', RandomForestClassifier(max_depth=19))
])
pl2.fit(X_train, y_train)
f1_score(pl2.predict(X_train), np.array(y_train),pos_label='buy')

0.7672526041666667

In [30]:
f1_score(pl2.predict(X_test), np.array(y_test),pos_label='buy')

0.6836610827870842

In [31]:
#DECISION TREE CLASSIFIER

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StdScalerByGroup(), ['owner', 'est_amount'])
])
pl3 = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', DecisionTreeClassifier(max_depth=19))
])
pl3.fit(X_train, y_train)
f1_score(pl3.predict(X_train), np.array(y_train),pos_label='buy')

0.87170626349892

In [32]:
f1_score(pl3.predict(X_test), np.array(y_test),pos_label='buy')

0.7182549987016359

 ##### GridSearch:
    Using GridSearchCV, we found that the best n_neighbors hyperparameter is 1; this overfits the training data.

In [33]:
hyperparameters = {'KN__n_neighbors': np.arange(1,10)}
searcher = GridSearchCV(pl1, hyperparameters, scoring='f1_micro')
searcher.fit(X_train, y_train)
searcher.best_estimator_

Pipeline(steps=[('features',
                 ColumnTransformer(transformers=[('dday',
                                                  FunctionTransformer(func=<function dday at 0x000001FBC24ACCA0>),
                                                  ['disclosure_date']),
                                                 ('dmonth',
                                                  FunctionTransformer(func=<function dmonth at 0x000001FBA0BEEE50>),
                                                  ['disclosure_date']),
                                                 ('dyear',
                                                  FunctionTransformer(func=<function dyear at 0x000001FBC2669310>),
                                                  ['disclosure_date']),
                                                 ('tday',
                                                  Functi...
                                                  ['transaction_date']),
                                       

##### Final Model
    Classification Model: K Neighbors Classifier
    Parameter: Disclosure date, transaction date, party, ticker, owner, and estimated amount
    Hyperparameter: n_neighbors = 2
    F1-Score: 0.7525264394829612

In [34]:
# FINAL MODEL

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StdScalerByGroup(), ['owner', 'est_amount'])
])
finalpl = Pipeline([
    ('features', feature_eng_pipeline),
    ('KN', KNeighborsClassifier(n_neighbors=2))
])
finalpl.fit(X_train, y_train)
f1_score(finalpl.predict(X_train), np.array(y_train), pos_label='buy')

0.877894570457599

In [35]:
# Final Model F1 Score
f1_score(finalpl.predict(X_test), np.array(y_test), pos_label='buy')

0.7525264394829612

### Fairness Analysis

Null Hypothesis: Our model is fair. Its F1-Score for Republicans and Democrats are roughly the same, and differences would be due to random chance.

Alternative Hypothesis: Our model is unfair. Its F1-Score for Republicans and Democrats are significantly different.

alpha = 0.05

p-value: 0.0

Conclusion: We observed a p-value of 0.0, which is less than our alpha of 0.05. Therefore, we have enough evidence to reject our null hypothesis. Therefore, our model is unfairly predicting more accurately for Republicans.

##### Computing observed test statistic

In [38]:
Republican = df[df['Party'] == 'Republican']
Democratic = df[df['Party'] == 'Democratic']

# train test split REPUBLICAN

Xr = Republican[['transaction_date','est_amount','Party','disclosure_date','ticker', 'owner']]
yr = Republican['type']
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.25)
finalpl.fit(Xr_train, yr_train)

r_f1 = f1_score(finalpl.predict(Xr_test), np.array(yr_test),pos_label='buy')

# train test split DEMOCRATIC

Xd = Democratic[['transaction_date','est_amount','Party','disclosure_date','ticker', 'owner']]
yd = Democratic['type']
Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.25)
finalpl.fit(Xd_train, yd_train)

d_f1 = f1_score(finalpl.predict(Xd_test), np.array(yd_test),pos_label='buy')

# observed test statistics
observed_abs_diff = np.abs(r_f1-d_f1)
observed_diff = r_f1-d_f1

In [39]:
observed_abs_diff

0.11381726813550685

##### Permutation test

In [40]:
# Simulation
results = []
df_p = df.copy()

for _ in range(200):
    
    # Permutation
    
    df_p['Shuffled_Party'] = np.random.permutation(df_p['Party'])
    Republican = df_p[df_p['Shuffled_Party'] == 'Republican']
    Democratic = df_p[df_p['Shuffled_Party'] == 'Democratic']
    
    # Train test split Republican
    
    Xr = Republican[['transaction_date','est_amount','Party','disclosure_date','ticker', 'owner']]
    yr = Republican['type']
    Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.25)
    finalpl.fit(Xr_train, yr_train)
    # Republican F1-Score
    r_f1 = f1_score(finalpl.predict(Xr_test), np.array(yr_test),pos_label='buy')


    # Train test split Democratic
    
    Xd = Democratic[['transaction_date','est_amount','Party','disclosure_date','ticker', 'owner']]
    yd = Democratic['type']
    Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.25)
    finalpl.fit(Xd_train, yd_train)
    # Democratic F1-Score
    d_f1 = f1_score(finalpl.predict(Xd_test), np.array(yd_test),pos_label='buy')

    
    abs_diff = np.abs(r_f1-d_f1)
    
    # Test statistics
    results.append(abs_diff)

In [41]:
p_value = np.mean(results >= observed_abs_diff)

In [42]:
p_value

0.0