## Detection and Use of Insider Trading Information

#### This is one of three notebooks comprising the capstone project.  


#### The other two notebooks are: `get_insider_buys.ipynb`, and `insider_trade_detection.ipynb`.

### The main notebook is  `insider_trade_detection.ipynb`


* Purpose of this notebook: Perform the lateral prediction aspect of the project. 
* Input is a list of ticker symbols, via a file `tickerlist.csv`
* Output is a `lrc_inferred.csv`

This file  `insider_buys.csv` is then used in the notebook `insider_trade_detection`

#### Overall outline of lateral predictor.

1. Read `ohlcv_all` and ensure the dates are all the same
2. Get `lrc_all`. Make sure the set the initial values to 0 instead of NaN
3. We'll made lateral predictions for each date in `prediction_dates`. So set up `prediction_dates`. Here's how.
     - a. We want the `prediction_dates` to start as early as possible and end on the last day of available data. 
     - b. "As early as possible:" For each given target date `tdate`, we want to make the lateral predictions for all stocks for `tdate`, based on regression on the past `n_days_back` worth of data from the neighbors. This is the constraint on how far back we can set the beginning of `prediction_dates`.
     - c. We also want to be able to see what happens after the `tdate`, `n_days_fwd` about 20, say. 
     - d. So what this means is that for any `tdate` in `prediction_dates`, I'm able to access data in `lrc_all` for `n_days_back` from tdate up to `n_days_fwd` from `tdate`.  
     
For each `tdate` in `prediction_dates`:

4. Create `regression_dates`, the dates on which we'll do the regressions for tdate. 
    - a. The features are going to be lrc for each date up thru tdate
    Create a np array consisting of the `lrc` column for all tickers and dates from (tdate - n_days_back) to (tdate - 1)
    - b. Compute the distance matrix based on that np array
    - c. For each target ticker:
        - i. Use the distance matrix to get the neighborhood
        - ii. Use the lrc data for each neighor, for all the regression_dates, to laterally predict the lrc value of the target ticker on the target date. 
        - iii. To check for the insider effect, compare the relative return over the next `n_days_fwd` for the neighborhood as a whole vs. that of the target stock. 



In [None]:
from datetime import datetime, timedelta
#import datetime
import insider_trade_detector as itd
import yfinance as yf
import pandas as pd
import numpy as np
import time
from pytz import timezone
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt





In [None]:
# def get_datestamp(date_str):
#     """
#     Convert a date string to a timezone-aware timestamp for the 'America/New_York' timezone.

#     :param date_str: String representing the date in 'YYYY-MM-DD' format.
#     :return: Timezone-aware datetime object for the 'America/New_York' timezone.
#     """
#     # Parse the string into a datetime object
#     naive_datetime = datetime.strptime(date_str, "%Y-%m-%d")

#     # Define the New York timezone
#     new_york_tz = timezone("America/New_York")

#     # Localize the datetime object to New York timezone
#     aware_datetime = new_york_tz.localize(naive_datetime)

#     return aware_datetime


In [None]:
def ohlcv_all_to_lrc_all(ohlcv_all):
    lrc_all = []
    for ticker in ohlcv_all.index.get_level_values(0).unique():
        ohlcv_single = ohlcv_all.xs(ticker, level='ticker')
        previous_close = ohlcv_single['Close'].shift(1)
        lrc = pd.DataFrame({'lrc': np.log(ohlcv_single['Close'] / previous_close)})
        lrc.iloc[0] = 0
        lrc_all.append(lrc.set_index([pd.Index([ticker] * len(lrc), name='ticker'), ohlcv_single.index]))
    lrc_all_df = pd.concat(lrc_all)
    return lrc_all_df


#### Data Acquisition and Structure
In this section, we utilize `yfinance` to import OHLCV (Open, High, Low, Close, Volume) data for 40 selected stocks, and addionally QQQ and XLK, covering every trading day over the past 11 years. Ensure that we've got the same dates for all the stocks.

The data is structured into a Pandas DataFrame named `ohlcv_all`. This DataFrame is indexed on two levels: `ticker` and `date`. and columns `Open`, `High`,  `Low`, `Close`,  and `Volume`.


In [None]:
symbol_list = sorted(itd.read_tickerlist_csv('tickerlist_naz100.csv'))
#symbol_list = symbol_list[20:26]
symbol_list = list({'QQQ','XLK'}.union(set(symbol_list)))
symbol_dict = { sym: yf.Ticker(sym) for sym in symbol_list }
print(symbol_list)

start_date, late_ticker = itd.get_start_date(symbol_list,start_date_pad=20)
start_date = start_date.strftime('%Y-%m-%d')
#start_date = '2020--01'; late_ticker=None

print(len(symbol_list)-2, 'Nasdaq-100 tech stocks, plus QQQ and XLK. \n')
print(start_date, '\t "Latest" ticker is', late_ticker)

yesterday_date =  (datetime.now() - timedelta(days=2)).strftime('%Y-%m-%d')

ohlcv_all = pd.concat(
    {sym : symbol_dict[sym].history(sym, start=start_date, end=yesterday_date, actions=False) for sym in symbol_dict},
    names=['ticker', 'date'])

In [None]:
# Verify that each ticker has the same date indices.
unique_tickers = ohlcv_all.index.get_level_values('ticker').unique()
first_ticker_dates = ohlcv_all.xs(unique_tickers[0], level='ticker').index
all_dates_same = True
for ticker in unique_tickers:
    ticker_dates = ohlcv_all.xs(ticker, level='ticker').index
    if not ticker_dates.equals(first_ticker_dates):
        all_dates_same = False
        print(f'\t {ticker} has a wrong number of dates')
if all_dates_same:
    print("All tickers have the same Date indices.")
else:
    print("Not all tickers have the same Date indices.")


#### Computing Daily Log-Returns
The next step involves calculating the daily log-returns for each stock on each trading day. Place the results in new DataFrame `lrc_all` with the same two-level indexing (ticker and date). 

This DataFrame serves as the foundation for the subsequent steps of identifying stock correlations and anomalies.


In [None]:
lrc_all = ohlcv_all_to_lrc_all(ohlcv_all)

Set some parameters and some working var's that depend on them. Key here is `prediction_dates`: We computed a predicted LRC value for each stock for each day in `prediction_dates`

In [None]:
n_days_back = 200
n_days_fwd = 10
ewm_span = 200
nbd_size = 3


start_index_for_predictions = n_days_back + 1
end_index_for_predictions = len(ticker_dates) - n_days_fwd
prediction_dates = ticker_dates[start_index_for_predictions:end_index_for_predictions]

ewm_alpha = 2 / (ewm_span + 1)
ewm_weights = np.array([(1 - ewm_alpha) ** i for i in range(n_days_back)])
ewm_weights = ewm_weights[::-1]
ewm_weights = ewm_weights / ewm_weights.sum()

Initialze and run the main loop. Create a `LinearRegression` called `lateral_lateral_predictor` and use it as follows

* For each `tdate` in `prediction_dates`:
    * Compute the distance matrix 
    * For each `symb`:
        - Find the `nbd_size` nearest neighbors
        - Use the LinearRegression object to compute  the inferred LRC value for `symb`, `tdate` based the LRC's of the rest of the neighborhood. 
        - Record these inferred LRC values, and also the R^2-values for the training data and for the testing data.

Main result of the loop go in `lrc_inferred`. 

This takes several minutes. 

In [None]:
lrc_inferred = pd.DataFrame(index=lrc_all.index, columns=['lrc_inferred', 'r2_train', 'r2_test'])
lateral_predictor = LinearRegression()
rsquareds = []
testvals_5day = []
for tdate in prediction_dates:

    tdate_index = ticker_dates.get_loc(tdate)

    traindates_start_index = tdate_index-n_days_back-1
    traindates_end_index = tdate_index-1
    train_dates = ticker_dates[traindates_start_index:traindates_end_index]
    
    testdates_start_index = tdate_index
    testdates_end_index = tdate_index+n_days_fwd
    test_dates = ticker_dates[testdates_start_index:testdates_end_index]


    lrc_train = lrc_all.loc[(slice(None), train_dates),:]
    lrc_test = lrc_all.loc[(slice(None), test_dates), :]
    
    distance_matrix =  1 - itd.corr_ewm(lrc_train,ewm_span)
    
    for symb in symbol_dict.keys():
        nbrs,tightness =  itd.find_neighbors( distance_matrix, symb, k=nbd_size )
        
        X_train = lrc_train.loc[nbrs].unstack(level='ticker')['lrc'].values
        y_train = lrc_train.loc[symb].values.squeeze()
        lateral_predictor.fit(X_train, y_train, sample_weight=ewm_weights)
        y_hat_train = lateral_predictor.predict(X_train)
        residuals = y_train - y_hat_train
        mse_train = mean_squared_error(y_train, y_hat_train)
        r2_train = lateral_predictor.score(X_train, y_train)
        
        X_test = lrc_test.loc[nbrs].unstack(level='ticker')['lrc'].values
        y_test = lrc_test.loc[symb].values.squeeze()
        y_predicted_test = lateral_predictor.predict(X_test)
        residuals = y_test - y_predicted_test
        mse_test = mean_squared_error(y_test, y_predicted_test)
        if r2_train>0.5:
            r2_test = lateral_predictor.score(X_test, y_test)
        else:
            r2_test = 0.0


        rsquareds.append([r2_train, r2_test])
        lrc_inferred.loc[(symb,tdate),'lrc_inferred'] = y_predicted_test[0]
        lrc_inferred.loc[(symb,tdate),'r2_train'] = r2_train
        lrc_inferred.loc[(symb,tdate),'r2_test'] = r2_test

Write `lrc_inferred` to a csv file. 

In [None]:
lrc_inferred['lrc'] = lrc_all['lrc']
lrc_inferred = lrc_inferred[['lrc', 'lrc_inferred', 'r2_train', 'r2_test']]
lrc_inferred = lrc_inferred.loc[lrc_inferred.index.get_level_values('date').isin(prediction_dates)]
lrc_inferred.shape
lrc_inferred.reset_index().to_csv('lrc_inferred.csv', index=False)
time.sleep(1)
x = pd.read_csv('lrc_inferred.csv')
x.set_index(['ticker', 'date'], inplace=True)
x.head(5)