# Extra Credit Opportuinity

# Can You Predit a 20 Day Change in Stock Price?
* Get 5 points added to your lowest Major Lab grade (Lab 1, 2, or 3) for completing the extra credit. 
* Get 10 additional points, if you achieve a lower Mean Absolute Error < 25% OR a Correct direction > 57.6%

# Instructions 
1. Create and execute three original 30 day simulation models (Examples below) trying to improve your MAE each time. 
2. Pickel your predictions for each run using the .gz file extension and the file names shown below. 
3. Execute the code in the observations section using the predictions from your best run. 
4. Update with your own observations. Feel free to add any visuals or tables you need to support your observations.
5. State how many Extra Credit points you think you should be awarded.  If you qualify for the 10 additional points, explain how / why you think your model achieves a better result. 

# Rules
* All extra credit work and submissions must be individual.  Do not share your approaches.    
* "Original" simualtions include you making changes to the models, parameter tuning, feature selection, feature engineering, training record selection strategies etc. to achieve a better result. 
* Feel free to run more than three simulations, but only include your best 3 simulations in the final submission. 
* You MUST use the make_cv_folds() function and the train / test folds it creates for the simulation.  DO NOT modify this code in any way.
* You MUST use the example code below to execute your simulations.  
* Only modify the simuilation code to improve the model predictions.   
* You must make a prediction for each of the 30 days and 2,065 test records as shown below.
* You must train your model using some or all of the training data for each respective fold (see examples below).
* DO NOT mix folds in any way. Use the simulation examples as a starting point. Training records for each fold are 20 days older than the test records you are trying to predict. 
* DO NOT add new or future training data to the folds!!! 
* Simulation predictions must be pickled/saved using the filenames: pred_data_e1.gz, pred_data_e2.gz, and pred_data_e3.gz as shown below.
* Your work MUST be reproduceable! I will not award any points if I cannot reproduce your work using the cv_folds below!   
* You may use any scikit learn tools to create better predictions.
* I WILL MAKE A FINAL DETERMINATION ON AWARDING EXTRA CREDIT POINTS. PLEASE ASK IF YOU HAVE QUESTIONS, DO NOT ASSUME.

# Things to Consider Trying
* Modify TfidfVectorizer() parameters to make better features. 
* Use different models or model ensembles.
* Use sklearn.feature_selection (SelectFromModel, SelectKBest) to select better features.
* Try out the AdaBoostRegressor()
* Try dimensionality reduction or remove very rare features from the TfidfVectorizer's sparse matrix.  

In [4]:
import pandas as pd
import numpy as np

sec_data = pd.read_pickle('./Data/SecData.gz')

### Introduction to the Data
**This dataframe has the raw text and metadata for 6,215 SEC 10-Q filings.  Field definitions are below:**
* CompanyCIK - Used by the SEC to identify each company.
* CompanyName - Name of company filing the report
* FileType - These should all be '10-Q'.
* FileDate - Date the filing was made.
* EdgarTextUrl & EdgarHtmlUrl - Links to the report on the SEC's website.
* AccessionNumber - A unique identifier assigned automatically to an accepted submission by the SEC.
* SecFileName - The SEC quarterly index file where this filing is located. 
* CompanyTicker - The stock ticker for this company (when available)
* FileDate_ClosingPrice - Closing stock price on the FileDate.
* FileDate_Plus_20 - The FileDate + 20 days.
* FileDate_Plus_20_Price - Closing stock price on the FileDate_Plus_20.
* Pct_Change_20 - The percentage change between the FileDate_ClosingPrice and the FileDate_Plus_20_Price.
* FileName - The name of this filing on my local filing system in case we need to look at it. 
* file_text_length - Number of text characters  extracted from the filing. I have removed all filings with less than 25,000 characters.
* f_text - The actual raw text extracted from the filing. 

In [5]:
sec_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6215 entries, 2021-01-04 to 2021-06-28
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CompanyCIK              6215 non-null   int64         
 1   CompanyName             6215 non-null   object        
 2   FileType                6215 non-null   object        
 3   FileDate                6215 non-null   datetime64[ns]
 4   EdgarTextUrl            6215 non-null   object        
 5   EdgarHtmlUrl            6215 non-null   object        
 6   AccessionNumber         6215 non-null   object        
 7   SecFileName             6215 non-null   object        
 8   CompanyTicker           6215 non-null   object        
 9   FileDate_ClosingPrice   6215 non-null   float64       
 10  FileDate_Plus_20        6215 non-null   datetime64[ns]
 11  FileDate_Plus_20_Price  6215 non-null   float64       
 12  Pct_Change_20           6215 n

### Create Folds for Cross Validation
**Below I create the train and test folds you must use for cross validation.  Here are the rules:**
1. Everyone must use the 30 train and test folds below.
2. Notice that the train data record counts for each fold increase by fold.  
3. You will likely not be able to use all the training data for each fold due to processing time.
4. In addition, older training data might make for stale predictions as well. You can test to find an optimal amount. 

# --- DO NOT MODIFY THIS CODE --- 


In [44]:
from datetime import date, timedelta

def make_cv_folds():

    retrain_after_days = 1 
    current_date = sec_data['FileDate'].min()
    end_dt = sec_data['FileDate'].max()
    #end_dt = current_date + timedelta(days=3)

    cv_data = []

    while current_date <= end_dt:

        # Get the start and end dates for the train and test period
        train_end_dt = (current_date - timedelta(days=19))
        test_start_dt = current_date
        test_end_dt = (current_date +  timedelta(days=retrain_after_days-1))

        # Create the train and test datasets for this fold
        train_idx = sec_data[(sec_data.index <= train_end_dt)]
        test_idx = sec_data[(sec_data.index >= current_date) &
                             (sec_data.index <=  test_end_dt)]

        # Get the record counts for the train and test datasets
        train_records_ava = len(train_idx)
        test_records_ava = len(test_idx)

        cv_data.append([train_end_dt, test_start_dt, test_end_dt, 
                        train_records_ava, test_records_ava, train_idx, test_idx])

        # Move the current date forward n days
        current_date = test_end_dt + timedelta(days=1)

    cv_folds = pd.DataFrame(cv_data, columns =['train_end_dt', 'test_start_dt', 'test_end_dt', 
                                               'train_records_ava', 'test_records_ava', 'train_idx', 'test_idx'])

    # Get cv folds that at least have 1000 train records and >= 1 record to predict
    cv_folds = cv_folds[(cv_folds.train_records_ava >= 1000) & 
                        (cv_folds.test_records_ava >= 1)].tail(30).reset_index(drop=True)
    return cv_folds

In [47]:
cv_folds = make_cv_folds()
print('Total CV Folds:', len(cv_folds))
print('Test Start Date:', cv_folds.test_start_dt.min())
print('Test End Date:', cv_folds.test_end_dt.max())
cv_folds

Total CV Folds: 30
Test Start Date: 2021-05-13 00:00:00
Test End Date: 2021-06-28 00:00:00


Unnamed: 0,train_end_dt,test_start_dt,test_end_dt,train_records_ava,test_records_ava,train_idx,test_idx
0,2021-04-24,2021-05-13,2021-05-13,1203,312,CompanyCIK CompanyName ...,CompanyCIK ...
1,2021-04-25,2021-05-14,2021-05-14,1203,291,CompanyCIK CompanyName ...,CompanyCIK Com...
2,2021-04-28,2021-05-17,2021-05-17,1415,490,CompanyCIK CompanyName ...,CompanyCIK ...
3,2021-04-29,2021-05-18,2021-05-18,1640,46,CompanyCIK CompanyNa...,CompanyCIK ...
4,2021-04-30,2021-05-19,2021-05-19,1835,28,CompanyCIK ...,CompanyCIK ...
5,2021-05-01,2021-05-20,2021-05-20,1835,29,CompanyCIK ...,CompanyCIK ...
6,2021-05-02,2021-05-21,2021-05-21,1835,43,CompanyCIK ...,CompanyCIK ...
7,2021-05-05,2021-05-24,2021-05-24,2493,227,CompanyCIK C...,CompanyCIK ...
8,2021-05-06,2021-05-25,2021-05-25,3088,50,CompanyCIK CompanyNam...,CompanyCIK ...
9,2021-05-07,2021-05-26,2021-05-26,3611,24,CompanyCIK CompanyNam...,CompanyCIK ...


# Modify code below to build a better model! 

### Build a Pipeline

In [68]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer

reg = Ridge(alpha=10, max_iter=1000, normalize=False, tol=0.0001, solver='sparse_cg')

# Pipeline to transform X from text to matrix
trans_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer(binary=True, ngram_range=(2, 3),
                                                         smooth_idf=False,max_df=1.0,min_df= 1,
                                                         norm='l2',stop_words= None,strip_accents=None,
                                                         use_idf=True,sublinear_tf=True,
                                                         token_pattern='\\b[^\\d\\W]{4,}\\b')), 
                                            ('vt',VarianceThreshold()),  
                                            ('scaler', StandardScaler(with_mean=False))
                                            ])

# Tools to transform y to a normal distribution using quantiles 
quant_trans = QuantileTransformer(n_quantiles=1000,
                                      output_distribution='normal',
                                      random_state=0)

ttr = TransformedTargetRegressor(regressor=reg,
                                         transformer=quant_trans)


### Simulation Example 1
* This is the most basic example using all of the available training data.
* So each successive fold will take longer and longer to train. 
* Note this took 6 hours to run on a single thread using one of the fastest regressors in sklearn. 
* In this example I transform y as well using a TransformedTargetRegressor.
* This uses quantiles to project y onto a normal distirbution before training the model. 
* The TransformedTargetRegressor also does an inverse_tranform on predict() so the values that the model predicts come back in the units that we expect prior to transformation.  

In [81]:
%%time

pred_data = []

# Fit and predict on each train and test fold
for train, test in zip(cv_folds.train_idx, cv_folds.test_idx):
    
    # Get X_train and y_train for this fold
    X_train = train.f_text.values
    y_train = train.Pct_Change_20.values
    
    # Transform X_train using the pipeline
    X_train_trans = trans_pipe.fit_transform(X_train,y_train)
    # Transform y and fit model using ttr
    ttr.fit(X_train_trans, y_train)
    
    # Get X_test and y_test for this fold
    X_test = test.f_text.values
    y_test = test.Pct_Change_20.values
    
    # Transform X_test for this fold
    X_test_trans = trans_pipe.transform(X_test)
    preds = ttr.predict(X_test_trans)
    
    # Select all but last column f_text of data
    test = test.iloc[:,:-1].copy()
    test['y_pred'] = preds
    pred_data.append(test)

pred_data = pd.concat(pred_data, axis=0, ignore_index=True)
pred_data

Wall time: 6h 13min 8s


Unnamed: 0,CompanyCIK,CompanyName,FileType,FileDate,EdgarTextUrl,EdgarHtmlUrl,AccessionNumber,SecFileName,CompanyTicker,FileDate_ClosingPrice,FileDate_Plus_20,FileDate_Plus_20_Price,Pct_Change_20,FileName,file_text_length,y_pred
0,1801777,Applied Molecular Transport Inc.,10-Q,2021-05-13,edgar/data/1801777/0001564590-21-027568.txt,edgar/data/1801777/0001564590-21-027568-index....,0001564590-21-027568,2021-QTR2,AMTI,41.090000,2021-06-02,45.570000,10.902895,1801777_0001564590-21-027568.txt,359191,-0.812420
1,1533040,Phio Pharmaceuticals Corp.,10-Q,2021-05-13,edgar/data/1533040/0001683168-21-001971.txt,edgar/data/1533040/0001683168-21-001971-index....,0001683168-21-001971,2021-QTR2,PHIO,1.870000,2021-06-02,2.150000,14.973267,1533040_0001683168-21-001971.txt,83562,-0.867653
2,1808805,ARYA Sciences Acquisition Corp III,10-Q,2021-05-13,edgar/data/1808805/0001140361-21-017275.txt,edgar/data/1808805/0001140361-21-017275-index....,0001140361-21-017275,2021-QTR2,ARYA,9.910000,2021-06-02,10.270000,3.632700,1808805_0001140361-21-017275.txt,110343,-2.899093
3,1533743,"Processa Pharmaceuticals, Inc.",10-Q,2021-05-13,edgar/data/1533743/0001493152-21-011311.txt,edgar/data/1533743/0001493152-21-011311-index....,0001493152-21-011311,2021-QTR2,PCSA,6.160000,2021-06-02,6.330000,2.759742,1533743_0001493152-21-011311.txt,94345,-0.928380
4,1808865,"iTeos Therapeutics, Inc.",10-Q,2021-05-13,edgar/data/1808865/0001564590-21-027588.txt,edgar/data/1808865/0001564590-21-027588-index....,0001564590-21-027588,2021-QTR2,ITOS,22.360001,2021-06-02,19.459999,-12.969595,1808865_0001564590-21-027588.txt,504089,-2.141107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,1668010,"Digital Brands Group, Inc.",10-Q,2021-06-28,edgar/data/1668010/0001104659-21-086107.txt,edgar/data/1668010/0001104659-21-086107-index....,0001104659-21-086107,2021-QTR2,DBGI,5.490000,2021-07-16,5.065000,-7.741343,1668010_0001104659-21-086107.txt,77002,0.816602
2061,1170010,CARMAX INC,10-Q,2021-06-28,edgar/data/1170010/0001170010-21-000116.txt,edgar/data/1170010/0001170010-21-000116-index....,0001170010-21-000116,2021-QTR2,KMX,129.009995,2021-07-16,132.009995,2.325401,1170010_0001170010-21-000116.txt,103899,10.203390
2062,82473,INNSUITES HOSPITALITY TRUST,10-Q,2021-06-28,edgar/data/82473/0001493152-21-015419.txt,edgar/data/82473/0001493152-21-015419-index.html,0001493152-21-015419,2021-QTR2,IHT,7.530000,2021-07-16,5.570000,-26.029216,82473_0001493152-21-015419.txt,149080,2.982800
2063,1819510,Atlantic Avenue Acquisition Corp,10-Q,2021-06-28,edgar/data/1819510/0001140361-21-022516.txt,edgar/data/1819510/0001140361-21-022516-index....,0001140361-21-022516,2021-QTR2,ASAQ,9.730000,2021-07-16,9.730000,0.000000,1819510_0001140361-21-022516.txt,91305,0.320832


In [84]:
datapath = 'C:\\Users\\jakem\\Documents\\Belk\\Sec\\ExtraCredit\\Data\\pred_data_e1.gz' 
pred_data.to_pickle(datapath)

from sklearn.metrics import mean_absolute_error
mean_absolute_error(pred_data.Pct_Change_20, pred_data.y_pred)

25.703681964867013

### Simulation Example 2
* This model uses everything in Simulation 1.
* However, it only uses the 1000 most recent filings for each training fold.
* This dramatically speeds up fit() times and (in theory) forces the model to focus on only very recent market events.

In [89]:
%%time

pred_data = []

# Fit and predict on each train and test fold
for train, test in zip(cv_folds.train_idx, cv_folds.test_idx):
    
    # Take only the 1000 most recent filings so the model is focused on current filing events 
    # This only works since the file is indexed and sorted by FileDate!
    train = train.tail(1000)
    
    # Get X_train and y_train for this fold
    X_train = train.f_text.values
    y_train = train.Pct_Change_20.values
    
    # Transform X_train using the pipeline
    X_train_trans = trans_pipe.fit_transform(X_train,y_train)
    # Transform y and fit model using ttr
    ttr.fit(X_train_trans, y_train)
    
    # Get X_test and y_test for this fold
    X_test = test.f_text.values
    y_test = test.Pct_Change_20.values
    
    # Transform X_test for this fold
    X_test_trans = trans_pipe.transform(X_test)
    preds = ttr.predict(X_test_trans)
    
    # Select all but last column f_text of data
    test = test.iloc[:,:-1].copy()
    test['y_pred'] = preds
    pred_data.append(test)
    

# Create the prediction file 
pred_data_2 = pd.concat(pred_data, axis=0, ignore_index=True)

# Save the prediction file to disk
datapath = 'C:\\Users\\jakem\\Documents\\Belk\\Sec\\ExtraCredit\\Data\\pred_data_e2.gz' 
pred_data_2.to_pickle(datapath)

# Calculate the final MAE for our predictions 
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(pred_data_2.Pct_Change_20, pred_data_2.y_pred)

# Display results
print('Example 2 MAE:', mae)
pred_data_2

Example 2 MAE: 25.651764417804994
Wall time: 51min 9s


Unnamed: 0,CompanyCIK,CompanyName,FileType,FileDate,EdgarTextUrl,EdgarHtmlUrl,AccessionNumber,SecFileName,CompanyTicker,FileDate_ClosingPrice,FileDate_Plus_20,FileDate_Plus_20_Price,Pct_Change_20,FileName,file_text_length,y_pred
0,1801777,Applied Molecular Transport Inc.,10-Q,2021-05-13,edgar/data/1801777/0001564590-21-027568.txt,edgar/data/1801777/0001564590-21-027568-index....,0001564590-21-027568,2021-QTR2,AMTI,41.090000,2021-06-02,45.570000,10.902895,1801777_0001564590-21-027568.txt,359191,-2.668639
1,1533040,Phio Pharmaceuticals Corp.,10-Q,2021-05-13,edgar/data/1533040/0001683168-21-001971.txt,edgar/data/1533040/0001683168-21-001971-index....,0001683168-21-001971,2021-QTR2,PHIO,1.870000,2021-06-02,2.150000,14.973267,1533040_0001683168-21-001971.txt,83562,-2.391299
2,1808805,ARYA Sciences Acquisition Corp III,10-Q,2021-05-13,edgar/data/1808805/0001140361-21-017275.txt,edgar/data/1808805/0001140361-21-017275-index....,0001140361-21-017275,2021-QTR2,ARYA,9.910000,2021-06-02,10.270000,3.632700,1808805_0001140361-21-017275.txt,110343,-3.494797
3,1533743,"Processa Pharmaceuticals, Inc.",10-Q,2021-05-13,edgar/data/1533743/0001493152-21-011311.txt,edgar/data/1533743/0001493152-21-011311-index....,0001493152-21-011311,2021-QTR2,PCSA,6.160000,2021-06-02,6.330000,2.759742,1533743_0001493152-21-011311.txt,94345,-2.142371
4,1808865,"iTeos Therapeutics, Inc.",10-Q,2021-05-13,edgar/data/1808865/0001564590-21-027588.txt,edgar/data/1808865/0001564590-21-027588-index....,0001564590-21-027588,2021-QTR2,ITOS,22.360001,2021-06-02,19.459999,-12.969595,1808865_0001564590-21-027588.txt,504089,-3.208599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,1668010,"Digital Brands Group, Inc.",10-Q,2021-06-28,edgar/data/1668010/0001104659-21-086107.txt,edgar/data/1668010/0001104659-21-086107-index....,0001104659-21-086107,2021-QTR2,DBGI,5.490000,2021-07-16,5.065000,-7.741343,1668010_0001104659-21-086107.txt,77002,0.725640
2061,1170010,CARMAX INC,10-Q,2021-06-28,edgar/data/1170010/0001170010-21-000116.txt,edgar/data/1170010/0001170010-21-000116-index....,0001170010-21-000116,2021-QTR2,KMX,129.009995,2021-07-16,132.009995,2.325401,1170010_0001170010-21-000116.txt,103899,0.398278
2062,82473,INNSUITES HOSPITALITY TRUST,10-Q,2021-06-28,edgar/data/82473/0001493152-21-015419.txt,edgar/data/82473/0001493152-21-015419-index.html,0001493152-21-015419,2021-QTR2,IHT,7.530000,2021-07-16,5.570000,-26.029216,82473_0001493152-21-015419.txt,149080,0.429809
2063,1819510,Atlantic Avenue Acquisition Corp,10-Q,2021-06-28,edgar/data/1819510/0001140361-21-022516.txt,edgar/data/1819510/0001140361-21-022516-index....,0001140361-21-022516,2021-QTR2,ASAQ,9.730000,2021-07-16,9.730000,0.000000,1819510_0001140361-21-022516.txt,91305,0.203965


### Simulation Example 3
* This example uses everything in Simulation 2.
* It also adds the impute_y() function.
* This function discretizes y into 20 equal-sized buckets and then:
 * Imuputes all values in the min decile to the min decile's max value
 * Imuputes all values in the max decile to the max decile's min value
* The intention is to cap outliers (extreme gains or losses) in the model's training data.  

In [90]:
def impute_y(y):
    df = pd.DataFrame()
    df['y'] = y

    # Discretize y into equal-sized buckets based on rank or based on sample quantiles.
    buckets = 20
    qcut = pd.qcut(df['y'], buckets, duplicates='drop').value_counts().sort_index()

    # Create a new column with the decile number for each record
    df['decile'] = pd.qcut(df['y'], buckets, labels=range(1,len(qcut)+1), duplicates='drop')

    # Get the min and max decile nums
    decile_num_min = df['decile'].min()
    decile_num_max = df['decile'].max()

    # Get the min and max values for the top and bottom decile 
    min_decile_val = df['y'][df['decile'] == decile_num_min].max()
    max_decile_val = df['y'][df['decile'] == decile_num_max].min()

    # Imput all values in the min and max deciles.
    df.loc[df['decile'] == decile_num_min, 'y'] = min_decile_val
    df.loc[df['decile'] == decile_num_max, 'y'] = max_decile_val

    return df['y'].values 

In [91]:
%%time

pred_data = []

# Fit and predict on each train and test fold
for train, test in zip(cv_folds.train_idx, cv_folds.test_idx):
    
    # Take only the 1000 most recent filings so the model is focused on current filing events 
    # This only works since the file is indexed and sorted by FileDate!
    train = train.tail(1000)
    
    # Get X_train and y_train for this fold
    X_train = train.f_text.values
    # Impute y to cap extreme low and high repsonses in training data.
    y_train = impute_y(train.Pct_Change_20.values)
    
    # Transform X_train using the pipeline
    X_train_trans = trans_pipe.fit_transform(X_train,y_train)
    # Transform y and fit model using ttr
    ttr.fit(X_train_trans, y_train)
    
    # Get X_test and y_test for this fold
    X_test = test.f_text.values
    y_test = test.Pct_Change_20.values
    
    # Transform X_test for this fold
    X_test_trans = trans_pipe.transform(X_test)
    preds = ttr.predict(X_test_trans)
    
    # Select all but last column f_text of data
    test = test.iloc[:,:-1].copy()
    test['y_pred'] = preds
    pred_data.append(test)
    

# Create the prediction file 
pred_data_3 = pd.concat(pred_data, axis=0, ignore_index=True)

# Save the prediction file to disk
datapath = 'C:\\Users\\jakem\\Documents\\Belk\\Sec\\ExtraCredit\\Data\\pred_data_e3.gz' 
pred_data_3.to_pickle(datapath)

# Calculate the final MAE for our predictions 
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(pred_data_3.Pct_Change_20, pred_data_3.y_pred)

# Display results
print('Example 2 MAE:', mae)
pred_data_3

Example 2 MAE: 25.669028262380955
Wall time: 50min 54s


Unnamed: 0,CompanyCIK,CompanyName,FileType,FileDate,EdgarTextUrl,EdgarHtmlUrl,AccessionNumber,SecFileName,CompanyTicker,FileDate_ClosingPrice,FileDate_Plus_20,FileDate_Plus_20_Price,Pct_Change_20,FileName,file_text_length,y_pred
0,1801777,Applied Molecular Transport Inc.,10-Q,2021-05-13,edgar/data/1801777/0001564590-21-027568.txt,edgar/data/1801777/0001564590-21-027568-index....,0001564590-21-027568,2021-QTR2,AMTI,41.090000,2021-06-02,45.570000,10.902895,1801777_0001564590-21-027568.txt,359191,-2.910703
1,1533040,Phio Pharmaceuticals Corp.,10-Q,2021-05-13,edgar/data/1533040/0001683168-21-001971.txt,edgar/data/1533040/0001683168-21-001971-index....,0001683168-21-001971,2021-QTR2,PHIO,1.870000,2021-06-02,2.150000,14.973267,1533040_0001683168-21-001971.txt,83562,-2.347162
2,1808805,ARYA Sciences Acquisition Corp III,10-Q,2021-05-13,edgar/data/1808805/0001140361-21-017275.txt,edgar/data/1808805/0001140361-21-017275-index....,0001140361-21-017275,2021-QTR2,ARYA,9.910000,2021-06-02,10.270000,3.632700,1808805_0001140361-21-017275.txt,110343,-3.008968
3,1533743,"Processa Pharmaceuticals, Inc.",10-Q,2021-05-13,edgar/data/1533743/0001493152-21-011311.txt,edgar/data/1533743/0001493152-21-011311-index....,0001493152-21-011311,2021-QTR2,PCSA,6.160000,2021-06-02,6.330000,2.759742,1533743_0001493152-21-011311.txt,94345,-2.096923
4,1808865,"iTeos Therapeutics, Inc.",10-Q,2021-05-13,edgar/data/1808865/0001564590-21-027588.txt,edgar/data/1808865/0001564590-21-027588-index....,0001564590-21-027588,2021-QTR2,ITOS,22.360001,2021-06-02,19.459999,-12.969595,1808865_0001564590-21-027588.txt,504089,-3.449068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2060,1668010,"Digital Brands Group, Inc.",10-Q,2021-06-28,edgar/data/1668010/0001104659-21-086107.txt,edgar/data/1668010/0001104659-21-086107-index....,0001104659-21-086107,2021-QTR2,DBGI,5.490000,2021-07-16,5.065000,-7.741343,1668010_0001104659-21-086107.txt,77002,0.647907
2061,1170010,CARMAX INC,10-Q,2021-06-28,edgar/data/1170010/0001170010-21-000116.txt,edgar/data/1170010/0001170010-21-000116-index....,0001170010-21-000116,2021-QTR2,KMX,129.009995,2021-07-16,132.009995,2.325401,1170010_0001170010-21-000116.txt,103899,0.310888
2062,82473,INNSUITES HOSPITALITY TRUST,10-Q,2021-06-28,edgar/data/82473/0001493152-21-015419.txt,edgar/data/82473/0001493152-21-015419-index.html,0001493152-21-015419,2021-QTR2,IHT,7.530000,2021-07-16,5.570000,-26.029216,82473_0001493152-21-015419.txt,149080,0.310888
2063,1819510,Atlantic Avenue Acquisition Corp,10-Q,2021-06-28,edgar/data/1819510/0001140361-21-022516.txt,edgar/data/1819510/0001140361-21-022516-index....,0001140361-21-022516,2021-QTR2,ASAQ,9.730000,2021-07-16,9.730000,0.000000,1819510_0001140361-21-022516.txt,91305,0.201495


### Interesting Observations

**Is 25% Mean Absolute Error really high?**
* It appears that a few penny stocks are dramatically inflating the MAE
* The current model is gettting the direction correct 57% of the time during the simulation.
* The model is currently more accurate at predicting decreases.  

In [150]:
# ************* Change this to the predictions for your best simulation! *************
all_fold_preds = pred_data_3

In [151]:
# Look at the outliers
all_fold_preds[all_fold_preds.Pct_Change_20 > 1000]

Unnamed: 0,CompanyCIK,CompanyName,FileType,FileDate,EdgarTextUrl,EdgarHtmlUrl,AccessionNumber,SecFileName,CompanyTicker,FileDate_ClosingPrice,...,FileName,file_text_length,y_pred,Predicted_Direction,Actual_Direction,Correct_Direction,Correct_Increase,Correct_Decrease,Share_Unit_Value,Share_Unit_Value_Raw
639,1399306,Simlatus Corp,10-Q,2021-05-17,edgar/data/1399306/0001399306-21-000020.txt,edgar/data/1399306/0001399306-21-000020-index....,0001399306-21-000020,2021-QTR2,SIML,0.0008,...,1399306_0001399306-21-000020.txt,125625,-3.173482,Decrease,Increase,False,False,False,-0.111694,0.111694
718,1662574,"Grom Social Enterprises, Inc.",10-Q,2021-05-17,edgar/data/1662574/0001683168-21-002065.txt,edgar/data/1662574/0001683168-21-002065-index....,0001683168-21-002065,2021-QTR2,GRMM,0.24,...,1662574_0001683168-21-002065.txt,121351,-10.630937,Decrease,Increase,False,False,False,-7.02,7.02
955,1101026,"Zivo Bioscience, Inc.",10-Q,2021-05-17,edgar/data/1101026/0001078782-21-000455.txt,edgar/data/1101026/0001078782-21-000455-index....,0001078782-21-000455,2021-QTR2,ZIVO,0.13,...,1101026_0001078782-21-000455.txt,116920,-2.57432,Decrease,Increase,False,False,False,-4.16,4.16


In [152]:
# Look at the adjusted MAE when the outliers are removed
p = all_fold_preds[all_fold_preds.Pct_Change_20 < 1000]
mae = mean_absolute_error(p.Pct_Change_20, p.y_pred)
print('MAE Excluding Outliers:', mae)

MAE Excluding Outliers: 15.957038846127698


In [125]:
all_fold_preds['Predicted_Direction'] = np.where(all_fold_preds['y_pred'] < 0, 'Decrease', 'Increase')
all_fold_preds['Actual_Direction'] = np.where(all_fold_preds['Pct_Change_20'] < 0, 'Decrease', 'Increase') 
all_fold_preds['Correct_Direction'] = all_fold_preds['Actual_Direction'] == all_fold_preds['Predicted_Direction']
all_fold_preds['Correct_Increase'] = np.where((all_fold_preds['Actual_Direction'] == 'Increase')
                                              & (all_fold_preds['Correct_Direction'] == True),
                                              True, False)
all_fold_preds['Correct_Decrease'] = np.where((all_fold_preds['Actual_Direction'] == 'Decrease')
                                              & (all_fold_preds['Correct_Direction'] == True),
                                              True, False)

print('Total Predictions:', len(all_fold_preds.index) ,'\n')
print('Correct_Direction')
all_fold_preds['Correct_Direction'].value_counts(normalize=True)

Total Predictions: 2065 

Correct_Direction


False    0.576755
True     0.423245
Name: Correct_Direction, dtype: float64

In [126]:
ti = len(all_fold_preds['Actual_Direction'][all_fold_preds.Actual_Direction == 'Increase']) 
print('Total_Increases: ', ti)
ci = all_fold_preds['Correct_Increase'].sum()
print('Correct_Increases: ', ci)
print('Percent_Correct:', (ci/ti) * 100)

Total_Increases:  1382
Correct_Increases:  485
Percent_Correct: 35.09406657018813


In [127]:
td = len(all_fold_preds['Actual_Direction'][all_fold_preds.Actual_Direction == 'Decrease']) 
print('Total_Decreases: ', td)
cd = all_fold_preds['Correct_Decrease'].sum()
print('Correct_Decreases: ', cd)
print('Percent_Correct:', (cd/td) * 100)

Total_Decreases:  683
Correct_Decreases:  389
Percent_Correct: 56.95461200585652
