# General

1. How to run much code up in the sky 
2. How to acess the data from the csv-file 

# About Lobster Data

Lobster data is an online limit order book data tool. This is a tool that gives access to the entire universe of NASDAQ traded stocks.

The lobster database offers the NASDAQ's historical TotalView-ITCH files. This is data that provide detailed information about every trade and quote for all NASDAQ-listed stocks. In the interface it is possible customize the level of detail for the data e.g. the level. 

Here Submissions, cancellations and executions (visible and hidden) are seperately identified.

For each limit order event the following information is provided: Time stamp(up to nanosecond precision), order ID, price, size and buy/sell indicator.
The database contains data from the 27th of june 2007 up to the day before yesterday. 

### Output Structure 

Lobster generates two files - 'message' and 'orderbook' file for each active trading day:
1. The 'orderbook' file contains the evolution of the limit order book up to the desirede number of levels. 
2. The 'message' file contains indicators for the type of event causing an update of the limit order book in the requested price range. 

All events are timestamped to second after midnight with decimal precision up to nanoseconds.

### Message File

This file contains the following: 
1. Time (ts) seconds after midnight with up to nanoseconds precision
2. Event type:(1: Submission of a new limit order)(2: Cancellation (partial deletion of a limit order))(3: Deletion (total deletion of a limit order))(4: Execution of a visible limit order)(5: Execution of a hidden limit order)(6: Indicates a cross trade, e.g. auction trade)(7: Trading halt indicator (detailed information below))
3. Order ID: Unique order reference number
4. Size: Number of shares
5. Price: Dollar price times 10000
6. Direction: (-1: sell limit order) (1:Buy limit order) (note: Execution of a sell (buy) limit order corresponds to a buyer (seller) initiated trade, i.e. buy (sell) trade.)


### Order Book File

This file contains ask and bid prices and ask and bid sizes

We have the following variables:
1. Ask Price 1: Level 1 ask price (best ask price)
2. Ask Size 1: Level 1 ask volume (best ask volume)
3. Bid Price 1: Level 1 bid price (best bid price)
4. Bid Size 1: Level 1 bid volume (best bid volume)
5. Ask Price 2: Level 2 ask price (second best ask price)
6. Ask Size 2: Level 2 ask volume (second best ask volume) 
7. ...

The output of the message file

### General - how does it work?

The limit order book data is contracted based on the NASDAQ's histotical TotalView-ITCH files.

Instead of streaming the state of the entire limit book order after each update, the Lobster data only display the information when the limit order event changes the order book. Here will each limit order submission, cancellation and execution result in an individual event message being streamed to the market.

"Compare, for example, a level 1 and a level 25 request. In case of a level 1 request only 'trades and quotes', i.e. changes to the best bid and ask prices and their respective volumes are recorded. The limit order book saved to the output file in case of a level 25 request is updated every time a price or volume changes in the range from the 25th best bid to the 25th best ask price."

### About order book modeling (https://towardsdatascience.com/application-of-gradient-boosting-in-order-book-modeling-3cd5f71575a7)

The order book is a electronic list of buy and sell orders for a specific security or finansial instrument. Here the orderbook lists the amount being bid or offered for each price point. 
N.B. -> Market depth can help traders determine the direction of future price movements. (Here can the market depth be able to tell if the market is able to absorb large market order without significantly impacting the price).


The mid-price is the price between the best price of the seller and the the best price of the buyer (in our data the midprice will be: (ask_price_1+bid_price_1)/2)

# CODE

### GET THE DATA

In [None]:
# The following code is generated with inspiration from steffan voigt (https://www.voigtstefan.me/post/lobster-1/)
import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import os

def process_orderbook_data(assets, date, level, directory):
    """
    Processes order book data for a given set of assets on a specific date.

    This function reads message and order book data from CSV files for each 
    asset, processes the data by modifying timestamps, prices, and computing 
    additional metrics like midquote, spread, and volume. Finally, it saves the 
    processed data into Feather format files.

    Parameters:
    assets (list): A list of asset symbols (e.g., ['AAPL', 'TSLA']).
    date (str): The date for which the data is processed, in 'YYYY-MM-DD' format.
    level (int): The level of the order book data to process.
    directory (str): The directory path where the data files are located and where 
                     the processed files will be saved.

    Returns:
    None
    """

    # Change the working directory to the specified directory:
    os.chdir(directory)

    # Iterate over each asset in the list:
    for dates in date:
        for asset in assets:
            # Construct file names for messages and order book data:
            messages_filename = f"{asset}_{dates}_34200000_57600000_message_{level}.csv"
            orderbook_filename = f"{asset}_{dates}_34200000_57600000_orderbook_{level}.csv"

            # Read messages data from CSV:
            messages_raw = pd.read_csv(
                messages_filename, 
                names=["ts", "type", "order_id", "m_size", "m_price", "direction", "null"],
                dtype={"ts": float, "type": int, "order_id": int, "m_size": float, "m_price": float, "direction": int, "null": object}
            )

            # Convert timestamp to datetime and adjust for time zone and date:
            messages_raw["ts"] = messages_raw["ts"].apply(
                lambda x: datetime.fromtimestamp(x, tz=pytz.timezone('GMT')).replace(tzinfo=pytz.timezone('GMT')).replace(year=int(dates[0:4]), month=int(dates[5:7]), day=int(dates[8:10]))
            )

            # Adjust message prices:
            messages_raw["m_price"] = messages_raw["m_price"] / 10000

            # Read order book data from CSV:
            orderbook_raw = pd.read_csv(
                orderbook_filename, 
                header=None,
                names=[f"{col}_{i+1}" for i in range(level) for col in ["ask_price", "ask_size", "bid_price", "bid_size"]],
                dtype=np.float64
            )

            # Adjust order book prices:
            orderbook_raw = orderbook_raw.apply(lambda x: x/10000 if "price" in x.name else x)

            # Merge messages and order book data:
            orderbook = pd.concat([messages_raw, orderbook_raw], axis=1)
            del orderbook["null"]  # delete the unnecessary 'null' column

            # Calculate midquote, spread, and volume:
            orderbook["midquote"] = orderbook.apply(lambda x: (x["ask_price_1"]/2) + (x["bid_price_1"]/2), axis=1)
            orderbook["spread"] = orderbook.apply(lambda x: ((x["ask_price_1"] - x["bid_price_1"])/x["midquote"]) * 10000, axis=1)
            orderbook["volume"] = np.where((orderbook["type"] == 4) | (orderbook["type"] == 5), orderbook["m_size"], 0)
            #orderbook["hidden_volume"] = np.where(orderbook["type"] == 5, orderbook<["m_size"], 0)

            # Add a column with the asset name:
            orderbook["asset"] = f"{asset}"

            # Save the processed data to a Feather file:
            orderbook.to_feather(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/Feather/{asset}_{dates}_orderbook_{level}.feather")
            print('Done with asset: ', asset)
        print('Done with date: ', dates)

# Usage example:
asset = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']
date = ["2023-03-21","2023-03-22","2023-03-23"]
level = 15
directory = r"/Users/jensknudsen/Desktop/LOBSTER_DATA/Data"
process_orderbook_data(asset, date, level, directory)


### CHECK MAX DEPTH ALLOWED

In [None]:
#Code to detect the lowest value in the orderbook
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

def concatenate_orderbooks(asset_list, date, level, directory):
    orderbook_df_list = []
    for asset in asset_list:
        file_path = os.path.join(directory, f'{asset}_{date}_orderbook_{level}.feather')
        orderbook_df_list.append(pd.read_feather(file_path))
    orderbook = pd.concat(orderbook_df_list)
    return orderbook


# df_returns = pd.DataFrame()

asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']

# ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'APO', 'ARM', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BRK.B', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CB', 'CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'E', 'ECL', 'EL', 'ELV', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FI', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITUB', 'ITW', 'JNJ', 'JPM', 'KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCK', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'MUFG', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PBR', 'PBR.A', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RTX', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPGI', 'STLA', 'SYK', 'T', 'TD', 'TDG', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']

date = ["2023-03-21","2023-03-22","2023-03-23"]
level = 15
prediction_ahead = 4
time_interval = '5S' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek;500L=0,5 sek; 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
depth = 5 # also 10, 30, 50
window = 5 # also 3, 5, 10
directory = r"/Users/jensknudsen/Desktop/LOBSTER_DATA/Feather"


import pandas as pd
minvalue_list =[]

def find_column_with_value(df, value):
    """
    Find the column name in a pandas DataFrame that contains a specific float64 value.

    Parameters:
    df (pandas.DataFrame): The DataFrame to search.
    value (float): The float64 value to search for.

    Returns:
    list: A list of column names that contain the specified value.
    """
    try:
        columns_with_value = []

        for column in df.columns:
            if df[column].dtype == 'float64' and (df[column] == value).any():
                columns_with_value.append(column)
        return_value = columns_with_value[0]
        # extract only the number as a interger with regax
        return_value = int(''.join(filter(str.isdigit, return_value)))

        return return_value
    except:
        return 15

minvalue_list =[]

for e in date:
    for i in asset_list:
        df_test = concatenate_orderbooks([i], e, level, directory)
        # Example usage:
        a = find_column_with_value(df_test, 199999)
        b = find_column_with_value(df_test, -999999.9999)
        c = find_column_with_value(df_test, 999999.9999)

        # return the lowest value of the three
        min_value = min(a, b, c)
        minvalue_list.append(min_value)

minvalueoverall = min(minvalue_list)
print(f"This is the maximum number of depth that is allowed in this timehorizon is: {minvalueoverall}")

### THE MAIN CODE

In [None]:
asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CNI', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TM', 'TMO', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UPS', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'ZTS']
len(asset_list)

In [None]:
#CODE WITH THE RIGHT PICTURES
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import shutil

# BMO + CAT + CNQ + FDX + GS + TJX +TRV +UNP+V +WMT +XOM = done in 0,1 sek
# ADSK + APH + CMG + ECL + ROP= 0,01 sek

asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CNI', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TM', 'TMO', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UPS', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'ZTS']#['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TM', 'TMO', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UPS', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'ZTS'] # ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']  #['AAPL', 'ABNB', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'ALGN', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'ANSS', 'ASML', 'AVGO', 'AZN', 'BIIB', 'BKNG', 'BKR', 'CDNS', 'CEG', 'CHTR', 'CMCSA', 'COST', 'CPRT', 'CRWD', 'CSCO', 'CSGP', 'CSX', 'CTAS', 'CTSH', 'DDOG', 'DLTR', 'DXCM', 'EA', 'EBAY', 'ENPH', 'EXC', 'FANG', 'FAST', 'FTNT', 'GEHC', 'GFS', 'GILD', 'GOOGL', 'HON', 'IDXX', 'ILMN', 'INTC', 'INTU', 'ISRG', 'JD', 'KDP', 'KHC', 'KLAC', 'LCID', 'LRCX', 'LULU', 'MAR', 'MDLZ', 'MELI', 'META', 'MNST', 'MRNA', 'MRVL', 'MSFT', 'MU', 'NFLX', 'NVDA', 'NXPI', 'ODFL', 'ON', 'ORLY', 'PANW', 'PAYX', 'PCAR', 'PDD', 'PEP', 'PYPL', 'QCOM', 'REGN', 'ROST', 'SBUX', 'SGEN', 'SIRI', 'SNPS', 'TEAM', 'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL', 'ZM', 'ZS'] # ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'APO', 'ARM', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BRK.B', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CB', 'CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'E', 'ECL', 'EL', 'ELV', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FI', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITUB', 'ITW', 'JNJ', 'JPM', 'KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCK', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'MUFG', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PBR', 'PBR.A', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RTX', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPGI', 'STLA', 'SYK', 'T', 'TD', 'TDG', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']

#original_starttime = datetime.fromisoformat("2023-03-21 09:30:00.000000+00:00")

level = 15
prediction_ahead = 4 # This the the duration of each picture 4*time_interval 
depth = 5 # also 10, 30, 50
time_interval = '10L' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 500L=0,5sek, 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10
date_listen =["2023-03-21"]#["2023-03-21","2023-03-22","2023-03-23"]


def create_folder(depth, time_interval, window):
    """
    Create new folders with a specified naming convention at two different locations.
    If the folders already exist, prompt the user for permission to delete existing files.

    Parameters:
    depth (int): Depth parameter used in the folder names.
    time_interval (int): Time interval parameter used in the folder names.
    window (int): Window parameter used in the folder names.
    """
    for dato in date_listen:
        # First folder path
        folder_path_1 = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}_date{dato}'

        # Second folder path
        folder_path_2 = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/Returns/depth{depth}_time{time_interval}_window{window}_date{dato}'

        # Function to create or clear a folder
        def create_or_clear_path(folder_path):
            if os.path.exists(folder_path):
                response = input(f"Folder already exists: {folder_path}. Delete files in it? (yes/no): ").strip().lower()
                if response == 'yes':
                    shutil.rmtree(folder_path)
                    os.makedirs(folder_path, exist_ok=True)
                    print(f"Folder cleared and recreated: {folder_path}")
                else:
                    print(f"Folder retained with existing files: {folder_path}")
            else:
                os.makedirs(folder_path, exist_ok=True)
                print(f"Folder created: {folder_path}")

        # Create or clear the first folder
        create_or_clear_path(folder_path_1)

        # Create or clear the second folder
        create_or_clear_path(folder_path_2)

# Example usage
create_folder(depth,time_interval, window)


for w in date_listen:
    original_starttime = datetime.fromisoformat(f"{w} 09:30:00.000000+00:00")  
    for i in asset_list:
        datetime_obj = datetime.fromisoformat(f"{w} 09:33:00.000+00:00") # default is 16:00:00.000+00:00
        datetime_obj = datetime_obj - pd.to_timedelta(window * prediction_ahead+1) # to avoid problem in the end of the prediction horizon (due to the return)
        a = original_starttime
        b = datetime.fromisoformat(f"{w} 09:31:05.000000+00:00")
        c = original_starttime
        while c < datetime_obj:
            # Helper function to concatenate orderbooks
            def concatenate_orderbooks(asset_list, date, level, directory):
                orderbook_df_list = []
                for asset in asset_list:
                    file_path = os.path.join(directory, f'{asset}_{date}_orderbook_{level}.feather')
                    orderbook_df_list.append(pd.read_feather(file_path))
                orderbook = pd.concat(orderbook_df_list)
                return orderbook

            def round_up_timestamp(ts, freq):
                """Round up a timestamp based on a given frequency"""
                return (ts + pd.Timedelta(freq) - pd.Timedelta('1ns')).floor(freq)

            def split_by_seconds_optimized(df, freq, window):
                df['ts'] = pd.to_datetime(df['ts'])
                # Round up the timestamps
                df['ts'] = df['ts'].apply(lambda x: round_up_timestamp(x, freq))
                df.set_index('ts', inplace=True)
                if 'asset' not in df.columns:
                    raise ValueError("DataFrame must have an 'asset' column")
                asset_results = {}
                for asset, asset_df in df.groupby('asset'):
                    # Resample the dataframe
                    resampled_df = asset_df.resample(freq).last()
                    # Forward-fill any NaN values
                    resampled_df.ffill(inplace=True)
                    # If the first row(s) are still NaN, drop them
                    resampled_df.dropna(inplace=True)
                    interval_dataframes = resampled_df.groupby(resampled_df.index).tail(1)
                    cols_to_drop = ["type", "order_id", "m_size", "m_price","spread", "direction", "volume", "midquote", "asset"] # "hidden_volume"
                    interval_dataframes.drop(columns=cols_to_drop, inplace=True)
                    interval_dataframes = interval_dataframes.T
                    rolling_windows = [None] * (len(interval_dataframes.columns) - (window - 1))
                    for start_col in range(len(interval_dataframes.columns) - (window - 1)):
                        end_col = start_col + window
                        window_df = interval_dataframes.iloc[:, start_col:end_col]
                        rolling_windows[start_col] = window_df
                    asset_results[asset] = rolling_windows
                return asset_results

            # Function to truncate the results based on x
            def truncate_results(asset_results, depth):
                truncated_results = {}
                num_rows = depth * 4  # Convert depth to number of rows to keep
                for asset, windows in asset_results.items():
                    truncated_windows = [window.head(num_rows) for window in windows]
                    truncated_results[asset] = truncated_windows
                return truncated_results

            # Main function to process orderbooks
            def process_orderbooks(asset_list, w, level, freq, window, start_time, end_time, directory, depth):
                orderbook = concatenate_orderbooks(asset_list, w, level, directory)
                orderbook = orderbook.loc[orderbook['ts'].between(start_time, end_time)].reset_index(drop=True)
                asset_results = split_by_seconds_optimized(orderbook, freq, window)
                asset_results = truncate_results(asset_results, depth)
                return asset_results

            # USE:
            asset = [i]
            date = f"{w}"
            directory = r"/Users/jensknudsen/Desktop/LOBSTER_DATA/Feather"
            start_time = a #"2023-03-21 09:30:19.975000+00:00"
            end_time = b
            #Dinamic parameters:
            #prediction_ahead = 3
            #depth = 10 # also 10, 30, 50
            #time_interval = '30S' #: 10S = 10sek, 1S=1sek; 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
            #window = 3 # also 3, 5, 10

            results = process_orderbooks(asset, date, level, time_interval, window, start_time, end_time, directory, depth)
            # make the a copy of the data results to be asset_results_10ms

            asset_results_10ms = results.copy()
            
            # NEW IMPROVE CODE
            from helper_functions import scale_dataframe
            from concurrent.futures import ProcessPoolExecutor
            import os

            def scale_dataframes_concurrently(dfs):
                # Using ProcessPoolExecutor to parallelize the scaling
                # Using os.cpu_count() to get the number of available CPU cores
                num_workers = os.cpu_count() - 1  # Using all cores minus one
                with ProcessPoolExecutor(max_workers=num_workers) as executor:
                    return list(executor.map(scale_dataframe, dfs))

            def scale_asset_dataframes(asset_data):
                for asset, dfs in asset_data.items():
                    # Use the concurrent scaling function
                    asset_data[asset] = scale_dataframes_concurrently(dfs)
                return asset_data

            from datetime import datetime, timedelta

            def add_time_units(original_starttime, unit, units_to_add):
                # Define the conversion of each unit to seconds
                unit_conversion = {
                    '60S': 60,         # 60 seconds
                    '30S': 30,         # 30 seconds
                    '10S': 10,         # 10 second
                    '5S': 5,         # 5 second
                    '1S': 1,         # 1 second
                    '500L':0.5,      # 0.5 second   
                    '100L': 0.1,     # 0.1 seconds
                    '10L': 0.01,     # 0.01 seconds
                    '1L': 0.001,     # 0.001 seconds
                    '100U': 0.001    # 0.001 seconds
                }

                # Calculate the total seconds to add
                total_seconds_to_add = units_to_add * unit_conversion[unit]

                # Add the time to the original datetime
                return original_starttime + timedelta(seconds=total_seconds_to_add)

            def minus_time_units(original_starttime, unit, units_to_add):
                # Define the conversion of each unit to seconds
                unit_conversion = {
                    '60S': 60,         # 60 seconds
                    '30S': 30,         # 30 seconds
                    '10S': 10,         # 10 second
                    '5S': 5,         # 5 second
                    '1S': 1,         # 1 second
                    '500L':0.5,      # 0.5 second
                    '100L': 0.1,     # 0.1 seconds
                    '10L': 0.01,     # 0.01 seconds
                    '1L': 0.001,     # 0.001 seconds
                    '100U': 0.001    # 0.001 seconds
                }

                # Calculate the total seconds to add
                total_seconds_to_add = units_to_add * unit_conversion[unit]

                # Add the time to the original datetime
                return original_starttime - timedelta(seconds=total_seconds_to_add)

            def sub_time_units(unit, units_to_add):
                # Define the conversion of each unit to seconds
                unit_conversion = {
                    '60S': 60,         # 60 seconds
                    '30S': 30,         # 30 seconds
                    '10S': 10,         # 10 second
                    '5S': 5,         # 5 second
                    '1S': 1,         # 1 second
                    '500L':0.5,      # 0.5 second
                    '100L': 0.1,     # 0.1 seconds
                    '10L': 0.01,     # 0.01 seconds
                    '1L': 0.001,     # 0.001 seconds
                    '100U': 0.001    # 0.001 seconds
                }

                # Calculate the total seconds to add
                total_seconds_to_add = units_to_add * unit_conversion[unit]

                # Add the time to the original datetime
                return timedelta(seconds=total_seconds_to_add)
            # THE BEST AROUND
            def save_images_from_dataframes_aligned(asset_data, save_dir, depth, time_interval, window, asset,w):
                # Image dimensions
                IMAGE_WIDTH = (len(asset_data[asset[0]][0].columns) * 20) + len(asset_data[asset[0]][0].columns)
                IMAGE_HEIGHT = int((len(asset_data[asset[0]][0]) / 2) * 4)
                # Define pixel boundaries for price values
                pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)
                def dataframe_to_image_array(df):
                    
                    # Assuming df is a pandas DataFrame that has been defined earlier
                    # Create a copy of the DataFrame to work with, preserving the original
                    df_working_copy = df.copy()
                    
                    IMAGE_WIDTH = (len(df_working_copy.columns) * 20) + len(df_working_copy.columns)
                    IMAGE_HEIGHT = int((len(df_working_copy) / 2) * 4) # normalt 4
                    pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT)
                    
                    img = np.zeros((IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)
                    
                    for column in df_working_copy.columns:
                        prices = df_working_copy[column].iloc[::2].values
                        sizes = np.copy(df_working_copy[column].iloc[1::2].values)  # Work with a copy to avoid modifying the original
                    
                        price_positions = np.digitize(prices, pixel_boundaries) - 1
                    
                        # Find all unique values and their counts
                        unique_positions, counts = np.unique(price_positions, return_counts=True)
                        
                        # Identify values that occur more than once
                        duplicates = unique_positions[counts > 1]
                        
                        if duplicates.size > 0:
                            for dup in duplicates:
                                # Find all indices where this duplicate value occurs
                                dup_indices = np.where(price_positions == dup)[0]
                                # Sum the sizes for these indices
                                size_sum = sizes[dup_indices].sum()
                                # Assign the summed size to the duplicates
                                sizes[dup_indices] = size_sum
                                #print(f"Column: {column}, Duplicate value {dup} at indices {dup_indices} with adjusted size {size_sum}")
                        #else:
                            #print(f"Column: {column}, No duplicates found")
                        
                        # Update the working copy of the DataFrame with the adjusted sizes for this column
                        df_working_copy[column].iloc[1::2] = sizes
                    
                    # rescale the size rows based on the highest value in the column
                    
                    # Rescaling logic
                    
                    max_size_value = df_working_copy.loc[df_working_copy.index.str.contains("size")].max().max()
                    
                    df_working_copy.loc[df_working_copy.index.str.contains("size")] = df_working_copy.loc[df_working_copy.index.str.contains("size")].apply(lambda x: x / max_size_value)
                    
                    
            
                    for col_idx, column in enumerate(df_working_copy.columns):
                        mid = (col_idx * 21) + 10
                        used_positions = set()  # Set to keep track of used price positions
                        
                        # Extract prices and sizes for the current column
                        prices = df_working_copy[column].iloc[::2].values
                        sizes = df_working_copy[column].iloc[1::2].values
                        price_positions = IMAGE_HEIGHT - np.digitize(prices, pixel_boundaries)
                        price_positions = np.clip(price_positions, 0,IMAGE_HEIGHT-1)
            
                    
                        for idx in range(len(prices)):
                            price_pos = price_positions[idx]
                            #print(f"Price: {prices[idx]}, Position: {price_pos}")
                            if price_pos not in used_positions:  # Check if the position has not been used
                                img[price_pos, mid] = 1  # Mark the price position
                                size_value = sizes[idx]
                                line_length = int(10 * size_value)  # Determine the length of the size line
                                
                                # Determine if it's a bid or ask size and position the size line accordingly
                                if 'ask_size' in df_working_copy.index[2*idx + 1]:  # This might need adjustment based on your DataFrame's structure
                                    img[price_pos, mid+1:mid+1+line_length] = 1  # Position the ask size to the right
                                else:
                                    img[price_pos, mid-line_length:mid] = 1  # Position the bid size to the left
                                
                                used_positions.add(price_pos)  # Mark this position as used
            
            
                    return img
                # Loop through each asset and each of its dataframes
                for asset, dfs in asset_data.items():
                    for idx, df_working_copy in enumerate(dfs):
                        last_column_name = df_working_copy.columns[-1]  # Extract the name of the first column    
                        img_array = dataframe_to_image_array(df_working_copy)
                        # Assuming 'last_column_name' is a datetime object
                        formatted_timestamp = last_column_name.strftime('%Y-%m-%d_%H-%M-%S.%f%z')
                        filename = f"{asset}_depth{depth}_interval{time_interval}_window{window}_date{w}_{formatted_timestamp}.npz"
                        #filename = f"{asset}_depth{depth}_interval{time_interval}_window{window}_{last_column_name}.npz"
                        save_path = os.path.join(save_dir, filename)
                        np.savez_compressed(save_path, img_array)

            asset_results_10ms = scale_asset_dataframes(asset_results_10ms)

            save_dir = f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}_date{w}"
            
            save_images_from_dataframes_aligned(asset_results_10ms, save_dir, depth, time_interval, window,asset,w)

            from datetime import datetime, timedelta
            subtime = sub_time_units(time_interval, 400)
            addtime = sub_time_units(time_interval, 3500)

            #print(asset_results_10ms[asset[0]][-1].columns[-1]- subtime) #timedelta(milliseconds=6000)) # 40ms for 10L; 400ms for 100L
            #print(asset_results_10ms[asset[0]][-1].columns[-1]+ addtime) #timedelta(milliseconds=120000)) # 100000ms for 10L;
            a = asset_results_10ms[asset[0]][-1].columns[-1] - subtime #timedelta(milliseconds=6000) # 40ms for 10L;
            c = asset_results_10ms[asset[0]][-1].columns[-1]
            if b < datetime_obj:
                b = asset_results_10ms[asset[0]][-1].columns[-1] + addtime #timedelta(milliseconds=120000) # 100000ms for 10L;
            else:
                c = datetime_obj  

        else :
            df_mid = concatenate_orderbooks(asset, date,level,directory)
            # Assuming df_mid is your DataFrame

            # Convert the 'ts' column to datetime if it's not already
            df_mid['ts'] = pd.to_datetime(df_mid['ts'])

            # Set the 'ts' column as the index
            df_mid.set_index('ts', inplace=True)

            ask_price_1_df = df_mid['ask_price_1'].resample(time_interval).last()
            ask_size_1_df = df_mid['ask_size_1'].resample(time_interval).last()
            bid_price_1_df = df_mid['bid_price_1'].resample(time_interval).last()
            bid_size_1_df = df_mid['bid_size_1'].resample(time_interval).last()
            midquote_df = df_mid['midquote'].resample(time_interval).last()

            # Resample the DataFrame to a 10-millisecond interval
            # and use the last midquote observation for each interval
            resampled_df = df_mid['midquote'].resample(time_interval).last()

            # Calculate returns as the percentage change in midquote
            returns = resampled_df.pct_change(periods=prediction_ahead)

            # Forward fill missing data
            returns_filled = returns.fillna(method='ffill')

            # Convert the returns back to a DataFrame with the timestamp reset as a column
            returns_df = returns_filled.reset_index()

            # wish to move the midqoute column one down in the returns_df
            # make the the prediction!!!
            returns_df['midquote'] = returns_df['midquote'].shift(-prediction_ahead) #burde være 4 her!!! (jævnfør tanken i første omgang)

            # do it for the ask_price_1_df:
            ask_price_1_df = ask_price_1_df.reset_index().rename(columns={'ts': 'ask_price_1'})
            ask_price_1_df.columns = ['ts', 'ask_price_1']
            ask_price_1_df = ask_price_1_df.fillna(method='ffill')

            # do it for the ask_size_1_df:
            ask_size_1_df = ask_size_1_df.reset_index().rename(columns={'ts': 'ask_size_1'})
            ask_size_1_df.columns = ['ts', 'ask_size_1']
            ask_size_1_df = ask_size_1_df.fillna(method='ffill')

            # do it for the bid_price_1_df:
            bid_price_1_df = bid_price_1_df.reset_index().rename(columns={'ts': 'bid_price_1'})
            bid_price_1_df.columns = ['ts', 'bid_price_1']
            bid_price_1_df = bid_price_1_df.fillna(method='ffill')

            # do it for the bid_size_1_df:
            bid_size_1_df = bid_size_1_df.reset_index().rename(columns={'ts': 'bid_size_1'})
            bid_size_1_df.columns = ['ts', 'bid_size_1']
            bid_size_1_df = bid_size_1_df.fillna(method='ffill')

            # do it for the midquote_df:
            midquote_df = midquote_df.reset_index().rename(columns={'ts': 'midquote'})
            midquote_df.columns = ['ts', 'midquote_real']
            midquote_df = midquote_df.fillna(method='ffill')
            import pandas as pd
            from functools import reduce

            # Assuming you have six DataFrames: df1, df2, df3, df4, df5, df6
            dataframes = [ask_price_1_df, ask_size_1_df, bid_price_1_df, bid_size_1_df, midquote_df, returns_df]

            # Merge all DataFrames on a common column using reduce
            returns_df = reduce(lambda left, right: pd.merge(left, right, on='ts', how='inner'), dataframes)

            # Now merged_df contains data from all six DataFrames, merged on 'common_column'
            # Move the ts column one up in the returns_df
            returns_df['ts'] = returns_df['ts'].shift(-1)

            ## Create a DataFrame by combining the six Series
            #df = pd.DataFrame({
            #    'Column1': returns_df, 
            #    'Column2': ask_price_1_df, 
            #    'Column3': ask_size_1_df, 
            #    'Column4': bid_price_1_df, 
            #    'Column5': bid_size_1_df, 
            #    'Column6': midquote_df
            #})

            #display(df)

            # make a new new_datetime_start that add 5 units to the original_starttime
            new_datetime_start = add_time_units(original_starttime, time_interval, window)  # Adds window sequence to the new_datetime_start
            # make the the dataframe start from the first_formatted_timestamp
            new_datime_end = minus_time_units(datetime_obj, time_interval, window) # Minus window sequence to the new_datetime_start

            # make the the dataframe start from the new_datetime_start to the new_datime_end
            returns_df = returns_df.loc[returns_df['ts'].between(new_datetime_start, new_datime_end)].reset_index(drop=True)

            # pu the the returns_df to the df_returns by concat
            #df_returns = pd.concat([df_returns, returns_df])

            #reindex the df_returns
            returns_df = returns_df.reset_index(drop=True)
            # save the returns_df to csv
            returns_df.to_csv(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/Returns/depth{depth}_time{time_interval}_window{window}_date{w}/returns_{i}_.csv", index=False)
            print(f"Last value of b is: {b}")
            print("Done with the asset:", i)
    print('Done with the date:', w) 
print("Done with all the dates")

# remove the ones that are not needed
import os
from datetime import datetime, timezone

# Define the directory where the files are stored
directory = f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}_date{w}"
# List all files in the directory
files = os.listdir(directory)

# Iterate through the files and delete those with timestamps greater than datetime_obj
for file_name in files:
    #print the last 20 characters of the file name
    name_of_file = file_name[-35:-4]
    #convert name_of_file to time format:
    # Define the format for parsing
    format_str = '%Y-%m-%d_%H-%M-%S.%f%z'
    # Convert the string to a datetime object
    to_time = datetime.strptime(name_of_file, format_str)
    
    if to_time > datetime_obj:
        # Define the path of the file to delete
        file_path = os.path.join(directory, file_name)
        # Delete the file
        os.remove(file_path)
print("All files removed")

In [None]:
# Make the code above danymic - done
# Get the data from calling a function (get the returns and the images)
# Vær obs på at dybden på nogle stocks ikke er så stor -> så den står som 9999999 (Dette skal der tages hånd om) -> måske bare lav en function til at fange maks dybden for tidshorisonten
# Vær obs på tispunktet b ikke overstiger 24.00.00 -> så det bliver den næste dag (Dette skal der tages hånd om)

### COUNT THE NUMBER OF FILES IN A FOLDER

In [None]:
import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import os
#ORIGINAL
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import shutil


asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TM', 'TMO', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UPS', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'ZTS']#['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS'] 

level = 15
prediction_ahead = 4 # This the the duration of each picture 4*time_interval 
depth = 5 # also 10, 30, 50
time_interval = '10L' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 500L=0,5sek, 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10
date_listen = ["2023-03-21"]#["2023-03-21","2023-03-22","2023-03-23"]

def count_files_fast(folder_path):
    """Counts the number of files in the specified folder using os.scandir(), which is faster for large directories."""
    count = 0
    with os.scandir(folder_path) as it:
        for entry in it:
            if entry.is_file():
                count += 1
    return count

# Initialize the sum of file counts
total_file_count = 0
# Loop through each date
for w in date_listen:
    folder_path = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}_date{w}'
    file_count = count_files_fast(folder_path)
    total_file_count += file_count
    print(f'Number of files in folder for {w}: {file_count}')

# Print the total number of files across all folders
print(f'Total number of files across all folders: {total_file_count}')


### GENERATE THE DF_ARRAY

In [None]:
import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

def generate_df_array(depth, time_interval, window,date_listen):
    """
    Load .npz files from a specified folder and process the data into a pandas DataFrame.

    This function reads all .npz files from a given directory, concatenates arrays found within
    each file, and then extracts relevant information like 'Ticker' and 'Time' from the filenames.
    The function assumes a specific naming convention for the files.

    Parameters:
    depth (int): Depth parameter used in the folder path.
    time_interval (int): Time interval parameter used in the folder path.
    window (int): Window parameter used in the folder path.

    Returns:
    pandas.DataFrame: A DataFrame containing the filenames, combined arrays, tickers, and times.
    """
    all_dfs = []

    # Define the folder path using formatted strings for depth, time_interval, and window
    for w in date_listen:
        folder_path = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}_date{w}'

        # List all .npz files in the specified folder
        file_list = [file for file in os.listdir(folder_path) if file.endswith('.npz')]
        file_list.sort()  # Sort the file list alphabetically

        # Lists to store each file's numpy array and filename
        arrays = []
        filenames = []

        for file in file_list:
            file_path = os.path.join(folder_path, file)
            with np.load(file_path, allow_pickle=True) as data:
                # Combine all arrays in a file, if any
                combined_array = np.concatenate([data[array_name] for array_name in data.files])
                arrays.append(combined_array)
                filenames.append(file)

        # Create a DataFrame with filenames and arrays
        df_array = pd.DataFrame({'Filename': filenames, 'Arrays': arrays})

        # Extract ticker and time from the filename, and convert time to datetime format
        df_array['Ticker'] = df_array['Filename'].str.extract(r'([A-Z]+)_')
        df_array['Time'] = pd.to_datetime(df_array['Filename'].str.extract(r'(\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}\.\d{6})')[0], format='%Y-%m-%d_%H-%M-%S.%f')

        def minus_time_units(original_starttime, unit, units_to_add):
            # Define the conversion of each unit to seconds
            unit_conversion = {
                '60S': 60,         # 60 seconds
                '30S': 30,         # 30 seconds
                '10S': 10,         # 10 second
                '5S': 5,         # 5 second
                '1S': 1,         # 1 second
                '500L':0.5,      # 0.5 second
                '100L': 0.1,     # 0.1 seconds
                '10L': 0.01,     # 0.01 seconds
                '1L': 0.001,     # 0.001 seconds
                '100U': 0.0001    # 0.001 seconds
            }
            # Calculate the total seconds to add
            total_seconds_to_add = units_to_add * unit_conversion[unit]
            # Add the time to the original datetime
            return original_starttime - timedelta(seconds=total_seconds_to_add)

        datetime_obj = datetime.fromisoformat(f"{w} 09:33:00.000+00:00") # default is 16:00:00.000+00:00
        # Drop some rows that are not needed 
        new_datime_end = minus_time_units(datetime_obj, time_interval, window)
        new_datime_end = new_datime_end.replace(tzinfo=None)
        df_array = df_array[df_array['Time'] <= new_datime_end]
        
        # count the rows in the df_array
        print(len(df_array))

        all_dfs.append(df_array)

    # Concatenate all DataFrames from the list
    combined_df = pd.concat(all_dfs, ignore_index=True) if all_dfs else pd.DataFrame()
    return combined_df

# Example usage:
df_array = generate_df_array(depth, time_interval, window,date_listen)
df_array

In [None]:
plt.figure(figsize=(20,10))
plt.imshow(df_array['Arrays'][23], cmap='gray')

### GENERATE THE DF_MIDPRICE 

In [None]:
import pandas as pd
import os
import numpy as np

def generate_df_midprice(depth, time_interval, window,date_listen):
    """
    Generate a DataFrame (df_midprice) from CSV files in a specified folder.
    This function reads CSV files from a given directory, extracts ticker information,
    processes time data, renames columns as necessary, and creates a 'midquote_target' column.

    Parameters:
    depth (int): Depth parameter used in the folder path.
    time_interval (int): Time interval parameter used in the folder path.
    window (int): Window parameter used in the folder path.

    Returns:
    pandas.DataFrame: A DataFrame containing data from all CSV files with additional columns for Ticker, Time, and midquote_target.
    """
    dataframe_collect = []
    
    for w in date_listen:
        # Define the folder path using formatted strings for depth, time_interval, and window
        folder_path = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/Returns/depth{depth}_time{time_interval}_window{window}_date{w}'

        # List and sort CSV files in the specified folder
        csv_files = sorted([file for file in os.listdir(folder_path) if file.endswith('.csv')])

        dataframes = []  # Initialize an empty list to store DataFrames

        for file in csv_files:
            file_path = os.path.join(folder_path, file)
            df = pd.read_csv(file_path)

            # Extract ticker from the file name
            Ticker = ''.join(filter(str.isupper, os.path.splitext(file)[0]))
            df['Ticker'] = Ticker  # Add the ticker as a new column

            dataframes.append(df)

        # Concatenate all DataFrames into one
        df_midprice = pd.concat(dataframes, ignore_index=True)

        # Reorder columns to have 'Ticker' first
        df_midprice = df_midprice[['Ticker'] + [col for col in df_midprice.columns if col != 'Ticker']]

        # Rename the 'ts' column to 'Time' if it exists
        if 'ts' in df_midprice.columns:
            df_midprice = df_midprice.rename(columns={'ts': 'Time'})

        def standardize_timestamps(timestamp_str):
            try:
                # Parse the timestamp string to a datetime object
                timestamp_obj = datetime.strptime(timestamp_str, '%Y-%m-%d %H:%M:%S%z')
                # Reformat to include microseconds (even if zero)
                return timestamp_obj.strftime('%Y-%m-%d %H:%M:%S.%f%z')
            except ValueError:
                try:
                    # Attempt parsing without timezone if the first format fails
                    timestamp_obj = datetime.strptime(timestamp_str, '%Y-%m-%d %H:%M:%S')
                    return timestamp_obj.strftime('%Y-%m-%d %H:%M:%S.%f')
                except ValueError:
                    # Return the original string if both parsing attempts fail
                    return timestamp_str

        # Apply the function to each element in the 'Time' column
        df_midprice['Time'] = df_midprice['Time'].apply(standardize_timestamps)


        # Convert 'Time' column to datetime64[ns], removing timezone
        df_midprice['Time'] = pd.to_datetime(df_midprice['Time']).dt.tz_localize(None)

        # Convert 'Ticker' column to string
        df_midprice['Ticker'] = df_midprice['Ticker'].astype(str)

        # Create a new column 'midquote_target' based on the 'midquote' values
        df_midprice['midquote_target'] = np.where(df_midprice['midquote'] > 0, 1, 0)
        print(len(df_midprice))
        dataframe_collect.append(df_midprice)
    
    df_midprice = pd.concat(dataframe_collect, ignore_index=True) if dataframe_collect else pd.DataFrame()
    return df_midprice


# Example usage:
df_midprice = generate_df_midprice(depth, time_interval, window, date_listen)
df_midprice
# remove duplicates rows based on the 'Time' and 'Ticker' columns
#df_midprice = df_midprice.drop_duplicates(subset=['Time', 'Ticker'], keep='first')
#df_midprice

### COMBINE THE DATAFRAMES

In [None]:
## THIS IS FOR THE ONCE THAT DO NOT HAVE THE SAME START TIME

## Find the rows in df_array that are not in df_midprice
df_array_not_in_df_midprice = df_array[~df_array.set_index(['Time', 'Ticker']).index.isin(df_midprice.set_index(['Time', 'Ticker']).index)]
#
## Find the rows in df_midprice that are not in df_array
df_midprice_not_in_df_array = df_midprice[~df_midprice.set_index(['Time', 'Ticker']).index.isin(df_array.set_index(['Time', 'Ticker']).index)]
#


In [None]:
# delete all the following tickers from the df_array: 'AJG', 'AMGN', 'AON', 'APD', 'AZO', 'CB', 'CHTR', 'CI', 'CME','CNI', 'CTAS', 'DELL', 'DHR', 'EQIX', 'FMX', 'INTU', 'MMC', 'MSI','NOC', 'PH', 'RSG', 'SCCO', 'SMFG', 'SONY', 'SPGI', 'TMO', 'TM','TT', 'UPS', 'WDAY','AMX'
df_array = df_array[~df_array['Ticker'].isin(['AJG', 'AMGN', 'AON', 'APD', 'AZO', 'CB', 'CHTR', 'CI', 'CME','CNI', 'CTAS', 'DELL', 'DHR', 'EQIX', 'FMX', 'INTU', 'MMC', 'MSI','NOC', 'PH', 'RSG', 'SCCO', 'SMFG', 'SONY', 'SPGI', 'TMO', 'TM','TT', 'UPS', 'WDAY','AMX'])]

In [None]:
def combine_dataframes(df_array,df_midprice):
    # combine the df_array and df_midprice
    if len(df_midprice) == len(df_array):
        df = pd.merge(df_midprice, df_array, on=['Ticker', 'Time'])
    else:
        print("ERROR: The length of df_midprice is not equal to the length of df_array")
    return df
df_combined = combine_dataframes(df_array,df_midprice)
# make the spread column:
df_combined['spread'] = df_combined['ask_price_1'] - df_combined['bid_price_1']
df_combined

In [None]:
# save the df_combined to to pickle
df_combined.to_pickle(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/combined/combined_depth{depth}_time{time_interval}_window{window}.pkl")

In [None]:
#CODE WITH THE RIGHT PICTURES
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import shutil

# BMO+XOM 

asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']  #['AAPL', 'ABNB', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'ALGN', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'ANSS', 'ASML', 'AVGO', 'AZN', 'BIIB', 'BKNG', 'BKR', 'CDNS', 'CEG', 'CHTR', 'CMCSA', 'COST', 'CPRT', 'CRWD', 'CSCO', 'CSGP', 'CSX', 'CTAS', 'CTSH', 'DDOG', 'DLTR', 'DXCM', 'EA', 'EBAY', 'ENPH', 'EXC', 'FANG', 'FAST', 'FTNT', 'GEHC', 'GFS', 'GILD', 'GOOGL', 'HON', 'IDXX', 'ILMN', 'INTC', 'INTU', 'ISRG', 'JD', 'KDP', 'KHC', 'KLAC', 'LCID', 'LRCX', 'LULU', 'MAR', 'MDLZ', 'MELI', 'META', 'MNST', 'MRNA', 'MRVL', 'MSFT', 'MU', 'NFLX', 'NVDA', 'NXPI', 'ODFL', 'ON', 'ORLY', 'PANW', 'PAYX', 'PCAR', 'PDD', 'PEP', 'PYPL', 'QCOM', 'REGN', 'ROST', 'SBUX', 'SGEN', 'SIRI', 'SNPS', 'TEAM', 'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL', 'ZM', 'ZS'] # ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'APO', 'ARM', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BRK.B', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CB', 'CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'E', 'ECL', 'EL', 'ELV', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FI', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITUB', 'ITW', 'JNJ', 'JPM', 'KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCK', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'MUFG', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PBR', 'PBR.A', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RTX', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPGI', 'STLA', 'SYK', 'T', 'TD', 'TDG', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']

#original_starttime = datetime.fromisoformat("2023-03-21 09:30:00.000000+00:00")

level = 15
prediction_ahead = 4 # This the the duration of each picture 4*time_interval 
depth = 5 # also 10, 30, 50
time_interval = '10L' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 500L=0,5sek, 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10
date_listen = ["2023-03-21"]#["2023-03-21","2023-03-22","2023-03-23"]


In [None]:
# import the pickle file
import pandas as pd
df = pd.read_pickle(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/combined/combined_depth{depth}_time{time_interval}_window{window}.pkl")

In [None]:
tickers_of_interest = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'AFL','AIG', 'ALL', 'AMAT', 'AMD', 'AMT', 'AMZN', 'ANET', 'ASML', 'AVGO','AXP', 'AZN', 'BABA', 'BAC', 'BA', 'BBVA', 'BDX', 'BHP', 'BKNG','BLK', 'BMY', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'CARR','CCI', 'CDNS', 'CL', 'CMCSA', 'COF', 'COP', 'COST', 'CP', 'CRH','CRM', 'CSCO', 'CSX', 'CVS', 'CVX', 'C', 'DASH', 'DEO','DE', 'DHI', 'DIS', 'DUK', 'EL', 'EMR', 'ENB', 'EOG', 'EPD','EQNR', 'ETN', 'ET', 'EW', 'FCX', 'FTNT', 'GD', 'GE', 'GILD', 'GM','GOOGL', 'GSK', 'GWW', 'HCA', 'HDB', 'HD', 'HLT', 'HMC', 'HON','HSBC', 'HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'ISRG','ITW', 'JNJ', 'JPM', 'KHC', 'KKR', 'KLAC', 'KO', 'LIN', 'LLY','LMT', 'LOW', 'LRCX', 'LULU', 'MAR', 'MA', 'MCD', 'MCHP', 'MCO','MDLZ', 'MDT', 'MELI', 'META', 'MET', 'MMM', 'MNST', 'MO', 'MPC','MRK', 'MRVL', 'MSCI', 'MSFT', 'MS', 'MU', 'NEE', 'NFLX', 'NGG','NKE', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI','ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX', 'PBR', 'PCAR', 'PEP','PFE', 'PGR', 'PG', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD','PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROST', 'RY', 'SAN','SAP', 'SBUX', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SNOW','SNPS', 'SNY', 'SO', 'SPG', 'SPOT', 'STLA', 'SYK', 'TD', 'TEAM','TFC', 'TGT', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TTE', 'TXN', 'T','UBER', 'UBS', 'UL', 'UNH', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ','WELL', 'WFC', 'WM', 'ZTS']

len(tickers_of_interest)

In [None]:
# load df from the pickle file
df = pd.read_pickle(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/combined/combined_depth{depth}_time{time_interval}_window{window}.pkl")

In [None]:
# display this picture df['Arrays'][76]
import matplotlib.pyplot as plt
# make the plot bigger
plt.figure(figsize=(20,10))

plt.imshow(df['Arrays'][4393], cmap='gray')

In [None]:
import pandas as pd

def split_dataframe_by_ticker(df, time_column='Time', train_ratio=0.6, val_ratio=0.2):
    """
    Splits a DataFrame into training, validation, and test sets based on ticker.
    
    Parameters:
    df (DataFrame): The input DataFrame.
    time_column (str): The name of the column containing time data.
    train_ratio (float): The proportion of data to be used for training.
    val_ratio (float): The proportion of data to be used for validation.

    Returns:
    DataFrame: Training set.
    DataFrame: Validation set.
    DataFrame: Test set.
    """

    # Convert Time column to datetime for proper sorting
    df[time_column] = pd.to_datetime(df[time_column])

    # Define a function to split the data
    def split_data(group):
        group = group.sort_values(by=time_column)
        idx_train = int(len(group) * train_ratio)
        idx_val = int(len(group) * (train_ratio + val_ratio))
        return group.iloc[:idx_train], group.iloc[idx_train:idx_val], group.iloc[idx_val:]

    # Apply the function and get splits
    splits = df.groupby('Ticker').apply(lambda g: split_data(g))

    # Extract splits into separate DataFrames using list comprehension
    train_df = pd.concat([s[0] for s in splits])
    val_df = pd.concat([s[1] for s in splits])
    test_df = pd.concat([s[2] for s in splits])

    return train_df, val_df, test_df

# Usage Example:
train_df, val_df, test_df = split_dataframe_by_ticker(df)

### Monte Carlo Simulation of Sharpe Ratio (maybe insert this into the test file)

In [10]:
import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import os
#ORIGINAL
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import shutil

df_returns = pd.DataFrame()
asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TM', 'TMO', 'TMUS', 'TRI', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UPS', 'USB', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'ZTS']

level = 15
prediction_ahead = 4 # This the the duration of each picture 4*time_interval 
depth = 5 # also 10, 30, 50
time_interval = '10L' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 500L=0,5sek, 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10
date_listen = ["2023-03-21"]

In [11]:
# import the pickle file
import pandas as pd
df = pd.read_pickle(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/combined/combined_depth{depth}_time{time_interval}_window{window}.pkl")

In [12]:
# only keep the rows there the ticker column is in the asset_list:
df = df[df['Ticker'].isin(asset_list)]

#sort df first by 'Time' and then by 'Ticker':
df = df.sort_values(by=['Time', 'Ticker'])

training_procent = 0.6

# Calculate the number of samples that should be in the training set
num_samples = len(df) * training_procent

def closest_value(input):
    # Adjust the input to the closest higher multiple of 250
    return int(input + (200 - input % 200))

# Get the adjusted number of samples for the training set
train_samples = closest_value(num_samples)

# Split the dataframe
train_df = df[:train_samples]
test_df = df[train_samples:]

print(len(train_df))  # This will print the number of samples in the training set
print(len(test_df))   # This will print the number of samples in the test set

test_df = test_df.iloc[:1121000]

print(len(test_df))   # This will print the number of samples in the test set


2189600
1459528
1121000


### CNN

In [None]:
# NY TEST AF MODEL KØR DETTE I COLAB
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(32 * 40 * 105, 128)
        self.fc2 = nn.Linear(128, 1)

    def forward(self, x):
        # Assuming x is of shape (batch_size, 40, 105)
        x = x.unsqueeze(1)  # Add a channel dimension (batch_size, 1, 40, 105)
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(x.size(0), -1)  # Flatten the tensor for the fully connected layer
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))  # Sigmoid for binary classification
        return x
# Define the loss function and optimizer
# Define the model (using the SimpleCNN class from the previous example)
model = SimpleCNN()

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Add code here to prepare your data, train, and evaluate the model

from torch.utils.data import Dataset, DataLoader
import torch

class TensorDataset(Dataset):
    def __init__(self, dataframe):
        self.tensors = list(dataframe['tensors'])
        self.targets = list(dataframe['midquote_target'])

    def __len__(self):
        return len(self.tensors)

    def __getitem__(self, idx):
        return self.tensors[idx], self.targets[idx]

# Create datasets
train_dataset = TensorDataset(train_df)
val_dataset = TensorDataset(val_df)
test_dataset = TensorDataset(test_df)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

import torch
import torch.nn as nn
import torch.optim as optim

# Check if CUDA is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = SimpleCNN().to(device)  # Replace SimpleCNN with your model class
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 10  # Define the number of epochs

for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    train_loss = 0.0

    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(), targets.float())
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * inputs.size(0)

    train_loss /= len(train_loader.dataset)

    # Validation Loop
    model.eval()  # Set the model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs.squeeze(), targets.float())
            val_loss += loss.item() * inputs.size(0)

    val_loss /= len(val_loader.dataset)

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")


In [None]:
# make sure that there are the same amount of positive and negative returns in the training data, randomly choose the same amount of positive and negative returns
# make a dataframe with only the positive returns
df_positive = train_df[train_df['midquote_target'] == 1]
# make a dataframe with only the negative returns
df_negative = train_df[train_df['midquote_target'] == 0]

# find the length of the positive dataframe
length_positive = len(df_positive)
# find the length of the negative dataframe
length_negative = len(df_negative)

# find the minimum length of the two dataframes
min_length = min(length_positive, length_negative)

# randomly choose the same amount of positive and negative returns
df_positive_fixed = df_positive.sample(n=min_length, random_state=1)
df_negative_fixed = df_negative.sample(n=min_length, random_state=1) 

# combine the two dataframes

df_train = pd.concat([df_positive_fixed, df_negative_fixed])

print('lenght of the df_train:', len(df_train))
print('lenght of the df_val:', len(val_df))
print('lenght of the df_test:', len(test_df))

In [None]:
# give me distribution of the midquote that are above 0 and the ones that are below 0 or equal to 0:
print(df_train[df_train['midquote'] > 0]['midquote'].count()/len(df_train['midquote']))
print(df_train[df_train['midquote'] <= 0]['midquote'].count()/len(df_train['midquote']))
print(f"Number of neutral midquotes: {df_train[df_train['midquote'] == 0]['midquote'].count()} out of {len(df_train['midquote'])}")

# make a distribution plot of the midquote
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(df_train['midquote'], bins=100, kde=True)
plt.show()


### PUSH TO COLAB

In [None]:
import tarfile
import os

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

# Define destination folder path
destination_folder = '/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/tar_folder/'

# Usage for first tar file
source_folder1 = f'/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}'
output_tar_file1 = destination_folder + 'array.tar'  # Save to destination folder

make_tarfile(output_tar_file1, source_folder1)


In [None]:
from googleapiclient.discovery import build
from google.oauth2 import service_account
import os

SCOPES = ['https://www.googleapis.com/auth/drive']
SERVICE_ACCOUNT_FILE = 'credentials.json'
PARENT_FOLDER_ID = "1ARfYF9vuYgHeoffm6I0ezfKU11eAyByI"

def authenticate():
    creds = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES)
    return creds

def upload_file(file_path):
    creds = authenticate()
    service = build('drive', 'v3', credentials=creds)

    file_name = os.path.basename(file_path)  # Extracts file name from file_path
    file_metadata = {
        'name': file_name,
        'parents': [PARENT_FOLDER_ID]
    }

    media = MediaFileUpload(file_path, resumable=True)
    file = service.files().create(
        body=file_metadata,
        media_body=media,
        fields='id'
    ).execute()

def upload_folder(folder_path):
    for item in os.listdir(folder_path):
        full_item_path = os.path.join(folder_path, item)
        if os.path.isfile(full_item_path):
            upload_file(full_item_path)

# Usage example: upload all files in a specific local folder
local_folder_path = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/tar_folder"  # Replace with your local folder path
upload_folder(local_folder_path)

In [None]:
from googleapiclient.discovery import build
from google.oauth2 import service_account
import os

SCOPES = ['https://www.googleapis.com/auth/drive']
SERVICE_ACCOUNT_FILE = 'credentials.json'
PARENT_FOLDER_ID = "1J633912fL3GnUOyWpWgaoTmdm0DM3RhS"

def authenticate():
    creds = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES)
    return creds

def upload_file(file_path):
    creds = authenticate()
    service = build('drive', 'v3', credentials=creds)

    file_name = os.path.basename(file_path)  # Extracts file name from file_path
    file_metadata = {
        'name': file_name,
        'parents': [PARENT_FOLDER_ID]
    }

    media = MediaFileUpload(file_path, resumable=True)
    file = service.files().create(
        body=file_metadata,
        media_body=media,
        fields='id'
    ).execute()

def upload_folder(folder_path):
    for item in os.listdir(folder_path):
        full_item_path = os.path.join(folder_path, item)
        if os.path.isfile(full_item_path):
            upload_file(full_item_path)
# Usage example: upload all files in a specific local folder
local_folder_path = f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/Returns/depth{depth}_time{time_interval}_window{window}"  # Replace with your local folder path
upload_folder(local_folder_path)

### CNN TRAIN

In [None]:
#ORIGINAL
import pandas as pd
import os
import numpy as np
import math
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

df_returns = pd.DataFrame()
asset_list = ['AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'ALL','AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CARR','CB', 'CCI','CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DASH','DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'ECL', 'EL', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK','GWW', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC','HUM', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITW', 'JNJ', 'JPM', 'KHC','KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCHP', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NUE', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PAYX','PBR', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPG','SPGI', 'SPOT','STLA', 'SYK', 'T', 'TD', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS']
original_starttime = datetime.fromisoformat("2023-03-21 09:30:00.000000+00:00")

level = 15
prediction_ahead = 4
depth = 5 # also 10, 30, 50
time_interval = '5S' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10

# import the pickle file
import pandas as pd
df = pd.read_pickle(f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/combined/combined_depth{depth}_time{time_interval}_window{window}.pkl")

#sort df first by 'Time' and then by 'Ticker':
df = df.sort_values(by=['Time', 'Ticker'])


x = len(df)*0.6 

def closest_value(x):
    return int(x + 250 - x%250)

input = closest_value(x)

# keep 60 pct of the data as training data and 40 pct as test data
train_df = df[:int(input)]
test_df = df[int(1-input):]

use_gpu = True
use_ramdon_split = False
use_dataparallel = True
import os
import sys
sys.path.insert(0, '..')

if use_gpu:
    from gpu_tools import *
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join([ str(obj) for obj in select_gpu(query_gpu())])

import time
import datetime
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split


torch.manual_seed(42)

# Set batch size
batch_size = 10

# Initialize an empty list to store batches
batched_images = []

# Process arrays in batches
for i in range(0, len(train_df), batch_size):
    batch = train_df['Arrays'].iloc[i:i+batch_size]
    batched_images.append(np.stack(batch))

# Concatenate batches to create the final array
images = np.concatenate(batched_images)

label_df = train_df

# Check the shape of the resulting array
print(images.shape)
print(label_df.shape)

In [None]:
class MyDataset(Dataset):
    
    def __init__(self, img, label):
        self.img = torch.Tensor(img.copy())
        self.label = torch.Tensor(label)
        self.len = len(img)
  
    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        return self.img[idx], self.label[idx]

In [None]:
if not use_ramdon_split:
    train_val_ratio = 0.7
    split_idx = int(images.shape[0] * 0.7)
    train_dataset = MyDataset(images[:split_idx], (label_df.midquote_target).values[:split_idx])
    val_dataset = MyDataset(images[split_idx:], (label_df.midquote_target).values[split_idx:])
else:
    dataset = MyDataset(images, (label_df.midquote_target).values)
    train_val_ratio = 0.7
    train_dataset, val_dataset = random_split(dataset, \
        [int(dataset.len*train_val_ratio), dataset.len-int(dataset.len*train_val_ratio)], \
        generator=torch.Generator().manual_seed(42))
    del dataset

train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True, pin_memory=True)
val_dataloader = DataLoader(val_dataset, batch_size=256, shuffle=False, pin_memory=True)

In [None]:
def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.)
    elif isinstance(m, nn.Conv2d):
        torch.nn.init.xavier_uniform_(m.weight)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=(5, 3), stride=(3, 1), dilation=(2, 1), padding=(12, 1)),
            nn.BatchNorm2d(64),
            nn.LeakyReLU(negative_slope=0.01, inplace=True),
            nn.MaxPool2d((2, 1), stride=(2, 1)),
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=(5, 3), stride=(3, 1), dilation=(2, 1), padding=(12, 1)),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(negative_slope=0.01, inplace=True),
            nn.MaxPool2d((2, 1), stride=(2, 1)),
        )
        self.layer3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=(5, 3), stride=(3, 1), dilation=(2, 1), padding=(12, 1)),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(negative_slope=0.01, inplace=True),
            nn.MaxPool2d((2, 1), stride=(2, 1)),
        )
        
        # Dynamically calculate the size of the FC layer
        with torch.no_grad():
            # Correctly prepare the dummy input tensor
            dummy_input = torch.autograd.Variable(torch.rand(1, 1, 40, 105))
            self._temp_size = self._get_conv_output(dummy_input).view(-1).shape[0]

        self.fc1 = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(self._temp_size, 2),
        )
        self.softmax = nn.Softmax(dim=1)

    def _get_conv_output(self, input_tensor):
        output = self._forward_features(input_tensor)
        return output

    def _forward_features(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        return x

    def forward(self, x):
        x = self._forward_features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.fc1(x)
        return x


In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
export_onnx = True
net = Net().to(device)
net.apply(init_weights)  # Ensure init_weights is defined elsewhere

if export_onnx:
    import torch.onnx
    # Adjust the dummy input to match your actual input size and device
    x = torch.randn([1, 1, 40, 105]).to(device)
    torch.onnx.export(net, x, "../cnn_baseline.onnx", export_params=True,  # Changed to export_params=True to include weights
                      opset_version=10, do_constant_folding=True,  # Adjusted for optimization
                      input_names=['input_images'], output_names=['output_prob'],
                      dynamic_axes={'input_images': {0: 'batch_size'}, 'output_prob': {0: 'batch_size'}})


In [None]:
for images, labels in train_dataloader:
    # Add a channel dimension to make it [batch_size, 1, height, width]
    images = images.unsqueeze(1)  # This changes shape from [128, 40, 105] to [128, 1, 40, 105]
    # Now you can forward pass this through your network
    outputs = net(images.to(device))
    # Rest of your training loop...
    break  # This break is just to stop the loop here for demonstration

In [None]:
from thop import profile as thop_profile
# Creating a dummy input tensor with the correct shape [1, 1, 40, 105]
# This assumes a single sample, but you can adjust the first dimension as needed
input_tensor = torch.randn(1, 40, 105).unsqueeze(1).to(device)  # Now shape is [1, 1, 40, 105]

# Profiling with THOP
flops, params = thop_profile(net, inputs=(input_tensor,))
print('FLOPs = ' + str(flops / 1000**3) + 'G')
print('Params = ' + str(params / 1000**2) + 'M')


In [None]:
def train_loop(dataloader, net, loss_fn, optimizer):
    
    running_loss = 0.0
    current = 0
    net.train()
    
    with tqdm(dataloader) as t:
        for batch, (X, y) in enumerate(t):
            X = X.to(device)
            y = y.to(device)
            y_pred = net(X)
            loss = loss_fn(y_pred, y.long())
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            running_loss = (len(X) * loss.item() + running_loss * current) / (len(X) + current)
            current += len(X)
            t.set_postfix({'running_loss':running_loss})
    
    return running_loss


In [None]:
def val_loop(dataloader, net, loss_fn):

    running_loss = 0.0
    current = 0
    net.eval()
    
    with torch.no_grad():
        with tqdm(dataloader) as t:
            for batch, (X, y) in enumerate(t):
                X = X.to(device)
                y = y.to(device)
                y_pred = net(X)
                loss = loss_fn(y_pred, y.long())

                running_loss += loss.item()
                running_loss = (len(X) * running_loss + loss.item() * current) / (len(X) + current)
                current += len(X)
            
    return running_loss

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    total_loss = 0
    for batch, (X, y) in enumerate(dataloader):
        X = X.unsqueeze(1)  # Ensure X has the correct shape
        X, y = X.to(device), y.to(device)
        
        # Convert target tensor to long
        y = y.long()  # This line fixes the error

        optimizer.zero_grad()
        pred = model(X)
        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    average_loss = total_loss / len(dataloader)
    print(f"Avg training loss: {average_loss}")
    return average_loss

def val_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    total_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X = X.unsqueeze(1)  # Ensure X has the correct shape
            X, y = X.to(device), y.to(device)
            y = y.long()  # Convert target tensor to long

            pred = model(X)
            total_loss += loss_fn(pred, y).item()

    average_loss = total_loss / num_batches
    print(f"Avg validation loss: {average_loss}")
    return average_loss


In [None]:
if use_gpu and use_dataparallel and 'DataParallel' not in str(type(net)):
    net = net.to(device)
    net = nn.DataParallel(net)

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-5)

start_epoch = 0
min_val_loss = 1e9
last_min_ind = -1
early_stopping_epoch = 5


In [None]:
start_time = datetime.datetime.now().strftime('%Y%m%d_%H:%M:%S')
#os.mkdir('../pt'+os.sep+start_time)
epochs = 100
for t in range(start_epoch, epochs):
    print(f"Epoch {t}\n-------------------------------")
    time.sleep(0.2)
    train_loss = train_loop(train_dataloader, net, loss_fn, optimizer)
    val_loss = val_loop(val_dataloader, net, loss_fn)
    tb.add_histogram("train_loss", train_loss, t)
    torch.save(net, '../pt'+os.sep+start_time+os.sep+'baseline_epoch_{}_train_{:5f}_val_{:5f}.pt'.format(t, train_loss, val_loss)) 
    if val_loss < min_val_loss:
        last_min_ind = t
        min_val_loss = val_loss
    elif t - last_min_ind >= early_stopping_epoch:
        break

print('Done!')
print('Best epoch: {}, val_loss: {}'.format(last_min_ind, min_val_loss))

### CNN TEST

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import os
import sys
sys.path.insert(0, '..')

if use_gpu:
    from utils.gpu_tools import *
    os.environ["CUDA_VISIBLE_DEVICES"] = ','.join([ str(obj) for obj in select_gpu(query_gpu())])

os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split

torch.manual_seed(42)

In [None]:

import time
import datetime
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split


torch.manual_seed(42)

# Set batch size
batch_size = 10

# Initialize an empty list to store batches
batched_images = []

# Process arrays in batches
for i in range(0, len(test_df), batch_size):
    batch = test_df['Arrays'].iloc[i:i+batch_size]
    batched_images.append(np.stack(batch))

# Concatenate batches to create the final array
images = np.concatenate(batched_images)

label_df = test_df

# Check the shape of the resulting array
print(images.shape)
print(label_df.shape)


In [None]:
class MyDataset(Dataset):
    
    def __init__(self, img, label):
        self.img = torch.Tensor(img.copy())
        self.label = torch.Tensor(label)
        self.len = len(img)
  
    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        return self.img[idx], self.label[idx]

In [None]:
dataset = MyDataset(images, (label_df.midquote_target).values)

In [None]:
test_dataloader = DataLoader(dataset, batch_size=2048, shuffle=False)

### PLOT THE IMAGE

In [None]:
from datetime import datetime, timedelta
def add_time_units(original_starttime, unit, units_to_add):
    # Define the conversion of each unit to seconds
    unit_conversion = {
        '60S': 60,         # 60 seconds
        '30S': 30,         # 30 seconds
        '10S': 10,         # 10 second
        '5S': 5,         # 5 second
        '1S': 1,         # 1 second
        '100L': 0.1,     # 0.1 seconds
        '10L': 0.01,     # 0.01 seconds
        '1L': 0.001,     # 0.001 seconds
        '100U': 0.001    # 0.001 seconds
    }
    # Calculate the total seconds to add
    total_seconds_to_add = units_to_add * unit_conversion[unit]
    # Add the time to the original datetime
    return original_starttime + timedelta(seconds=total_seconds_to_add)

In [None]:
# PLOT THE IMAGE
import matplotlib.pyplot as plt
import numpy as np
import os

original_starttime = datetime.fromisoformat("2023-03-21 09:30:00.000000+00:00")  
# Directory where the image arrays are saved
save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"  # Change this to your desired save directory

#make the  new_datetime_start this format: 2023-03-21_09-30-07.000000+0000
new_datetime_start = add_time_units(original_starttime, time_interval, 78).strftime('%Y-%m-%d_%H-%M-%S.%f%z')
print(new_datetime_start)
# Load the image array {asset}_depth{depth}_interval{time_interval}_window{window}_{idx}.npz
filename = f"depth{depth}_time{time_interval}_window{window}_date2023-03-21/{asset_list[0]}_depth{depth}_interval{time_interval}_window{window}_date2023-03-21_{new_datetime_start}.npz"

load_path = os.path.join(save_dir, filename)
img_array = np.load(load_path)['arr_0']
print(filename)
# Display the image
plt.figure(figsize=(18, 9))
plt.imshow(img_array, cmap='gray')

# display the dimension of the image
img_array.shape

### DELETE THE FOLDERS

In [None]:
# DELETE THE FOLDERS
import os
import shutil

def delete_all_in_folder(folder):
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print('Failed to delete %s. Reason: %s' % (file_path, e))

# Define your folder paths
folder1 = f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/Returns/depth{depth}_time{time_interval}_window{window}"
folder2 = f"/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth{depth}_time{time_interval}_window{window}"

# Delete all elements in both folders
delete_all_in_folder(folder1)
delete_all_in_folder(folder2)


### PLAYING AROUND

In [None]:
# display the image
import matplotlib.pyplot as plt
import numpy as np
import os

# Directory where the image arrays are saved
save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"  # Change this to your desired save directory

# Load the image array {asset}_depth{depth}_interval{time_interval}_window{window}_{idx}.npz
filename = f"depth{depth}_time{time_interval}_window{window}/{asset[0]}_depth{depth}_interval{time_interval}_window{window}_2023-03-21_09-30-03.600000+0000.npz"
load_path = os.path.join(save_dir, filename)
img_array = np.load(load_path)['arr_0']

# Display the image
plt.figure(figsize=(20, 10))
plt.imshow(img_array, cmap='gray')

# display the dimensions of the image
img_array.shape

In [None]:
import numpy as np
import os

def load_npz_files_in_batches(directory, batch_size=10000000000):
    df_list = []
    batch_count = 0

    # Get all .npz files and sort them
    all_files = [f for f in os.listdir(directory) if f.endswith('.npz')]
    sorted_files = sorted(all_files)

    # Iterate through the sorted list of files
    for filename in sorted_files:
        #print(f"Processing file: {filename}")  # Print the filename being processed
        file_path = os.path.join(directory, filename)
        data = np.load(file_path)['arr_0']
        df_list.append(data)
        batch_count += 1

        if batch_count >= batch_size:
            yield df_list
            df_list = []
            batch_count = 0

    if df_list:  # Yield remaining files in the last batch
        yield df_list

# Usage
directory = '/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth10_time10L_window5'
for df_array in load_npz_files_in_batches(directory):
    # Process each batch here
    print(f"Batch size: {len(df_array)}")  # Print the size of each batch


In [None]:
# import plt
import matplotlib.pyplot as plt

# Display the imageA
plt.figure(figsize=(20, 10))
plt.imshow(df_array[59], cmap='gray')

In [None]:
# these are the predictors for the model train the model on this!
i = 3
print(df_array[i])
print(returns_df['midquote'][i])

### BEST ATTEMPT

In [None]:
import pandas as pd
import os
import numpy as np
import math
import config
import importlib
importlib.reload(config)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Helper function to concatenate orderbooks
def concatenate_orderbooks(asset_list, date, level, directory):
    orderbook_df_list = []
    for asset in asset_list:
        file_path = os.path.join(directory, f'{asset}_{date}_orderbook_{level}.feather')
        orderbook_df_list.append(pd.read_feather(file_path))
    orderbook = pd.concat(orderbook_df_list)
    return orderbook

def round_up_timestamp(ts, freq):
    """Round up a timestamp based on a given frequency"""
    return (ts + pd.Timedelta(freq) - pd.Timedelta('1ns')).floor(freq)

def split_by_seconds_optimized(df, freq, window):
    df['ts'] = pd.to_datetime(df['ts'])
    # Round up the timestamps
    df['ts'] = df['ts'].apply(lambda x: round_up_timestamp(x, freq))
    df.set_index('ts', inplace=True)
    if 'asset' not in df.columns:
        raise ValueError("DataFrame must have an 'asset' column")
    asset_results = {}
    for asset, asset_df in df.groupby('asset'):
        # Resample the dataframe
        resampled_df = asset_df.resample(freq).last()
        
        # Forward-fill any NaN values
        resampled_df.ffill(inplace=True)
        
        # If the first row(s) are still NaN, drop them
        resampled_df.dropna(inplace=True)
        
        interval_dataframes = resampled_df.groupby(resampled_df.index).tail(1)
        cols_to_drop = ["type", "order_id", "m_size", "m_price", "direction", "spread", "hidden_volume", "volume", "midquote", "asset"]
        interval_dataframes.drop(columns=cols_to_drop, inplace=True)
        interval_dataframes = interval_dataframes.T
        rolling_windows = [None] * (len(interval_dataframes.columns) - (window - 1))
        for start_col in range(len(interval_dataframes.columns) - (window - 1)):
            end_col = start_col + window
            window_df = interval_dataframes.iloc[:, start_col:end_col]
            rolling_windows[start_col] = window_df
        asset_results[asset] = rolling_windows
    return asset_results


# Function to display the first x*4 rows of each DataFrame for each asset in the results
# ... [Previous helper functions remain unchanged]

# Function to truncate the results based on x
def truncate_results(asset_results, depth):
    truncated_results = {}
    num_rows = depth * 4  # Convert depth to number of rows to keep
    
    for asset, windows in asset_results.items():
        truncated_windows = [window.head(num_rows) for window in windows]
        truncated_results[asset] = truncated_windows
    
    return truncated_results


# Main function to process orderbooks
def process_orderbooks(asset_list, date, level, freq, window, start_time, end_time, directory, depth):
    orderbook = concatenate_orderbooks(asset_list, date, level, directory)
    orderbook = orderbook.loc[orderbook['ts'].between(start_time, end_time)].reset_index(drop=True)
    asset_results = split_by_seconds_optimized(orderbook, freq, window)
    asset_results = truncate_results(asset_results, depth)
    return asset_results

# USE:
asset = ['TSLA']
date = "2023-03-21"
level = 50
directory = r"/Users/jensknudsen/Desktop/LOBSTER_DATA/Data"
start_time = "2023-03-21 09:30:00.000000+00:00"
end_time = "2023-03-21 09:31:00.000000+00:00"
#Dinamic parameters:
depth = 50 # also 10, 30, 50
time_interval = '1L' #: 1S=1sek; 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 10 # also 3, 5, 10

results = process_orderbooks(asset, date, level, time_interval, window, start_time, end_time, directory, depth)


In [None]:
# make the a copy of the data results to be asset_results_10ms
asset_results_10ms = results.copy()

# NEW IMPROVE CODE
from helper_functions import scale_dataframe
from concurrent.futures import ProcessPoolExecutor
import os

def scale_dataframes_concurrently(dfs):
    # Using ProcessPoolExecutor to parallelize the scaling
    # Using os.cpu_count() to get the number of available CPU cores
    num_workers = os.cpu_count() - 1  # Using all cores minus one
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        return list(executor.map(scale_dataframe, dfs))

def scale_asset_dataframes(asset_data):
    for asset, dfs in asset_data.items():
        # Use the concurrent scaling function
        asset_data[asset] = scale_dataframes_concurrently(dfs)
    return asset_data

asset_results_10ms = scale_asset_dataframes(asset_results_10ms)

In [None]:
import pandas as pd

def detect_missing_timestamps(asset_results):
    missing_timestamps_info = {}

    for asset_name, dataframes_list in asset_results.items():
        missing_timestamps_dataframes = []

        for df_index, df in enumerate(dataframes_list):
            # Convert column names to datetime
            try:
                timestamps = pd.to_datetime(df.columns)
            except Exception as e:
                print(f"Error converting timestamps for {asset_name} at index {df_index}: {e}")
                continue

            # Calculate differences between consecutive timestamps
            time_diffs = timestamps.to_series().diff()

            # Check for missing intervals (greater than 0.01 seconds)
            if any(time_diffs > pd.Timedelta(seconds=0.01)):
                missing_timestamps_dataframes.append(df_index)

        if missing_timestamps_dataframes:
            missing_timestamps_info[asset_name] = missing_timestamps_dataframes

    return missing_timestamps_info

# Example usage
missing_dfs_info = detect_missing_timestamps(asset_results_10ms)
print("Dataframes with missing timestamps:", missing_dfs_info)

In [None]:
# THE ONE THAT WORKS
#import pandas as pd
#import numpy as np
#import numba
#
#@numba.jit(nopython=True)
#def scale_values(values, min_val, max_val):
#    return (values - min_val) / (max_val - min_val)
#
#def scale_rowwise(df, rows):
#    values = df.loc[rows].values
#    min_val = values.min()
#    max_val = values.max()
#    scaled_values = scale_values(values, min_val, max_val)
#    df.loc[rows] = scaled_values
#    return df
#
#def scale_dataframe(dataframe):
#    # Identify rows
#    price_rows = dataframe.index[dataframe.index.str.contains('price')]
#    size_rows = dataframe.index[dataframe.index.str.contains('size')]
#
#    # Scale rows separately
#    dataframe = scale_rowwise(dataframe, price_rows)
#    dataframe = scale_rowwise(dataframe, size_rows)
#    
#    return dataframe
#
#def scale_asset_dataframes(asset_data):
#    for asset, dfs in asset_data.items():
#        scaled_dfs = [scale_dataframe(df.copy()) for df in dfs]
#        asset_data[asset] = scaled_dfs
#    return asset_data
#

In [None]:
#asset_results_10ms['TSLA'][0].loc[asset_results_10ms['TSLA'][0].index.str.contains('bid_price')]
#asset_results_10ms['TSLA'][0].loc[asset_results_10ms['TSLA'][0].index.str.contains('ask_size')]

In [None]:
##delete every element in this folder: /Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays:
#import os, shutil
#folder = '/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays/depth50_time1L_window10'
#
#for filename in os.listdir(folder):
#    file_path = os.path.join(folder, filename)
#    try:
#        if os.path.isfile(file_path) or os.path.islink(file_path):
#            os.unlink(file_path)  # delete the file
#        elif os.path.isdir(file_path):
#            shutil.rmtree(file_path)  # delete the folder
#    except Exception as e:
#        print('Failed to delete %s. Reason: %s' % (file_path, e))

In [None]:
#    # Image dimensions
#def save_images_from_dataframes_fast(asset_data, save_dir, depth, time_interval, window):
#    IMAGE_WIDTH = (len(asset_data['TSLA'][0].columns) * 20) + len(asset_data['TSLA'][0].columns)
#    IMAGE_HEIGHT = int((len(asset_data['TSLA'][0]) / 2) * 2)
#    # Define pixel boundaries for price values
#    pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)
#
#    def dataframe_to_image_arrays(df):
#        # Create an array to store images for all columns
#
#        all_images = np.zeros(shape=(df.shape[1], IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)
#        
#        for col_idx, column in enumerate(df.columns):
#            mid = (col_idx * 21) + 10
#            aggregated_sizes = np.zeros(IMAGE_HEIGHT)
#            # Extract prices and sizes for the current column
#            prices = df[column].iloc[::2].values
#
#            sizes = df[column].iloc[1::2].values
#            price_positions = IMAGE_HEIGHT - np.digitize(prices, pixel_boundaries)
#            price_positions = np.clip(price_positions, 0, IMAGE_HEIGHT-1)
#            
#            # Aggregate sizes for this pixel value
#            np.add.at(aggregated_sizes, price_positions, sizes)
#            
#            # Normalize the aggregated sizes for this column
#            max_size = np.max(aggregated_sizes)
#            normalized_sizes = aggregated_sizes / max_size if max_size != 0 else aggregated_sizes
#            for idx in range(0, len(prices)):
#                price_pos = price_positions[idx]
#
#                all_images[col_idx, price_pos, mid] = 255
#                
#                # Use normalized aggregated size for this pixel value
#                size_value = normalized_sizes[price_pos]
#                line_length = int(10 * size_value)
#                
#                if 'ask_size' in df.index[2*idx + 1]:
#                    all_images[col_idx, price_pos, mid:mid+line_length] = 255
#                else:
#                    all_images[col_idx, price_pos, mid-line_length:mid] = 255
#        
#        return all_images
#    # Loop through each asset and each of its dataframes
#    for asset, dfs in asset_data.items():
#
#        for df in dfs:
#            all_images_array = dataframe_to_image_arrays(df)
#            for col_idx, column in enumerate(df.columns):
#                img_array = all_images_array[col_idx]
#                # Incorporate the column name in the filename and removed the idx
#                filename = f"{asset}_{column}_depth{depth}_interval{time_interval}_window{window}.npz"
#                save_path = os.path.join(save_dir, filename)
#                np.savez_compressed(save_path, img_array)
## The function processes the entire dataframe at once and then saves images for each column afterward.
#
## You'll need to test this version on your own since I don't have the actual data.
#save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"
#
#save_images_from_dataframes_fast(asset_results_10ms, save_dir, depth, time_interval, window)

In [None]:
## CLAES CODE:
#import re
#
#def sanitize_filename(filename):
#    """Sanitize the filename by replacing spaces and special characters with underscores."""
#    sanitized = re.sub(r'[^\w\s]', '_', filename)  # Replace special characters with underscores
#    sanitized = sanitized.replace(' ', '_')  # Replace spaces with underscores
#    return sanitized
#
#def save_images_from_dataframes_sanitized(asset_data, save_dir, depth, time_interval, window):
#    # Image dimensions
#    IMAGE_WIDTH = (len(asset_data['TSLA'][0].columns) * 20) + len(asset_data['TSLA'][0].columns)
#    IMAGE_HEIGHT = int((len(asset_data['TSLA'][0]) / 2) * 2)
#
#    # Define pixel boundaries for price values
#    pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)
#
#    def dataframe_to_image_array(df):
#        img = np.zeros(shape=(IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)
#        
#        for col_idx, column in enumerate(df.columns):
#            mid = (col_idx * 21) + 10
#            aggregated_sizes = np.zeros(IMAGE_HEIGHT)
#
#            # Extract prices and sizes for the current column
#            prices = df[column].iloc[::2].values
#            sizes = df[column].iloc[1::2].values
#            price_positions = IMAGE_HEIGHT - np.digitize(prices, pixel_boundaries)
#            price_positions = np.clip(price_positions, 0, IMAGE_HEIGHT-1)
#            
#            # Aggregate sizes for this pixel value
#            np.add.at(aggregated_sizes, price_positions, sizes)
#            
#            # Normalize the aggregated sizes for this column
#            max_size = np.max(aggregated_sizes)
#            normalized_sizes = aggregated_sizes / max_size if max_size != 0 else aggregated_sizes
#
#            for idx in range(0, len(prices)):
#                price_pos = price_positions[idx]
#                img[price_pos, mid] = 255
#                
#                # Use normalized aggregated size for this pixel value
#                size_value = normalized_sizes[price_pos]
#                line_length = int(10 * size_value)
#                
#                if 'ask_size' in df.index[2*idx + 1]:
#                    img[price_pos, mid:mid+line_length] = 255
#                else:
#                    img[price_pos, mid-line_length:mid] = 255
#                
#        return img
#
#    # Loop through each asset and each of its dataframes
#    for asset, dfs in asset_data.items():
#        for df in dfs:
#            img_array = dataframe_to_image_array(df)
#            first_column_name = str(df.columns[0]).replace(' ', '_')  # Convert to string and replace spaces
#            filename = f"{asset}_{first_column_name}_depth{depth}_interval{time_interval}_window{window}.npz"
#            sanitized_filename = sanitize_filename(filename)
#            save_path = os.path.join(save_dir, sanitized_filename)
#            np.savez_compressed(save_path, img_array)
#
## The filename now includes the first column of the relevant dataframe, and is sanitized to be safe for file operations.
## You'll need to test this version on your own since I don't have the actual data.
#
## Since I still don't have the actual data, you'll need to test this version on your own.
#save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"
#save_images_from_dataframes_sanitized(asset_results_10ms, save_dir, depth, time_interval, window)

In [None]:
# NEW IMPROVED CODE
def save_images_from_dataframes_aligned(asset_data, save_dir, depth, time_interval, window):
    # Image dimensions
    IMAGE_WIDTH = (len(asset_data['TSLA'][0].columns) * 20) + len(asset_data['TSLA'][0].columns)
    IMAGE_HEIGHT = int((len(asset_data['TSLA'][0]) / 2) * 2)

    # Define pixel boundaries for price values
    pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)

    def dataframe_to_image_array(df):
        img = np.zeros(shape=(IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)
        
        for col_idx, column in enumerate(df.columns):
            mid = (col_idx * 21) + 10
            aggregated_sizes = np.zeros(IMAGE_HEIGHT)

            # Extract prices and sizes for the current column
            prices = df[column].iloc[::2].values
            sizes = df[column].iloc[1::2].values
            price_positions = IMAGE_HEIGHT - np.digitize(prices, pixel_boundaries)
            price_positions = np.clip(price_positions, 0, IMAGE_HEIGHT-1)
            
            # Aggregate sizes for this pixel value
            np.add.at(aggregated_sizes, price_positions, sizes)
            
            # Normalize the aggregated sizes for this column
            max_size = np.max(aggregated_sizes)
            normalized_sizes = aggregated_sizes / max_size if max_size != 0 else aggregated_sizes

            for idx in range(0, len(prices)):
                price_pos = price_positions[idx]
                img[price_pos, mid] = 255
                
                # Use normalized aggregated size for this pixel value
                size_value = normalized_sizes[price_pos]
                line_length = int(10 * size_value)
                
                if 'ask_size' in df.index[2*idx + 1]:
                    img[price_pos, mid:mid+line_length] = 255
                else:
                    img[price_pos, mid-line_length:mid] = 255
                
        return img

    # Loop through each asset and each of its dataframes
    for asset, dfs in asset_data.items():
        for idx, df in enumerate(dfs):
            img_array = dataframe_to_image_array(df)
            filename = f"{asset}_depth{depth}_interval{time_interval}_window{window}_{idx}.npz"
            save_path = os.path.join(save_dir, filename)
            np.savez_compressed(save_path, img_array)

# This function attempts to mirror the original function's behavior more closely while using 
# efficient NumPy operations to improve speed. Testing this with the data is essential.

# Since I still don't have the actual data, you'll need to test this version on your own.
save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"
save_images_from_dataframes_aligned(asset_results_10ms, save_dir, depth, time_interval, window)

In [None]:
## OLD GOOD CODE
#import numpy as np
#import os
#
#def save_images_from_dataframes(asset_data, save_dir, depth, time_interval, window):
#    # Image dimensions
#    IMAGE_WIDTH = (len(asset_data['TSLA'][0].columns) * 20) + len(asset_data['TSLA'][0].columns)
#    IMAGE_HEIGHT = int((len(asset_data['TSLA'][0]) / 2) * 2)
#
#    # Define pixel boundaries for price values
#    pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)
#
#    def dataframe_to_image_array(df):
#        img = np.zeros(shape=(IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)
#        for col_idx, column in enumerate(df.columns):
#            mid = (col_idx * 21) + 10
#            aggregated_sizes = {}
#            for idx in range(0, len(df.index), 2):  # Step by 2 because of price-size pairs
#                price_label = df.index[idx]
#                size_label = df.index[idx + 1]
#                
#                price_value = df[column][price_label]
#                price_pos = IMAGE_HEIGHT - np.digitize(price_value, pixel_boundaries)
#                price_pos = max(0, min(IMAGE_HEIGHT-1, price_pos))
#                
#                # Aggregate sizes for this pixel value
#                size_value = df[column][size_label]
#                aggregated_sizes[price_pos] = aggregated_sizes.get(price_pos, 0) + size_value
#
#            # Normalize the aggregated sizes for this column
#            max_size = max(aggregated_sizes.values())
#            for key in aggregated_sizes:
#                aggregated_sizes[key] /= max_size
#
#            for idx in range(0, len(df.index), 2):
#                price_label = df.index[idx]
#                size_label = df.index[idx + 1]
#                
#                price_value = df[column][price_label]
#                price_pos = IMAGE_HEIGHT - np.digitize(price_value, pixel_boundaries)
#                price_pos = max(0, min(IMAGE_HEIGHT-1, price_pos))
#                
#                img[price_pos, mid] = 255
#                
#                # Use normalized aggregated size for this pixel value
#                size_value = aggregated_sizes[price_pos]
#                line_length = int(10 * size_value)
#                
#                if 'ask_size' in size_label:
#                    img[price_pos, mid:mid+line_length] = 255
#                else:
#                    img[price_pos, mid-line_length:mid] = 255
#                
#        return img
#
#    # Loop through each asset and each of its dataframes
#    for asset, dfs in asset_data.items():
#        for idx, df in enumerate(dfs):
#            img_array = dataframe_to_image_array(df)
#            filename = f"{asset}_depth{depth}_interval{time_interval}_window{window}_{idx}.npz"
#            save_path = os.path.join(save_dir, filename)
#            np.savez_compressed(save_path, img_array)
#
#
#save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"
#save_images_from_dataframes(asset_results_10ms, save_dir, depth, time_interval, window)
#

### Display the images:

In [None]:
# display the image
import matplotlib.pyplot as plt
import numpy as np
import os

# Directory where the image arrays are saved
save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"  # Change this to your desired save directory

# Load the image array {asset}_depth{depth}_interval{time_interval}_window{window}_{idx}.npz
filename = "TSLA_depth50_interval100L_window10_1.npz"
load_path = os.path.join(save_dir, filename)
img_array = np.load(load_path)['arr_0']

# Display the image
plt.figure(figsize=(20, 10))
plt.imshow(img_array, cmap='gray')

In [None]:
## display the image
#import matplotlib.pyplot as plt
#import numpy as np
#import os
#
## Directory where the image arrays are saved
#save_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"  # Change this to your desired save directory
#
## Load the image array
#filename = "TSLA_2023_03_21_09_30_40_00_00_depth50_interval100L_window5_npz.npz"
#load_path = os.path.join(save_dir, filename)
#img_array = np.load(load_path)['arr_0']
#
## Display the image
#plt.figure(figsize=(20, 10))
#plt.imshow(img_array, cmap='gray')

### TRAIN the data

In [None]:
use_gpu = True
use_ramdon_split = False
use_dataparallel = True

In [None]:
import os
import sys
import shutil
#sys.path.insert(0, '..')

# if use_gpu:
#     from utils.gpu_tools import *
#     os.environ["CUDA_VISIBLE_DEVICES"] = ','.join([ str(obj) for obj in select_gpu(query_gpu())])

import time
import datetime
import numpy as np
import pandas as pd

from tqdm import tqdm

import config

# Pytorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split

torch.manual_seed(42)

%load_ext autoreload
%autoreload 2

In [None]:
def load_saved_images(directory):
    all_images = []
    all_files = [f for f in os.listdir(directory) if f.endswith('.npz')]
    shapes = []
    
    for file in all_files:
        with np.load(os.path.join(directory, file)) as data:
            img_array = data['arr_0']
            shapes.append(img_array.shape)
            all_images.append(img_array)
            
    unique_shapes = set(shapes)
    if len(unique_shapes) > 1:
        print(f"Found inconsistent shapes: {unique_shapes}")
    
    # Depending on your needs, you can return a list of arrays instead of a single numpy array
    return all_images

# Load the images
image_dir = "/Users/jensknudsen/Desktop/LOBSTER_DATA/PROJECT/container_for_arrays"
images = load_saved_images(image_dir)

### THIS IS THE MAIN PICTURE

In [None]:
import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame(asset_results_10ms['TSLA'][100])

# Check if the DataFrame needs transposing
if 'ask_price_1' in df.columns:
    df = df.transpose()

# Image dimensions
IMAGE_WIDTH = (len(df.columns) * 20) + len(df.columns)
IMAGE_HEIGHT = int((len(df) / 2) * 2)

img = np.zeros(shape=(IMAGE_HEIGHT, IMAGE_WIDTH), dtype=np.uint8)

# Define pixel boundaries for price values
pixel_boundaries = np.linspace(0, 1, IMAGE_HEIGHT+1)

# Plotting
for col_idx, column in enumerate(df.columns):
    mid = (col_idx * 21) + 10
    aggregated_sizes = {}

    # Step 1: Aggregate the sizes that fall into the same pixel
    for idx in range(0, len(df.index), 2):  # Step by 2 because of price-size pairs
        price_label = df.index[idx]
        size_label = df.index[idx + 1]
        
        price_value = df[column][price_label]
        
        # Determine the corresponding pixel for the price value (inverted for correct placement)
        price_pos = IMAGE_HEIGHT - np.digitize(price_value, pixel_boundaries)
        price_pos = max(0, min(IMAGE_HEIGHT-1, price_pos))
        
        # Aggregate sizes for this pixel value
        size_value = df[column][size_label]
        aggregated_sizes[price_pos] = aggregated_sizes.get(price_pos, 0) + size_value

    # Step 2: Normalize the aggregated sizes for this column
    max_size = max(aggregated_sizes.values())
    for key in aggregated_sizes:
        aggregated_sizes[key] /= max_size

    for idx in range(0, len(df.index), 2):
        price_label = df.index[idx]
        size_label = df.index[idx + 1]
        
        price_value = df[column][price_label]
        
        # Determine the corresponding pixel for the price value (inverted for correct placement)
        price_pos = IMAGE_HEIGHT - np.digitize(price_value, pixel_boundaries)
        price_pos = max(0, min(IMAGE_HEIGHT-1, price_pos))
        img[price_pos, mid] = 255
        
        # Use normalized aggregated size for this pixel value
        size_value = aggregated_sizes[price_pos]
        
        line_length = int(10 * size_value)
        
        if 'ask_size' in size_label:
            img[price_pos, mid:mid+line_length] = 255
        else:
            img[price_pos, mid-line_length:mid] = 255

# Displaying the image
plt.figure(figsize=(20, 10))
plt.imshow(img, cmap='gray', aspect='auto')
plt.xlabel('Timestamps')
plt.ylabel('Scaled Value')
plt.show()


### PASTE ZIP FILES INTO A FOLDER

In [None]:
import py7zr
import os
import shutil


# Set the directory containing the zip files
source_directory = '/Users/jensknudsen/Desktop/zip_folder'
# Set the directory where you want to extract the files
destination_directory = '/Users/jensknudsen/Desktop/LOBSTER_DATA/Data'

# Ensure the destination directory exists
os.makedirs(destination_directory, exist_ok=True)

# Iterate over each file in the source directory
for filename in os.listdir(source_directory):
    if filename.endswith('.7z'):
        # Full path to the .7z file
        file_path = os.path.join(source_directory, filename)
        print(f"Processing {filename}...")  # Debugging: print the file being processed
        try:
            # Open the .7z file
            with py7zr.SevenZipFile(file_path, mode='r') as archive:
                # Extract all files into a temporary directory within the loop
                temp_dir = os.path.join(source_directory, "temp_extract")
                archive.extractall(path=temp_dir)
                
                # Move only CSV files to the destination directory
                for root, dirs, files in os.walk(temp_dir):
                    for file in files:
                        if file.endswith('.csv'):
                            shutil.move(os.path.join(root, file), os.path.join(destination_directory, file))
                
                # Cleanup the temporary extraction directory
                shutil.rmtree(temp_dir)
                
            print(f"Extracted CSV files from {filename} successfully.")
        except py7zr.SevenZipFileError as e:
            print(f"Failed to extract {filename}. The file may be corrupt or not a .7z file. Error: {e}")
        except Exception as e:
            print(f"An error occurred with {filename}: {e}")

print("Done processing all .7z files.")



### Devide tickers into branches

In [None]:
tickers = [
    'AAPL', 'ABBV', 'ABNB', 'ABT', 'ACN', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AFL', 'AIG', 'AJG', 'AMAT', 'AMD', 'AMGN', 'AMT', 'AMX', 'AMZN', 'ANET', 'AON', 'APD', 'APH', 'APO', 'ARM', 'ASML', 'AVGO', 'AXP', 'AZN', 'AZO', 'BA', 'BABA', 'BAC', 'BBVA', 'BDX', 'BHP', 'BKNG', 'BLK', 'BMO', 'BMY', 'BN', 'BNS', 'BP', 'BRK.B', 'BSX', 'BTI', 'BUD', 'BX', 'C', 'CAT', 'CB', 'CDNS', 'CHTR', 'CI', 'CL', 'CMCSA', 'CME', 'CMG', 'CNI', 'CNQ', 'COF', 'COP', 'COST', 'CP', 'CRH', 'CRM', 'CRWD', 'CSCO', 'CSX', 'CTAS', 'CVS', 'CVX', 'DE', 'DELL', 'DEO', 'DHI', 'DHR', 'DIS', 'DUK', 'E', 'ECL', 'EL', 'ELV', 'EMR', 'ENB', 'EOG', 'EPD', 'EQIX', 'EQNR', 'ET', 'ETN', 'EW', 'FCX', 'FDX', 'FI', 'FMX', 'FTNT', 'GD', 'GE', 'GILD', 'GM', 'GOOGL', 'GS', 'GSK', 'HCA', 'HD', 'HDB', 'HLT', 'HMC', 'HON', 'HSBC', 'IBM', 'IBN', 'ICE', 'INFY', 'ING', 'INTC', 'INTU', 'ISRG', 'ITUB', 'ITW', 'JNJ', 'JPM', 'KKR', 'KLAC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'LRCX', 'LULU', 'MA', 'MAR', 'MCD', 'MCK', 'MCO', 'MDLZ', 'MDT', 'MELI', 'MET', 'META', 'MMC', 'MMM', 'MNST', 'MO', 'MPC', 'MRK', 'MRVL', 'MS', 'MSCI', 'MSFT', 'MSI', 'MU', 'MUFG', 'NEE', 'NFLX', 'NGG', 'NKE', 'NOC', 'NOW', 'NSC', 'NTES', 'NVDA', 'NVO', 'NVS', 'NXPI', 'ORCL', 'ORLY', 'OXY', 'PANW', 'PBR', 'PBR.A', 'PCAR', 'PDD', 'PEP', 'PFE', 'PG', 'PGR', 'PH', 'PLD', 'PM', 'PNC', 'PSA', 'PSX', 'PXD', 'PYPL', 'QCOM', 'RACE', 'REGN', 'RELX', 'RIO', 'ROP', 'ROST', 'RSG', 'RTX', 'RY', 'SAN', 'SAP', 'SBUX', 'SCCO', 'SCHW', 'SHEL', 'SHOP', 'SHW', 'SLB', 'SMFG', 'SNOW', 'SNPS', 'SNY', 'SO', 'SONY', 'SPGI', 'STLA', 'SYK', 'T', 'TD', 'TDG', 'TEAM', 'TFC', 'TGT', 'TJX', 'TM', 'TMO', 'TMUS', 'TRI', 'TRV', 'TSLA', 'TSM', 'TT', 'TTE', 'TXN', 'UBER', 'UBS', 'UL', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VALE', 'VLO', 'VRTX', 'VZ', 'WDAY', 'WELL', 'WFC', 'WM', 'WMT', 'XOM', 'ZTS'
]

branches = {
    'Technology': ['AAPL', 'ADBE', 'ADI', 'ADSK', 'AMAT', 'AMD', 'AMZN', 'ASML', 'AVGO', 'CRM', 'CDNS', 'CSCO', 'CTAS', 'EA', 'IBM', 'INTC', 'INTU', 'MSFT', 'MU', 'NVDA', 'ORCL', 'PANW', 'SHOP', 'SNPS', 'TSM', 'TXN'],
    'Healthcare': ['ABBV', 'ABT', 'AON', 'AZN', 'BDX', 'BIIB', 'BMY', 'BSX', 'CI', 'CVS', 'DHR', 'GILD', 'HCA', 'JNJ', 'LLY', 'MDT', 'MRK', 'PFE', 'REGN', 'SNY', 'SYK', 'UNH', 'VRTX'],
    'Financial Services': ['AIG', 'AXP', 'BAC', 'BKNG', 'BLK', 'BRK.B', 'C', 'CME', 'COF', 'GS', 'ICE', 'JPM', 'MA', 'MCO', 'MMC', 'MS', 'PNC', 'SCHW', 'SPGI', 'TFC', 'V', 'WFC'],
    'Consumer Goods': ['EL', 'GM', 'HD', 'KO', 'LULU', 'MAR', 'MDLZ', 'NKE', 'PG', 'PM', 'PEP', 'RACE', 'RIO', 'RL', 'SBUX', 'TGT', 'TM', 'UL', 'UN', 'VLO', 'WMT'],
    'Energy': ['APA', 'BP', 'COP', 'CVX', 'E', 'ENB', 'EOG', 'EQNR', 'EPD', 'HES', 'KMI', 'MRO', 'MPC', 'OXY', 'PBR', 'PSX', 'RDS.A', 'RDS.B', 'SU', 'VLO', 'XOM'],
    'Industrial': ['BA', 'CAT', 'DE', 'EMR', 'GE', 'HON', 'LMT', 'MMM', 'NOC', 'RTX', 'UPS'],
    'Retail': ['AMZN', 'BABA', 'COST', 'JD'],
    'Telecommunication': ['CHTR', 'CMCSA', 'TMUS', 'T', 'VZ'],
    'Automotive': ['F', 'GM', 'HMC', 'TSLA', 'TM'],
    'Entertainment': ['DIS', 'NFLX', 'SNE'],
    'Agriculture': ['ADM', 'BHP', 'CNI', 'CP', 'FMC', 'MOS', 'NTR'],
    'Utilities': ['AEP', 'DUK', 'EXC', 'NEE', 'SO'],
    'Pharmaceuticals': ['ABBV', 'AZN', 'BMY', 'GILD', 'JNJ', 'MRK', 'PFE', 'SNY'],
    'Insurance': ['AIG', 'AON', 'BRK.B', 'MET', 'PRU', 'TRV']
}

ticker_to_branch = {}
for branch, companies in branches.items():
    for company in companies:
        ticker_to_branch[company] = branch

print(ticker_to_branch)

In [None]:
ticker_to_branch['AAPL']

# CNN (FROM COLAB)

In [None]:
import torch
import numpy
import matplotlib.pyplot as plt

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

In [None]:
#ORIGINAL
import pandas as pd
import os
import numpy as np
import math
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

df_returns = pd.DataFrame()
asset_list = ['AAPL', 'ABNB', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'ALGN', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'ANSS', 'ASML', 'AVGO', 'AZN', 'BIIB', 'BKNG', 'BKR', 'CDNS', 'CEG', 'CHTR', 'CMCSA', 'COST', 'CPRT', 'CRWD', 'CSCO', 'CSGP', 'CSX', 'CTAS', 'CTSH', 'DDOG', 'DLTR', 'DXCM', 'EA', 'EBAY', 'ENPH', 'EXC', 'FANG', 'FAST', 'FTNT', 'GEHC', 'GFS', 'GILD', 'GOOGL', 'HON', 'IDXX', 'ILMN', 'INTC', 'INTU', 'ISRG', 'JD', 'KDP', 'KHC', 'KLAC', 'LCID', 'LRCX', 'LULU', 'MAR', 'MDLZ', 'MELI', 'META', 'MNST', 'MRNA', 'MRVL', 'MSFT', 'MU', 'NFLX', 'NVDA', 'NXPI', 'ODFL', 'ON', 'ORLY', 'PANW', 'PAYX', 'PCAR', 'PDD', 'PEP', 'PYPL', 'QCOM', 'REGN', 'ROST', 'SBUX', 'SGEN', 'SIRI', 'SNPS', 'TEAM', 'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL', 'ZM', 'ZS']
# 'AAPL', 'ABNB', 'ADBE', 'ADI', 'ADP', 'ADSK', 'AEP', 'ALGN', 'AMAT', 'AMD', 'AMGN', 'AMZN', 'ANSS', 'ASML', 'AVGO', 'AZN', 'BIIB', 'BKNG', 'BKR', 'CDNS', 'CEG', 'CHTR', 'CMCSA', 'COST', 'CPRT', 'CRWD', 'CSCO', 'CSGP', 'CSX', 'CTAS', 'CTSH', 'DDOG', 'DLTR', 'DXCM', 'EA', 'EBAY', 'ENPH', 'EXC', 'FANG', 'FAST', 'FTNT', 'GEHC', 'GFS', 'GILD', 'GOOGL', 'HON', 'IDXX', 'ILMN', 'INTC', 'INTU', 'ISRG', 'JD', 'KDP', 'KHC', 'KLAC', 'LCID', 'LRCX', 'LULU', 'MAR', 'MDLZ', 'MELI', 'META', 'MNST', 'MRNA', 'MRVL', 'MSFT', 'MU', 'NFLX', 'NVDA', 'NXPI', 'ODFL', 'ON', 'ORLY', 'PANW', 'PAYX', 'PCAR', 'PDD', 'PEP', 'PYPL', 'QCOM', 'REGN', 'ROST', 'SBUX', 'SGEN', 'SIRI', 'SNPS', 'TEAM', 'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL', 'ZM', 'ZS' #'GOOG','MCHP','FI'

original_starttime = datetime.fromisoformat("2023-03-21 09:30:00.000000+00:00")

level = 50
prediction_ahead = 5
depth = 5 # also 10, 30, 50
time_interval = '1S' #: 30S = 30sek, 10S = 10sek, 5S=5sek, 1S=1sek; 100L=0,1sek; 10L=0,01sek; 1L=0,001sek; 100U=0,001sek
window = 5 # also 3, 5, 10

In [None]:
file_path = '/content/drive/MyDrive/DataLOB_array/combined_depth5_time1S_window5.pkl'
df = pd.read_pickle(file_path)

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.set_printoptions(edgeitems=2)
torch.manual_seed(123)
# Convert each NumPy array in the 'Arrays' column to a PyTorch tensor
df['tensors'] = df['Arrays'].apply(lambda x: torch.from_numpy(x))
# Assuming your DataFrame has a column 'tensors' with the tensors
df['tensors'] = df['tensors'].apply(lambda x: (x.float() / 255).to(torch.float32))

# Check if CUDA is available
if torch.cuda.is_available():
    # Apply .to(device='cuda') to each tensor in the column
    df['tensors'] = df['tensors'].apply(lambda x: x.to(device='cuda'))
    print("CUDA is avaibable and is now applied to each tensor")
else:
    print("CUDA is not available. Tensors will remain on CPU.")

In [None]:
import pandas as pd

def split_dataframe_by_ticker(df, time_column='Time', train_ratio=0.6, val_ratio=0.2):
    """
    Splits a DataFrame into training, validation, and test sets based on ticker.

    Parameters:
    df (DataFrame): The input DataFrame.
    time_column (str): The name of the column containing time data.
    train_ratio (float): The proportion of data to be used for training.
    val_ratio (float): The proportion of data to be used for validation.

    Returns:
    DataFrame: Training set.
    DataFrame: Validation set.
    DataFrame: Test set.
    """

    # Convert Time column to datetime for proper sorting
    df[time_column] = pd.to_datetime(df[time_column])

    # Define a function to split the data
    def split_data(group):
        group = group.sort_values(by=time_column)
        idx_train = int(len(group) * train_ratio)
        idx_val = int(len(group) * (train_ratio + val_ratio))
        return group.iloc[:idx_train], group.iloc[idx_train:idx_val], group.iloc[idx_val:]

    # Apply the function and get splits
    splits = df.groupby('Ticker').apply(lambda g: split_data(g))

    # Extract splits into separate DataFrames using list comprehension
    train_df = pd.concat([s[0] for s in splits])
    val_df = pd.concat([s[1] for s in splits])
    test_df = pd.concat([s[2] for s in splits])

    return train_df, val_df, test_df

# Usage Example:
train_df, val_df, test_df = split_dataframe_by_ticker(df)


In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm
import time

# Custom Dataset Class
class StockDataset(Dataset):
    """Custom Dataset for loading stock tensor data"""
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        features = self.dataframe.iloc[idx]['tensors']
        label = self.dataframe.iloc[idx]['midquote_target']
        features = features.unsqueeze(0)  # Ensure there's a channel dimension
        return features, label

# CNN Model Definition
class StockCNN(nn.Module):
    def __init__(self):
        super(StockCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 10 * 26, 120)  # Adjust based on the output size
        self.fc2 = nn.Linear(120, 2)  # For binary classification

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Assuming train_df, val_df, and test_df are already defined and split
train_dataset = StockDataset(train_df)
val_dataset = StockDataset(val_df)
test_dataset = StockDataset(test_df)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Setup for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = StockCNN().to(device)

import torch

# Assuming train_df contains your training data
num_class_0 = (train_df['midquote_target'] == 0).sum()
num_class_1 = (train_df['midquote_target'] == 1).sum()

total_samples = len(train_df)
weight_for_class_0 = total_samples / (2 * num_class_0)
weight_for_class_1 = total_samples / (2 * num_class_1)

# Define weights tensor
weights = torch.tensor([weight_for_class_0, weight_for_class_1], dtype=torch.float32).to(device)

# Define the loss function with weighted class
criterion = nn.CrossEntropyLoss(weight=weights)
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 10

# Training and Validation Function
def train_one_epoch(epoch_index, num_epochs, model, device, train_loader, optimizer, criterion):
    model.train()
    running_loss = 0.0
    correct_predictions = 0
    total_predictions = 0
    start_time = time.time()

    for batch_idx, (data, targets) in enumerate(tqdm(train_loader, desc=f"Epoch {epoch_index+1}/{num_epochs} Training")):
        data, targets = data.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, targets.long())
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total_predictions += targets.size(0)
        correct_predictions += (predicted == targets).sum().item()

    epoch_loss = running_loss / len(train_loader)
    epoch_accuracy = (correct_predictions / total_predictions) * 100
    print(f"Training - Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.2f}%, Time: {time.time() - start_time:.2f}s")

def validate(epoch_index, num_epochs, model, device, val_loader, criterion):
    model.eval()
    running_loss = 0.0
    correct_predictions = 0
    total_predictions = 0

    with torch.no_grad():
        for data, targets in tqdm(val_loader, desc=f"Epoch {epoch_index+1}/{num_epochs} Validation"):
            data, targets = data.to(device), targets.to(device)

            outputs = model(data)
            loss = criterion(outputs, targets.long())

            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total_predictions += targets.size(0)
            correct_predictions += (predicted == targets).sum().item()

    epoch_loss = running_loss / len(val_loader)
    epoch_accuracy = (correct_predictions / total_predictions) * 100
    print(f"Validation - Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.2f}%")

# Main Training Loop
for epoch in range(num_epochs):
    train_one_epoch(epoch, num_epochs, model, device, train_loader, optimizer, criterion)
    validate(epoch, num_epochs, model, device, val_loader, criterion)


In [None]:
model.eval()  # Set the model to evaluation mode
test_correct = 0
test_total = 0
predictions = []
true_labels = []

with torch.no_grad():  # No need to track gradients during testing
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()
        predictions.extend(predicted.view(-1).cpu().numpy())
        true_labels.extend(labels.view(-1).cpu().numpy())

test_accuracy = 100 * test_correct / test_total
print(f'Test Accuracy: {test_accuracy:.2f}%')


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

conf_mat = confusion_matrix(true_labels, predictions)
sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues', xticklabels=['Down', 'Up'], yticklabels=['Down', 'Up'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

def test(model, device, test_loader, criterion):
    model.eval()
    running_loss = 0.0
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for data, targets in tqdm(test_loader, desc="Testing"):
            data, targets = data.to(device), targets.to(device)

            outputs = model(data)
            loss = criterion(outputs, targets.long())

            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())

    epoch_loss = running_loss / len(test_loader)
    accuracy = accuracy_score(all_targets, all_predictions)
    precision = precision_score(all_targets, all_predictions)
    recall = recall_score(all_targets, all_predictions)
    f1 = f1_score(all_targets, all_predictions)
    confusion_mat = confusion_matrix(all_targets, all_predictions)

    print(f"Test - Loss: {epoch_loss:.4f}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print("Confusion Matrix:")
    print(confusion_mat)

# After training and validation, test the model on the test set
test(model, device, test_loader, criterion)


# (Re)-imag(in)ing Price Trends

A key component of image- based prediction is the implicit data scaling achieved by the image representation—images put all stocks’ past price data on the same scale so that their recent maximum high and minimum low prices span the height of the image and all other prices (open, high, low, close, and moving average) are rescaled accordingly, and likewise for volume

The main component of our image is a concatenation of daily OHLC bars over consecutive 5, 20, or 60-day intervals (approximately weekly, monthly, and quarterly price trajectories, respectively). The width of a n-day image is thus 3n pixels. We replace prices by CRSP adjusted returns to translate the opening, closing, high and low prices into relative scales that abstract from price effects of stock splits and dividend issuance.