# Data pre-processing

In [1]:
# Enabling the notebook execution from this sub-folder
import sys, os, ipynbname
NOTEBOOK_NAME = f"{ipynbname.name()}.ipynb"
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(NOTEBOOK_NAME), os.path.pardir)))

In [2]:
# Importin utils global variables and methods
from src.Utils import *

## Importing initial datasets

In [3]:
# Importing training dataset
TRAIN = pd.read_csv(filepath_or_buffer=Utils.FILENAMES["TRAIN"])

In [4]:
# Importing second training dataset
TRAIN_2 = pd.read_csv(filepath_or_buffer=Utils.FILENAMES["TRAIN_2"])

In [5]:
# Importing test dataset
TEST = pd.read_csv(filepath_or_buffer=Utils.FILENAMES["TEST"])

## Processing & On-disk saving

After an in-depth visualization, we have observed that the few samples contained inside the `TEST` dataset are

In [6]:
# Verify the continuity of the time series
#TRAIN.tail()
#TRAIN_2.head()

In [7]:
# Merging TRAIN and TRAIN_2 as df
df = pd.concat([TRAIN, TRAIN_2], axis = 0)

**Comments**:
- First timestamp: **2018-01-01 00:01:00**
- Last timestamp: **2022-01-24 00:00:00**

In [8]:
# Saving on-disk (CSV + Parquet)
SAVE_CSV, SAVE_PQT = False, True

# Deep-copying the DataFrame
df_ = df.copy(deep=True)

# Removing the useless column
try:
    df_.drop(["Target"], axis=1, inplace=True)
except:
    pass

# Converting the timestamp column
df_["timestamp"] = pd.to_datetime(arg=df_["timestamp"], unit="s", errors="ignore")

# Setting the timestamp column as index column
df_.set_index(["timestamp"], inplace=True)

# Converting the Count number
df_["Count"] = pd.to_numeric(arg=df_["Count"], downcast="integer")

# Checking for +inf/-inf values
df_["VWAP"].replace([np.inf, -np.inf], np.nan, inplace=True)

# Checking for NaN values
nb_NaN_values = df_.isnull().sum().sum()

# Replacing the NaN values by the previous value (for columns VWAP & Volume ==> No incidence)
if nb_NaN_values != 0: # Expected: 9 for the asset_id: 10
    df_.fillna(method="ffill", inplace=True)

if SAVE_CSV:
    # Saving it to a new CSV file in assets/
    df_.to_csv(path_or_buf=f"{Utils.ASSETS_FOLDER}/csv/ALL.csv")

if SAVE_PQT:
    # Saving it to a new Parquet file in assets/ (better for file I/O speed & compression)
    table = pa.Table.from_pandas(df=df_)
    pq.write_table(table=table, where=f"{Utils.ASSETS_FOLDER}/parquet/ALL.parquet")

We are now:
1. Splitting our dataframe by `asset_id`
2. Removing unnecessary columns
3. Converting the timestamp as a pandas `DateTime` object
4. Setting this datetime as index
5. Performing some additional operations to compress the data and remove some non-expected precision
6. Saving to disk as CSV and Parquet files for each `asset_id`

For instance, the column `Count` is an integer-value property by definition.
However, our dataset and/or import procedure casted it as a floatting-point number.

$\longrightarrow$ We will reduce this precision using the `.astype()` method.

In [9]:
# Hashmap of train datasets for each asset id
df_dict = {asset_id: None for asset_id in Utils.ASSET_IDS}

SAVE_CSV, SAVE_PQT = False, True

# For each asset id, perform the pre-processing
for asset_id in Utils.ASSET_IDS:
    print(f"--> Processing Asset #{asset_id} \t ({Utils.get_asset_name(asset_id)})")
    
    # Retrieving the corresponding data rows
    df_dict[asset_id] = df[df["Asset_ID"] == asset_id]
    
    # Removing the Asset_ID column (useless now)
    try:
        df_dict[asset_id].drop(["Asset_ID", "Target"], axis=1, inplace=True)
    except:
        pass
    
    # Converting the timestamp column
    df_dict[asset_id]["timestamp"] = pd.to_datetime(arg=df_dict[asset_id]["timestamp"], 
                                                     unit="s", 
                                                     errors="ignore")
    # Setting the timestamp column as index column
    df_dict[asset_id].set_index(["timestamp"], inplace=True)
    
    # Converting the Count number
    df_dict[asset_id]["Count"] = pd.to_numeric(arg=df_dict[asset_id]["Count"], 
                                               downcast="integer")
    
    # Checking for +inf/-inf values
    df_dict[asset_id]["VWAP"].replace([np.inf, -np.inf], np.nan, inplace=True)
    
    # Checking for NaN values
    nb_NaN_values = df_dict[asset_id].isnull().sum().sum()
    
    # Replacing the NaN values by the previous value (for columns VWAP & Volume ==> No incidence)
    if nb_NaN_values != 0: # Expected: 9 for the asset_id: 10
        df_dict[asset_id].fillna(method="ffill", inplace=True) # bfill is also available
    
    if SAVE_CSV:
        # Saving it to a new CSV file in assets/
        df_dict[asset_id].to_csv(path_or_buf=f"{Utils.ASSETS_FOLDER}/csv/{asset_id}.csv")

    if SAVE_PQT:
        # Saving it to a new Parquet file in assets/ (better for file I/O speed & compression)
        table = pa.Table.from_pandas(df=df_dict[asset_id])
        pq.write_table(table=table, where=f"{Utils.ASSETS_FOLDER}/parquet/{asset_id}.parquet")

--> Processing Asset #0 	 (Binance Coin)
--> Processing Asset #1 	 (Bitcoin)
--> Processing Asset #2 	 (Bitcoin Cash)
--> Processing Asset #3 	 (Cardano)
--> Processing Asset #4 	 (Dogecoin)
--> Processing Asset #5 	 (EOS.IO)
--> Processing Asset #6 	 (Ethereum)
--> Processing Asset #7 	 (Ethereum Classic)
--> Processing Asset #8 	 (IOTA)
--> Processing Asset #9 	 (Litecoin)
--> Processing Asset #10 	 (Maker)
--> Processing Asset #11 	 (Monero)
--> Processing Asset #12 	 (Stellar)
--> Processing Asset #13 	 (TRON)


## Sanity check

We want to check if the sum of the number of rows for each `asset_id` is equal to the number of rows from the original dataset

In [10]:
assert sum([df_dict[k].shape[0] for k in Utils.ASSET_IDS]) == df.shape[0]

We can also check the memory usage of the newly created pandas DataFrames:

In [11]:
# Computing the memory usage of each DataFrame
MEM_USAGE = {asset_id: df_dict[asset_id].memory_usage(index=True).sum()/10**6 for asset_id in Utils.ASSET_IDS}

# Computing the global memory usage
GLOBAL_MEM_USAGE = sum(list(MEM_USAGE.values()))/10**3
print(f"Global mem. usage: \t {GLOBAL_MEM_USAGE :.1f} GB")

Global mem. usage: 	 1.6 GB
