## **1. Data cleaning**

**OVERVIEW**
- This workbook will focus on ensuring data downloaded from `data/raw/prices`is in the best quality for the following analyses
- It implies checking **missing values (NaNs), formatting, merging of tables, data transformation, etc.**
- All clean data will be saved in `data/processed`

**SUMMARY RESULTS**
- The resulting processed dataset (`asset_universe`) has **1,477 rows and 19 columns** (excluding the 'Date' column)
- The date range goes from **2019-01-02** to **2024-12-30** and there are **NO missing values** to deal with

#### **1.1 Importing necessary libraries**

In [1]:
import pandas as pd
import random
from src.helpers_io import raw_path, processed_path, read_csv_raw, save_csv_processed

# Creating path to 'data/prices'
raw_prices_dir = raw_path("prices")

#### **1.2 Loading datasets**

In [2]:
# Quick view of saved files and storing ticker names
tickers = []

for file in list(raw_prices_dir.iterdir()):
    filename = file.name.split(sep="_")[0]
    tickers.append(filename)
    print(file.name)

AMZN_prices.csv
BZ_prices.csv
CL_prices.csv
EURUSD_prices.csv
FTSE_prices.csv
GBPUSD_prices.csv
GC_prices.csv
GSPC_prices.csv
IEF_prices.csv
IRX_prices.csv
IXIC_prices.csv
JPM_prices.csv
MSFT_prices.csv
NG_prices.csv
NVDA_prices.csv
ORCL_prices.csv
SI_prices.csv
TLT_prices.csv
USDJPY_prices.csv


#### **1.3 Inspecting structure and data types**

Based on the results, all columns have **correct data types** and **non-missing values**

In [4]:
# Setting a seed
random.seed(123)

# Small sample of assets to check
assets = random.sample(tickers, k=3)

for asset in assets:
    data = read_csv_raw(f"{raw_prices_dir / asset}_prices.csv", parse_dates=["Date"])
    print(asset)
    display(data.head(5), data.info(), data.isna().any())

BZ
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1510 entries, 0 to 1509
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    1510 non-null   datetime64[ns]
 1   Close   1510 non-null   float64       
 2   High    1510 non-null   float64       
 3   Low     1510 non-null   float64       
 4   Open    1510 non-null   float64       
 5   Volume  1510 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 70.9 KB


Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2019-01-02,54.91,56.560001,52.5,54.25,43517
1,2019-01-03,55.950001,56.290001,53.93,54.77,36535
2,2019-01-04,57.060001,58.299999,55.360001,55.580002,42426
3,2019-01-07,57.330002,58.919998,57.279999,57.369999,41677
4,2019-01-08,58.720001,58.860001,57.110001,57.630001,34135


None

Date      False
Close     False
High      False
Low       False
Open      False
Volume    False
dtype: bool

IEF
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1509 entries, 0 to 1508
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    1509 non-null   datetime64[ns]
 1   Close   1509 non-null   float64       
 2   High    1509 non-null   float64       
 3   Low     1509 non-null   float64       
 4   Open    1509 non-null   float64       
 5   Volume  1509 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 70.9 KB


Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2019-01-02,89.408463,89.417027,89.202967,89.314276,18668600
1,2019-01-03,90.119179,90.179115,89.425627,89.442749,10616700
2,2019-01-04,89.391335,89.554015,89.280019,89.519772,6616700
3,2019-01-07,89.143021,89.53689,89.10877,89.494075,5459200
4,2019-01-08,88.920403,89.125898,88.920403,89.04884,6879500


None

Date      False
Close     False
High      False
Low       False
Open      False
Volume    False
dtype: bool

CL
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1509 entries, 0 to 1508
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    1509 non-null   datetime64[ns]
 1   Close   1509 non-null   float64       
 2   High    1509 non-null   float64       
 3   Low     1509 non-null   float64       
 4   Open    1509 non-null   float64       
 5   Volume  1509 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 70.9 KB


Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2019-01-02,46.540001,47.779999,44.349998,45.799999,850480
1,2019-01-03,47.09,47.490002,45.349998,46.259998,788718
2,2019-01-04,47.959999,49.220001,46.650002,46.900002,817277
3,2019-01-07,48.52,49.790001,48.110001,48.299999,819939
4,2019-01-08,49.779999,49.950001,48.310001,48.73,765981


None

Date      False
Close     False
High      False
Low       False
Open      False
Volume    False
dtype: bool

#### **1.4 Standardizing dataframes**

In [5]:
datasets = {}

for file in raw_prices_dir.glob("*.csv"):
    # Getting asset namem ONLY
    filename = file.name.split(sep="_")[0]

    # Reading CSV and converting 'Date' into datetime format
    data = read_csv_raw(f"prices/{file.name}", parse_dates=["Date"])
    data = data.set_index("Date").sort_index(ascending=True)

    # Keeping 'Date' and renaming 'Close'
    data = data.rename(columns={"Close": filename})
    data = data[[filename]]

    # Adding it to 'datasets' library
    datasets[filename] = data

#### **1.5 Analizing common date ranges for merging**

In [7]:
# Creating a new DataFrame
comparison_table = {"ticker": [], "start_date": [], "end_date": [], "n_rows": [], "nan_values": []}

# Iterating each dataset
for ticker, dataset in datasets.items():
    comparison_table["ticker"].append(ticker)
    comparison_table["start_date"].append(dataset.index.min())
    comparison_table["end_date"].append(dataset.index.max())
    comparison_table["n_rows"].append(len(dataset))
    comparison_table["nan_values"].append(dataset.isna().sum().iloc[0])

# Converting my dictionary into a DataFrame and sorting values based on 'start_date'
comparison_table = pd.DataFrame(comparison_table)
comparison_table.set_index("ticker", inplace=True)
comparison_table = comparison_table.sort_values("start_date", ascending=True)

# Defining merging window
max_start_date = max(comparison_table["start_date"])
min_end_date = min(comparison_table["end_date"])

print(f"""start_date: {max_start_date}
end_date: {min_end_date}""")
comparison_table

start_date: 2019-01-02 00:00:00
end_date: 2024-12-30 00:00:00


Unnamed: 0_level_0,start_date,end_date,n_rows,nan_values
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USDJPY,2019-01-01,2024-12-30,1564,0
EURUSD,2019-01-01,2024-12-30,1564,0
GBPUSD,2019-01-01,2024-12-30,1564,0
SI,2019-01-02,2024-12-30,1509,0
ORCL,2019-01-02,2024-12-30,1509,0
NVDA,2019-01-02,2024-12-30,1509,0
NG,2019-01-02,2024-12-30,1510,0
MSFT,2019-01-02,2024-12-30,1509,0
JPM,2019-01-02,2024-12-30,1509,0
IXIC,2019-01-02,2024-12-30,1509,0


#### **1.6. Merging datasets**

In [10]:
# Creating the asset universe
first_key = next(iter(datasets))    # Using first key from dictionary as starting dataset
asset_universe = datasets[first_key].loc[max_start_date:min_end_date]

# Merging datasets
for _, dataset in datasets.items():
    if asset_universe.columns[0] != dataset.columns[0]:
        asset_universe = asset_universe.merge(dataset, on="Date", how="inner")

asset_universe

Unnamed: 0_level_0,AMZN,BZ,CL,EURUSD,FTSE,GBPUSD,GC,GSPC,IEF,IRX,IXIC,JPM,MSFT,NG,NVDA,ORCL,SI,TLT,USDJPY
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2019-01-02,76.956497,54.910000,46.540001,1.146171,6734.200195,1.275429,1281.000000,2510.030029,89.408463,2.365,6665.939941,81.616714,94.612610,2.958,3.377355,40.754513,15.542000,101.310387,109.667999
2019-01-03,75.014000,55.950001,47.090000,1.131811,6692.700195,1.252191,1291.800049,2447.889893,90.119179,2.355,6463.500000,80.456802,91.131996,2.945,3.173304,40.357971,15.706000,102.463188,107.441002
2019-01-04,78.769501,57.060001,47.959999,1.139108,6837.399902,1.262881,1282.699951,2531.939941,89.391335,2.358,6738.859863,83.422867,95.370461,3.044,3.376611,42.097382,15.695000,101.277206,107.807999
2019-01-07,81.475502,57.330002,48.520000,1.141044,6810.899902,1.273496,1286.800049,2549.689941,89.143021,2.353,6823.470215,83.480858,95.492081,2.944,3.555370,42.764305,15.669000,100.978615,108.522003
2019-01-08,82.829002,58.720001,49.779999,1.147974,6861.600098,1.278609,1283.199951,2574.409912,88.920403,2.400,6897.000000,83.323433,96.184494,2.967,3.466858,43.151848,15.626000,100.713219,108.615997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-12-20,224.919998,72.940002,69.459999,1.036495,8084.600098,1.249797,2628.699951,5930.850098,89.719093,4.215,19572.599609,232.674530,433.402924,3.748,134.670654,167.989105,29.660000,85.179329,157.643997
2024-12-23,225.059998,72.629997,69.239998,1.043308,8102.700195,1.256992,2612.300049,5974.069824,89.321884,4.215,19764.880859,233.448151,432.062775,3.656,139.639572,167.474243,29.888000,84.398026,156.533005
2024-12-24,229.050003,73.580002,70.099998,1.040583,8137.000000,1.253447,2620.000000,6040.040039,89.370323,4.200,20031.130859,237.286880,436.112885,3.946,140.189468,169.721893,29.974001,84.754929,157.164993
2024-12-27,223.750000,74.169998,70.599998,1.042318,8149.799805,1.252976,2617.199951,5970.839844,89.205643,4.178,19722.029297,236.170517,427.377319,3.514,136.980164,167.296021,29.655001,84.012230,157.748001


In [11]:
# Quick check of 'asset_universe'
print(asset_universe.info())    # Verifying data length and dtype
display(asset_universe.isna().sum())    # Veryfing NaNs

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1477 entries, 2019-01-02 to 2024-12-30
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AMZN    1477 non-null   float64
 1   BZ      1477 non-null   float64
 2   CL      1477 non-null   float64
 3   EURUSD  1477 non-null   float64
 4   FTSE    1477 non-null   float64
 5   GBPUSD  1477 non-null   float64
 6   GC      1477 non-null   float64
 7   GSPC    1477 non-null   float64
 8   IEF     1477 non-null   float64
 9   IRX     1477 non-null   float64
 10  IXIC    1477 non-null   float64
 11  JPM     1477 non-null   float64
 12  MSFT    1477 non-null   float64
 13  NG      1477 non-null   float64
 14  NVDA    1477 non-null   float64
 15  ORCL    1477 non-null   float64
 16  SI      1477 non-null   float64
 17  TLT     1477 non-null   float64
 18  USDJPY  1477 non-null   float64
dtypes: float64(19)
memory usage: 230.8 KB
None


AMZN      0
BZ        0
CL        0
EURUSD    0
FTSE      0
GBPUSD    0
GC        0
GSPC      0
IEF       0
IRX       0
IXIC      0
JPM       0
MSFT      0
NG        0
NVDA      0
ORCL      0
SI        0
TLT       0
USDJPY    0
dtype: int64

#### **1.7 Exporting processed dataset**

In [12]:
# Saving CSV into data/processed
filepath = processed_path("asset_universe.csv")

try:
    save_csv_processed(asset_universe, "asset_universe.csv", index=True)

    if filepath.exists():
        print("Successfully exported! ✅")
    else:
        print("Export failed ⚠️")

except Exception as e:
    print(f"""Error during export ❌
        Details: {e}""")

Successfully exported! ✅
