Hey there! In this notebook, I'm diving into the challenge of predicting price movements in the cryptocurrency market. Using historical data and some technical indicators (like moving averages and RSI), my goal is to build a model that can anticipate price direction. The metric I'm optimizing for is the macro-averaged F1 score, which balances both precision and recall. Let’s see how this goes!



I’m loading the dataset to get a quick look at the structure. This includes the main train.csv for training, test.csv for predictions, and a sample_submission.csv to see the required format for submission.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# load the training and test data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

# display the first few rows of the training data to understand structure
train.head()


Unnamed: 0,timestamp,open,high,low,close,volume,quote_asset_volume,number_of_trades,taker_buy_base_volume,taker_buy_quote_volume,target
0,1525471260,0.9012,0.9013,0.9012,0.9013,134.98,121.646459,4.0,125.08,112.723589,1.0
1,1525471320,0.90185,0.90195,0.90185,0.90195,1070.54,965.505313,12.0,879.94,793.612703,0.0
2,1525471380,0.9014,0.9014,0.90139,0.90139,2293.06,2066.963991,5.0,0.0,0.0,0.0
3,1525471440,0.90139,0.9014,0.90138,0.90139,6850.59,6175.000909,19.0,1786.3,1610.149485,0.0
4,1525471500,0.90139,0.90139,0.9013,0.9013,832.3,750.222624,3.0,784.82,707.4289,0.0


Time to convert the timestamp to a more readable datetime format and fill in any missing values. This step helps to ensure consistency across the data.

In [3]:
# convert timestamp to a datetime format for better readability
train['timestamp'] = pd.to_datetime(train['timestamp'], unit='s')
test['timestamp'] = pd.to_datetime(test['timestamp'], unit='s')

# handle any missing values by filling with the column mean
# this ensures no NaNs interfere with model training
train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)


Here’s where I add a bunch of useful features to help the model understand price trends and volatility. I’m adding things like price difference, volatility, moving averages, RSI, and lagged features.

In [5]:
# create features: price difference, volatility, and moving averages
train['price_diff'] = train['close'] - train['open']
train['volatility'] = train['high'] - train['low']
test['price_diff'] = test['close'] - test['open']
test['volatility'] = test['high'] - test['low']

# calculate moving averages for different windows to capture trends
for period in [5, 10, 20]:
    train[f'ma_{period}'] = train['close'].rolling(window=period).mean()
    test[f'ma_{period}'] = test['close'].rolling(window=period).mean()

# calculate rsi (relative strength index) as a momentum indicator
delta = train['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
train['rsi'] = 100 - (100 / (1 + rs))

delta_test = test['close'].diff()
gain_test = (delta_test.where(delta_test > 0, 0)).rolling(window=14).mean()
loss_test = (-delta_test.where(delta_test < 0, 0)).rolling(window=14).mean()
rs_test = gain_test / loss_test
test['rsi'] = 100 - (100 / (1 + rs_test))

# create lagged features to help capture previous values' impact on current values
lags = [1, 5, 10, 15, 30, 60]
for lag in lags:
    train[f'close_lag_{lag}'] = train['close'].shift(lag)
    test[f'close_lag_{lag}'] = test['close'].shift(lag)
    train[f'volatility_lag_{lag}'] = train['volatility'].shift(lag)
    test[f'volatility_lag_{lag}'] = test['volatility'].shift(lag)

# fill any nan values created by lagging with forward/backward fill
train.fillna(method='bfill', inplace=True)
train.fillna(method='ffill', inplace=True)
test.fillna(method='bfill', inplace=True)
test.fillna(method='ffill', inplace=True)

# define the features and target for training
# dropping timestamp as it's not relevant for the model
features = [col for col in train.columns if col not in ['timestamp', 'target']]
target = 'target'


With the data ready, I’m splitting it into training and validation sets, applying SMOTE to handle class imbalance, and scaling the features. Then, I’m training an XGBoost model with GPU support, which should help speed things up!

In [7]:
# apply smote to balance classes since this is a binary classification
X = train[features]
y = train[target]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# scale features to have a mean of 0 and a standard deviation of 1 for better model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# apply smote for oversampling the minority class
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# define and train the xgboost model with gpu support to speed up training
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.2,
    tree_method='gpu_hist',  # this ensures gpu usage where possible
    eval_metric='logloss',
    random_state=42,
    use_label_encoder=False
)
xgb_model.fit(X_train_resampled, y_train_resampled)

# evaluate the model on the validation set using f1 score
y_val_pred = xgb_model.predict(X_val_scaled)
f1 = f1_score(y_val, y_val_pred, average='macro')
print(f"validation f1 score: {f1:.4f}")


Validation F1 Score: 0.5324


To ensure the model is consistent, I’m using 5-fold cross-validation. This will give us a more stable metric and help verify if the model is reliable.

In [8]:
# perform 5-fold cross-validation to ensure model stability and reliability
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb_model, X, y, cv=kf, scoring='f1_macro')
print(f"5-fold cross-validation macro f1 score: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


5-Fold Cross-Validation Macro F1 Score: 0.5107 ± 0.0008


Reflecting on this project, I’m happy with the progress made but also see room for improvement. Working on a laptop while visiting family in upstate NY added some constraints — I didn’t have access to my usual setup, so running large models like XGBoost was a bit challenging. With more computational power, I could’ve explored deeper architectures, like LSTM or GRU with TensorFlow, which would’ve been ideal for time-series data like this.

Additionally, with access to cloud resources or my main system, I’d run more intensive hyperparameter tuning and try out ensemble techniques. Despite these constraints, I think this notebook demonstrates a strong understanding of the end-to-end process for predictive modeling, from data preprocessing and feature engineering to model evaluation and optimization. Looking forward to applying these skills in a high-power environment next time!

In [9]:
# scale test data using the same scaler used for training
X_test_scaled = scaler.transform(test[features])

# make predictions on the test set
test_predictions = xgb_model.predict(X_test_scaled)

# create the submission file as required for competition format
submission = pd.DataFrame({
    'row_id': test.index,
    'target': test_predictions
})
submission.to_csv('submission.csv', index=False)
print(f"submission file created: submission.csv with {len(submission)} rows")


Submission file created: submission.csv with 909617 rows
