# Data Processing

This notebook shows the processing steps used to clean our datasets and populate the `clean_data` folder.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
%cd drive/MyDrive/fin_big_data_colab/
!ls

/content/drive/MyDrive/fin_big_data_colab
clean_data  data  data_processing.ipynb  helpers


In [3]:
import time
import gc
import torch
from helpers.data_processing import TweetsProcessing, BTCProcessing

%load_ext autoreload
%autoreload 2

## Bitcoin Dataset

In [4]:
start_btc = time.time()
btc = BTCProcessing(data_path='data/BTC-USDT.parquet', clean_folder='clean_data')
btc.hourly_granularity()
btc.compute_returns()
btc.save_clean_data()
end_btc = time.time()

# Convert to minutes, seconds, and milliseconds
elapsed_time = end_btc - start_btc
minutes = int(elapsed_time // 60)
seconds = int(elapsed_time % 60)
milliseconds = int((elapsed_time - int(elapsed_time)) * 1000)
print(f"Time spent for BTC pre-processing: {minutes}min {seconds}sec {milliseconds}ms")

Time spent for BTC pre-processing: 0min 1sec 856ms


## Tweeter Dataset

In [5]:
start_tweeter = time.time()
tweeter = TweetsProcessing(data_path='data/tweets.zip', clean_folder='clean_data')
tweeter.select_pertinent_tweets(threshold=100)
tweeter.clean_tweets()
tweeter.remove_non_english_tweets(batch_size=16384)

gc.collect() # Run garbage collection
torch.cuda.empty_cache() # Empty the GPU cache

tweeter.tweets_sentiment_analysis(batch_size=128)
tweeter.save_clean_data()
end_tweeter = time.time()

# Convert to minutes, seconds, and milliseconds
elapsed_time = end_tweeter - start_tweeter
minutes = int(elapsed_time // 60)
seconds = int(elapsed_time % 60)
milliseconds = int((elapsed_time - int(elapsed_time)) * 1000)
print(f"Time spent for Tweeter pre-processing: {minutes}min {seconds}sec {milliseconds}ms")

17chunks [02:35,  9.16s/chunks]


xPertinence Selection Report:
	-Threshold: 100
	-Total number of original tweets: 16889765
	-Number of pertinent tweets, wrt threshold: 78809 (0.47%)


xCleaning tweets: 100%|██████████| 78809/78809 [00:02<00:00, 26460.21tweets/s]
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda
xDetecting languages: 100%|██████████| 5/5 [13:32<00:00, 162.55s/it]


Language Detection Report:
	-Languages detected: 20
	['en' 'sw' 'nl' 'pt' 'ur' 'ru' 'de' 'hi' 'tr' 'th' 'es' 'it' 'pl' 'fr'
 'ja' 'ar' 'zh' 'bg' 'el' 'vi']
	-Number of English tweets kept: 59326 (75.28%)
	Note: Tweets from other languages have been stored into clean_data/foreign_lang_tweets.csv (19483 tweets)


Device set to use cuda
xDetecting sentiments:   2%|▏         | 10/464 [00:04<03:53,  1.95it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
xDetecting sentiments: 100%|██████████| 464/464 [05:14<00:00,  1.47it/s]


Sentiment Analysis Report:
	-Number of tweets considered as 'BULLISH': 26359 (Avg. score: 0.80)
	-Number of tweets considered as 'NEUTRAL': 16495 (Avg. score: 0.77)
	-Number of tweets considered as 'BEARISH': 16472 (Avg. score: 0.83)

Time spent for Tweeter pre-processing: 22min 20sec 407ms
