# Crypto Sentiment on Chart Analysis

This notebook aims to explore the potential relationship between sentiment on 4chan's Business and Finance board and the price action of selected cryptocurrencies. 4chan is a valuable source for sentiment analysis as its posts are freely accessible via its API, making it a cost-effective alternative to platforms like Twitter. The primary objective of this analysis is to investigate whether sentiment derived from 4chan posts can be correlated with the price movements of specific cryptocurrencies. By binning 4chan posts into specific time intervals and aligning them with cryptocurrency price charts, we can calculate a net sentiment score for each bin and observe any patterns or trends that may emerge. Here's how it works:

- **Data Collection**: Scrape 4chan thread posts related to cryptocurrencies, focusing on the Business and Finance board. Then, organise the posts in chronological order.
- **Datetime Binning**: Slot the post data into predefined datetime bins that correspond to intervals on a cryptocurrency price chart.
- **Sentiment Analysis**: Perform sentiment analysis on the text data within each bin to categorize posts as bullish, neutral, or bearish.
Sum the sentiment scores within each bin to generate a net sentiment score.
- **Net Sentiment Score Calculation**: Calculate the net average sentiment score by subtracting the bearish sentiment score from the bullish sentiment score and then dividing with the number of predictions. A positive net sentiment score indicates a bullish sentiment, while a negative score indicates bearish sentiment.
- **Analysis**: Compare the net sentiment scores with the corresponding price action in the cryptocurrency market. Investigate whether there is a discernible relationship between the sentiment on 4chan and the subsequent price movements.

In this notebook, this analysis is demonstrated using cryptocurrency price data sourced from the Binance, OKX, and Bybit APIs and thread post data from 4chan's Business and Finance board.

## Prepare your Environment

Ensure that the 'venv' kernel is selected for this notebook. If not, click on 'Kernel' at the top bar, select 'Change Kernel...' and select 'crypto-trading-analysis' as the kernel. For convenience, ensure that 'Always start the preferred kernel' is ticked. Click 'Select' to confirm the setting.

Install the environment's dependencies using the command below. After installation, restart the kernel to use the updated packages. To restart, click on 'Kernel' at the top bar and select 'Restart Kernel' and click on 'Restart'. Please skip this step if you have already done it.

In [None]:
pip install -r requirements.txt

## Import packages

In [12]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os
import sys
from datetime import datetime
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from statsmodels.tsa.stattools import coint
from itertools import combinations
from statsmodels.tsa.stattools import coint
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
from utils import calculate_profit, plot_strategy
from data_manager import load_ts_df, process_data, sanitize_data, save_sentiment_score_df, load_presaved_df
from social_media_analysis.data_manager import load_df_range as load_post_df_range

os.environ['CURL_CA_BUNDLE'] = ''
os.environ['REQUESTS_CA_BUNDLE'] = ''
os.environ['https_proxy'] = 'http://127.0.0.1:10809'
os.environ['http_proxy'] = 'http://127.0.0.1:10809'
os.environ['all_proxy'] = 'socks5://127.0.0.1:10808'

## Process Price Dataframe

- Before proceeding, ensure that the price data has been downloaded using ***'data_manager.py'***.
- Enter the ***cex*** (Centralized Exchange) and ***interval*** values used for data download to load the relevant *.pkl* files and retrieve the dataframe.
- All available pairs will be loaded by default.
- Note that some pairs might be new and may lack sufficient data within the downloaded timeframe. Such pairs will be removed based on the ***nan_remove_threshold*** setting, which defines the maximum percentage of NaN values allowed relative to the total data points. For example, with a ***nan_remove_threshold*** of 0.1, if a pair has 100 data points and 15 are NaN, the pair will be excluded.
- From the remaining pairs, you can filter the top N volume pairs using the ***top_n_volume_pairs*** parameter.
- This part of the code will also ensure that all timeseries columns have the same number of data points.
- The earliest and latest dates for all pairs will be recorded. These dates can then be used to determine the timeframe for slicing the data in the next step.

### Inputs

In [2]:
##### INPUTS #####
cex = 'binance'
interval = '1d'
nan_remove_threshold = 0.1

# Select only the top N mean volume pairs from the selected pairs to analyse.
top_n_volume_pairs = 100

# Select volume filter mode. Options: ['rolling', 'mean'].
volume_filter_mode = 'rolling'
##################

In [3]:
print("\nMode: Crypto Sentiment on Chart Analysis")
print("CEX: {}".format(str(cex).capitalize()))
print("Interval: {}".format(interval))
print("NaN Remove Threshold: {}".format(nan_remove_threshold))
print("Top N Volume Pairs: {}".format(top_n_volume_pairs))
print("Volume Filter Mode: {}".format(str(volume_filter_mode).capitalize()))

merged_price_df = process_data('sentiment_on_chart', cex, interval, nan_remove_threshold, [],
                 top_n_volume_pairs, volume_filter_mode)

print("\n")


Mode: Crypto Sentiment on Chart Analysis
CEX: Binance
Interval: 1d
NaN Remove Threshold: 0.1
Top N Volume Pairs: 100
Volume Filter Mode: Rolling

Columns that contains NaN values:
               Pair  NaN Count          Remark
5           DIAUSDT        359       To Remove
13        EIGENUSDT        358       To Remove
74          COSUSDT        357       To Remove
61          REIUSDT        354       To Remove
37        HMSTRUSDT        353       To Remove
43         LOKAUSDT        351       To Remove
83         GHSTUSDT        350       To Remove
31          FIOUSDT        347       To Remove
99         CATIUSDT        347       To Remove
0          FIDAUSDT        346       To Remove
96          KDAUSDT        345       To Remove
70        NEIROUSDT        343       To Remove
85   1MBABYDOGEUSDT        343       To Remove
42       UXLINKUSDT        342       To Remove
2           POLUSDT        340       To Remove
100       AERGOUSDT        337       To Remove
8           RPLUSDT 

## Sanitize the price dataframe

- Slice the dataframe according to the specified ***start_date*** and ***end_date***. Choose ***start_date*** and ***end_date*** within the timeframe shown by the output of the previous cell.
- Interpolate any missing values in the dataframe.
- If the interpolation fails, just backfill with the latest valid value.
- Verify that all is as expected with an `assert` and check the shapes of 2 random pairs, which should have the same dimensions.

### Inputs

In [4]:
##### INPUTS #####
start_date = '2024-01-01'
end_date = '2024-10-07'
##################

In [5]:
print("\n")

price_data_sanitized, sorted_available_pairs = sanitize_data(merged_price_df, start_date, end_date)

if price_data_sanitized:
    print("-Data Check-")
    keys = list(price_data_sanitized.keys())
    count = 0

    for key in keys:
        print("{}'s Data Shape: {}".format(key, price_data_sanitized[key].shape))
        count+=1

        if count == 2:
            break
            
else:
    print("No data found.")

print("\n")



-Data Check-
BTCUSDT's Data Shape: (281, 1)
ETHUSDT's Data Shape: (281, 1)




## Process Post Dataframe

- Select the post data source. As of now, the only available options are *4chan* and *hugging_face*.
   - To download 4chan's data, change directory into the ***'social_media_analysis'*** folder and run ***'python data_manager.py'*** in the terminal first.
   - To download Hugging Face's data, change directory into the ***'social_media_analysis'*** folder and run ***'python download_hugging_face_data.py'*** in the terminal first.
   - To download Telegram Channel's data, change directory into the ***'social_media_analysis'*** folder and run ***'python scrape_telegram_data.py'*** in the terminal first.
- Write the relative directory path to the data from the current location to ***dir_path*** (eg. './social_media_analytics/saved_data/4chan/biz').

### Inputs

In [6]:
##### INPUTS #####
# Select data source. Options: ['4chan', 'hugging_face', 'telegram'].
source = 'telegram'

# Select source tag. Eg. 'biz' (source = '4chan'), '-1001164734593-SpiderCrypto Trading Journal' (source = 'telegram')
tag = "-1001369518127-Crypto Mumbles"
##################

In [7]:
dir_path = './social_media_analysis/saved_data/{}/{}'.format(source, tag)
merged_post_df = load_post_df_range(dir_path, source, start_date, end_date)

merged_post_df.head(10)

Unnamed: 0,Date Time,Comment UUID,Chat ID,Chat Title,User ID,Username,Comment
0,2024-01-01 23:43:25,-1001369518127-6401,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,started tapping and farming this like cookie c...
1,2024-01-02 04:19:04,-1001369518127-6407,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,you’ll qualify for the airdrop if you’ve spent...
2,2024-01-02 16:48:03,-1001369518127-6412,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,"new ฿ yearly high: ,386"
3,2024-01-02 21:02:35,-1001369518127-6414,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,thoughts on approval?👇🏼 thumbs up - sell the n...
4,2024-01-02 21:31:24,-1001369518127-6416,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,seems like majority thinks that there won't be...
5,2024-01-02 21:46:00,-1001369518127-6417,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,a comfy 2.5x from entry so far approaching 100...
6,2024-01-02 21:57:49,-1001369518127-6419,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,expected to launch end january
7,2024-01-02 21:59:44,-1001369518127-6420,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,recommended validators to stake with for the c...
8,2024-01-02 22:22:13,-1001369518127-6423,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,check your eligibility for dymension airdrop h...
9,2024-01-02 22:40:23,-1001369518127-6424,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,remember to check all your wallets before the ...


In [8]:
_, sample_price_df = next(iter(price_data_sanitized.items()))
bin_datetime_df = pd.DataFrame(sample_price_df.index)
bin_datetime_df.columns = ['Binned Date Time']

In [9]:
# Use pd.merge_asof to align to bin_datetime_df
post_data_binned_df = pd.merge_asof(merged_post_df, bin_datetime_df, left_on='Date Time', right_on='Binned Date Time')
post_data_binned_df['Date Time'] = post_data_binned_df['Binned Date Time']
post_data_binned_df = post_data_binned_df.drop(columns=['Binned Date Time'])
post_data_binned_df = post_data_binned_df.dropna(subset=['Date Time'])
post_data_binned_df.head(10)

Unnamed: 0,Date Time,Comment UUID,Chat ID,Chat Title,User ID,Username,Comment
0,2024-01-01,-1001369518127-6401,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,started tapping and farming this like cookie c...
1,2024-01-02,-1001369518127-6407,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,you’ll qualify for the airdrop if you’ve spent...
2,2024-01-02,-1001369518127-6412,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,"new ฿ yearly high: ,386"
3,2024-01-02,-1001369518127-6414,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,thoughts on approval?👇🏼 thumbs up - sell the n...
4,2024-01-02,-1001369518127-6416,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,seems like majority thinks that there won't be...
5,2024-01-02,-1001369518127-6417,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,a comfy 2.5x from entry so far approaching 100...
6,2024-01-02,-1001369518127-6419,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,expected to launch end january
7,2024-01-02,-1001369518127-6420,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,recommended validators to stake with for the c...
8,2024-01-02,-1001369518127-6423,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,check your eligibility for dymension airdrop h...
9,2024-01-02,-1001369518127-6424,-1001369518127,Crypto Mumbles,-1001369518127,cryptomumbles,remember to check all your wallets before the ...


## Load Pre-saved Sentiment Scores 

The purpose of this section is to identify which datetimes already have sentiment scores. If a datetime has a score, the LLM will skip performing sentiment analysis for that datetime. This helps to save time.

In [10]:
print("\n")

presaved_sentiment_score_df_base = pd.DataFrame(columns=['Open Time', 'Sentiment Score'])
sentiment_score_dir_path = './saved_data/sentiment_score/{}/{}/{}'.format(source, tag, cex + '_' + interval)
normalised_sentiment_score_dir_path = './saved_data/normalised_sentiment_score/{}/{}/{}'.format(source, tag, cex + '_' + interval)

presaved_sentiment_score_df, _ = load_presaved_df(presaved_sentiment_score_df_base, sentiment_score_dir_path)
presaved_normalised_sentiment_score_df, _ = load_presaved_df(presaved_sentiment_score_df_base, normalised_sentiment_score_dir_path)

print("Presaved Sentiment Score:")
print(presaved_sentiment_score_df.tail(5))
print("\n")
print("Presaved Normalised Sentiment Score:")
print(presaved_normalised_sentiment_score_df.tail(5))

presaved_sentiment_score_datetime_dict = pd.Series(True, index=presaved_sentiment_score_df['Open Time']).to_dict()
presaved_normalised_sentiment_score_datetime_dict = pd.Series(True, index=presaved_normalised_sentiment_score_df['Open Time']).to_dict()

overlapped_datetime = set(presaved_sentiment_score_datetime_dict.keys()).intersection(presaved_normalised_sentiment_score_datetime_dict.keys())
overlapped_presaved_datetime_dict = {datetime: True for datetime in overlapped_datetime}

print("\n")



Presaved Sentiment Score:
Empty DataFrame
Columns: [Open Time, Sentiment Score]
Index: []


Presaved Normalised Sentiment Score:
Empty DataFrame
Columns: [Open Time, Sentiment Score]
Index: []




## Initialise LLM and Checking LLM Inputs

In [13]:
model_name = "ElKulako/cryptobert"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels=3)
max_length = 128
pipe = TextClassificationPipeline(model=model,
                                  tokenizer=tokenizer,
                                  max_length=max_length,                           # original max_length is 64
                                  truncation=True,
                                  padding='max_length')

'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /ElKulako/cryptobert/resolve/main/tokenizer_config.json (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x178a31510>: Failed to establish a new connection: [Errno 61] Connection refused')))' thrown while requesting HEAD https://huggingface.co/ElKulako/cryptobert/resolve/main/tokenizer_config.json


ProxyError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /ElKulako/cryptobert/resolve/main/tokenizer_config.json (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x178a31510>: Failed to establish a new connection: [Errno 61] Connection refused')))

In [None]:
binned_datetime_comments_dict = {}
num_tokens_count_dict = {}

for _, row in post_data_binned_df.iterrows():
    bin_datetime = row['Date Time']
    bin_comment = row['Comment']
    tokens = tokenizer.tokenize(bin_comment)
    num_tokens = len(tokens)

    if num_tokens not in num_tokens_count_dict:
        num_tokens_count_dict[num_tokens] = 1
    else:
        num_tokens_count_dict[num_tokens] += 1
        
    if bin_datetime not in binned_datetime_comments_dict:
        binned_datetime_comments_dict[bin_datetime] = []

    binned_datetime_comments_dict[bin_datetime].append(bin_comment)

In [None]:
print("\n")

num_tokens = list(num_tokens_count_dict.keys())
counts = list(num_tokens_count_dict.values())

plt.figure(figsize=(10, 6))
plt.bar(num_tokens, counts)
plt.axvline(x=max_length, color='red', linestyle='--', label='Truncation Point')

plt.xlabel('Number of Tokens')
plt.ylabel('Count of Comments')
plt.title('Histogram of Number of Tokens')
plt.legend()

plt.show()

print("\n")

## Use LLM to perform Sentiment Analysis

This part will take a while...

In [None]:
print("\n")

post_bin_datetime_list = []
post_bin_sentiment_score_list = []
post_bin_normalised_sentiment_score_list = []

for bin_datetime, comments in binned_datetime_comments_dict.items():

    if bin_datetime in overlapped_presaved_datetime_dict:
        continue

    print("Performing sentiment analysis on comments in Date Time {} bin...".format(bin_datetime))
    
    sentiment_preds = pipe(comments)
    bin_sentiment_score = 0
    
    for sentiment_pred in sentiment_preds:
        if sentiment_pred['label'].lower() == 'bullish':
            bin_sentiment_score += sentiment_pred['score']
        elif sentiment_pred['label'].lower() == 'bearish':
            bin_sentiment_score -= sentiment_pred['score']

    bin_normalised_sentiment_score = bin_sentiment_score / len(sentiment_preds)

    post_bin_datetime_list.append(bin_datetime)
    post_bin_sentiment_score_list.append(bin_sentiment_score)
    post_bin_normalised_sentiment_score_list.append(bin_normalised_sentiment_score)

print("\n")

## Save and Check Sentiment Score Data

### Sentiment Score

In [None]:
print("\n")

binned_sentiment_score_df = pd.DataFrame({
    'Open Time': post_bin_datetime_list,
    'Sentiment Score': post_bin_sentiment_score_list
})

binned_sentiment_score_df = save_sentiment_score_df(binned_sentiment_score_df, sentiment_score_dir_path, start_date, end_date)
binned_sentiment_score_df = binned_sentiment_score_df.set_index('Open Time')

print("Sentiment Score:")
print(binned_sentiment_score_df.head(5))
print(binned_sentiment_score_df.tail(5))

print("\n")

### Normalised Sentiment Score

In [None]:
print("\n")

binned_normalised_sentiment_score_df = pd.DataFrame({
    'Open Time': post_bin_datetime_list,
    'Sentiment Score': post_bin_normalised_sentiment_score_list
})

binned_normalised_sentiment_score_df = save_sentiment_score_df(binned_normalised_sentiment_score_df, normalised_sentiment_score_dir_path, start_date, end_date)
binned_normalised_sentiment_score_df = binned_normalised_sentiment_score_df.set_index('Open Time')

print("Normalised Sentiment Score:")
print(binned_normalised_sentiment_score_df.head(5))
print(binned_normalised_sentiment_score_df.tail(5))

print("\n")

In [None]:
print("\nSelectable pairs:")

for pair in sorted_available_pairs:
    print("- {}".format(pair))

print("\n")

## Select Pairs for Detailed Analysis

- Please select any pair combination from the output below.

### Inputs

In [None]:
##### INPUTS #####
ticker_pairs = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
##################

## Plot Function

Bullish sentiment is highlighted in the chart as green and bearish sentiment is highlighted in the chart as red. The alpha of the colours (opacity) represent the score of the sentiment. The higher the sentiment score, the higher the alpha value of the color (more opaque).

In [None]:
def plot_sentiment_on_chart(ticker_pairs, price_data_sanitized, binned_sentiment_score_df):
    fig, axs = plt.subplots(len(ticker_pairs), 1, figsize=(18, 14))
    if len(ticker_pairs) == 1:
        axs = [axs]
    
    for i, ticker in enumerate(ticker_pairs):
    
        if ticker not in sorted_available_pairs:
            print("{} is not found in the list of selectable pairs. Please choose another one.".format(ticker))
            continue
    
        price_data = price_data_sanitized[ticker]['Close']
        open_time = price_data_sanitized[ticker].index
        sentiment_score = binned_sentiment_score_df['Sentiment Score']
        sentiment_score = sentiment_score.reindex(open_time)
        sentiment_score = sentiment_score.fillna(0)
        max_abs_sentiment_score = sentiment_score.abs().max()
        sentiment_score_normalized = (sentiment_score / max_abs_sentiment_score) * 0.5
        
        axs[i].plot(open_time, price_data, label=f'{ticker}', color='gray', alpha=0.7)
    
        # Apply rolling mean with a window of 15
        price_data_smooth = price_data.rolling(window=15, min_periods=1).mean()
        axs[i].plot(open_time, price_data_smooth, label=f'{ticker} SMA', color='blue')
    
        if len(open_time) > 1:
            # Plot sentiment score regions
            for j in range(len(sentiment_score_normalized) - 1):
                start_time = open_time[j]
                end_time = open_time[j + 1]
                
                if sentiment_score_normalized.iloc[j] > 0.2:
                    axs[i].axvspan(start_time, end_time, color='green', alpha=sentiment_score_normalized.iloc[j])
                elif sentiment_score_normalized.iloc[j] < 0:
                    axs[i].axvspan(start_time, end_time, color='red', alpha=abs(sentiment_score_normalized.iloc[j]))
        
            # Handling the last bin if necessary
            last_time = open_time[-1]
            next_time = last_time + (open_time[-1] - open_time[-2])  # Assumes last bin has the same duration
            if sentiment_score_normalized.iloc[-1] > 0.2:
                axs[i].axvspan(last_time, next_time, color='green', alpha=sentiment_score_normalized.iloc[-1])
            elif sentiment_score_normalized.iloc[-1] < 0:
                axs[i].axvspan(last_time, next_time, color='red', alpha=abs(sentiment_score_normalized.iloc[-1]))
    
        green_patch = mpatches.Patch(color='green', label='Bullish Sentiment')
        red_patch = mpatches.Patch(color='red', label='Bearish Sentiment')
    
        handles, labels = axs[i].get_legend_handles_labels()
        handles.extend([green_patch, red_patch])
        labels.extend(['Bullish Sentiment', 'Bearish Sentiment'])
    
        axs[i].set_ylabel('Price ($)', fontsize=18)
        axs[i].set_xlabel('Open Time', fontsize=18)
        axs[i].set_title(f'{ticker}', fontsize=24)
        axs[i].legend(handles=handles, labels=labels, loc='best')
        axs[i].grid(True)
    
    plt.tight_layout()
    plt.show()

### Sentiment Score

In [None]:
print("\n")

plot_sentiment_on_chart(ticker_pairs, price_data_sanitized, binned_sentiment_score_df)

print("\n")

### Normalised Sentiment Score

In [None]:
print("\n")

plot_sentiment_on_chart(ticker_pairs, price_data_sanitized, binned_normalised_sentiment_score_df)

print("\n")