# Bitcoin Price Trend Prediction with VADER and Logistic Regression
In this jupiter notebook, we experienment the approach to predict the bitcoin price trend using features like sentiment score and label, number of comments, likes, reweets.
## 1. Setup

In [1]:
import re
import pandas as pd

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MinMaxScaler

### Helper function to get the sentiment score and sentiment label of a text using VADER

In [2]:
def get_sentiment_score_and_label(text):
  # VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool
  # that is specifically attuned to sentiments expressed in social media.
  analyzer = SentimentIntensityAnalyzer()

  # Analyze sentiment
  scores = analyzer.polarity_scores(text)

  score = scores['compound']
  # positive sentiment : (compound score >= 0.05) 
  if score >= 0.05:
    label = 1
  # negative sentiment : (compound score <= -0.05)
  elif score <= -0.05:
    label = -1
  # neutral sentiment : (compound score > -0.05) and (compound score < 0.05) 
  else:
    label = 0
  return score, label

### Helper function to get the price trend label
Check the percentage difference between the prices of when the tweet was published and 15 mins after the tweet was published.
1. If the percentage difference >= 1%, label as 1 (increase)
2. If the percentage difference <= -1%, label as -1 (decrease)
3. Otherwise, label as 0 (no change)

In [3]:
def get_price_trend_label(prices):
  prices_list = eval(prices)
  first_price = prices_list[0]
  last_price = prices_list[-1]

  price_difference = last_price - first_price
  percentage_difference = (price_difference / first_price) * 100

  if percentage_difference >= 1:
    print("increase by 1%", first_price, last_price)
    return 1 # increase
  elif percentage_difference <= -1:
    print("decrease by 1%", first_price, last_price)
    return -1 # decrease
  else:
    return 0 # no change

### Helper function to clean up numbers in Elon Tweets dataset

In [4]:
def preprocess_number(num_str_origin, text):
    if num_str_origin == 'nan':
        return 0
    
    # Remove any commas from the string
    num_str = num_str_origin.replace(',', '')
    
    # Extract the numerical value
    num = re.findall(r'\d+\.?\d*', num_str)
    
    # If K (thousand) is present, multiply by 1000
    if 'K' in num_str:
        return int(float(num[0]) * 1000)
    
    # If M (million) is present, multiply by 1000000
    elif 'M' in num_str:
        return int(float(num[0]) * 1000000)
    
    # Otherwise, return the number
    else:
        try:
            output = int(num[0])
            return output
        except Exception as e:
            print("failed to parse number", e, num_str, text)


## 2. Generate the training and test datasets
1. Generate features like sentiment score, sentiment label, comments, likes, retweets using the above helper function.
2. Generate label using the label helper function.
3. Write 1 and 2 to a csv
4. Split the data to training and test datasets following 80:20 ratio.

In [7]:
file_path = '../../data/elon-tweets-with-price.csv'
data = pd.read_csv(file_path)

data['sentiment_score'], data['sentiment_label'] = zip(*data['text'].apply(get_sentiment_score_and_label))
data['Comments'] = data.apply(lambda row: preprocess_number(str(row['Comments']), row['text']), axis=1)
data['Likes'] = data.apply(lambda row: preprocess_number(str(row['Likes']), row['text']), axis=1)
data['Retweets'] = data.apply(lambda row: preprocess_number(str(row['Retweets']), row['text']), axis=1)
data['price_trend_label'] = data['next_15min_prices'].apply(get_price_trend_label)

features = ['sentiment_score', 'sentiment_label', 'Comments', 'Likes', 'Retweets']
target = 'price_trend_label'

features_target_data = data[features + [target]]
features_target_data.to_csv("../../data/elon-tweets-features-n-labels.csv", index=False)

X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=42)

decrease by 1% 7395.8 7304.49
increase by 1% 7571.11 7666.67
increase by 1% 7583.07 7680.0
decrease by 1% 10045.42 9794.18
decrease by 1% 7770.51 7687.46
decrease by 1% 5298.31 5223.57
decrease by 1% 5280.78 5226.67
decrease by 1% 5335.31 5261.17
decrease by 1% 5410.08 5352.02
increase by 1% 4930.51 4991.3
decrease by 1% 6399.82 6166.66
increase by 1% 5083.46 5141.89
decrease by 1% 5395.94 5323.19
increase by 1% 6297.9 6374.35
increase by 1% 5310.3 5367.72
decrease by 1% 4979.59 4928.06
decrease by 1% 6266.89 6123.6
increase by 1% 6091.6 6203.59
increase by 1% 5804.23 5868.47
increase by 1% 5821.4 5898.61
increase by 1% 5788.63 5868.74
increase by 1% 5874.42 5946.62
increase by 1% 6457.41 6524.62
increase by 1% 6902.66 7090.17
increase by 1% 9823.68 9940.94
decrease by 1% 8710.69 8607.92
decrease by 1% 9368.13 9262.9
increase by 1% 9814.98 9925.94
increase by 1% 9803.97 9931.66
decrease by 1% 9925.0 9806.2
increase by 1% 10165.72 10286.75
increase by 1% 10573.31 10744.59
increase by 1%

### Analyse the distribution of the price_trend_label values

In [8]:
# Display the distribution of values in the price_trend_label column in the training set
print("Training Set:")
print(y_train.value_counts())

# Display the distribution of values in the price_trend_label column in the test set
print("\nTest Set:")
print(y_test.value_counts())

Training Set:
price_trend_label
 0    4235
 1      74
-1      58
Name: count, dtype: int64

Test Set:
price_trend_label
 0    1058
 1      21
-1      13
Name: count, dtype: int64


## 3. Use logistic Regression to train the Bitcoin Price Trend Prediction model
1. Normalize the input features since they contain values in different dimension.
2. Use Logistic Regression with class_weight=balanced to give more importance to classes 1 and -1 which has very little data.
3. Evaluate the performance on the test data.

In [61]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# give more importance to the minority classes
lr = LogisticRegression(class_weight='balanced')
lr.fit(X_train_scaled, y_train)

y_pred = lr.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.4267399267399267
Classification Report:
               precision    recall  f1-score   support

          -1       0.02      0.54      0.03        13
           0       0.97      0.43      0.60      1058
           1       0.01      0.10      0.02        21

    accuracy                           0.43      1092
   macro avg       0.33      0.36      0.22      1092
weighted avg       0.94      0.43      0.58      1092

