## Correlation Analysis
Here are the steps taken:
1. importing packages
2. loading the data and creating the datasets
3. preprocessing the datasets
4. calculating sentiment polarity using VADER
5. merging price and sentiment data based on their intervals
6. selecting the suitable interval for the analysis using cross-correlation
7. selecting the suitable MA/EMA hyperparameter (assuming context improves sentiment index)
8. storing the selected index as a dataset to be used later

### importing packages

In [58]:
import pandas as pd
import numpy as np
from matplotlib.pyplot import plot as plt
import logging
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from tqdm import tqdm

In [3]:
import sys
import os
current_working_directory = os.getcwd()
sys.path.append(os.path.dirname(current_working_directory))
from utils.eda import generate_report

In [4]:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.debug("using debug mode")

DEBUG:root:using debug mode


In [59]:
tqdm.pandas()

### loading the data and creating the datasets

In [44]:
def clean_dataset(source: pd.DataFrame):
    """
    This function cleans the input DataFrame by:
    - Converting 'user_followers' and 'user_friends' columns to integer type
    - Converting 'user_verified' column to boolean type
    - Converting 'date' column to datetime type
    - Dropping rows with any null values
    - Setting 'date' as the index of the DataFrame
    It then returns the cleaned DataFrame.
    """
    df = source.copy()
    df["user_followers"] = pd.to_numeric(df["user_followers"], errors='coerce').astype('Int64')
    df["user_friends"] = pd.to_numeric(df["user_friends"], errors='coerce').astype('Int64')
    df["user_verified"] = df["user_verified"].astype("bool")
    df["date"] = pd.to_datetime(df["date"], errors='coerce')
    df = df.dropna().set_index("date")
    return df

def handle_dataset(file_path, df=None, columns=None):
    """
    This function handles the reading and writing of datasets.
    - If a DataFrame is provided, it writes the DataFrame to a CSV file at the given file path.
    - If no DataFrame is provided, it checks if a file exists at the given file path and reads it if it does.
    - If no file exists, it returns None.
    """
    if df is not None:
        logging.debug(f"Writing to csv at {file_path}")
        df.to_csv(file_path)
    elif os.path.isfile(file_path):
        logging.debug(f"Reading dataset from {file_path}")
        return pd.read_csv(file_path, lineterminator='\n', usecols=columns).set_index("date")
    return None

def slice_dataframe(sdf, start_datetime, end_datetime):
    """
    This function slices a DataFrame based on a given date range.
    - It first sorts the DataFrame by 'date'.
    - It then slices the DataFrame based on the given start and end datetime.
    - It returns the sliced DataFrame.
    """
    sorted_df = sdf.sort_values(by='date')
    sliced_df = sorted_df.loc[start_datetime:end_datetime]
    return sliced_df

def get_dataframe(cwd, source_dataset_address, clean_dataset_address, sliced_dataset_address):
    """
    This function gets a DataFrame from a given file path.
    - It first tries to read a sliced dataset from a given file path.
    - If no sliced dataset exists, it tries to read a clean dataset from a given file path.
    - If no clean dataset exists, it generates a clean dataset from a source dataset.
    - It then slices the DataFrame and writes it to a CSV file.
    - It returns the DataFrame.
    """
    dataset_directory = os.path.join(os.path.dirname(cwd), "dataset")
    clean_dataset_address = os.path.join(dataset_directory, clean_dataset_address)
    sliced_dataset_address = os.path.join(dataset_directory, sliced_dataset_address)
    columns = ['user_followers', 'user_friends', 'user_verified', 'date', 'text']

    df = handle_dataset(sliced_dataset_address, columns=columns)
    if df is None:
        df = handle_dataset(clean_dataset_address, columns=columns)
        if df is None:
            logging.debug("Generating clean dataset from source")
            df = pd.read_csv(source_dataset_address, lineterminator='\n', usecols=columns)
            df = clean_dataset(df)
            handle_dataset(clean_dataset_address, df=df)
        df = slice_dataframe(df)
        handle_dataset(sliced_dataset_address, df=df)

    return df

In [46]:
# Check to see if the cleaned version of dataset exists. if not, we build the dataframe from the source.
source_dataset_address = "../raw/bitcoin-tweets/Bitcoin_tweets.csv"
clean_dataset_address = "_correlation_analysis_clean.csv"
sliced_dataset_address="_correlation_analysis_sliced.csv"
df = get_dataframe(current_working_directory, source_dataset_address, clean_dataset_address, sliced_dataset_address)

DEBUG:root:Reading dataset from /home/hamid/src/Financial_NLP/dataset/_correlation_analysis_sliced.csv


In [47]:
df

Unnamed: 0_level_0,user_followers,user_friends,user_verified,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-07-17 12:00:00,1534,2044,False,"Baller, Jack Mallers Calls Out Brian Armstrong..."
2021-07-17 12:00:00,27309,166,False,Square developing #bitcoin-focused business as...
2021-07-17 12:00:00,34,19,False,#Bitcoin\nCurrent Price:\n$ 31397.97\n€ 26640....
2021-07-17 12:00:00,953,119,False,Now tell me looking straight into my eyes that...
2021-07-17 12:00:00,145324,834,True,There are a number of walk-in NHS vaccination ...
...,...,...,...,...
2021-07-29 23:59:12,73,583,False,This only has 1k views? WTF #CRYPTO #Bitcoin #...
2021-07-29 23:59:14,413361,387,False,LET'S DISCUSS THE LATEST #CRYPTO NEWS!\n\n-- P...
2021-07-29 23:59:21,60,1049,False,@fireworksdoge Address : 0x01F69047c11DD924631...
2021-07-29 23:59:28,99,206,False,@drdisrespect doc do you #bitcoin


In [48]:
df.dtypes

user_followers     int64
user_friends       int64
user_verified       bool
text              object
dtype: object

### preprocessing the datasets

- remove ads
- extract emojies
- change hashtags and coin names to represent unique symbol 

### calculating sentiment polarity using VADER

In [51]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/hamid/nltk_data...


True

In [54]:
sia = SentimentIntensityAnalyzer()

In [60]:
df[['negative', 'neutral', 'positive', 'compound']] = df['text'].progress_apply(lambda text: pd.Series(sia.polarity_scores(text)))

100%|█████████████████████████████████| 331234/331234 [02:15<00:00, 2445.64it/s]


In [65]:
df.index = pd.to_datetime(df.index)

# Initialize a dictionary to store the resampled DataFrames
resampled_dfs = {}

# Resample to intervals and aggregate sentiment scores
for interval in ['5T', '15T', '30T', 'H']:
    resampled_dfs[interval] = df.resample(interval).agg({
        'negative': 'mean',
        'neutral': 'mean',
        'positive': 'mean',
        'compound': 'mean'
    })

In [68]:
resampled_dfs.keys()

dict_keys(['5T', '15T', '30T', 'H'])

### merging price and sentiment data based on their intervals

### selecting the suitable MA/EMA hyperparameter (assuming the context improves the sentiment index)

### storing the selected index as a dataset to be used later