# News analysis

### Notebook description

This notebook aims to analyze the correlation between news about Bitcoin and Bitcoin price over years.

### Data overview

The dataset used in the notebook's charts is the result of a merge of numerous public datasets and automated crawling, those datasets' sources are listed in the README file.

Each post has been classified using fast-classifier from Flair framework (read the README for details about the license).

## Basic Analysis

To achieve uniformity between charts, a project-wise colors palette has been used

In [1]:
from palette import palette
base_path = '/mnt/hgfs/VMs_Shared/datasets/filtered/'

For consistency with twitter analysis, posts' info has been grouped by date (view <nome notebook> for grouping details)

In [2]:
daily_csv = base_path + 'news_daily_info.csv'

In [3]:
import pandas as pd

raw_df = pd.read_csv(daily_csv).dropna()

In [4]:
def avg_sentiment(group) -> float:
    total = sum(group['count'])
    sentiment = sum(group['count'] * group['signed_score'])/total
    return sentiment

def score_to_label(score) -> str:
    if score == 0:
        return 'NEUTRAL'
    return 'POSITIVE' if score > 0 else 'NEGATIVE'

def normalize(value: float, range_min: float, range_max: float) -> float:
    return (value-range_min)/(range_max-range_min)

def normalize_series(series, series_min=None, series_max=None) -> pd.Series:
    if series_min is None:
        series_min = min(series)
        
    if series_max is None:
        series_max = max(series)
    return series.apply(lambda x: normalize(x, series_min, series_max))

In [5]:
raw_df['label'] = raw_df['label'].apply(lambda x: x.replace('"', ''))
raw_df['signed_score'] = raw_df['conf'] * raw_df['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

Common dates range is calculated intersecting the market dataframe with news one

In [6]:
market_daily_csv = base_path + '/market_daily_info.csv'
market_dates = pd.read_csv(market_daily_csv).dropna()['date']

dates_min = max([min(market_dates), min(raw_df['date'])])
dates_max = min([max(market_dates), max(raw_df['date'])])

dates = pd.concat([market_dates, raw_df['date']])
    
dates = dates.drop_duplicates().sort_values()
dates = dates[(dates_min <= dates) & (dates <= dates_max)]

In [7]:
raw_df = raw_df[(raw_df['date'] >= dates_min) & (raw_df['date'] <= dates_max)]

In [8]:
date_grouped = raw_df.groupby('date')
daily_df = pd.DataFrame(index=raw_df['date'].drop_duplicates())
daily_df['sentiment'] = date_grouped.apply(avg_sentiment)
daily_df['norm_sent'] = normalize_series(daily_df['sentiment'], -1, 1)
daily_df['label'] = daily_df['sentiment'].apply(score_to_label)
daily_df['count'] = date_grouped.apply(lambda x: sum(x['count']))
daily_df['norm_count'] = normalize_series(daily_df['count'], 0)
negatives = raw_df[raw_df['label'] == 'NEGATIVE'][['date', 'count']]
negatives.columns= ['date', 'negatives']
positives = raw_df[raw_df['label'] == 'POSITIVE'][['date', 'count']]
positives.columns= ['date', 'positives']
daily_df = daily_df.merge(negatives, on='date')
daily_df = daily_df.merge(positives, on='date')
daily_df = daily_df.drop_duplicates(subset=['date'])

In [9]:
daily_df = daily_df[(daily_df['date'] >= dates_min) & (daily_df['date'] <= dates_max)]
daily_df['norm_sent'] = normalize_series(daily_df['sentiment'], -1, 1)

As for news, the market info are day grained

In [10]:
market_daily_csv = base_path+ 'market_daily_info.csv'
market_raw_df = pd.read_csv(market_daily_csv)
market_raw_df = market_raw_df.dropna()
market_df = pd.DataFrame(dates, columns=['date'])
market_df = market_df.merge(market_raw_df, on='date')
market_df['mid_price'] = (market_df['high'] + market_df['low'])/2
market_df['norm_mid_price'] = normalize_series(market_df['mid_price'])
market_df = market_df[(dates_min <= market_df['date']) & (market_df['date']<= dates_max)]

In [11]:
import altair as alt

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

### Weekly volume

In [12]:
daily_df['month'] = daily_df['date'].apply(lambda x: x[:-3])

weekly_volume = daily_df[['month', 'count']].groupby(by='month', as_index=False).mean()

plot_title = alt.TitleParams('Weekly volume', subtitle='Average volume per week')
volume_chart = alt.Chart(weekly_volume, title=plot_title).mark_area().encode(alt.X('yearmonth(month):T', title='Date'),
                                                                                 alt.Y('count', title='Volume'),
                                                                                 color=alt.value(palette['news']))

volume_chart

There is a pattern in the volume per day: there are under 40 news per day (excluded 2021) and in August news is under 10 per day. This reflects the crawling method and the nature of news; for each day has been crawled only the first google news page and, normally, in August, when people are on holidays, the interest related to financial things drops.

### Basic data exploration

#### Sentiment

In [13]:
sent_rounded = daily_df[['norm_sent']].copy()
sent_rounded['norm_sent'] = sent_rounded['norm_sent'].apply(lambda x: round(x, 2))

alt.Chart(sent_rounded, title='Sentiment dispersion').mark_boxplot(color=palette['news']).encode(alt.X('norm_sent', title='Normalized sentiment')).properties(height=200)

IQR shows that the used classifier had some problem classifying news or that they are simply impartial (for experience, Bitcoin-related news are not impartial). Contrary to what happens for other media, the news sentiment range is on the whole classification range.

In [14]:
sent_dist = sent_rounded.groupby('norm_sent', as_index=False).size()
sent_dist.columns = ['norm_sent', 'count']

alt.Chart(sent_dist, title='Sentiment distribution').mark_area().encode(alt.X('norm_sent', title='Normalized sentiment'), alt.Y('count', title='Count'), color=alt.value(palette['news']))

The sentiment distribution is similar to a normal distribution but wider, so the Pearson correlation could be useful.

#### Volume

In [25]:
alt.Chart(daily_df, title='Volume dispersion').mark_boxplot(color=palette['news']).encode(alt.X('count', title=None)).properties(height=200)

As expected, there are few outliers (2021 news)

In [26]:
volumes = daily_df[['count']]

volumes_dist = volumes.groupby('count', as_index=False).size()
volumes_dist.columns = ['volume', 'count']

alt.Chart(volumes_dist, title='Volume distribution').mark_line().encode(alt.X('volume', title='Normalized sentiment'), alt.Y('count', title='Count'), color=alt.value(palette['news']))

The volume distribution is clearly unbalanced on the left, so Pearson correlation can't be applied to evaluate volume correlation.

## Sentiment analysis

In [17]:
domain = [0, 1]
color_range = [palette['negative'], palette['positive']]

time_selector = alt.selection(type='interval', encodings=['x'])

gradient = alt.Color('norm_sent', scale=alt.Scale(domain=domain, range=color_range), title='Normalized sentiment')

price_chart = alt.Chart(market_df).mark_line(color=palette['strong_price']).encode(
    x=alt.X('yearmonthdate(date):T',
           scale=alt.Scale(domain=time_selector),
           title=None),
    y=alt.Y('mid_price')
)
        
plot_title = alt.TitleParams('Normalized sentiment vs Bitcoin price', subtitle='0:= negative, 1:= positive')
histogram = alt.Chart(daily_df, title=plot_title).mark_bar().encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100, extent=time_selector),
                                                       scale=alt.Scale(domain=time_selector),
                                                       axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                 alt.Y('norm_sent',
                                                      scale=alt.Scale(domain=[0,1]),
                                                      title='Normalized sentiment'),
                                                 color=gradient)




selection_plot = alt.Chart(daily_df).mark_bar().encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100),
                                                            title='Date',
                                                            axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                       alt.Y('norm_sent', title=None),
                                                       color=gradient).add_selection(time_selector).properties(height=50)

(histogram + price_chart).resolve_scale(y='independent') & selection_plot

Bar binning has some problems plotting the sentiment if it's in [-1, 1], for that reason the interactive version uses normalized sentiment and the static one uses original sentiment values.

In [18]:
domain = [-1, 1]

gradient = alt.Color('sentiment', scale=alt.Scale(domain=domain, range=color_range), title='Sentiment')

price_chart = alt.Chart(market_df).mark_line(color=palette['strong_price']).encode(
    x=alt.X('yearmonthdate(date):T'),
    y=alt.Y('mid_price', title='Mid price')
)

plot_title = alt.TitleParams('Static sentiment vs Bitcoin price', subtitle='-1:= negative, 1:= positive')
histogram = alt.Chart(daily_df, title=plot_title).mark_bar().encode(alt.X('yearmonthdate(date):T', title='Date'),
                                                 alt.Y('sentiment',
                                                      scale=alt.Scale(domain=[-1,1]),
                                                      title='Sentiment'),
                                                 color=gradient)




(histogram + price_chart).resolve_scale(y='independent')

Plotting sentiment in [-1, 1] permit to understand immediately if price direction is the same as the sentiment one.

Is hard to view a pattern in news sentiment near days have opposite sentiments, this proves that the news are more similar to hypothesis than truth or facts.

#### Correlation

To measure the correlation between sentiment and price, two approaches will be used:
- TLCC (Time Lagged Cross-Correlation): a measure of the correlation of the whole time series given a list of time offsets
- Windowed TLCC: the time series are lagged as in the first case, but the correlation is calculated for each window; this is useful to understand correlation "direction" (so the time series' roles) over time.

##### TLCC

In [19]:
methods = ['pearson', 'kendall', 'spearman']
offsets = list(range(-150, 151)) # list of days offset to test

correlations = []

sent_vs_price = pd.DataFrame(daily_df['date'], columns=['date'])
sent_vs_price['sent'] = daily_df['norm_sent']
sent_vs_price = sent_vs_price.merge(market_df[['date', 'mid_price']], on='date')

for method in methods:
    method_correlations = [(method, offset, sent_vs_price['sent'].corr(sent_vs_price['mid_price'].shift(-offset), method=method))
                           for offset in offsets]
    correlations.extend(method_correlations)
        
correlations_df = pd.DataFrame(correlations, columns=['method', 'offset', 'correlation'])

spearman_correlations = correlations_df[correlations_df['method'] == 'spearman']

max_corr = max(spearman_correlations['correlation'])
max_corr_offset = spearman_correlations[spearman_correlations['correlation'] == max_corr]['offset'].iloc[0]

min_corr = min(spearman_correlations['correlation'])
min_corr_offset = spearman_correlations[spearman_correlations['correlation'] == min_corr]['offset'].iloc[0]

max_corr_text = f'Max correlation ({round(max_corr, 3)}) with an offset of {max_corr_offset} days'
min_corr_text = f'Min correlation ({round(min_corr, 3)}) with an offset of {min_corr_offset} days'

plot_title = alt.TitleParams('Correlations', subtitle=[max_corr_text, min_corr_text])
corr_chart = alt.Chart(correlations_df, title=plot_title).mark_line().encode(alt.X('offset', title='Offset days'),
                                                          alt.Y('correlation', title='Correlation'),
                                                          alt.Color('method', title='Method'))

corr_chart

In this case, Pearson correlation could be considered reliable, but its trend is similar to the others two.

Correlation near 0 is the reflection of the inconsistency of sentiment over days.

##### WTLCC

For semplicity, the next chart will visualiza WTLCC using Spearman correlation only.

In [20]:
from math import ceil

def get_window(series: pd.Series, window) -> pd.Series:
    return series.iloc[window[0]: window[1]]
    

def windowed_corr(first: pd.Series, second: pd.Series) -> list:
    windows = [(window * window_size, (window * window_size)+window_size) for window in range(ceil(len(second)/window_size))]
    windows_corr = [get_window(first, window).corr(get_window(second, window), method = 'spearman') for window in windows]
    return windows_corr, windows

offsets = list(range(-66, 67)) # reduced offsets for better visualization
window_size = 120 # one window = one quarter

windowed_correlations = []

for offset in offsets:
    windows_corr, windows = windowed_corr(sent_vs_price['sent'], sent_vs_price['mid_price'].shift(-offset))
    for window, window_corr in enumerate(windows_corr):
        windowed_correlations.append((window, window_corr, offset))
    
    
windowed_correlations_df = pd.DataFrame(windowed_correlations, columns=['window', 'correlation', 'offset'])


plot_title = alt.TitleParams('Quarter lagged correlation sentiment/price', subtitle='-1:= price as master, 1:= sentiment as master')
color = alt.Color('correlation', scale=alt.Scale(domain=[-1, 1], range=[palette['negative'], palette['positive']]), title='Correlation')
alt.Chart(windowed_correlations_df, height=800, width=800, title=plot_title).mark_rect().encode(alt.X('window:O', title='Window'), alt.Y('offset:O', title='Offset days'), color)


In [27]:
from math import ceil

def get_window(series: pd.Series, window) -> pd.Series:
    return series.iloc[window[0]: window[1]]
    

def windowed_corr(first: pd.Series, second: pd.Series) -> list:
    windows = [(window * window_size, (window * window_size)+window_size) for window in range(ceil(len(second)/window_size))]
    windows_corr = [get_window(first, window).corr(get_window(second, window), method = 'spearman') for window in windows]
    return windows_corr, windows

offsets = list(range(-66, 67)) # reduced offsets for better visualization
window_size = 60 # one window = two months

windowed_correlations = []

for offset in offsets:
    windows_corr, windows = windowed_corr(sent_vs_price['sent'], sent_vs_price['mid_price'].shift(-offset))
    for window, window_corr in enumerate(windows_corr):
        windowed_correlations.append((window, window_corr, offset))
    
    
windowed_correlations_df = pd.DataFrame(windowed_correlations, columns=['window', 'correlation', 'offset'])


plot_title = alt.TitleParams('Quarter lagged correlation sentiment/price', subtitle='-1:= price as master, 1:= sentiment as master')
color = alt.Color('correlation', scale=alt.Scale(domain=[-1, 1], range=[palette['negative'], palette['positive']]), title='Correlation')
alt.Chart(windowed_correlations_df, height=800, width=800, title=plot_title).mark_rect().encode(alt.X('window:O', title='Window'), alt.Y('offset:O', title='Offset days'), color)


The heatmaps show, again, that the sentiment is too volatile to be useful

## Volume analysis

Another aspect of data is the volume, in other words: is relevant that the people speak well or bad about Bitcoin or it's enough that people speak?

In [21]:
time_selector = alt.selection(type='interval', encodings=['x'])

dummy_df = pd.DataFrame({'date': [min(daily_df['date']), max(daily_df['date'])], 'count': [0, 0]})
zero_line = alt.Chart(dummy_df).mark_line(color='grey').encode(x=alt.X('yearmonthdate(date):T'), y=alt.Y('count'))

price = alt.Chart(market_df).mark_line(color=palette['price']).encode(
    x=alt.X('yearmonthdate(date):T',
           scale=alt.Scale(domain=time_selector),
           title=None),
    y=alt.Y('mid_price', title='Mid price')
)
        
plot_title = alt.TitleParams('Volume vs Bitcoin price')

histogram = alt.Chart(daily_df, title=plot_title).mark_bar(color=palette['neutral_1']).encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100, extent=time_selector),
                                                       scale=alt.Scale(domain=time_selector),
                                                       axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                 alt.Y('count',
                                                      title='Volume'))

histogram_reg = histogram.transform_regression('date', 'count', method='poly', order=9).mark_line(color=palette['strong_neutral_1'])

volume_chart = histogram  + zero_line

price_reg = price.transform_regression('date', 'mid_price', method='poly', order=9).mark_line(color=palette['strong_price'])


price_chart = price

selection_plot = alt.Chart(daily_df).mark_bar(color=palette['neutral_1']).encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100),
                                                            title='Date',
                                                            axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                       alt.Y('count', title=None)).add_selection(time_selector).properties(height=50)

alt.layer(volume_chart, price_chart).resolve_scale(y='independent') & selection_plot

The regularity of news crawling makes the volume analysis useless.

#### Correlation

##### TLCC

In [22]:
methods = ['pearson', 'kendall', 'spearman']
offsets = list(range(-150, 151)) # list of days offset to test

correlations = []

volume_vs_price = pd.DataFrame(daily_df['date'], columns=['date'])
volume_vs_price['volume'] = daily_df['count']
volume_vs_price = volume_vs_price.merge(market_df[['date', 'mid_price']], on='date')

for method in methods:
    method_correlations = [(method, offset, volume_vs_price['volume'].corr(sent_vs_price['mid_price'].shift(-offset), method=method))
                           for offset in offsets]
    correlations.extend(method_correlations)
        
correlations_df = pd.DataFrame(correlations, columns=['method', 'offset', 'correlation'])

spearman_correlations = correlations_df[correlations_df['method'] == 'spearman']

max_corr = max(spearman_correlations['correlation'])
max_corr_offset = spearman_correlations[spearman_correlations['correlation'] == max_corr]['offset'].iloc[0]

min_corr = min(spearman_correlations['correlation'])
min_corr_offset = spearman_correlations[spearman_correlations['correlation'] == min_corr]['offset'].iloc[0]

max_corr_text = f'Max correlation ({round(max_corr, 3)}) with an offset of {max_corr_offset} days'
min_corr_text = f'Min correlation ({round(min_corr, 3)}) with an offset of {min_corr_offset} days'

plot_title = alt.TitleParams('Correlations', subtitle=[max_corr_text, min_corr_text])
corr_chart = alt.Chart(correlations_df, title=plot_title).mark_line().encode(alt.X('offset', title='Offset days'),
                                                          alt.Y('correlation', title='Correlation'),
                                                          alt.Color('method', title='Method'))

corr_chart

As anticipated, volume correlation is too near 0 to be considered informative.

#### WLTCC

In [23]:
from math import ceil

def get_window(series: pd.Series, window) -> pd.Series:
    return series.iloc[window[0]: window[1]]
    

def windowed_corr(first: pd.Series, second: pd.Series) -> list:
    windows = [(window * window_size, (window * window_size)+window_size) for window in range(ceil(len(second)/window_size))]
    windows_corr = [get_window(first, window).corr(get_window(second, window), method = 'spearman') for window in windows]
    return windows_corr, windows

offsets = list(range(-66, 67)) # reduced offsets for better visualization
window_size = 120 # one window = one quarter

windowed_correlations = []

for offset in offsets:
    windows_corr, windows = windowed_corr(volume_vs_price['volume'], volume_vs_price['mid_price'].shift(-offset))
    for window, window_corr in enumerate(windows_corr):
        windowed_correlations.append((window, window_corr, offset))
    
    
windowed_correlations_df = pd.DataFrame(windowed_correlations, columns=['window', 'correlation', 'offset'])


plot_title = alt.TitleParams('120 days lagged correlation volume/price', subtitle='-1:= price as master, 1:= sentiment as master')
color = alt.Color('correlation', scale=alt.Scale(domain=[-1, 1], range=[palette['negative'], palette['positive']]), title='Correlation')
alt.Chart(windowed_correlations_df, height=800, width=800, title=plot_title).mark_rect().encode(alt.X('window:O', title='Window'), alt.Y('offset:O', title='Offset days'), color)


For each offset, there are some windows positively correlated, but another negatively correlated; another demonstration that with this dataset, volume correlation analysis is useless.

## Conclusions

- The used crawling strategy makes the volume correlation analysis, useless
- Taking into account the relevance of each news site, could help to find sentiment patterns and a better correlation.