# Twitter analysis

### Notebook description

This notebook aims to analyze the correlation between tweets and Bitcoin price over years.

### Data overview

The dataset used in the notebook's charts is the result of a merge of numerous public datasets, those datasets' sources are listed in the README file.

The origin of the dataset brings some problems, such as inconstancy of data per day, missing values. Some of the tweets' sentiments were already classified, but due to lack of classification process description and to classification uniformity, each tweet has been re-classified using fast-classifier from Flair framework (read the README for details about the license).

## Basic Analysis

To achieve uniformity between charts, a project-wise colors palette has been used

In [1]:
from palette import palette
base_path = '/mnt/hgfs/VMs_Shared/datasets/filtered/'

To easily handling the large amount of data (about 26M of tweets), tweets' info has been grouped by date (view <nome notebook> for grouping details)

In [2]:
daily_csv = base_path + 'twitter_daily_info.csv'

In [3]:
import pandas as pd

raw_df = pd.read_csv(daily_csv).dropna()

In [4]:
def avg_sentiment(group) -> float:
    total = sum(group['count'])
    sentiment = sum(group['count'] * group['signed_score'])/total
    return sentiment

def score_to_label(score) -> str:
    if score == 0:
        return 'NEUTRAL'
    return 'POSITIVE' if score > 0 else 'NEGATIVE'

def normalize(value: float, range_min: float, range_max: float) -> float:
    return (value-range_min)/(range_max-range_min)

def normalize_series(series, series_min=None, series_max=None) -> pd.Series:
    if series_min is None:
        series_min = min(series)
        
    if series_max is None:
        series_max = max(series)
    return series.apply(lambda x: normalize(x, series_min, series_max))

In [5]:
raw_df['label'] = raw_df['label'].apply(lambda x: x.replace('"', ''))
raw_df['signed_score'] = raw_df['conf'] * raw_df['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

Common dates range is calculated intersecting the market dataframe with Twitter one

In [6]:
market_daily_csv = base_path + '/market_daily_info.csv'
market_dates = pd.read_csv(market_daily_csv).dropna()['date']

dates_min = max([min(market_dates), min(raw_df['date'])])
dates_max = min([max(market_dates), max(raw_df['date'])])

dates = pd.concat([market_dates, raw_df['date']])
    
dates = dates.drop_duplicates().sort_values()
dates = dates[(dates_min <= dates) & (dates <= dates_max)]

In [7]:
raw_df = raw_df[(raw_df['date'] >= dates_min) & (raw_df['date'] <= dates_max)]

In [8]:
date_grouped = raw_df.groupby('date')
daily_df = pd.DataFrame(index=raw_df['date'].drop_duplicates())
daily_df['sentiment'] = date_grouped.apply(avg_sentiment)
daily_df['norm_sent'] = normalize_series(daily_df['sentiment'], -1, 1)
daily_df['label'] = daily_df['sentiment'].apply(score_to_label)
daily_df['count'] = date_grouped.apply(lambda x: sum(x['count']))
daily_df['norm_count'] = normalize_series(daily_df['count'], 0)
negatives = raw_df[raw_df['label'] == 'NEGATIVE'][['date', 'count']]
negatives.columns= ['date', 'negatives']
positives = raw_df[raw_df['label'] == 'POSITIVE'][['date', 'count']]
positives.columns= ['date', 'positives']
daily_df = daily_df.merge(negatives, on='date')
daily_df = daily_df.merge(positives, on='date')
daily_df = daily_df.drop_duplicates(subset=['date'])

In [9]:
daily_df = daily_df[(daily_df['date'] >= dates_min) & (daily_df['date'] <= dates_max)]
daily_df = daily_df[daily_df['count'] < 100000] # removing outliers
daily_df['norm_sent'] = normalize_series(daily_df['sentiment'], -1, 1)

As for Twitter, the market infos are day grained

In [10]:
market_daily_csv = base_path+ 'market_daily_info.csv'
market_raw_df = pd.read_csv(market_daily_csv)
market_raw_df = market_raw_df.dropna()
market_df = pd.DataFrame(dates, columns=['date'])
market_df = market_df.merge(market_raw_df, on='date')
market_df['mid_price'] = (market_df['high'] + market_df['low'])/2
market_df['norm_mid_price'] = normalize_series(market_df['mid_price'])
market_df = market_df[(dates_min <= market_df['date']) & (market_df['date']<= dates_max)]

In [11]:
import altair as alt

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

### Weekly volume

In [12]:
daily_df['month'] = daily_df['date'].apply(lambda x: x[:-3])

weekly_volume = daily_df[['month', 'count']].groupby(by='month', as_index=False).mean()

volume_chart = alt.Chart(weekly_volume, title='Weekly volume').mark_area().encode(alt.X('yearmonth(month):T', title='Date'),
                                                                                 alt.Y('count', title='Daily volume'),
                                                                                 color=alt.value(palette['twitter']))

volume_chart

As can be seen, the crawling rate is not constant over time, and this will create some issues during the following analysis.

### Basic data exploration

#### Sentiment

In [13]:
sent_rounded = daily_df[['norm_sent']].copy()
sent_rounded['norm_sent'] = sent_rounded['norm_sent'].apply(lambda x: round(x, 2))

alt.Chart(sent_rounded, title='Sentiment dispersion').mark_boxplot().encode(alt.X('norm_sent', title='Normalized sentiment'), color=alt.value(palette['twitter'])).properties(height=200)

Median is very low (0.2), this means that a large part of the tweets are very negative.

In [42]:
sent_dist = sent_rounded.groupby('norm_sent', as_index=False).size()
sent_dist.columns = ['norm_sent', 'count']

alt.Chart(sent_dist, title='Sentiment distribution').mark_area().encode(alt.X('norm_sent', title='Normalized sentiment'), alt.Y('count', title='Count'), color=alt.value(palette['twitter']))

The sentiment distribution shows, again, that many of the tweets are very negative, a good part of them are near 0.5, which could for two reasons:
- For the classification has been used the Flair fast-classifier
- Tweets normally are not the state of the art of language, users use hashtags, acronyms, and stuffs like that.

#### Volume

In [44]:
alt.Chart(daily_df, title='Volume dispersion').mark_boxplot().encode(alt.X('count', title=None), color=alt.value(palette['twitter'])).properties(height=200)

There are a large number of outliers, probably caused by spammers and bots; infact a good upgrade in the dataset cleaning process could be the remotion of duplicated tweets.

In [16]:
volumes = daily_df[['count']]

volumes_dist = volumes.groupby('count', as_index=False).size()
volumes_dist.columns = ['volume', 'count']

alt.Chart(volumes_dist, title='Volume distribution').mark_line().encode(alt.X('volume', title='Normalized sentiment'), alt.Y('count', title='Count'), color=alt.value(palette['twitter']))

In [17]:
volumes = daily_df[daily_df['count'] <= 12255][['count']]

volumes_dist = volumes.groupby('count', as_index=False).size()
volumes_dist.columns = ['volume', 'count']

plot_title = alt.TitleParams('Clipped volume distribution', subtitle=['Clipped to q3'])
alt.Chart(volumes_dist, title=plot_title).mark_line().encode(alt.X('volume', title='Normalized sentiment'), alt.Y('count', title='Count'), color=alt.value(palette['twitter'])) 

The volume distribution is clearly concentrated on the left, and, also removing values above q3, the distribution remains unbalanced on the left.

### Sentiment analysis

In [18]:
domain = [0, 1]
color_range = [palette['negative'], palette['positive']]

time_selector = alt.selection(type='interval', encodings=['x'])

gradient = alt.Color('norm_sent', scale=alt.Scale(domain=domain, range=color_range))

price_chart = alt.Chart(market_df).mark_line(color=palette['strong_price']).encode(
    x=alt.X('yearmonthdate(date):T',
           scale=alt.Scale(domain=time_selector),
           title=None),
    y=alt.Y('mid_price')
)
        
plot_title = alt.TitleParams('Normalized sentiment vs Bitcoin price', subtitle='0:= negative, 1:= positive')
histogram = alt.Chart(daily_df, title=plot_title).mark_bar().encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100, extent=time_selector),
                                                       scale=alt.Scale(domain=time_selector),
                                                       axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                 alt.Y('norm_sent',
                                                      scale=alt.Scale(domain=[0,1]),
                                                      title='Normalized sentiment'),
                                                 color=gradient)




selection_plot = alt.Chart(daily_df).mark_bar().encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100),
                                                            title='Date',
                                                            axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                       alt.Y('norm_sent', title=None),
                                                       color=gradient).add_selection(time_selector).properties(height=50)

(histogram + price_chart).resolve_scale(y='independent') & selection_plot

Bar binning has some problems plotting the sentiment if it's in [-1, 1], for that reason the interactive version uses normalized sentiment and the static one uses original sentiment values.

In [46]:
domain = [-1, 1]

gradient = alt.Color('sentiment', scale=alt.Scale(domain=domain, range=color_range), title='Sentiment')

price_chart = alt.Chart(market_df).mark_line(color=palette['strong_price']).encode(
    x=alt.X('yearmonthdate(date):T'),
    y=alt.Y('mid_price', title='Mid price')
)

plot_title = alt.TitleParams('Static sentiment vs Bitcoin price', subtitle='-1:= negative, 1:= positive')
histogram = alt.Chart(daily_df, title=plot_title).mark_bar().encode(alt.X('yearmonthdate(date):T', title='Date'),
                                                 alt.Y('sentiment',
                                                      scale=alt.Scale(domain=[-1,1]),
                                                      title='Sentiment'),
                                                 color=gradient)




(histogram + price_chart).resolve_scale(y='independent')

Plotting sentiment in [-1, 1] permit to understand immediately if price direction is the same as the sentiment one.

#### Correlation

To measure the correlation between sentiment and price, two approaches will be used:
- TLCC (Time Lagged Cross-Correlation): a measure of the correlation of the whole time series given a list of time offsets
- Windowed TLCC: the time series are lagged as in the first case, but the correlation is calculated for each window; this is useful to understand correlation "direction" (so the time series' roles) over time.

##### TLCC

In [47]:
methods = ['pearson', 'kendall', 'spearman']
offsets = list(range(-150, 151)) # list of days offset to test

correlations = []

sent_vs_price = pd.DataFrame(daily_df['date'], columns=['date'])
sent_vs_price['sent'] = daily_df['norm_sent']
sent_vs_price = sent_vs_price.merge(market_df[['date', 'mid_price']], on='date')

for method in methods:
    method_correlations = [(method, offset, sent_vs_price['sent'].corr(sent_vs_price['mid_price'].shift(-offset), method=method))
                           for offset in offsets]
    correlations.extend(method_correlations)
        
correlations_df = pd.DataFrame(correlations, columns=['method', 'offset', 'correlation'])

spearman_correlations = correlations_df[correlations_df['method'] == 'spearman']

max_corr = max(spearman_correlations['correlation'])
max_corr_offset = spearman_correlations[spearman_correlations['correlation'] == max_corr]['offset'].iloc[0]

min_corr = min(spearman_correlations['correlation'])
min_corr_offset = spearman_correlations[spearman_correlations['correlation'] == min_corr]['offset'].iloc[0]

max_corr_text = f'Max correlation ({round(max_corr, 3)}) with an offset of {max_corr_offset} days'
min_corr_text = f'Min correlation ({round(min_corr, 3)}) with an offset of {min_corr_offset} days'

plot_title = alt.TitleParams('Correlations', subtitle=[max_corr_text, min_corr_text])
corr_chart = alt.Chart(correlations_df, title=plot_title).mark_line().encode(alt.X('offset', title='Offset days'),
                                                          alt.Y('correlation', title='Correlation'),
                                                          alt.Color('method', title='Method'))

corr_chart

For the nature of the data, the Pearson correlation should not be applied since the variables are not normally distributed.

Kendal and Spearman share the information used to compute the correlation, in fact, the related charts are simply shifted.

The goal of that section is to find a time offset to maximize the correlation, so the choice between Spearman and Kendall is not significant.

##### WTLCC

For semplicity, the next chart will visualiza WTLCC using Spearman correlation only.

In [51]:
from math import ceil

def get_window(series: pd.Series, window) -> pd.Series:
    return series.iloc[window[0]: window[1]]
    

def windowed_corr(first: pd.Series, second: pd.Series) -> list:
    windows = [(window * window_size, (window * window_size)+window_size) for window in range(ceil(len(second)/window_size))]
    windows_corr = [get_window(first, window).corr(get_window(second, window), method = 'spearman') for window in windows]
    return windows_corr, windows

offsets = list(range(-66, 67)) # reduced offsets for better visualization
window_size = 120 # one window = one quarter

windowed_correlations = []

for offset in offsets:
    windows_corr, windows = windowed_corr(sent_vs_price['sent'], sent_vs_price['mid_price'].shift(-offset))
    for window, window_corr in enumerate(windows_corr):
        windowed_correlations.append((window, window_corr, offset))
    
    
windowed_correlations_df = pd.DataFrame(windowed_correlations, columns=['window', 'correlation', 'offset'])


plot_title = alt.TitleParams('Quarter lagged correlation sentiment/price', subtitle='-1:= price as master, 1:= sentiment as master')
color = alt.Color('correlation', scale=alt.Scale(domain=[-1, 1], range=[palette['negative'], palette['positive']]), title='Correlation')
alt.Chart(windowed_correlations_df, height=800, width=800, title=plot_title).mark_rect().encode(alt.X('window:O', title='Window'), alt.Y('offset:O', title='Offset days'), color)


With the above heatmap, the level of details is higher; in fact, from that can be seen that:
- In the case of windows 7 and 17 in most of the tested offsets the price acts as "master"
- If we compare the sentiment with the future (positive offsets) in the same window the correlation is nearly 1 but in another is nearly -1, so the price prediction based is not so reliable or the crawling of the tweets was not consistent
- If we compare the sentiment with the past (negative offsets) in the windows from 8 to 15 the correlation is pretty high, suggesting that the price is the master for this time range

### Volume analysis

Another aspect of data is the volume of that, in other words: is relevant that the people speak well or bad about Bitcoin or it's enough that people speak?

In [53]:
time_selector = alt.selection(type='interval', encodings=['x'])

dummy_df = pd.DataFrame({'date': [min(daily_df['date']), max(daily_df['date'])], 'count': [0, 0]})
zero_line = alt.Chart(dummy_df).mark_line(color='grey').encode(x=alt.X('yearmonthdate(date):T'), y=alt.Y('count'))

price = alt.Chart(market_df).mark_line(color=palette['price']).encode(
    x=alt.X('yearmonthdate(date):T',
           scale=alt.Scale(domain=time_selector),
           title=None),
    y=alt.Y('mid_price', title='Mid price')
)
        
plot_title = alt.TitleParams('Volume vs Bitcoin price')

histogram = alt.Chart(daily_df, title=plot_title).mark_bar(color=palette['neutral_1']).encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100, extent=time_selector),
                                                       scale=alt.Scale(domain=time_selector),
                                                       axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                 alt.Y('count',
                                                      title='Volume'))

histogram_reg = histogram.transform_regression('date', 'count', method='poly', order=9).mark_line(color=palette['strong_neutral_1'])

volume_chart = histogram + histogram_reg + zero_line

price_reg = price.transform_regression('date', 'mid_price', method='poly', order=9).mark_line(color=palette['strong_price'])


price_chart = price + price_reg

selection_plot = alt.Chart(daily_df).mark_bar(color=palette['neutral_1']).encode(alt.X('yearmonthdate(date):T',
                                                       bin=alt.Bin(maxbins=100),
                                                            title='Date',
                                                            axis=alt.Axis(labelOverlap='greedy', labelSeparation=6)),
                                                       alt.Y('count', title=None)).add_selection(time_selector).properties(height=50)

alt.layer(volume_chart, price_chart).resolve_scale(y='independent') & selection_plot

Trend lines make it easier to see that a correlation exists, but the time series are not synchronized.

#### Correlation

##### TLCC

In [54]:
methods = ['pearson', 'kendall', 'spearman']
offsets = list(range(-150, 151)) # list of days offset to test

correlations = []

volume_vs_price = pd.DataFrame(daily_df['date'], columns=['date'])
volume_vs_price['volume'] = daily_df['count']
volume_vs_price = volume_vs_price.merge(market_df[['date', 'mid_price']], on='date')

for method in methods:
    method_correlations = [(method, offset, volume_vs_price['volume'].corr(sent_vs_price['mid_price'].shift(-offset), method=method))
                           for offset in offsets]
    correlations.extend(method_correlations)
        
correlations_df = pd.DataFrame(correlations, columns=['method', 'offset', 'correlation'])

spearman_correlations = correlations_df[correlations_df['method'] == 'spearman']

max_corr = max(spearman_correlations['correlation'])
max_corr_offset = spearman_correlations[spearman_correlations['correlation'] == max_corr]['offset'].iloc[0]

min_corr = min(spearman_correlations['correlation'])
min_corr_offset = spearman_correlations[spearman_correlations['correlation'] == min_corr]['offset'].iloc[0]

max_corr_text = f'Max correlation ({round(max_corr, 3)}) with an offset of {max_corr_offset} days'
min_corr_text = f'Min correlation ({round(min_corr, 3)}) with an offset of {min_corr_offset} days'

plot_title = alt.TitleParams('Correlations', subtitle=[max_corr_text, min_corr_text])
corr_chart = alt.Chart(correlations_df, title=plot_title).mark_line().encode(alt.X('offset', title='Offset days'),
                                                          alt.Y('correlation', title='Correlation'),
                                                          alt.Color('method', title='Method'))

corr_chart

As for sentiment, Pearson correlation is not applicable, but this time its trend is more similar to the other two indices than before.

An interesting thing is that the max correlation is slightly lower than max sentiment correlation, but this time with an offset of only 6 days in the past, volume more significant than sentiment?

#### WLTCC

In [55]:
from math import ceil

def get_window(series: pd.Series, window) -> pd.Series:
    return series.iloc[window[0]: window[1]]
    

def windowed_corr(first: pd.Series, second: pd.Series) -> list:
    windows = [(window * window_size, (window * window_size)+window_size) for window in range(ceil(len(second)/window_size))]
    windows_corr = [get_window(first, window).corr(get_window(second, window), method = 'spearman') for window in windows]
    return windows_corr, windows

offsets = list(range(-66, 67)) # reduced offsets for better visualization
window_size = 120 # one window = one quarter

windowed_correlations = []

for offset in offsets:
    windows_corr, windows = windowed_corr(volume_vs_price['volume'], volume_vs_price['mid_price'].shift(-offset))
    for window, window_corr in enumerate(windows_corr):
        windowed_correlations.append((window, window_corr, offset))
    
    
windowed_correlations_df = pd.DataFrame(windowed_correlations, columns=['window', 'correlation', 'offset'])


plot_title = alt.TitleParams('Quarter lagged correlation volume/price', subtitle='-1:= price as master, 1:= sentiment as master')
color = alt.Color('correlation', scale=alt.Scale(domain=[-1, 1], range=[palette['negative'], palette['positive']]), title='Correlation')
alt.Chart(windowed_correlations_df, height=800, width=800, title=plot_title).mark_rect().encode(alt.X('window:O', title='Window'), alt.Y('offset:O', title='Offset days'), color)


It's simple to note that the right half of the heatmap is more or less green, this corresponds to the time range where the data crawling is better.

In addition, if we look in the near future (positive offsets) the average correlation is not so bad compared to the maximum correlation.

### Users analysis

Could be interesting to analyze users' metrics, in fact such metrics could be very useful for deeper analysis (weighted tweets and so on)

In [25]:
top_tweeters_csv = base_path + 'top_tweeters.csv'
top_retweeted_csv = base_path + 'top_retweeted.csv'
words_avg_csv = base_path + 'words_avg.csv'

top_tweeters = pd.read_csv(top_tweeters_csv).iloc[1:] # First element groups users without username
top_retweeted = pd.read_csv(top_retweeted_csv).iloc[1:] # First element groups users without username
words_avg = pd.read_csv(words_avg_csv)

#### Teweets

In [26]:
top_tweeters = pd.DataFrame(top_tweeters['tweets'], columns=['tweets'])
top_tweeters['outliers'] = top_tweeters['tweets']
top_tweeters['no outliers'] = top_tweeters['tweets']
top_tweeters = top_tweeters[['outliers', 'no outliers']]

top_tweeters_long = top_tweeters.melt(value_name='tweets', var_name='viz')

In [27]:
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'no outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=False, median=True).encode(alt.X('tweets:Q', title=None), 
alt.Y('viz:N', axis=None)).properties(width=500, height=300)

In [28]:
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'no outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=False, median=True).encode(alt.X('tweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=True, median=True).encode(alt.X('tweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

As can be seen, there are some outliers that are very far from the median, this indicates the presence of spammers or bots (thousands of tweets for a single human user, are impossible). 
An upgrade to data processing could be recognizing outliers to remove them.

#### Retweets

In [29]:
top_retweeted = pd.DataFrame(top_retweeted['retweets'], columns=['retweets'])
top_retweeted['outliers'] = top_retweeted['retweets']
top_retweeted['no outliers'] = top_retweeted['retweets']
top_retweeted = top_retweeted[['outliers', 'no outliers']]

top_retweeted_long = top_retweeted.melt(value_name='retweets', var_name='viz')

In [30]:
alt.Chart(top_retweeted_long[top_tweeters_long['viz'] == 'no outliers'], title='Average retweets per user').mark_boxplot(size=10, outliers=False, median=True).encode(alt.X('retweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(top_retweeted_long[top_tweeters_long['viz'] == 'outliers'], title='Average retweets per user').mark_boxplot(size=10, outliers=True, median=True).encode(alt.X('retweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

How many each user has been retweeted is much different than how many tweets he posted. Having outliers, in this case, is part of the reality: some users are not so followed and their tweets have no retweets, others (speaking about Bitcoin, Elon Musk for example) are what's called a "VIP".

This aspect is very interesting because could open another set of possible analysis; for example: removing "Normal people" from the dataset, how much change the correlation with the price?

#### Tweets average length

For simplicity, here the "length of a tweet" will be the number of words, this also reflects the intent of analyzing this aspect: Find a way to evaluate the relevance of a tweet

In [None]:
words_avg = pd.DataFrame(words_avg['words_avg'], columns=['words_avg'])
words_avg['words_avg'] = words_avg['words_avg'].apply(lambda x: int(x))
words_avg['outliers'] = words_avg['words_avg']
words_avg['no outliers'] = words_avg['words_avg']
words_avg = words_avg[['outliers', 'no outliers']]

words_avg_long = words_avg.melt(value_name='words_avg', var_name='viz')

In [37]:
alt.Chart(words_avg_long[words_avg_long['viz'] == 'no outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

In [58]:
alt.Chart(words_avg_long[words_avg_long['viz'] == 'no outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(words_avg_long[words_avg_long['viz'] == 'outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=True, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

There are clearly outliers that are very far from the median, but in this case, the most important thing is IRQ; Q1 is 5 and Q3 is 15, this means that most of the tweets in the dataset have a number of words compatible with a sentence with meaning, therefore the number of bots is, probably, low.

#### Users' metrics mixed up

In [39]:
top_tweeters = pd.read_csv(top_tweeters_csv).iloc[1:]
top_retweeted = pd.read_csv(top_retweeted_csv).iloc[1:]
words_avg = pd.read_csv(words_avg_csv)

users_summary = top_tweeters.copy()
users_summary = users_summary.merge(top_retweeted, on=['username', 'full_name'])
users_summary = users_summary.merge(words_avg, on=['username', 'full_name'])

users_summary

words_q3 = users_summary['words_avg'].quantile(q=0.75)

users_summary_filtered = users_summary[users_summary['words_avg'] <= words_q3*1.5]

In [57]:
domain = [1, max(users_summary_filtered['words_avg'])]
range_ = [palette['negative'], palette['positive']]


plot_title = alt.TitleParams("Zoomed users' metrics", subtitle=["Re-adjusted words_avg color range"])
alt.Chart(users_summary_filtered, title=plot_title).mark_point(clip=True).encode(alt.X('tweets', scale=alt.Scale(domain=(0, 5000)), title='Tweets'), alt.Y('retweets', scale=alt.Scale(domain=(0, 16000)), title='Retweets'), alt.Color('words_avg', scale=alt.Scale(domain=domain, range=range_), title='Words avg')).properties(height=750, width=750).configure_point(size=10)

It's not possible to identify groups of users with similar metrics, but the above chart is a visualization of the population in the dataset: Most of the users are "normal people", but a portion of them have a good number of retweets and a good average tweets length.

Another expansion of this analysis could be a process where each user has a rank based on its metrics; based on that rank, a dedicated service "monitors" the users with a higher rank. This because, probably (but it's only a hypothesis), a user with a higher rank has a higher impact on the price.