### Users analysis

Charts about Twitter users' metrics are very heavy, so are moved from Twitter notebook to here.

Could be interesting to analyze users' metrics, in fact such metrics could be very useful for deeper analysis (weighted tweets and so on)

In [1]:
import pandas as pd
import altair as alt
from palette import palette

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [2]:
base_path = './Datasets/'

top_tweeters_csv = base_path + 'top_tweeters.csv'
top_retweeted_csv = base_path + 'top_retweeted.csv'
words_avg_csv = base_path + 'words_avg.csv'

#### Tweets

In [3]:
top_tweeters = pd.read_csv(top_tweeters_csv).iloc[1:] # First element groups users without username

top_tweeters = pd.DataFrame(top_tweeters['tweets'], columns=['tweets'])
top_tweeters['outliers'] = top_tweeters['tweets']
top_tweeters['no outliers'] = top_tweeters['tweets']
top_tweeters = top_tweeters[['outliers', 'no outliers']]

top_tweeters_long = top_tweeters.melt(value_name='tweets', var_name='viz')

In [4]:
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'no outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('tweets:Q', title=None), 
alt.Y('viz:N', axis=None)).properties(width=500, height=300)

In [5]:
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'no outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('tweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(top_tweeters_long[top_tweeters_long['viz'] == 'outliers'], title='Tweets per user').mark_boxplot(size=10, outliers=True, median=True, color=palette['twitter']).encode(alt.X('tweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

As can be seen, there are some outliers that are very far from the median, this indicates the presence of spammers or bots (thousands of tweets for a single human user, are impossible). 
An upgrade to data processing could be recognizing outliers to remove them.

#### Retweets

In [6]:
top_retweeted = pd.read_csv(top_retweeted_csv).iloc[1:] # First element groups users without username

top_retweeted = pd.DataFrame(top_retweeted['retweets'], columns=['retweets'])
top_retweeted['outliers'] = top_retweeted['retweets']
top_retweeted['no outliers'] = top_retweeted['retweets']
top_retweeted = top_retweeted[['outliers', 'no outliers']]

top_retweeted_long = top_retweeted.melt(value_name='retweets', var_name='viz')

In [7]:
alt.Chart(top_retweeted_long[top_tweeters_long['viz'] == 'no outliers'], title='Average retweets per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('retweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(top_retweeted_long[top_tweeters_long['viz'] == 'outliers'], title='Average retweets per user').mark_boxplot(size=10, outliers=True, median=True, color=palette['twitter']).encode(alt.X('retweets:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

How many each user has been retweeted is much different than how many tweets he posted. Having outliers, in this case, is part of the reality: some users are not so followed and their tweets have no retweets, others (speaking about Bitcoin, Elon Musk for example) are what's called a "VIP".

This aspect is very interesting because could open another set of possible analysis; for example: removing "Normal people" from the dataset, how much change the correlation with the price?

#### Tweets average length

For simplicity, here the "length of a tweet" will be the number of words, this also reflects the intent of analyzing this aspect: Find a way to evaluate the relevance of a tweet

In [10]:
words_avg = pd.read_csv(words_avg_csv)

words_avg = pd.DataFrame(words_avg['words_avg'], columns=['words_avg'])
words_avg['words_avg'] = words_avg['words_avg'].apply(lambda x: int(x))
words_avg['outliers'] = words_avg['words_avg']
words_avg['no outliers'] = words_avg['words_avg']
words_avg = words_avg[['outliers', 'no outliers']]

words_avg_long = words_avg.melt(value_name='words_avg', var_name='viz')

In [11]:
alt.Chart(words_avg_long[words_avg_long['viz'] == 'no outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

In [12]:
alt.Chart(words_avg_long[words_avg_long['viz'] == 'no outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=False, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300) + \
alt.Chart(words_avg_long[words_avg_long['viz'] == 'outliers'], title='Average post length per user').mark_boxplot(size=10, outliers=True, median=True, color=palette['twitter']).encode(alt.X('words_avg:Q', title=None), alt.Y('viz:N', title=None)).properties(width=500, height=300)

There are clearly outliers that are very far from the median, but in this case, the most important thing is IRQ; Q1 is 5 and Q3 is 15, this means that most of the tweets in the dataset have a number of words compatible with a sentence with meaning, therefore the number of bots is, probably, low.

#### Users' metrics mixed up

In [13]:
top_tweeters = pd.read_csv(top_tweeters_csv).iloc[1:]
top_retweeted = pd.read_csv(top_retweeted_csv).iloc[1:]
words_avg = pd.read_csv(words_avg_csv)

users_summary = top_tweeters.copy()
users_summary = users_summary.merge(top_retweeted, on=['username', 'full_name'])
users_summary = users_summary.merge(words_avg, on=['username', 'full_name'])

users_summary

words_q3 = users_summary['words_avg'].quantile(q=0.75)

users_summary_filtered = users_summary[users_summary['words_avg'] <= words_q3*1.5]

In [17]:
domain = [1, words_q3*1.5]
range_ = [palette['negative'], palette['positive']]


plot_title = alt.TitleParams("Zoomed users' metrics", subtitle=["Re-adjusted words avg color range"])
alt.Chart(users_summary_filtered, title=plot_title).mark_point(clip=True).encode(alt.X('tweets', scale=alt.Scale(domain=(0, 5000)), title='Tweets'), alt.Y('retweets', scale=alt.Scale(domain=(0, 16000)), title='Retweets'), alt.Color('words_avg', scale=alt.Scale(domain=domain, range=range_), title='Words avg')).properties(height=750, width=750).configure_point(size=10)

It's not possible to identify groups of users with similar metrics, but the above chart is a visualization of the population in the dataset: Most of the users are "normal people", but a portion of them have a good number of retweets and a good average tweets length.

Another expansion of this analysis could be a process where each user has a rank based on its metrics; based on that rank, a dedicated service "monitors" the users with a higher rank. This because, probably (but it's only a hypothesis), a user with a higher rank has a higher impact on the price.