The objective here is to try to predict the trends of the top 10 cryptocurrencies for the next month by using Twitter. The general idea is to analyze tweets and categorize them depending whether they are in favor or not for a given cryptocurrency. The first step is to retrieve tweets, we will use [snscrape](https://github.com/JustAnotherArchivist/snscrape) to scrape tweets. Then, for each tweet, we will apply a sentiment analysis using the [natural language toolkit](https://www.nltk.org/) python package.

In [None]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
from pycoingecko import CoinGeckoAPI
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
import matplotlib.pyplot as plt
import re

Here we define a function to scrape tweets by topic and period of time. The `max_tweets` parameters is just to make development faster. Also, the `start_date` and `end_date` parameters should be in US format: YYYY-MM-dd.

In [None]:
def get_tweets(topic, start_date, end_date, max_tweets=10):
    tweets = []
    query = '{topic} since:{start_date} until:{end_date}'.format(topic=topic, start_date=start_date, end_date=end_date)

    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
        if i > max_tweets - 1:
            break
        tweets.append([tweet.date, tweet.content, tweet.user.username])
    
    return tweets

For example, we can retrieve 10 tweets talking about Bitcoin that were tweeted between 1st and 31th January 2021.

In [None]:
tweets = get_tweets('Bitcoin', '2021-01-01', '2021-01-31', 10000)
tweets

To make data easier to manipulate we can use a DataFrame from `pandas` library.

In [None]:
df = pd.DataFrame(tweets, columns=['Datetime', 'Text', 'Username'])
df.head()

When looking at the tweet contents, we can notice that some information are not relevant like hashtags, hyperlinks or @mentions so we can remove them since they may interfere with the sentiment analysis.

In [None]:
def clean_text(text):
    text = re.sub(r'@[A-Za-z0-9\w]+', '', text) # remove @mentions
    text = re.sub(r'#', '', text) # remove hashtags
    text = re.sub(r'RT[\s]+', '', text) # remove RT (retweet) symbol
    text = re.sub(r'https?:\/\/\S+', '', text) # remove hyperlinks
    text = re.sub(' +', ' ', text) # remove inner extra spaces
    text = text.strip() # remove leading and ending spaces
    return text

In [None]:
df['Text'] = df['Text'].apply(clean_text)
df.head()

Now that tweets are cleaner, we can apply a sentiment analysis.

The idea of the sentiment analysis is to analyze plain text to identify if the message conveyed is positive, neutral or negative. To do that, we can train our own model by feeding it with a large dataset of positive, neutral and negative vocabulary. In my case, I prefer to use the builtin sentiment analyzer from `nltk` because I'm not good at ML ^^

In [None]:
# nltk requires one more dependency to make sentiment analysis work correctly
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

one_tweet = df['Text'][0]
score = sia.polarity_scores(one_tweet)
print('tweet: ', one_tweet)
print('score: ', score)

Here, we have 4 indicators:
- neg: negative
- neu: neutral
- pos: positive
- compound

The neg, neu and pos variables range from 0 to 1 and their sum is equal to 1. The compound ranges from -1 to 1 and is more precise indicator so we will use it to determine the category of each tweet.

In [None]:
# create a column 'Scores' containing the polarity score of the associated row
df['Scores'] = df['Text'].apply(lambda text: sia.polarity_scores(text))
# same with compound
df['Compound'] = df['Scores'].apply(lambda score: score['compound'])
df.head()

If the compound value is:
- greater than 0 then we can say that the tweet is positive
- lower than 0 then the tweet is negative
- equal to 0 then the tweet is neutral

We can get some stats

In [None]:
positive_tweets = df[df['Compound'] > 0]
negative_tweets = df[df['Compound'] < 0]
neutral_tweets = df[df['Compound'] == 0]

percentage_positive = (positive_tweets.shape[0] / df.shape[0]) * 100
percentage_negative = (negative_tweets.shape[0] / df.shape[0]) * 100
percentage_neutral = (neutral_tweets.shape[0] / df.shape[0]) * 100

print('{0}% positive tweets'.format(percentage_positive))
print('{0}% negative tweets'.format(percentage_negative))
print('{0}% neutral tweets'.format(percentage_neutral))

When I ran the above with 10000 Bitcoin tweets publised between the 1st and 31th January 2021 I got:
- 38.22% positive tweets
- 17.94% negative tweets
- 43.84% neutral tweets

So the results may say that people are more in bullish for Bitcoin.

Howewer, the sentiment analysis we did is actually not perfect. For example when we analyze this tweet:

"sooner or later they will understand" tweeted by the user BitcoinSSJ

As human, we tend to interpret this as a positive tweet but our model sees it as neutral.

In [None]:
tweet = 'sooner or later they will understand'
score = sia.polarity_scores(tweet)
score

Or even worse, this tweet is positive regarding Bitcoin but is interpreted as negative.

In [None]:
tweet = 'I am sure they are really concerned for us, the people. Meanwhile behind the scenes their loading up! I will never trust any of them! \nbitcoin is the future!'
score = sia.polarity_scores(tweet)
print(score)

So there are still work to do to improve our model. We can do it by providing more data for training to our model. But, teaching our model to understand sarcasm can be difficult.

Also, the person tweeting is also a parameter that we should take into account. For example, someone like Christine Lagarde has generally a bigger impact regarding user decision since she has the ability to set regulations.

Another example is Elon Musk who, by changing his twitter bio to '#Bitcoin' + 2 tweets, played a role regarding the recent 20% increase of bitcoin and 600% for Dogecoin in both 24h.

So users have also a weight but I don't know how to do that :/

At this point, we will just consider our model is working great and repeat the process above for the top 10 cryptocurrencies that we can get but requesting the [CoinGecko API](https://www.coingecko.com/en/api).

In [None]:
def get_top_10_coins():
    coingecko = CoinGeckoAPI()
    all_coins = coingecko.get_coins_markets(vs_currency='usd')
    return [coin['id'] for coin in all_coins[:10]]

top_coins = get_top_10_coins()
top_coins

Let's just gather the process in one function.

In [86]:
def sentiment_analysis(coin, start_date, end_date, max_tweets=1000):
    tweets = get_tweets(coin, start_date, end_date, max_tweets)

    df = pd.DataFrame(tweets, columns=['Datetime', 'Text', 'Username'])

    df['Text'] = df['Text'].apply(clean_text)
    df['Scores'] = df['Text'].apply(lambda text: sia.polarity_scores(text))
    # same with compound
    df['Compound'] = df['Scores'].apply(lambda score: score['compound'])

    positive_tweets = df[df['Compound'] > 0]
    negative_tweets = df[df['Compound'] < 0]
    neutral_tweets = df[df['Compound'] == 0]

    percentage_positive = (positive_tweets.shape[0] / df.shape[0]) * 100
    percentage_negative = (negative_tweets.shape[0] / df.shape[0]) * 100
    percentage_neutral = (neutral_tweets.shape[0] / df.shape[0]) * 100

    is_positive = percentage_positive > percentage_negative

    print('{0} sentiment analysis results:'.format(coin))
    print('{0}% positive tweets'.format(percentage_positive))
    print('{0}% negative tweets'.format(percentage_negative))
    print('{0}% neutral tweets'.format(percentage_neutral))
    if is_positive:
        print('People seem bullish about {0}'.format(coin))
    else:
        print('People seem bearish about {0}'.format(coin))
    print('===============================')

In [87]:
start_date = '2021-01-01'
end_date = '2021-01-31'
max_tweets = 1000

for coin in top_coins:
    sentiment_analysis(coin, start_date, end_date, max_tweets)

bitcoin sentiment analysis results:
36.8% positive tweets
17.299999999999997% negative tweets
45.9% neutral tweets
People seem bullish about bitcoin
ethereum sentiment analysis results:
39.2% positive tweets
10.4% negative tweets
50.4% neutral tweets
People seem bullish about ethereum
tether sentiment analysis results:
36.0% positive tweets
28.799999999999997% negative tweets
35.199999999999996% neutral tweets
People seem bullish about tether
ripple sentiment analysis results:
24.3% positive tweets
40.8% negative tweets
34.9% neutral tweets
People seem bearish about ripple
polkadot sentiment analysis results:
46.800000000000004% positive tweets
9.4% negative tweets
43.8% neutral tweets
People seem bullish about polkadot
cardano sentiment analysis results:
43.4% positive tweets
10.9% negative tweets
45.7% neutral tweets
People seem bullish about cardano
chainlink sentiment analysis results:
46.800000000000004% positive tweets
8.0% negative tweets
45.2% neutral tweets
People seem bullish