# Tweet Sentiment Analysis

The aim of this notebook is to train a classifier for sentiment analysis using tweet data, then to assign sentiment scores
to a group of tweets collected for a set of cryptocurrnecies over the period of a month. This will then be used to plot some correlation
graphs of price and return versus sentiment score at a daily frequency.

In [None]:
from src.spark import Spark
import pandas as pd
import src.tweet_volume as funcs
import matplotlib.pyplot as plt
from pyspark.sql import functions as fs
from src.plotting import double_plot
from pyspark.ml import PipelineModel
from pyspark.ml.classification import LogisticRegressionModel
import src.nlp.clean as clean
from pyspark.sql.types import StringType, IntegerType, StructField, StructType
from src.nlp.sentiment import SentimentAnalyser


## Load Spark and Data

For this we are going to use a Spark SQL Session, as the SQL package contains methods for reading CSV and JSON data into Dataframes.
We start a new Spark session by passing the URL of the master process we want to connect to. This can either be 'local' or a remote master process specified by a Spark URL. Here we will connect to a remote Spark master which has 25 nodes connected.

In [4]:
spark = Spark('load', 'local')
sess = spark.session()

The tweet json is preprocessed with a Schema which will select only necessary fields from the data read in.

In [5]:
df = funcs.load_dataframe(sess, '/cs/unique/ls99-kf39-cs5052/data/tweets/*.json', funcs.schema)
df2 = funcs.parse_timestamp(df)


### Training Data

Likewise, we need some training data for the algorithms, which is another set of tweets that has been pre-labelled for sentiment. The labels are `positive` and `negative`, identified by a `0` for negative and `1` for positive. This is good for us, as it makes classifying the tweets into a **Binary** classification problem, as opposed to a **Multi-class** one, considerably easier.

##### Data Cleansing

In order to perform sentiment analysis on the input data set, it needs to be simplified to remove a lot of the noise that exists in social media data and natural language in general.

We need to create an input vector to the classifier that contains information about the words in the text, so the first step is to make all words lower-case, as this will reduce the dimensionality of the input vector. Then URLs are stripped from the text, as they are not useful to us. After this, punctuation and other useless characters are also removed.



In [5]:
sent_schema = StructType([
    StructField("target", IntegerType(), True),
    StructField("text", StringType(), True)
])

train_df = sess.read.csv("/cs/unique/ls99-kf39-cs5052/train/train.csv", header=True, schema=sent_schema)

train_clean = clean.clean_tweets(train_df, "text")
train_clean.first()

Row(target=0, text=u'is so sad for my apl friend')

After loading in the training data, it is cleaned and split into a training and evaluation set. As there are 1.5 million tweets in total, we will only take 1% for eval, as this is still a large amount of tweets.

In [6]:
train, evaluation = train_clean.randomSplit([0.99, 0.1], seed=42)
print("Number of tweets in train: {:,}".format(train.count()))

Number of tweets in train: 1,434,200


### Model Training

We have already trained a sentiment analysis pipeline and saved the model for reuse later, but should a user wish to train a new model on different data, it is a simple case of using the class we have defined called `SentimentAnalyser`.

```python
# Setup a new model pipeline
model = SentimentAnalyser()

# This will compute the vector representation of the tweets first, then train a Binary Classifier  
model.train(train_data)  

# Get the predictions
predictions = model.predict(eval_data)  

# Count the number of positive and negative tweets and return dataframes of positive + negative
pos, neg = model.count_sentiments(predictions)  

# Print the accuracy and RoC AUC 
model.classification_report(predictions)  

# Save the model to disk for use later
model.save("/path/to/save/to")  
```



### Loading Model

The training and evaluation data was used to train a sentiment analysis pipeline, and the model weights were saved to disk, allowing the model to be re-used. Below, the next cell will load back in the model that we trained earlier and use it to compute sentiment scores for the cryptocurrency tweets.



In [6]:
pipeline = "/cs/unique/ls99-kf39-cs5052/models/"

sentiment_model = SentimentAnalyser()
sentiment_model.load(pipeline)

In [9]:
alts = ['ETH']#['BTC', 'XMR', 'DASH', 'LTC', 'ETC', 'BCH']  # 'ETH'
hashes = {
        'ETH': ['%ethereum%', '%ether%', '%eth%'],
        'BTC': ['%bitcoin%', '%btc%', '%bitcoin%'],
        'XMR': ["%monero%", "%xmr%", "%monero%"], 
        'DASH': ["%digital cash%", "%dash%", "%dash%"], 
        'LTC': ["%litecoin%", "%ltc%", "%litecoin%"], 
        'ETC': ["%ethereum classic%", "%etc%", "%eth classic%"], 
        'BCH': ["%bitcoincash%", "%bch%", "%bitcoin cash%"]
    }

crypto_data = funcs.load_crypto()


Loading KRAKEN_DASH_EUR.csv
Loading KRAKEN_LTC_EUR.csv
Loading KRAKEN_BTC_EUR.csv
Loading KRAKEN_ETC_EUR.csv
Loading KRAKEN_BCH_EUR.csv
Loading KRAKEN_XMR_EUR.csv
Loading KRAKEN_ETH_EUR.csv


In [11]:

for coin in alts:
    hash = hashes[coin]
    coin_df = df2.filter(fs.lower(df['text']).like(hash[0]) | fs.lower(df['text']).like(hash[1]) | fs.lower(df['text']).like(hash[2]))
    coin_clean = clean.clean_tweets(coin_df, "text")
    print("Looking at %s" % coin)
    print(coin_clean.first().text)
    
    coin_pred = sentiment_model.predict(coin_clean)
    pos, neg = sentiment_model.count_sentiments(coin_pred)
    
    daily_pos = funcs.aggregate_by_day(pos)
    daily_neg = funcs.aggregate_by_day(neg)
    daily_pos.set_index("date", inplace=True)
    daily_neg.set_index("date", inplace=True)
    daily_pos.index = pd.to_datetime(daily_pos.index)
    daily_neg.index = pd.to_datetime(daily_neg.index)
    
    daily_sent = pd.DataFrame(data={'pos':daily_pos['count'], 'neg':daily_neg['count']}, index=daily_pos.index)
    daily_sent['pos'] = daily_sent['pos'].astype(float)
    daily_sent['neg'] = daily_sent['neg'].astype(float)
    daily_sent['sentiment'] = (daily_sent['pos'] - daily_sent['neg']) / (daily_sent['pos'] + daily_sent['neg'])
    
    # Fetch coin price data, combine with sentiment score and plot.
    coin_df = crypto_data[coin]
    sent_price = coin_df
    sent_price['price'] = sent_price.weightedAverage
    sent_price['return'] = sent_price.price.pct_change()
    sent_price['sentiment'] = daily_sent['sentiment']
    sent_price.dropna(inplace=True)
    
    double_plot([sent_price.price, sent_price.sentiment], 
                ['%s Price' % coin, 'Sentiment'], 
                ['Date', 'Price', 'Sentiment Score'], 
                "%s Price vs Tweet Sentiment" % coin, 
                sent_price.index.tolist())
    double_plot([sent_price['return'], sent_price.sentiment], 
                ['%s Return' % coin, 'Sentiment'], 
                ['Date', 'Price', 'Sentiment Score'], 
                "%s Return vs Tweet Sentiment" % coin, 
                sent_price.index.tolist())
    print("Sentiment-Price correlation: %.4f" % sent_price.corr().sentiment.price)
    print("Sentiment-Return correlation: %.4f" % sent_price.corr()['sentiment']['return'])
    


Looking at ETH
rt @alttradex the alttradex ico starts tomorrow get your 20% bonus#alttradex #ico #bitcoin #ethereum… 
Tweets positive: 815,614, negative: 98,332
Sentiment-Price correlation: -0.2772
Sentiment-Return correlation: -0.0706




### Results

The results of the correlation analysis for daily sentiment scores and cryptocurrency price are given below. First, the raw output is listed, which shows the first tweet in the dataset for each coin, and then the calculated correlation scores and the graphs of price and return vs sentiment score per day.

```
Looking at ETH
rt @alttradex the alttradex ico starts tomorrow get your 20% bonus#alttradex #ico #bitcoin #ethereum…
Tweets positive: 815,614, negative: 98,332
Sentiment-Price correlation: -0.2772
Sentiment-Return correlation: -0.0706
```

![Price vs Sentiment](../img/ETH_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/ETH_Return_vs_Tweet_Sentiment.png)

```
Looking at BTC
rt @vinnylingham2x #bitcoin needs less developers and more incumbents and intermediaries
Tweets positive: 2,753,006, negative: 403,501
Sentiment-Price correlation: -0.1724
Sentiment-Return correlation: -0.5240
```

![Price vs Sentiment](../img/BTC_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/BTC_Return_vs_Tweet_Sentiment.png)

```
Looking at XMR
rt @andrew0hayes genesis mining discountnpj8st#btc #ltc #eth #xmr #dash #str #gnt #dgb #stratis #xlm #eos #iota…
Tweets positive: 34,201, negative: 8,696
Sentiment-Price correlation: 0.8958
Sentiment-Return correlation: 0.2364
```

![Price vs Sentiment](../img/XMR_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/XMR_Return_vs_Tweet_Sentiment.png)

```
Looking at DASH
rt @ken2020n 1027 あさイチ二宮和也1027 ロンドンハーツ2時間sp二宮和也1028 ラストレシピ公開直前！絶品グルメ打ち上げツアー二宮和也1029 鉄腕dash二宮和也112 アメトーーク二宮和也1112 相葉マナ…
Tweets positive: 451,318, negative: 116,419
Sentiment-Price correlation: -0.3727
Sentiment-Return correlation: -0.6164
```

![Price vs Sentiment](../img/DASH_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/DASH_Return_vs_Tweet_Sentiment.png)

```
Looking at LTC
rt @andrew0hayes genesis mining discountnpj8st#btc #ltc #eth #xmr #dash #str #gnt #dgb #stratis #xlm #eos #iota…
Tweets positive: 214,838, negative: 37,729
Sentiment-Price correlation: 0.4713
Sentiment-Return correlation: -0.0315
```

![Price vs Sentiment](../img/LTC_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/LTC_Return_vs_Tweet_Sentiment.png)

```
Looking at ETC
rt @andrew0hayes #genesismining code npj8st $etc $xmr #litecoin $ltc $steem $zec $dash $btc $eth $rep #eth…
Tweets positive: 46,498, negative: 3,611
Sentiment-Price correlation: -0.2842
Sentiment-Return correlation: -0.5320
```

![Price vs Sentiment](../img/ETC_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/ETC_Return_vs_Tweet_Sentiment.png)

```
Looking at BCH
rt @lisaciarlone what the fork bitcoin cash community preps hard fork slated for november 13
Tweets positive: 196,321, negative: 61,832
Sentiment-Price correlation: 0.5710
Sentiment-Return correlation: -0.1822
```

![Price vs Sentiment](../img/BCH_Price_vs_Tweet_Sentiment.png)
![Return vs Sentiment](../img/BCH_Return_vs_Tweet_Sentiment.png)




##### Conclusion

The results were somewhat mixed for the correlation analysis with sentiment. It looks like there is some strong correlation for a few currencies. However to fully explore the power of tweets for predicting cryptocurrency prices, more data is need. We only have 1 month of tweets and prices, which is not enough to build trading strategies from.

Also, the tweets were only analysed for positive and negative sentiment. It may be better in future to use 3 classes for positive, neutral and negative. This 

