# BUSN 20800 Final project

## Due date: 5pm on June 3, 2022 

### Note: 

This is an individual exam. You need run this notebook on your personal computer because the class server has issue with the package pyLDAvis.

# Part 1: Introduction

Twitter, one of the world’s largest social media services, has now become a platform for politicians, organizations, and companies to give updates to their followers. Users representing a company or a political party use Twitter to state views on current news, push their political campaigns and even confirm official policy decisions. Important figures that have used the site to broadcast their thoughts to millions of followers include Tesla founder Elon Musk, European Commission President Donald Tusk, and current UK Prime Minister Boris Johnson. The use of social networks can have large financial/business implications — for example, Elon Musk caught a lot of negative attention when he tweeted that ‘funding was secured’ to take his electric vehicle company Tesla private. Tesla’s share price rose as much as 8.5% (1) and this resulted in a punitive investigation into Musk by the Securities and Exchange Commission. This led to Musk being ousted as Chairman for the company that he founded, as well as a $20 million fine — all because of one tweet.

In this project, you are requested to analyze Elon Mush's tweet and link his tweets with the stock returns of Tesla  (TSLA).

Run the following codes to load the data, you don't need to modify any codes here.

In [None]:
# Set up codes

import pandas as pd
import numpy as np
import gensim
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS
import pyLDAvis 
import pyLDAvis.gensim_models as gensimvis
import re
import string
import operator
import warnings
from collections import defaultdict
import pprint
import matplotlib.pyplot as plt
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
# load data
df = pd.read_csv("TweetsElonMusk.csv")
df.date = df.date.apply(lambda x: x[:-5])

In [None]:
df.head()

In [None]:
# print out how many tweets we collected
text_data = list(df["texts"].values)
date      = list(df["date"].values)
T         = len(text_data) 
print(T)

In [None]:
import en_core_web_sm
spacy_nlp   = en_core_web_sm.load()

# Part I. Data cleaning and EDA

Please follow the data cleaning procedure you have seen in Trump's tweet example. 


In this ```clean_tweets()``` function, it will clean the original tweet and generate the result with tokenized words.

You don't need to modify any codes here.

In [None]:
def clean_tweets(tweet):
    """
    Tokenize and lemmatize an input tweet
    
    Input:
        Tweet: string type
    Output:
        A list containing tokens
    """
    tweet       = re.sub('&amp;', ' ',tweet)
    tweet       = emoji_pattern.sub(r' ', tweet)
    
    word_tokens = spacy_nlp(tweet)
    tokens      = []
    
    for w in word_tokens:
        if not w.is_stop: # not stop words
            s   = w.text.lower()
            s   = re.sub(r'^[@#]', '', s)
            s   = re.sub(r'[^a-zA-Z0-9_]+$', '', s)
            s   = re.sub(r'https?:\S*', '', s)
            s   = re.sub(r'[-,#()@=!\"\'\?\/:]+', ' ', s)
            
            #replace consecutive non-ASCII characters with a space
            s   = re.sub(r'[^\x00-\x7F]+',' ', s)
            tokens += s.split()
    text = " ".join(tokens)

    word_tokens    = spacy_nlp(text)
    filtered_tweet = []
    for w in word_tokens:
        if not w.is_stop:
            if w.lemma_ != "-PRON-":
                s = w.lemma_.lower()
            else:
                s = w.lower_
            s = s.strip('-')
            if len(s) <= 1:
                continue
            if re.match(r'^[a-zA-Z_\.]+$', s):
                filtered_tweet.append(s)
    
    return filtered_tweet

In [None]:
#Emoji patterns
emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)

Test if the tokenization works well.

In [None]:
##############################################################################
### TODO: Give an example to show the results of tokenization.             ###
##############################################################################

#Test tokenization


##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################

Construct the bag of words model. It stores the times of each word appeared in each tweet.

You don't need to modify any codes here.

In [None]:
# bag-of-words model
processed_docs = []

for t in text_data:
    try: 
        processed_docs.append(clean_tweets(t))
    except: pass


# create a dictionary for all tweets
dictionary     = gensim.corpora.Dictionary(processed_docs)

# create word bow for all tweets, this can help us record the times of each word appeared in each tweet.
bow_corpus     = [dictionary.doc2bow(doc) for doc in processed_docs]

# Part II. Latent Dirichlet Allocation(LDA)


LDA was proposed in 2003 to infer the topic distribution of documents. It can give the topic of each document in the document set in the form of probability distribution, so that after analyzing some documents to extract their topic distribution, topic clustering or text classification can be performed according to the topic distribution.

Latent Dirichlet Allocation (LDA) is a Bayesian probabilistic model of text documents. It assumes a collection of K “topics.” Each topic defines a multinomial distribution over the vocabulary and is assumed to have been drawn from a Dirichlet, $\beta_{k} \sim \text{Dirichlet}(\eta)$. Given the topics, LDA assumes the following generative process for each document $d$. First, draw a distribution over topics $\theta_{d} \sim \text{Dirichlet}(\alpha)$. Then, for each word $i$ in the document, draw a topic index $z_{d_i} \in \lbrace1, \dots , K\rbrace$ from the topic weights $z_{d_i} \sim \theta_{d}$ and draw the observed word $w_{d_i}$ from the selected topic, $w_{d_i} \sim \beta_{z_{d_i}}$ . For simplicity, we assume symmetric priors on $\theta$ and $\beta$, but this assumption is easy to relax.

Note that if we sum over the topic assignments $z$, then we get $p(w_{d_i} | \theta_d, \beta) = \sum_{k} \theta_{dk}\beta_{kw}$. This
leads to the “multinomial PCA” interpretation of LDA; we can think of LDA as a probabilistic factorization of the matrix of word counts $n$ (where $n_{dw}$ is the number of times word $w$ appears in document $d$) into a matrix of topic weights $\theta$ and a dictionary of topics $\beta$. Our work can thus be seen as an extension of online matrix factorization techniques that optimize squared error to more general probabilistic formulations.

We can analyze a corpus of documents with LDA by examining the posterior distribution of the topics $\beta$, topic proportions $\theta$, and topic assignments $z$ conditioned on the documents. This reveals latent structure in the collection that can be used for prediction or data exploration. This posterior cannot be computed directly, and is usually approximated using Markov Chain Monte Carlo
(MCMC) methods or variational inference. Both classes of methods are effective, but both present significant computational challenges in the face of massive data sets. Developing scalable approximate inference methods for topic models is an active area of research. 

This section is sited of [Online Learning for Latent Dirichlet Allocation](https://www.di.ens.fr/~fbach/mdhnips2010.pdf).

To learn more about [gensim.models.LadMulticore](https://radimrehurek.com/gensim/models/ldamulticore.html).

Perform LDA on these tweets and visualize the topic models.

(Try different parameters to see the difference.)

In [None]:
# Running LDA using Bag of Words
lda_model = gensim.models.LdaMulticore(corpus=bow_corpus,id2word=dictionary,num_topics=8, passes=20,random_state = 1,workers=3)

In [None]:
# Display the topic models
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    vis_data1 = gensimvis.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(vis_data1)

Based on the visualization results, can you summarize the major topics of Elon Musk's tweets? How many topics would you select?

**Your Answer here:**



# Part III. TF-IDF Model (For your reading)

TF-IDF means term-frequency-inverse-document-frequency. Term Frequency is the number of times a word has occurred in the document or a word's frequency in a document. Its domain remains local to the document. Document frequency is the fraction of documents in which the word has occurred. It’s calculated based on statistics collected from the entire corpus.

In very simple terms TF-IDF scheme is used for extracting features or important words which can be the best representative of your document. To get the intuitive feel of TF-IDF, consider a recipe book which has recipes for various fast foods.

Here, the recipe book is our corpus and each recipe is our document. Now consider various recipes such as:
1. Burger which will consist of words like "bun", "meat", "lettuce", "ketchup", "heat", "onion", "food", "preparation", "delicious", "fast"
2. French fries which will consist of words like "potato", "fry", "heat", "oil", "ketchup", "food", "preparation", "fast"
3. Pizza which will consist of words like "ketchup", "capsicum", "heat", "food", "delicious", "oregano", "dough", "fast"

Here words like "fast", "food", "preparation", "heat" occur in almost all the recipes. Such words will have a very high document frequency. Now consider words like "bun", "lettuce" for the recipe burger, "potato" and "fry" for the recipe french fries and "dough" and "oregano" for the recipe pizza. These have a high term frequency for the particular recipe they are related to, but will have a comparatively low document frequency.

Now we propose the scheme of TF-IDF, which simply put is a mathematical product of term-frequency and the inverse of the document frequency of each term. One can clearly visualize how the words like "potato" in French fries will have a high TF-IDF value and at the same time are the best representative of the document which in this case is "French Fries".

Now mathematically speaking:

- **Term frequency**: For a word (or term) $t$, the term frequency is denoted by $\text{tf}_{t,d}$ and the frequency of each term in a document $d$ is denoted by $\text{f}_{t,d}$ then $\text{tf}_{t,d} = \text{f}_{t,d}$

- **Inverse document frequency**: Taking the log smartirs scheme for inverse document frequency idf is calculated as $\text{idf}(t,D) = \text{log} \frac{N}{|\lbrace d \in D : t \in d|}\;\;$  or the log of inverse fraction of document that contain the term $t$, where $N$ is the total number of documents in the corpus. To avoid division by zero error generally 1 is added to the denominator.

- **TF-IDF**: Finally, TF-IDF is calculated as the product of term frequency $\text{tf}_{t,d}$ and inverse document frequency $\text{idf}(t,D)$, $\text{TF-IDF}(t,d,D) = \text{tf}_{t,d} * \text{idf}_{t,D}$

This section is sited of [Pivoted document length normalisation](https://rare-technologies.com/pivoted-document-length-normalisation/).

Run the below cells to construct a TF-IDF model. You don't need to modify any codes here.

In [None]:
#Running LDA using TF-IDF

#Create tf-idf model object using models.TfidfModel on bow_corpus and save it to 'tfidf'
tfidf           = gensim.models.TfidfModel(bow_corpus)

#apply transformation to the entire corpus and name it 'corpus_tfidf'
corpus_tfidf    = tfidf[bow_corpus]
tfidf_lda_model = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, 
                                             passes=20, random_state=np.random.RandomState(20800))
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    vis_data_tf_idf   = gensimvis.prepare(tfidf_lda_model, corpus_tfidf, dictionary)

pyLDAvis.display(vis_data_tf_idf)

Compare the topic models using TF-IDF and vanilla LDA, what differences did you find?

**Your Answer here:**

# Part IV: Sentiments Analysis

This function embedds a `SentimentIntensityAnalyzer()`, which is for analyzing the sentiment for each twitter. 


It returns those tweets which have sentiments scores higher or lower than cutoffs given.

You don't need to modify any codes here, but you can try to play with different cutoff values to see how it influences the final outcomes.

In [None]:
def sentiment_analyzer(sentences, pos_cutoff = 0.7, neg_cutoff = -0.7):
    """
    Given a list of tweets, return those that contain strong positive/negative sentiments
    Args:
        sentences: a list containing the indices of tweets in text_data
        pos_cutoff: tweets with sentiment scores >= pos_cutoff are classified as positive
        neg_cutoff: tweets with sentiment scores <= neg_cutoff are classified as negative
    Returns:
        pos: a list containing the indices of tweets that are classified as positive
        neg: a list containing the indices of tweets that are classified as negative
    """
    analyser  = SentimentIntensityAnalyzer()
    pos, neg  = [], []
    for i in sentences:
        score = analyser.polarity_scores(text_data[i])
        
        if score['compound']   >= pos_cutoff: # positive comments
            pos.append(i)
        elif score['compound'] <= neg_cutoff: # negative comments
            neg.append(i)
    return (pos, neg)

In [None]:
# lists containing indices for positive tweets and negative tweets resp.
pos, neg = sentiment_analyzer(np.arange(len(text_data)))

Display two tweets with positive and negative sentiment respectively.

In [None]:
##############################################################################
### TODO: Find a positive example.                                         ###
##############################################################################

# Positive sentence example



##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################

In [None]:
##############################################################################
### TODO: Find a negative example.                                         ###
##############################################################################


# Negative sentence example



##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################

# Part V. Stock trading

Now it's time for us to take the above analysis into trading.

In this section, we will use TSLA as a trading example. We will use the sentiment we have got before to decide when to long TSLA and short it respectively.



In [None]:
# Load TSLA return data
tsla =  pd.read_csv("TSLA.csv")
tsla['CLOSEPRC'] = 0.5 * (tsla.BID + tsla.ASK)
tsla.date =tsla.date.apply(lambda x: str(x))
tsla.date = tsla.date.apply(lambda x: x[:4]+'-'+x[4:6]+'-'+x[6:])
tsla.head()

In [None]:
def time_range(date):
    """
    Given a date, return the market opening time and closing time on that day
    Args:
        date: a string of the format 'mm/dd/yy hh:mm'
    Returns:
        a tuple: '(mm/dd/yy-09:30:00, mm/dd/yy-16:00:00)'
    """
    return (pd.Timestamp(pd.to_datetime(date).date()) + pd.Timedelta('09:30:00'), pd.Timestamp(pd.to_datetime(date).date()) + pd.Timedelta('16:00:00'))

Here is our trading strategy. 


For positive sentiment signals, the trading strategy is:
+ (1)if signal occurs before market opens: buy at open and sell at close;
+ (2)if signal occurs during market hours: buy at close and sell at tomorrow’s close
+ (3)if signal occurs after hours: buy at tomorrow’s open and sell at tomorrow’s close

For negative sentiment signals, the trading strategy is:
+ (1)if signal occurs before market opens: sell at open and return to 100% SPY exposure at close
+ (2)if signal occurs during market hours: sell at close and return to 100% SPY exposure at tomorrow’s close
+ (3)if signal occurs after hours: sell at tomorrow’s open and return to 100% SPY exposure at tomorrow’s close



We have implemented this trading strategy for you. You can take a look in the below ```trade``` function.



You don't need to modify any codes here.

In [None]:
def trade(signals, positive=True):
    """
    Given a list of signals, compute the profits gained.
    For positive sentiment signals, the trading strategy is:
        (1)if signal occurs before market opens: buy at open and sell at close;
        (2)if signal occurs during market hours: buy at close and sell at tomorrow’s close
        (3)if signal occurs after hours: buy at tomorrow’s open and sell at tomorrow’s close
    For negative sentiment signals, the trading strategy is:
        (1)if signal occurs before market opens: sell at open and return to 100% SPY exposure at close
        (2)if signal occurs during market hours: sell at close and return to 100% SPY exposure at tomorrow’s close
        (3)if signal occurs after hours: sell at tomorrow’s open and return to 100% SPY exposure at tomorrow’s close
    Args:
        signals: a list containing the time ('mm/dd/yy hh:mm') when a signal occurred (aka. tweet was published)
        positive: boolean. True means the input are all positive sentiment signals; False means negative sentiment signals. 
    Returns:
        No return value. Modifications are done on the dict 'profits'
    """
    for s in signals:
        op, ed = time_range(s)
        x      = str(pd.to_datetime(s).date())
        #Signal occurs before market opens:
        if pd.Timestamp(s) < op and len(tsla.loc[tsla["date"] == x]) > 0:
            if positive:
                #buy at open and sell at close
                p = tsla.loc[tsla["date"] == x]["CLOSEPRC"].values[0] - tsla.loc[tsla["date"] == x]["OPENPRC"].values[0]
            else:
                #sell at open and return to 100% SPY exposure at close
                p = tsla.loc[tsla["date"] == x]["OPENPRC"].values[0] - tsla.loc[tsla["date"] == x]["CLOSEPRC"].values[0]
            profits[x].append(p)
        else:
            t = pd.Timestamp(pd.to_datetime(s).date()) + pd.Timedelta('1 days') #next day
            y = str(pd.to_datetime(t).date())
            if len(tsla.loc[tsla["date"] == y]) > 0:
                #Signal occurs after hours:
                if pd.Timestamp(s) > ed:
                    if positive:
                        #buy at tomorrow’s open and sell at tomorrow’s close
                        p = tsla.loc[tsla["date"] == y]["CLOSEPRC"].values[0] - tsla.loc[tsla["date"] == y]["OPENPRC"].values[0]
                    else:
                        #sell at tomorrow’s open and return to 100% SPY exposure at tomorrow’s close
                        p = tsla.loc[tsla["date"] == y]["OPENPRC"].values[0] - tsla.loc[tsla["date"] == y]["CLOSEPRC"].values[0]
                # Signal occurs during market hours:
                elif len(tsla.loc[tsla["date"] == x]) > 0:
                    if positive:
                        #buy at close and sell at tomorrow’s close
                        p = tsla.loc[tsla["date"] == y]["CLOSEPRC"].values[0] - tsla.loc[tsla["date"] == x]["CLOSEPRC"].values[0]
                    else:
                        #sell at close and return to 100% SPY exposure at tomorrow’s close
                        p = tsla.loc[tsla["date"] == x]["CLOSEPRC"].values[0] - tsla.loc[tsla["date"] == y]["CLOSEPRC"].values[0]
                profits[y].append(p)

In [None]:
def find_signals(signals):
    """
    Given a list of signals, return the dates of their occurrences and the closing values of TSLA at the corresponding dates
    Args:
        signals: a list containing the time ('mm/dd/yy hh:mm') when a signal occurred (aka. tweet was published)
    Returns:
        time: a list containing the date ('mm/dd/yy') of each signal
        value: a list containing the closing values of TSLA at each day in time
    """
    time, value = [], []
    for s in signals:
        x = str(pd.to_datetime(s).date())
        if len(tsla.loc[tsla["date"] == x]) > 0:
            time.append(x)
            value.append(tsla.loc[tsla["date"] == x]["CLOSEPRC"].values[0])
    return (time, value)

In [None]:
# Initialize the profits, defalut = []

profits = defaultdict(list)

In [None]:
# Find the positive signals and negative signals and perform the trading respectively.

pos_time, neg_time = [df.iloc[i]["date"] for i in pos], [df.iloc[i]["date"] for i in neg]


trade(pos_time)
trade(neg_time, False)

In [None]:
time  = list(tsla["date"].values)
y1    = list(tsla["CLOSEPRC"].values)
y2    = [] 
start = 0
for t, v in zip(time, y1):
    start += np.mean(profits[t] if profits[t] else 0.0)
    y2.append(v + start)

In [None]:
# Get the buy and sell signals

buy, bval  = find_signals(pos_time)
sell, sval = find_signals(neg_time)

In [None]:
# Visualize the trading strategy

plt.figure(figsize = (16, 8))
plt.plot(time, y1, label = "Tsla raw return")
plt.plot(time, y2, label = "Portfolio Value")
plt.scatter(buy, bval, marker = 'v', c = "green", label = "Buy Signal")
plt.scatter(sell, sval, marker = '^', c = "red", label = "Sell Signal")
plt.xlabel("Time")
plt.ylabel("Price")
plt.legend(loc = "upper left")

plt.xticks(range(1,2960,300))
plt.show()

Based on the visualization above, do you think trade based on sentiment is a good idea? Analyze the returns and report your findings.

**Your Answer here:**