# Text to Features

### Current Issues

The list of stock indices given in the files `snp500_list.csv` and `nyse_list.csv` contain some indices with single-letter symbols, e.g. `A` for Agilent Technologies. This makes identifying mentions of these indices in tweets seemingly impossible. My solutions would be:
- Check if the indices are extracted correctly. 
- See if there is a standard convention of how stock indices are mentioned in Twitter text, e.g. some tweets use `$` in front of stock indices.
- As a last resort, we may toss these stocks from our analysis.


### Overview

In this notebook, we load the table `tweets_data.csv` of Tweets. We compute the (normalized) number of words from various word libraries for each Tweet in the string format. The word libraries include:
- `Henry08_poswords.txt` and `Henry08_negwords.txt`, containing positive and negative words, respectively, from Henry (2008).
- `LM11_pos_words.txt` and `LM11_neg_words.txt`, containing words related to positive and negative sentiments, respectively, from Loughran and McDonald (2011).
- `ML_positive_bigram.csv` and `ML_negative_bigram.csv`, containing positive and negative bigrams (no trigrams??????), respectively, from Hagenau et al. (2013). 
- `news_library.txt`, containing names of mainstream business news agencies.
Furthermore, we identify the stock indices listed in `snp500_list.csv` and `nyse_list.csv` for each row in the table.

The word counts and the list of mentioned stock indices will be added as new columns to the dataframe and will be saved on a new file: `tweets_data_features_added.csv`. The file will be read into the notebook that performs our model fit.

This notebook can easily be modified to treat a different tweet table file.


### Libraries

We use `pandas` for dataframe and `spaCy` for linguistic operations.

In [1]:
import pandas as pd
import spacy

We use `spaCy`'s `en_core_web_sm` model as the underlying English language processing model. Throughout this notebook, denote the model by `nlp`.

In [2]:
nlp = spacy.load('en_core_web_sm')

Also, we use `PhraseMatcher` (https://spacy.io/api/phrasematcher) to find word counts in order to conveniently work with bigrams.

In [3]:
from spacy.matcher import PhraseMatcher

### Setup

For the code in this section to work, one must render the functions in [Helper Functions](#the_destination) section first. The docstrings for helper functions also provide additional details about the method employed.
<br><br>
Define the local path of your repository folder.

In [12]:
localpath = "/Users/josht/Documents/GitHub/erdos_twitter_project"

Load the file `tweets_data.csv` into a dataframe. 

In [13]:
# Run this for real data
#df_tweets = pd.read_csv(localpath + "/data/tweets_data.csv")

# Run this during the trial run
df_tweets = pd.read_csv(localpath + "/data/Tweets_Raw/df_tsla OR aapl.csv")

The tweet contents appear as strings in the `text` column.

In [14]:
df_tweets.head(3)

Unnamed: 0,author_id,created_at,created_at_user,location,name,public_metrics_followers_count,public_metrics_following_count,public_metrics_like_count,public_metrics_listed_count,public_metrics_quote_count,public_metrics_reply_count,public_metrics_retweet_count,public_metrics_tweet_count,source,text,tweet_id,username
0,187896454,2018-12-31T23:59:31.000Z,2010-09-07T12:49:16.000Z,Louisiana,lui toth,39249,42551,2,298,0,0,1,30125,Twitter Web Client,"$BLSP huge volume, closes up 12.5%. Shares st...",1079889823161303047,sprtcrdlui
1,19593572,2018-12-31T23:58:57.000Z,2009-01-27T14:21:12.000Z,Denver/Colorado/Everywhere,Technoking Tim,25631,26385,0,577,0,0,2,79787,Twitter for iPhone,"RT @Polixenes13: Ross, please just never stop ...",1079889679774871552,TimWJackson
2,996125781620330500,2018-12-31T23:58:40.000Z,2018-05-14T20:31:08.000Z,"Georgia, USA",TeslaFamily,148,195,2,4,0,0,0,931,Twitter for Android,@kimpaquette I pity the fool that is shorting ...,1079889607515426816,Tesla90801351


We load the libraries of key phrases and put them in a list called `keys`.

In [15]:
keyfiles_words = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_poswords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_negwords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_pos_words.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_neg_words.txt"]

In [16]:
keyfiles_bigrams = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_positive_bigram.csv",
                   localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_negative_bigram.csv"]

In [17]:
# Run this for real data
#keys = [get_keywords(keyfile) for keyfile in keyfiles_words] + [get_keybigrams(keyfile) for keyfile in keyfiles_bigrams]

# Run this during the trial run
keys = [get_keywords(keyfile) for keyfile in keyfiles_words[:4]]


To see how many words are there in each library, we compute

In [18]:
[len(keys[i]) for i in range(len(keys))]

[47, 40, 225, 1285]

Define the library names in a legible manner for reference.

In [19]:
# Run this for real data
#key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg", "Hagenau13_pos", "Hagenau13_neg"]

# Run this during the trial run
key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg"]

Now, we prepare the list of stock indices, starting once again from the file paths.

In [20]:
keyfiles_stocks = [localpath + "/data/Stock_indices/snp500_list.csv",
                  localpath + "/data/Stock_indices/nyse_list.csv"]

We load the dataframe from each csv file, then extract the list of stock indices acronyms from the dataframe. Finally, we store all lists of indices into `stocks`, in a similar manner to `keys`.

In [21]:
stocks = []

for file in keyfiles_stocks:
    df_stocks = pd.read_csv(file)
    stocks.append(list(df_stocks["Symbol"]))

For `snp500_list.csv`, the company names are available under column `Security`. However, for `nyse_list.csv`, only index names are available under column `Name`. Unfortunately, there is no regular pattern to go from index names to company names. Also, index names are so long and specific that it is probably very rarely mentioned in full on Twitter. However, I am open to an alternative solution.

In [22]:
company_names = []

# Add company names from snp500_list.csv
df_stocks = pd.read_csv(keyfiles_stocks[0])
company_names.append(list(df_stocks["Security"]))

# As a place holder for nyse_list.csv, make it an empty list
company_names.append([])

Finally, define the stock index library names for reference.

In [23]:
stock_library = ["S&P500", "NYSE"]

### Mentioned Stock Indices

We apply `get_stock_list` to each tweet in `df_tweets["text"]` and each stock library in `stocks`. We store the stock lists in the list called `mentioned_stocks`.

In [24]:
mentioned_stocks = [[[] for i in range(df_tweets.shape[0])] for j in range(len(stocks))]

for j in range(len(stocks)):
    for i in range(df_tweets.shape[0]):
        mentioned_stocks[j][i] = get_stock_list(df_tweets["text"].iloc[i], stocks[j])

Then, we put `mentioned_stocks` into corresponding new columns, e.g. `Mentioned_stocks_S&P500`.

In [25]:
for j in range(len(stocks)):
    df_tweets["Mentioned_stocks_" + stock_library[j]] = mentioned_stocks[j]

Now, we see that new columns are added listing the mentioned stocks. Some rows have empty lists for both columns.

In [26]:
df_tweets.sample(3)

Unnamed: 0,author_id,created_at,created_at_user,location,name,public_metrics_followers_count,public_metrics_following_count,public_metrics_like_count,public_metrics_listed_count,public_metrics_quote_count,public_metrics_reply_count,public_metrics_retweet_count,public_metrics_tweet_count,source,text,tweet_id,username,Mentioned_stocks_S&P500,Mentioned_stocks_NYSE
13,26440443,2018-12-31T23:44:18.000Z,2009-03-25T05:42:31.000Z,United States,Shannon Powers,82,225,0,0,0,0,3,2030,Twitter for Android,RT @Hein_The_Sayer: @RationalEtienne @rm_natas...,1079885991739940864,rosepaleblue,[],[]
2,996125781620330500,2018-12-31T23:58:40.000Z,2018-05-14T20:31:08.000Z,"Georgia, USA",TeslaFamily,148,195,2,4,0,0,0,931,Twitter for Android,@kimpaquette I pity the fool that is shorting ...,1079889607515426816,Tesla90801351,[TSLA],[]
14,1214284580,2018-12-31T23:43:57.000Z,2013-02-24T02:46:06.000Z,"Durban, South Africa 🇿🇦",Vaccinated Renaissance Man💉😷,2547,1331,2,15,0,0,0,53716,Twitter for Android,@RationalEtienne @rm_natasha @TradrFloridaFIL ...,1079885905920438272,Hein_The_Slayer,[],[]


Finally, to save time for future steps, we drop from `df_tweets` the rows whose tweet mentions no stock index.

In [27]:
# Collect the indices for the rows in which no stock from each stock library is mentioned.
emp_ind = []
for i in range(df_tweets.shape[0]):
    if len(df_tweets["Mentioned_stocks_S&P500"].iloc[i]) == 0 and len(df_tweets["Mentioned_stocks_NYSE"].iloc[i]) == 0:
        emp_ind.append(i)

df_tweets_shorten = df_tweets.drop(emp_ind).copy()


We see that some rows have been removed.

In [30]:
print("Total number of rows:", len(df_tweets.index))

Total number of rows: 30


In [31]:
print("Total number of rows that mention a stock:", len(df_tweets_shorten.index))

Total number of rows that mention a stock: 15


### Word Counts

We apply `tweet_to_wordcounts` to each tweet in `df_tweets_shorten["text"]` that mentions at least one stock, either through indices or company names. Then, we store the results in `wordcounts_all`. 
<br><br>
Warning: this step may take a while.

In [32]:
wordcounts_all = [[-1 for i in range(df_tweets_shorten.shape[0])] for j in range(len(keys))]

for i in range(df_tweets_shorten.shape[0]):
    wordcounts = tweet_to_wordcounts(df_tweets_shorten["text"].iloc[i], keys)
    for j in range(len(keys)):
        wordcounts_all[j][i] = wordcounts[j]

Then, we put the resulting word counts for each phrase library into the corresponding new column, e.g. `Word_count_Henry08_pos`.

In [33]:
for j in range(len(keys)):
    df_tweets_shorten["Word_count_" + key_library[j]] = wordcounts_all[j]

### Results

Applying all the above operations related to word counts and mentioned stock indices, we modify `df_tweets` to the following form. Note that word counts are normalized, i.e. divided, by the total word count for each tweet.

In [35]:
df_tweets_shorten[["text", "Mentioned_stocks_S&P500", "Mentioned_stocks_NYSE", "Word_count_Henry08_pos",
                  "Word_count_Henry08_neg", "Word_count_LM11_pos", "Word_count_LM11_neg"]]

Unnamed: 0,text,Mentioned_stocks_S&P500,Mentioned_stocks_NYSE,Word_count_Henry08_pos,Word_count_Henry08_neg,Word_count_LM11_pos,Word_count_LM11_neg
0,"$BLSP huge volume, closes up 12.5%. Shares st...","[FB, MSFT, BRK.B, AAPL, TSLA, AMZN]",[HRI],0.0,0.0,0.0,0.0
1,"RT @Polixenes13: Ross, please just never stop ...",[TSLA],[],0.0,0.0,0.0,0.014286
2,@kimpaquette I pity the fool that is shorting ...,[TSLA],[],0.0,0.0,0.0,0.0
5,$TSLA passed 190K model 3 VIN registered. Yeah...,[TSLA],[],0.0,0.0,0.0,0.0
7,RT @TeslaCharts: Fraud. Fraud. Fraud. Fraud. \...,[TSLA],[],0.0,0.0,0.0,0.053333
8,FCC to suspend most operations this week due t...,[AAPL],[],0.0,0.0,0.0,0.016529
9,https://t.co/0HOIOe45er\n\n$TSLA news,[TSLA],[],0.0,0.0,0.0,0.0
11,Props to @GerberKawasaki for at least spelling...,[TSLA],[],0.0,0.0,0.0,0.0
12,RT @stockmarkettv: Tesla Production Numbers So...,[TSLA],[],0.0,0.0,0.0,0.0
15,"Bottom Filled on $WDBG Closed up 23% , Ready t...","[FB, MSFT, BRK.B, AAPL, TSLA, AMZN]",[HRI],0.003322,0.0,0.003322,0.0


Please refer to the note on top of this notebook about my remarks on stock indices.
<br><br>
Finally, we save the new `df_tweets` onto a new csv file called `tweets_data_features_added.csv`.

In [None]:
# Run this for real data
#df_tweets_shorten.to_csv(localpath + "/data/Tweets_Preprocessed/tweets_data_features_added.csv", index=False)

# Run this during the trial run
df_tweets_shorten.to_csv(localpath + "/data/Tweets_Preprocessed/df_tsla_aapl_features_added.csv", index=False)

<a id='the_destination'></a>
### Helper Functions

For brevity, we write down all the necessary but lengthy functions in this section.

In [4]:
def get_keywords(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains keywords separated by space.
                
    Output: 
    keywords_lemm -> The list of strings, each of which is a lemmatized word from the txt file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keywords = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_word = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == " ":    # When running into " ", we have finished reading a word. 
            keywords.append(this_word)
            this_word = ""
        elif text[i:] == "\n":   # This may occur at the end of the string.
            break
        else:     # With an additional letter, just add it to the current word.
            this_word = this_word + text[i].lower()
    
    # If the string does not end in " ", we will need to append the last word to keywords.
    if this_word != "":
        keywords.append(this_word)
    
    # Lemmatize keywords. See the function below.
    keywords_lemm_rept = key_lemmatize(keywords)
    
    # keywords_lemm_rept likely contains repeated elements due to words that stem from the same root. Eliminate them.
    keywords_lemm = list(set(keywords_lemm_rept))
    
    return keywords_lemm

In [5]:
def get_keybigrams(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains key bigrams separated by "\n". 
    
    This function should work for any n-grams, given that the text file is written in the same format.
                
    Output: 
    keybigrams_lemm -> The list of strings, each of which is a lemmatized bigram from the csv file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keybigrams = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_bigram = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == "\n":    # When running into "\n", we have finished reading a bigram.
            keybigrams.append(this_bigram)
            this_bigram = ""
        else:     # With an additional letter or space, just add it to the current bigram.
            this_bigram = this_bigram + text[i].lower()
    
    # If the string does not end in "\n", we will need to append the last bigram to keybigrams.
    if this_bigram != "":
        keybigrams.append(this_bigram)
    
    # Lemmatize keybigrams. See the function below.
    keybigrams_lemm_rept = key_lemmatize(keybigrams)
    
    # keybigrams_lemm_rept likely contains repeated elements. Eliminate them.
    keybigrams_lemm = list(set(keybigrams_lemm_rept))
    
    return keybigrams_lemm 

In [6]:
def key_lemmatize(keywords):
    """
    Input:
    keywords -> The list of key words/bigrams extracted from a library file.
    
    Output:
    keywords_lemm -> A list with elements from keywords, each converted into its lemma form using spaCy
    """
    # Define keywords_lemm
    keywords_lemm = []
    
    # Populate keywords_lemm by strings written from the lemma of the word.
    for i in range(len(keywords)):
        this_doc = nlp(keywords[i])
        this_word_lemm = ""
        for token in this_doc:
            this_word_lemm = this_word_lemm + token.lemma_ + " "   # Add space in case there are multiple words
        keywords_lemm.append(this_word_lemm[:-1])      # Discard the final space
    
    return keywords_lemm

In [7]:
def preprocess_tweet(tweet):
    """
    Input:
    tweet -> The raw tweet text in string
    
    Output:
    processed_doc -> A spaCy's doc object built from tweet with stop words (do, is, not, you, etc) removed.
    """
    # Convert tweet into doc
    doc_raw = nlp(tweet)
    
    # Define and populate the list of all tokens in doc_raw
    token_list = []
    for token in doc_raw:
        token_list.append(token)
    
    # Define and write down the tweet without stop words
    tweet_cleaned = ""
    for token in token_list:
        if not token.is_stop:
            tweet_cleaned = tweet_cleaned + token.text + " "
    
    # Finally, convert to doc once again
    return nlp(tweet_cleaned)

In [8]:
def tweet_to_wordlocs(tweet, keys, remove_stop=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of lists of key words/bigrams. For example, keys = [henry08_pos, henry08_neg, ..., newslib]
            Each key word/bigram is assumed to contain only English letter and space.
    remove_stop -> If True, remove stop words (do, is, not, you, etc) from tweet as a preprocessing step.
                   This option is recommended if the user would like to use the bigram lists by Hagenau et al (2013)
                   because they come with stop words removed, e.g. "able add" assumes that "to" has been removed.
    
    Output:
    wordlocs -> A list whose elements are of the form [keyloc, start, end], such that for the key word/bigram 
    from the i-th library, keys[i], located in tweet from tweet[j] to tweet[j+k-1] inclusive, we have 
    keyloc = i, start = j and end = j+k. The elements of wordlocs are sorted based on the value of start.
    """
    # Define the case-insensitive phrase matcher.
    matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
    
    # Convert tweet into a spaCy's doc object with stop words removed if called for.
    if remove_stop:
        tweet_text = preprocess_tweet(tweet)
    else:
        tweet_text = nlp(tweet)
    
    # Tokenize the key phrases and introduce them to the phrase matcher model.
    for i in range(len(keys)):
        phrases = [nlp(phrase) for phrase in keys[i]]
        matcher.add(str(i), phrases)
    
    # Find the matches
    matches = matcher(tweet_text)
    
    # Define and populate wordlocs
    wordlocs = []
    for i in range(len(matches)):
        keyloc_shifted, start, end = matches[i]
        keyloc = int(nlp.vocab.strings[keyloc_shifted])
        wordlocs.append([keyloc, start, end])
    
    return wordlocs

In [9]:
def wordlocs_to_wordcounts(wordlocs, tweet_length, num_keys, normalize=True):
    """
    Input:
    wordlocs -> The list of keys and locations found in the tweet, c.f. tweet_to_wordlocs function.
    tweet_length -> The length of tweet in words.
    num_keys -> The number of phrase lists, i.e. len(keys) from tweet_to_wordlocs function.
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define and populate wordcounts
    wordcounts = [0 for i in range(num_keys)]
    for j in range(len(wordlocs)):
        wordcounts[wordlocs[j][0]] += 1
    
    # Perform normalization as needed
    if normalize:
        return [wordcounts[i]/tweet_length for i in range(num_keys)]
    
    # In the case where normalization is not called for
    return wordcounts

In [10]:
def tweet_to_wordcounts(tweet, keys, remove_stop=True, normalize=True):
    """
    This function performs tweet_to_wordlocs then plug everything into wordlocs_to_wordcounts.
    """
    wordlocs = tweet_to_wordlocs(tweet, keys, remove_stop)
    wordcounts = wordlocs_to_wordcounts(wordlocs, len(tweet), len(keys), normalize=True)
    
    return wordcounts

In [11]:
def get_stock_list(tweet, stock_indices, company_names=[]):
    """
    Input:
    tweet -> The raw tweet text in string
    stock_indices -> The list of stock indices in string
    company_names -> The list of company names corresponding to stock_indices
                     If not provided, company names will not be searched for in tweet.
    
    Output:
    stock_list -> A list of stock indices in stock_indices that are mentioned in tweet,
                  either by indices or by company names.
    """
    # To make this case-insensitive, make tweet all lowercase.
    tweet_processed = tweet.lower()
    
    # Initialize stock_list as an empty list.
    stock_list = []
    
    # For each stock index, make it lowercase then find if it appears in tweet_processed.
    # To avoid false positives (in finding a mention), we only consider indices followed by " "
    # and preceeded by "$" or "#"
    for i in range(len(stock_indices)):
        loc_dollar = tweet_processed.find("$" + stock_indices[i].lower() + " ")
        loc_hashtag = tweet_processed.find("#" + stock_indices[i].lower() + " ")
        if max(loc_dollar, loc_hashtag) >= 0:
            stock_list.append(stock_indices[i])
    
    # For each company name, if available, make it lowercase then find if it appears in tweet_processed.
    for i in range(len(company_names)):
        loc = tweet_processed.find(" " + company_names[i].lower() + " ")
        if loc >= 0:
            stock_list.append(stock_indices[i])
    
    return list(set(stock_list))