# Text to Features


### Tl;dr for Frequent Users

- The only lines to customize for each run are `localpath` and `company_name`, which are written [here](#dest3).
- Run the [Libraries](#dest4) section, followed by [Helper Functions](#the_destination), then [Setup](#dest2) and all the remaining stuffs below it.


### Overview

In this notebook, we load the table `tweets_data.csv` of Tweets that mention a particular company in the stock exchange. We compute the (normalized) number of words from various word libraries for each Tweet in the string format. The word libraries include:
- `Henry08_poswords.txt` and `Henry08_negwords.txt`, containing positive and negative words, respectively, from Henry (2008).
- `LM11_pos_words.txt` and `LM11_neg_words.txt`, containing words related to positive and negative sentiments, respectively, from Loughran and McDonald (2011).
- `ML_positive_bigram.csv` and `ML_negative_bigram.csv`, containing positive and negative bigrams (no trigrams??????), respectively, from Hagenau et al. (2013). 
- `news_library.txt`, containing names of mainstream business news agencies.
- Vader sentiment score using a function imported from our external file.

The word counts will be added as new columns to the dataframe, and the final dataframe will be saved on a new file: `df_[companyname]_features_added.csv`. The file will be read into the notebook that performs our model fit.

This notebook can easily be modified to treat a different tweet table file.

The main purpose of this notebook is to walk through data preprocessing steps in a pedagogical fashion. The actual preprocessing work is executed in the notebook `Text_to_Features_Many_Stocks.ipynb`, which more conveniently loops the script of this notebook through all the companies.

<a id='dest4'></a>
### Libraries

We use `pandas` for dataframe and `spaCy` for linguistic operations.

In [1]:
import pandas as pd
import spacy

We use `spaCy`'s `en_core_web_sm` model as the underlying English language processing model. Throughout this notebook, denote the model by `nlp`.

In [2]:
nlp = spacy.load('en_core_web_sm')

<a id='dest2'></a>
### Setup

For the code in this section to work, one must render the functions in [Helper Functions](#the_destination) section first. The docstrings for helper functions also provide additional details about the method employed.
<br><br>
Define the local path of your repository folder.
<a id='dest3'></a>

In [8]:
# Customize based on your machine
localpath = "/Users/josht/Documents/GitHub/erdos_twitter_project"

Load the file into dataframe `df_tweets`. 

In [9]:
# Customize based on which stock to work on
company_name = "Starbucks"

In [10]:
df_tweets = pd.read_csv(localpath + "/data/df_" + company_name + ".csv")

The tweet contents appear as strings in the `text` column.

In [11]:
df_tweets.head(3)

Unnamed: 0,created_at,entities_cashtags,entities_hashtags,entities_urls,public_metrics_like_count,public_metrics_quote_count,public_metrics_reply_count,public_metrics_retweet_count,text,entities_mentions,created_at_user,public_metrics_followers_count,public_metrics_following_count,public_metrics_listed_count,public_metrics_tweet_count,media_type
0,2021-09-30 19:59:36,0,0,2,4,0,0,0,Campus labor shortage delays opening of the St...,1,2021-07-26 18:14:59,21,28,0,35,0
1,2021-09-30 19:59:12,0,0,0,4,0,3,0,Yo what are fire Starbucks drinks,0,2020-05-23 16:28:47,96,116,0,919,0
2,2021-09-30 19:59:07,0,0,2,0,0,0,0,https://t.co/dTZGW5bmsO\n\nhttps://t.co/APZH7I...,2,2010-09-16 05:43:14,4151,4962,56,391400,0


We load the libraries of key phrases and news agencies names, then put them in a list called `keys`. Each element of `keys` is a set of words from the corresponding library file.

In [12]:
keyfiles_words = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_poswords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/Henry08_negwords.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_pos_words.txt",
                  localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/LM11_neg_words.txt"]

In [13]:
keyfiles_bigrams = [localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_positive_bigram.csv",
                   localpath + "/Twitter_Sentiment_Analysis/Directional_Feature_Libraries/ML_negative_bigram.csv"]

In [14]:
keyfiles_news = [localpath + "/Twitter_Sentiment_Analysis/Relevance_Feature_Libraries/news_library.txt"]

In [15]:
# All libraries
keys = [get_keywords(keyfile) for keyfile in keyfiles_words] + [get_keybigrams(keyfile) for keyfile in keyfiles_bigrams] + [get_news_agencies(keyfile) for keyfile in keyfiles_news]

# Ignoring bigrams (significantly faster)
#keys = [get_keywords(keyfile) for keyfile in keyfiles_words] + [get_news_agencies(keyfile) for keyfile in keyfiles_news]


To see how many words are there in each library, we compute

In [16]:
[len(keys[i]) for i in range(len(keys))]

[104, 85, 354, 2355, 12130, 13330, 23]

Define the library names in a legible manner for reference.

In [17]:
# Run this for real data
key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg", 
               "Hagenau13_pos", "Hagenau13_neg", "News_agencies"]

# Run this during the trial run
#key_library = ["Henry08_pos", "Henry08_neg", "LM11_pos", "LM11_neg", "News_agencies"]

### Word Counts

We apply `tweet_to_wordcounts` to each tweet in `df_tweets["text"]`. Then, we store the results in `wordcounts_all`. 

In [18]:
wordcounts_all = [[-1 for i in range(df_tweets.shape[0])] for j in range(len(keys))]

for i in range(df_tweets.shape[0]):
    
    # Run this if ignoring bigrams
    # wordcounts = tweet_to_wordcounts(df_tweets_shorten["text"].iloc[i], keys)
    
    # Run this if including bigrams
    wordcounts = tweet_to_wordcounts(df_tweets["text"].iloc[i], keys[:4] + [keys[-1]])
    bigramcounts = tweet_to_bigramcounts(df_tweets["text"].iloc[i], keys[4:6])
    
    for j in range(len(keys)):
        if j <= 3:
            wordcounts_all[j][i] = wordcounts[j]
        elif j == len(keys) - 1:
            wordcounts_all[-1][i] = wordcounts[-1]
        else:
            wordcounts_all[j][i] = bigramcounts[j - 4]

Then, we put the resulting word counts for each phrase library into the corresponding new column, e.g. `Word_count_Henry08_pos`.

In [19]:
for j in range(len(keys[:-1])):
    df_tweets["Word_count_" + key_library[j]] = wordcounts_all[j]

Finally, we add the column for the number of news agency names that appear.

In [20]:
df_tweets["News_agencies_names_count"] = wordcounts_all[-1]

### Vader Score

Now, we add an extra columns to the dataframe corresponding to the Vader scores. This part of the code is imported from `vader_tweet_sentiment.py`.

In [22]:
from vader_tweet_sentiment import vader_tweet_sentiment

In [23]:
df_tweets = vader_tweet_sentiment(df_tweets)

### Results

Applying all the above operations related to word counts and mentioned stock indices, we modify `df_tweets` to the following form. Note that word counts are normalized, i.e. divided, by the total word count for each tweet.

In [24]:
df_tweets[["text", "Word_count_Henry08_pos", "Word_count_Henry08_neg", "Word_count_LM11_pos", "Word_count_LM11_neg", 
           "Word_count_Hagenau13_pos", "Word_count_Hagenau13_neg","News_agencies_names_count", "Compound_vader", 
           "Positive_vader", "Negative_vader", "Neutral_vader"]]

Unnamed: 0,text,Word_count_Henry08_pos,Word_count_Henry08_neg,Word_count_LM11_pos,Word_count_LM11_neg,Word_count_Hagenau13_pos,Word_count_Hagenau13_neg,News_agencies_names_count,Compound_vader,Positive_vader,Negative_vader,Neutral_vader
0,Campus labor shortage delays opening of the St...,0.000000,0.0,0.000000,0.012579,0.0,0.0,0.0,-0.2500,0.000,0.095,0.905
1,Yo what are fire Starbucks drinks,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,-0.3400,0.000,0.324,0.676
2,https://t.co/dTZGW5bmsO\n\nhttps://t.co/APZH7I...,0.000000,0.0,0.005348,0.000000,0.0,0.0,0.0,0.3716,0.192,0.084,0.724
3,Craving Auntie Anne's and Starbucks most deff😭...,0.014286,0.0,0.000000,0.000000,0.0,0.0,0.0,-0.5209,0.000,0.252,0.748
4,I love spoken word music. It’s so ridiculous. ...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.9002,0.421,0.103,0.476
...,...,...,...,...,...,...,...,...,...,...,...,...
36502,At starbucks getting my morning coffee..almost...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0000,0.000,0.000,1.000
36503,Happy Friday! Coffee is the fuel of humans 😹\n...,0.000000,0.0,0.006173,0.000000,0.0,0.0,0.0,0.7840,0.246,0.060,0.694
36504,Freedom has always came at a great cost! Remem...,0.000000,0.0,0.004505,0.004505,0.0,0.0,0.0,0.9258,0.352,0.132,0.516
36505,"Ten gunning for the Starbucks sponsorship, est...",0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0000,0.000,0.000,1.000


Before we save the result into a file, we delete all the rows with zero word counts from all dictionaries and absolutely neutral Vader score. The latter means zero positive, negative and compound Vader scores, with neutral Vader score of 1. Altogether, we remove 20-25% of the rows.

In [31]:
df_trivial_tweets = df_tweets.loc[df_tweets["Word_count_Henry08_pos"] == 0].loc[df_tweets["Word_count_Henry08_neg"] == 0].loc[df_tweets["Word_count_LM11_pos"] == 0].loc[df_tweets["Word_count_LM11_neg"] == 0].loc[df_tweets["Word_count_Hagenau13_pos"] == 0].loc[df_tweets["Word_count_Hagenau13_neg"] == 0].loc[df_tweets["News_agencies_names_count"] == 0].loc[df_tweets["Compound_vader"] == 0].loc[df_tweets["Positive_vader"] == 0].loc[df_tweets["Negative_vader"] == 0].loc[df_tweets["Neutral_vader"] == 1]
          

In [33]:
df_tweets_shorten = df_tweets.drop(df_trivial_tweets.index).copy()

In [34]:
df_tweets_shorten[["text", "Word_count_Henry08_pos", "Word_count_Henry08_neg", "Word_count_LM11_pos", "Word_count_LM11_neg", 
           "Word_count_Hagenau13_pos", "Word_count_Hagenau13_neg","News_agencies_names_count", "Compound_vader", 
           "Positive_vader", "Negative_vader", "Neutral_vader"]]

Unnamed: 0,text,Word_count_Henry08_pos,Word_count_Henry08_neg,Word_count_LM11_pos,Word_count_LM11_neg,Word_count_Hagenau13_pos,Word_count_Hagenau13_neg,News_agencies_names_count,Compound_vader,Positive_vader,Negative_vader,Neutral_vader
0,Campus labor shortage delays opening of the St...,0.000000,0.0,0.000000,0.012579,0.0,0.0,0.0,-0.2500,0.000,0.095,0.905
1,Yo what are fire Starbucks drinks,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,-0.3400,0.000,0.324,0.676
2,https://t.co/dTZGW5bmsO\n\nhttps://t.co/APZH7I...,0.000000,0.0,0.005348,0.000000,0.0,0.0,0.0,0.3716,0.192,0.084,0.724
3,Craving Auntie Anne's and Starbucks most deff😭...,0.014286,0.0,0.000000,0.000000,0.0,0.0,0.0,-0.5209,0.000,0.252,0.748
4,I love spoken word music. It’s so ridiculous. ...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.9002,0.421,0.103,0.476
...,...,...,...,...,...,...,...,...,...,...,...,...
36500,Today’s Starbucks drink 👍💫💫🧋🧋 https://t.co/eS2...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,-0.2263,0.000,0.213,0.787
36501,Macmillan coffee morning has arrived in Drakes...,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.6597,0.172,0.000,0.828
36503,Happy Friday! Coffee is the fuel of humans 😹\n...,0.000000,0.0,0.006173,0.000000,0.0,0.0,0.0,0.7840,0.246,0.060,0.694
36504,Freedom has always came at a great cost! Remem...,0.000000,0.0,0.004505,0.004505,0.0,0.0,0.0,0.9258,0.352,0.132,0.516


Finally, we save the new `df_tweets_shorten` onto a new csv file called `df_[companyname]_features_added.csv`.

In [36]:
# Retaining the text
df_tweets_shorten.to_csv(localpath + "/Data_Preprocessed/df_" + company_name + "_features_added_text.csv", index=False)

In [38]:
# Get rid of the text to save space further
df_tweets_notext = df_tweets_shorten.drop(columns=["text"]).copy()

df_tweets_notext.to_csv(localpath + "/Data_Preprocessed/df_" + company_name + "_features_added.csv", index=False)

Getting rid of the text itself reduces to file size to about 50% the size with text.

<a id='the_destination'></a>
### Helper Functions

For brevity, we write down all the necessary but lengthy functions in this section.

In [3]:
def get_keywords(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains keywords separated by space.
                
    Output: 
    keywords -> The set of strings, each of which is a word from the txt file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keywords = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_word = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == " ":    # When running into " ", we have finished reading a word. 
            keywords.append(this_word)
            this_word = ""
        elif text[i:] == "\n":   # This may occur at the end of the string.
            break
        else:     # With an additional letter, just add it to the current word.
            this_word = this_word + text[i].lower()
    
    # If the string does not end in " ", we will need to append the last word to keywords.
    if this_word != "":
        keywords.append(this_word)
    
    # Return the result in the set format.
    return set(keywords)

In [4]:
def get_keybigrams(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains key bigrams separated by "\n". 
    
    This function should work for any n-grams, given that the text file is written in the same format.
                
    Output: 
    keybigrams_lemm -> The set of strings, each of which is a bigram from the csv file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define keywords
    keybigrams = []
    
    # This is to keep track of the word we are reading as we traverse text.
    this_bigram = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == "\n":    # When running into "\n", we have finished reading a bigram.
            keybigrams.append(this_bigram)
            this_bigram = ""
        else:     # With an additional letter or space, just add it to the current bigram.
            this_bigram = this_bigram + text[i].lower()
    
    # If the string does not end in "\n", we will need to append the last bigram to keybigrams.
    if this_bigram != "":
        keybigrams.append(this_bigram)
    
    # Return the result in the set format.
    return set(keybigrams)

In [5]:
def get_news_agencies(filename: str):
    """
    Input: 
    filename -> The file name and its directory in string, with ending included. 
                The directory must be relative to the location of this notebook.
                This file contains names of news agencies separated by space.
                
    Output: 
    news_agencies -> The set of strings, each of which is a name of news agency from the input file.
    """
    # Load the file into a string
    text = open(filename, 'r').read()
    
    # Define news_agencies
    news_agencies = []
    
    # This is to keep track of the news agency name we are reading as we traverse text.
    this_word = ""
    
    # Traversing text
    for i in range(len(text)):
        if text[i] == " ":    # When running into " ", we have finished reading an agency's name. 
            news_agencies.append(this_word)
            this_word = ""
        elif text[i:] == "\n":   # This may occur at the end of the string.
            break
        else:     # With an additional letter, just add it to the current agency's name.
            this_word = this_word + text[i].lower()
    
    # If the string does not end in " ", we will need to append the last agency's name to news_agencies.
    if this_word != "":
        news_agencies.append(this_word)
    
    # Return the result in the set format.
    return set(news_agencies)

In [6]:
def tweet_to_wordcounts(tweet, keys, normalize=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of sets of key words. For example, keys = [henry08_pos, henry08_neg, ..., newslib]
            Each keyword is assumed to contain only English letter. WARNING: must remove bigrams
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define a spaCy's doc object for the tweet
    tweet_doc = nlp(tweet.lower())
    
    # Convert the doc object into a set of words
    tweet_words = set([token.text for token in tweet_doc])
    
    # Initialize wordcounts
    wordcounts = []
    
    # For each words library, we count the number of words in tweet using the more efficient 
    # intersection method. Then, if called for, we normalize the count by the length of the raw tweet.
    for i in range(len(keys)):      # Not including the news agencies for now
        this_wordcount = len(tweet_words.intersection(keys[i])) 
        if normalize:
            this_wordcount_normalized = this_wordcount / len(tweet)
            wordcounts.append(this_wordcount_normalized)
        else:
            wordcounts.append(this_wordcount)
    
    return wordcounts

In [7]:
def tweet_to_bigramcounts(tweet, keys, normalize=True):
    """
    Input:
    tweet -> The raw tweet text in string
    keys -> The list of sets of key bigrams. For example, keys = ["Hagenau13_pos", "Hagenau13_neg"]
            Each key bigram is assumed to contain only English letter and space.
    normalize -> If True, the word count for each keyword list is normalized by tweet_length.
                 If False, the raw word count will be returned.
    
    Output:
    wordcounts -> A list of length num_keys. Each element is the (normalized) word count corresponding
                  to the number of phrases from one of the phrase lists that appear in the tweet, 
                  as reported in wordlocs.
    """
    # Define a spaCy's doc object for the tweet
    tweet_doc = nlp(tweet.lower())
        
    # Convert the doc object into a list of words with stop words removed, in accordance with the bigram libraries.
    tweet_words = [token.text for token in tweet_doc if not token.is_stop]
    
    # Define the set of bigrams from the tweet, consisting of pairs of neighboring words.
    tweet_bigrams = set([tweet_words[i] + " " + tweet_words[i+1] for i in range(len(tweet_words) - 1)])
    
    # Initialize wordcounts
    wordcounts = []
    
    # For each bigrams library, we count the number of bigrams in tweet_bigrams using the more efficient 
    # intersection method. Then, if called for, we normalize the count by the length of the raw tweet.
    for i in range(len(keys)):      # Not including the news agencies for now
        this_wordcount = len(tweet_bigrams.intersection(keys[i])) 
        if normalize:
            this_wordcount_normalized = this_wordcount / len(tweet)
            wordcounts.append(this_wordcount_normalized)
        else:
            wordcounts.append(this_wordcount)
    
    return wordcounts

Once the user finish running all these helper functions, jump back to the [Setup](#dest2) section.