# Convert Tweets To Sentiments

Take the 500 tweets I collected previously and execute the basic NLP algorithm Vader on the texts of the tweets:
  * collect the values for positive and negative sentiments for every tweet
  * take the mean for positive and negative sentiments for every day
  * store the positive and negative means into a new CSV

In [94]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
from nltk import tokenize
import nltk
import glob

In [95]:
nltk.download('punkt')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [96]:
sid = SentimentIntensityAnalyzer()

Everyday's 500 tweets are stored into a seperate file. 
  * The __first__ for loop goes through this files.

Every file consists of CSV with tweet text and additional other meta data.
  * The __second__ for loop goes through this DataFrame and tokenizes it using NLTK.

Every tweet probably consists of multiple sentences.
  * The __third__ for loop goes through these tokenized sentences and extracts sentiment scores for each one and sums it up for negative and positive polarity.
  
Afterwards the collected and summed score will be normalized by the length of total tweets.

In [105]:
files = glob.glob("eco_data/#economy*")

tweets_sentiments_list = []
for file_str in files:
    date=""
    length=0
    tweets = pd.read_csv(file_str+"/tweets.csv")
    if length==0:
        length = tweets.shape[0]
    if not date:
        date = tweets.iloc[0].date
    neg_counter = 0
    pos_counter = 0
    for tweet in tweets.tweet:
        lines_list = tokenize.sent_tokenize(tweet)
        for sen in lines_list:
            ss = sid.polarity_scores(sen)
            neg_counter += ss["neg"]
            pos_counter += ss["pos"]
            
            # here is a bug, the summed sentiment scores have to be normalized by number of sentiments too
            # isnt really bad but will bias the data as longer tweets will get higher scores
    
    neg_counter = neg_counter / length
    pos_counter = pos_counter / length
    
    tweets_sentiments_list.append({"date":date,
                                   "length":length,
                                   "pos": pos_counter,
                                   "neg": neg_counter})

In [106]:
tweet_sent_df = pd.DataFrame(tweets_sentiments_list)

In [107]:
tweet_sent_df.head()

Unnamed: 0,date,length,pos,neg
0,2016-01-12,514,0.086444,0.04994
1,2020-01-29,518,0.109886,0.060622
2,2017-03-08,500,0.07539,0.071758
3,2018-07-22,334,0.094775,0.058647
4,2016-04-25,450,0.087584,0.049909


Setting a datetime index and sorting the tweets by it, afterwards storing the resutling DataFrame into an CSV for later processing.

In [108]:
tweet_sent_df = tweet_sent_df.set_index("date")

In [109]:
tweet_sent_df = tweet_sent_df.sort_values(by="date")

In [110]:
tweet_sent_df.to_csv("data/economy_sentiments.csv")

In [92]:
tweet_sent_df.head()

Unnamed: 0_level_0,length,pos,neg
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-01,1716,0.153879,0.043424
2015-01-02,1699,0.134617,0.035322
2015-01-03,1719,0.141888,0.061588
2015-01-04,1320,0.131757,0.060844
2015-01-05,1228,0.107428,0.072953
