# Sentiment140: Predicting Stock Movement Using Sentiment Analysis of Twitter Feed with Neural Networks

- baseline: https://www.scirp.org/journal/paperinformation.aspx?paperid=104142#ref9
- sentiment kaggle: https://www.kaggle.com/datasets/kazanova/sentiment140 

# Sentiment 140 Data
For the training data, we are going to use a sentiment tagged Twitter dataset of 1.6 million tweets, collected from Sentiment140 for sentiment classification. The tweets are tagged ‘1’ and ‘0’ for being ‘positive’ and ‘negative’ respectively.

It contains the following 6 fields:

1. **target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

2. **ids**: The id of the tweet ( 2087)

3. **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

4. **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.

5. **user**: the user that tweeted (robotickilldozr)

6. **text**: the text of the tweet (Lyx is cool)

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

### Download the data
Download from the Kaggle website and then save to your local directory. 

Kaggle: https://www.kaggle.com/datasets/kazanova/sentiment140


### Import packages

In [1]:
# utilities
import string 
import re 
import pickle # not used
import pandas as pd 
import time

In [2]:
# nltk
import nltk 
nltk.download('stopwords')
nltk.download('punkt') 
nltk.download('wordnet') 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /Users/Dell/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Dell/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Dell/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

### Load in data

In [4]:
# import drive so you can access your folders
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

In [5]:
# read in data
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin')

# add in the column names
df.columns = ['sentiment', 'tweet_id', 'time', 'flag', 'user', 'tweet']
df.head()


Unnamed: 0,sentiment,tweet_id,time,flag,user,tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [6]:
# create a new dataframe with only tweets and sentiment
features ='tweet'
target = 'sentiment'

df = df[[features, target]]
df.head()

Unnamed: 0,tweet,sentiment
0,is upset that he can't update his Facebook by ...,0
1,@Kenichan I dived many times for the ball. Man...,0
2,my whole body feels itchy and like its on fire,0
3,"@nationwideclass no, it's not behaving at all....",0
4,@Kwesidei not the whole crew,0


### Data preprocessing

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   tweet      1599999 non-null  object
 1   sentiment  1599999 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [8]:
# downcast to smaller integer size to reduce memory: int64 -> int8
df['sentiment'] = pd.to_numeric(df['sentiment'], downcast='integer')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   tweet      1599999 non-null  object
 1   sentiment  1599999 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 13.7+ MB


In [9]:
# find all sentiment values in the dataset 
df.sentiment.unique()

array([0, 4], dtype=int8)

In [10]:
# change 4 to 1 (positive)
df['sentiment'] = df['sentiment'].replace(4, 1)
df['sentiment'].value_counts()

1    800000
0    799999
Name: sentiment, dtype: int64

In [11]:
class TweetCleaner:
  def __init__(self):
    self.stop_words = set(stopwords.words('english'))
    self.emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':(': 'sad', 'XD': 'laughing',
          ':-(': 'sad', ':-<': 'sad', ':P': 'stuck-out-tongue', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          ':/': 'confused', ':|': 'neutral-face', ":'-)": 'sadsmile', "<3": 'love',
          ":'-)": 'tears-of-happiness'}

  def lowercase(self, tweet):
    ''' Each text is converted to lowercase. '''
    return tweet.lower()
  
  def replace_url(self, tweet):
    ''' Links starting with “Http” or “https” or “www” are replaced by “URL” '''
    url_regex = re.compile(r'(http[s]?://|www\.)\S+')
    return url_regex.sub('URL', tweet)

  def replace_emojis(self, tweet):
    '''Replace emojis by using a pre-defined dictionary containing emojis 
      along with their meaning. (e.g.: “:)” to “EMOJIsmile”) '''
    for emoji in self.emojis.keys():
      tweet = tweet.replace(emoji, "EMOJI" + self.emojis[emoji]) 
    return tweet

  def replace_username(self, tweet):
    ''' Replace @Usernames with the word “USER”. (e.g.: “@Kaggle” to “USER”)'''
    user_regex = re.compile(r'@[^\s]+')
    return user_regex.sub('USER', tweet)  

  def remove_nonalpha(self, tweet):
    ''' Replacing characters except Digits and Alphabets with space.'''
    nonalpha_regex = re.compile(r'[^a-zA-Z0-9]')
    return nonalpha_regex.sub(" ", tweet)
  
  def remove_consecutives(self, tweet):
    ''' 3 or more consecutive letters are 
        replaced by two letters. (e.g.: “Heyyyy” to “Heyy”) '''
    sequencePattern   = r"(.)\1\1+"
    seqReplacePattern = r"\1\1"
    return re.sub(sequencePattern, seqReplacePattern, tweet)

  def remove_stop_short_words(self, tweet):
    ''' English words that do not add much meaning to a sentence are removed
        and Words with a length of less than two are eliminated.'''
    words = nltk.word_tokenize(tweet)
    words = [word for word in words if word not in self.stop_words and len(word) >= 2]
    return ' '.join(words)

  def lemmatize(self, tweet):
    ''' Converting word to its base form. '''
    tweetwords = ''
    for word in tweet.split():
      word = WordNetLemmatizer().lemmatize(word)
      tweetwords += (word+' ')
    return tweetwords

  def clean_onetweet(self, tweet):
    ''' cleans one single tweet '''
    cleaned = self.lowercase(tweet)
    cleaned = self.replace_url(cleaned)
    cleaned = self.replace_emojis(cleaned)
    cleaned = self.replace_username(cleaned)
    cleaned = self.remove_nonalpha(cleaned)
    cleaned = self.remove_consecutives(cleaned)
    cleaned = self.remove_stop_short_words(cleaned)
    cleaned = self.lemmatize(cleaned)
    return cleaned

  def clean_alltweets(self, df):
    ''' cleans all tweets in the dataframe'''
    df['tweets_processed'] = df['tweet'].apply(self.clean_onetweet)
    df = df.drop(columns=['tweet'])
    df = df.rename(columns={'tweets_processed': 'tweet'})
    return df


In [12]:
# testing one tweet
tweet = "@jane and her Dogs have g !$MONEY sitting with https://www.mlq.ai/ai-companies-trading-investing/ babies with feet"
tweetCleaner = TweetCleaner()

print(tweetCleaner.clean_onetweet(tweet))

USER dog money sitting URL baby foot 


In [13]:
# method for processing tweets
def process_tweet_dataframe(df):
  tweetCleaner = TweetCleaner()
  
  t = time.time()
  df_processed = tweetCleaner.clean_alltweets(df)
  print(f'Text Preprocessing complete.')
  print(f'Time Taken: {round(time.time()-t)} seconds')
  return df_processed

In [None]:
df_processed = process_tweet_dataframe(df)

In [None]:
# view processed dataset
df.head()

Unnamed: 0,tweet,sentiment,tweets_processed
0,is upset that he can't update his Facebook by ...,0,upset update facebook texting might cry result...
1,@Kenichan I dived many times for the ball. Man...,0,USER dived many time ball managed save 50 rest...
2,my whole body feels itchy and like its on fire,0,whole body feel itchy like fire
3,"@nationwideclass no, it's not behaving at all....",0,USER behaving mad see
4,@Kwesidei not the whole crew,0,USER whole crew


In [None]:
df_processed.head()

Unnamed: 0,sentiment,tweet
0,0,upset update facebook texting might cry result...
1,0,USER dived many time ball managed save 50 rest...
2,0,whole body feel itchy like fire
3,0,USER behaving mad see
4,0,USER whole crew


### Save processed tweets to csv

In [14]:
# save the DataFrame to a CSV file
df_processed.to_csv('processed_sentiment140_tweets.csv', index=False)

# Uncomment the below lines and comment previous lines for loading preprocessed df
# df_processed = pd.read_csv('processed_sentiment140_tweets.csv')
# df_processed.dropna(inplace=True)

### Splitting data into train and test
We perform a random split over
the dataset to divide the dataset into a training dataset and a testing data set. The training dataset contains 1.52 million tweets, whereas the testing dataset contains 80,000 tweets.

5 percent of the training data from the sentiment 140 dataset was used to test the trained models. 

In [15]:
df_processed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599538 entries, 0 to 1599998
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1599538 non-null  int64 
 1   tweet      1599538 non-null  object
dtypes: int64(1), object(1)
memory usage: 36.6+ MB


In [16]:
df = df_processed
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df["tweet"], df["sentiment"], test_size=0.05, random_state=42)

# Vectorize the dataframe

In [17]:
# vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

### Train the model on SVM

The five models used to train were Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT), Boosted Tree (BT), and Random Forests (RF). The **best performance was SVM** (0.83 Accuracy, 0.83 F1 score, 0.83 Precision, 0.83 Recall).

The text data is vectorized using TF-IDF (term frequency-inverse document frequency) using the TfidfVectorizer class from scikit-learn. This converts the text data into a numerical feature matrix that can be used to train the SVM model.

The SVM model is trained using the SVC class from scikit-learn with a linear kernel and a regularization parameter of 1.0. 

In [None]:
# train the SVM model
svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_tfidf, y_train)

### Evaluate the model

In [None]:
# evaluate the SVM model on the testing set
y_pred_svm = svm.predict(X_test_tfidf)
print(classification_report(y_test, y_pred_svm))

### Train the model using LR

In [19]:
lr_model = LogisticRegression(C=0.1,max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

#### Evaluate the LR Model

In [20]:
# evaluate the LR model on the testing set
y_pred_lr = lr_model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.79      0.75      0.77     39996
           1       0.76      0.80      0.78     39981

    accuracy                           0.78     79977
   macro avg       0.78      0.78      0.78     79977
weighted avg       0.78      0.78      0.78     79977



In [21]:
lr_accuracy = 0
ytest = y_test.values
for i in range(len(y_pred_lr)):
    if y_pred_lr[i]==ytest[i]:
        lr_accuracy += 1
lr_accuracy = (lr_accuracy/len(y_pred_lr))*100
print("LR Accuracy :",lr_accuracy)

LR Accuracy : 77.53479125248509


### Train the model on Decision Tree 

In [18]:
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=1,max_depth=100)
dt_model.fit(X_train_tfidf,y_train)

#### Evaluate DT model

In [19]:
# evaluate the DT model on the testing set
y_pred_dt = dt_model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred_dt))

              precision    recall  f1-score   support

           0       0.69      0.71      0.70     39996
           1       0.70      0.69      0.70     39981

    accuracy                           0.70     79977
   macro avg       0.70      0.70      0.70     79977
weighted avg       0.70      0.70      0.70     79977



In [20]:
dt_accuracy = 0
ytest = y_test.values
for i in range(len(y_pred_dt)):
    if y_pred_dt[i]==ytest[i]:
        dt_accuracy += 1
dt_accuracy = (dt_accuracy/len(y_pred_dt))*100
print("DT Accuracy :",dt_accuracy)

DT Accuracy : 69.81007039523863


### Get old tweets from Jan-Dec 2016

Twitter seems to have updated their policy. Attempting to use an alternate approach.
Cloning the github repository

In [17]:
#!git clone https://github.com/Altimis/Scweet.git

Run the code below from the Scweet/Scweet directory in terminal

In [18]:
# !pip install -r Scweet/requirements.txt
# !pip3 install -U selenium==4.2.0
# !python scweet.py --words "AAPL" --until 2016-12-31 --since 2016-01-01 --limit 100 --interval 1 --display_type Latest --lang="en" --headless False

In [22]:
def get_old_tweets(scraped_tweets_path):
    #read the scraped tweets and timestamp from Scweet
    old_tweets_df = pd.read_csv(scraped_tweets_path)
    #rename columns to match train set
    old_tweets_df.rename(columns={'Embedded_text':'tweet','Timestamp':'time'},inplace=True)
    old_tweets_df = old_tweets_df[['tweet','time']]
    #Drop duplicates
    old_tweets_df.dropna(inplace=True)
    #re-format time to yyyy-mm-dd format
    old_tweets_df['time'] = pd.to_datetime(old_tweets_df['time'], format='%Y-%m-%d', errors='coerce')
    old_tweets_df['time'] = old_tweets_df['time'].dt.strftime('%Y-%m-%d')
    old_tweets_df = pd.DataFrame(data=old_tweets_df)
    # #save the csv file
    old_tweets_df.to_csv('old_tweets.csv')
    #pre-process the old tweets dataframe and drop duplicates
    old_tweets_preprocessed = process_tweet_dataframe(old_tweets_df)
    old_tweets_preprocessed.drop_duplicates(inplace=True)
    #save the pre-processed old tweets
    old_tweets_preprocessed.to_csv('old_tweets_preprocessed.csv')
    return old_tweets_preprocessed

In [23]:
old_tweets_preprocessed_df = get_old_tweets('Scweet/Scweet/outputs/AAPL_2016-01-01_2016-12-31.csv')

Text Preprocessing complete.
Time Taken: 0 seconds


Load and check the old_tweets_df

In [24]:
old_tweets_preprocessed_df = pd.read_csv('old_tweets_preprocessed.csv')
old_tweets_preprocessed_df.drop(columns=['Unnamed: 0'],inplace=True)

In [25]:
old_tweets_preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   time    294 non-null    object
 1   tweet   294 non-null    object
dtypes: object(2)
memory usage: 4.7+ KB


In [26]:
display(old_tweets_preprocessed_df.head())

Unnamed: 0,time,tweet
0,2016-01-01,replying USER
1,2016-01-01,best apple inc headline 2015 apple nasdaq aapl...
2,2016-01-01,aapl flat month dividend sold 75 profit read s...
3,2016-01-01,next apple 2016 new product rumor roundup URL ...
4,2016-01-01,USER hope follower engaged


#### Predict the sentiment using trained LR model on scraped old tweets

In [27]:
X_old = old_tweets_preprocessed_df.tweet
X_old_tfidf = tfidf.transform(X_old)
y_pred_old = lr_model.predict(X_old_tfidf)
old_tweets_preprocessed_df['sentiment'] = y_pred_old

In [28]:
old_tweets_preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   time       294 non-null    object
 1   tweet      294 non-null    object
 2   sentiment  294 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 7.0+ KB


In [29]:
display(old_tweets_preprocessed_df.head())

Unnamed: 0,time,tweet,sentiment
0,2016-01-01,replying USER,1
1,2016-01-01,best apple inc headline 2015 apple nasdaq aapl...,1
2,2016-01-01,aapl flat month dividend sold 75 profit read s...,0
3,2016-01-01,next apple 2016 new product rumor roundup URL ...,1
4,2016-01-01,USER hope follower engaged,1


#### Group by data and average sentiment values

In [30]:
old_tweets_avg = old_tweets_preprocessed_df.groupby(['time']).aggregate(
    {'sentiment': 'mean'})
display(old_tweets_avg.head())

Unnamed: 0_level_0,sentiment
time,Unnamed: 1_level_1
2016-01-01,0.670455
2016-01-02,0.746835
2016-01-03,0.666667
2016-01-04,0.0
2016-01-05,0.8
