# Exploring and Cleaning Tweets

This notebook will explore the tweet posts in detail and will explore functions to clean the data. The first section will 

In [1]:
import pandas as pd

in_tweet_file = 'zm-tweet-2020-9-11-to-2020-9-13.csv'
tweet_df  = pd.read_csv(in_tweet_file)

In [2]:

tweet_df[['date', 'content']].tail()
tweet_df_date_sorted = tweet_df[['date', 'content']].sort_values('date').reset_index()

tweet_df_date_sorted


Unnamed: 0,index,date,content
0,463,2020-09-11 00:02:59+00:00,https://t.co/9VjKMnpm7n End-of-Day Sort for 20...
1,462,2020-09-11 00:07:36+00:00,@cadeinvests So true! Saw that play out recent...
2,461,2020-09-11 00:12:33+00:00,5 Stocks in ETF Witnessing a Spike on the #Nas...
3,460,2020-09-11 00:15:58+00:00,今後は $ZM $CRWD $FB を長期ホールド、 $FSLY をスイングしつつ一定数ホー...
4,459,2020-09-11 00:16:27+00:00,$BYND $VIAC $SPY $PLAY were the plays today!\n...
...,...,...,...
459,4,2020-09-12 23:16:44+00:00,@marketmusician @BahamaBen9 $WORK is not $ZM -...
460,3,2020-09-12 23:20:15+00:00,@marketmusician @BahamaBen9 $WORK is a land an...
461,2,2020-09-12 23:22:25+00:00,"@marketmusician @BahamaBen9 Finally, the $WORK..."
462,1,2020-09-12 23:22:59+00:00,@JaxxTx @JonahLupton $SE 10.46% $LVGO 6% $PINS...


In [3]:
tweet_content_list = []

for i, row in tweet_df_date_sorted.iterrows():
    tweet_content_list.append([row['date'], row['content']])




print(len(tweet_content_list))

464


In [4]:
start_num = 400

for i in tweet_content_list[start_num:start_num+20]:
    print ('--tweet--')
    print (i[0])
    print (i[1])
    print ('-- ## -- \n')

--tweet--
2020-09-12 13:20:19+00:00
remove $ZM from my list
-- ## -- 

--tweet--
2020-09-12 13:29:57+00:00
$ZM This stock is 20% off its high &amp; found support at its 10ema. The reason software had such a big run is because that is where the growth is. There are not many companies growing like ZM. Who knows if this one will continue to hold up? Watch the key technical levels for clues https://t.co/iYjzsxH5xp
-- ## -- 

--tweet--
2020-09-12 13:51:25+00:00
$ZM Weekly. Inside wk. Big fat range here, so $ZM could just chop sideways for a bit - the range is big, tho, so that makes it interesting to me b/c I can play in a big range. Anyway, will be watching next wk to see if it can go inside wk + up....or trapped sideways... https://t.co/SWiFWyLO1j
-- ## -- 

--tweet--
2020-09-12 14:15:38+00:00
From ATH 

$SPX -7%
$QQQ -11%

$V -8
$BABA $WMT -9
$MA -10
$ATVI $NOW -11
$AMZN $FB $GOOGL $MSFT -12
$PYPL -13
$CRM $EA $PTON -15
$NFLX $ROKU -16
$NVDA -17
$AAPL $AMD $TTD -19
$SHOP $SQ $ZM -20
$AMA

# Steps to Clean Data


### Step 1: Remove all tweets that are are not English Language

The scraped tweets have a value 'lang' which is the language of the tweet. All tweets that are not of 'lang' = 'en' will be removed.

### Step 2: Removing emojis

From great article about sentiment analysis: https://heartbeat.comet.ml/twitter-sentiment-analysis-part-1-6063442c06f3

The author used the following functions to remove emojis and non-english characters

I have employed her functions in my project to remove emojis

In [7]:
import re


def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F" # emoticons
                           u"\U0001F300-\U0001F5FF" # symbols & pictographs
                           u"\U0001F680-\U0001F6FF" # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF" # flags (iOS)
                           u"\U00002500-\U00002BEF"  # chinese char
                           u"\U00002702-\U000027B0"
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           u"\U0001f926-\U0001f937"
                           u"\U00010000-\U0010ffff"
                           u"\u2640-\u2642"
                           u"\u2600-\u2B55"
                           u"\u200d"
                           u"\u23cf"
                           u"\u23e9"
                           u"\u231a"
                           u"\ufe0f"  # dingbats
                           u"\u3030"
                           "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', string)




In [8]:
# examples of using the above functions

tweet_sample_w_emojis = '''A Couple Morning Plays With One        🔥 📈 Running Over 60% 📈 🔥'''

print(f'Original tweet with emojis: \t\t\t{tweet_sample_w_emojis}')
print(f'Tweet after using remove_emoji function: \t{remove_emoji(tweet_sample_w_emojis)}')



Original tweet with emojis: 			A Couple Morning Plays With One        🔥 📈 Running Over 60% 📈 🔥
Tweet after using remove_emoji function: 	A Couple Morning Plays With One          Running Over 60%  


### Step 3: Remove Hyperlinks, Twitter marks and styles 

In [9]:
tweet = 'Stats for the day have arrived. 1 new follower and NO unfollowers :) via http://t.co/0s8GQYOeus.'


print('Original Tweet: ')
print(tweet)

# it will remove the old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+', '', tweet)

# it will remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)

# it will remove hashtags. We have to be careful here not to remove 
# the whole hashtag because text of hashtags contains huge information. 
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

# it will remove single numeric terms in the tweet. 
tweet2 = re.sub(r'[0-9]', '', tweet2)
print('\nAfter removing old style tweet, hyperlinks and # sign')
print(tweet2)

Original Tweet: 
Stats for the day have arrived. 1 new follower and NO unfollowers :) via http://t.co/0s8GQYOeus.

After removing old style tweet, hyperlinks and # sign
Stats for the day have arrived.  new follower and NO unfollowers :) via 


# Step 4: Remove Stop Words, Punctuations and Stemming

Stop words and Punctuations are to be removed for the data set for the SVM model. I believe that keeping stop words and punctuations for the BERT model will be more beneficial. If there is time, I will test it.

In [14]:
import nltk                             
from nltk.corpus import twitter_samples   
# nltk.download('stopwords')
# nltk.download('twitter_samples')

import re                                  
import string                             
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer  

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\jk\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

### Regarding Stop Words

- We can see that the stop words list above contains some words that could be important in some contexts. These could be words like i, not, between, because, won, against. I will customize the stop words list for the SVM model as it doesn't take the sentence into context but just individual words.  For the Roberta model, we will keep the stop words as is.


In [17]:
print('Before Tokenizing: ')
print(tweet2)

# instantiate the tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, 
                           strip_handles=True,
                           reduce_len=True)

# tokenize the tweets
tweet_tokens = tokenizer.tokenize(tweet2)

print('\nTokenized string:')
print(tweet_tokens)

Before Tokenizing: 
Stats for the day have arrived.  new follower and NO unfollowers :) via 

Tokenized string:
['stats', 'for', 'the', 'day', 'have', 'arrived', '.', 'new', 'follower', 'and', 'no', 'unfollowers', ':)', 'via']


In [19]:
print('Before tokenization')
print(tweet_tokens)


tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('\n\nAfter removing stop words and punctuation:')
print(tweets_clean)

Before tokenization
['stats', 'for', 'the', 'day', 'have', 'arrived', '.', 'new', 'follower', 'and', 'no', 'unfollowers', ':)', 'via']


After removing stop words and punctuation:
['stats', 'day', 'arrived', 'new', 'follower', 'unfollowers', ':)', 'via']


In [20]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [21]:
print('Before tokenization')
print(tweet_tokens)


tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('\n\nAfter removing stop words and punctuation:')
print(tweets_clean)

Before tokenization
['stats', 'for', 'the', 'day', 'have', 'arrived', '.', 'new', 'follower', 'and', 'no', 'unfollowers', ':)', 'via']


After removing stop words and punctuation:
['stats', 'day', 'arrived', 'new', 'follower', 'unfollowers', ':)', 'via']


### Stemming

Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

Consider the words:

- learn
- learning
- learned
- learnt

All these words are stemmed from its common root learn. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, happi and sunni. That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:

- happy
- happiness
- happier

We can see that the prefix happi is more commonly used. We cannot choose happ because it is the stem of unrelated words like happen.

NLTK has different modules for stemming and we will be using the PorterStemmer module which uses the Porter Stemming Algorithm. Let's see how we can use it in the cell below.



Please note that Stemming will be used for the SVM model. As I feel that having the full word for Roberta will yeild a more accurante sentiment score.



In [22]:
# Instantiate stemming class
stemmer = PorterStemmer() 

# Create an empty list to store the stems
tweets_stem = [] 

for word in tweets_clean:
    stem_word = stemmer.stem(word)  # stemming word
    tweets_stem.append(stem_word)  # append to the list

print('Words after stemming: ')
print(tweets_stem)

Words after stemming: 
['stat', 'day', 'arriv', 'new', 'follow', 'unfollow', ':)', 'via']
