### Importing libraries

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import snscrape.modules.twitter as sntwitter
from tqdm.notebook import tqdm

### Scraping tweets

We'll scrape tweets for the following stocks:

- Communication Services - Alphabet (NASDAQ:GOOGL)
- Energy - ExxonMobil (NYSE:XOM)
- Financials - JPMorgan Chase (NYSE:JPM)

We'll start be creating a function to pull the tweets, put them in a dataframe and filter by english language. 
<br>This function will be useful since we'll have to do this for all the stocks.

In [2]:
def get_tweets(stock, from_date, to_date, number_likes):
    '''Function to pull tweets and return dataframe filtered by language/likes'''
    
    # Empty list to store the tweets
    tweets_list =[]

    # Using TwitterSearchScraper to scrape tweets
    for i, tweet in tqdm(
        enumerate(
            sntwitter.TwitterSearchScraper(f'${stock} lang:en min_faves:{number_likes} since:{from_date} until:{to_date}').get_items()
            ), total = 50000  # progress bar size
    ):
        
        # Store the tweets in list
        tweets_list.append(
            [tweet.date, tweet.rawContent, tweet.user.username, tweet.likeCount, tweet.replyCount, \
             tweet.retweetCount, tweet.lang]
        )

    # Creating a dataframe from the tweets list
    tweet_df = pd.DataFrame(tweets_list, columns=['datetime','text','username','likeCount','replyCount',\
                                                  'retweetCount','language'])
    
    # Filter the dataframe by language, some tweets still go thru
    tweet_df = tweet_df[(tweet_df.likeCount >= number_likes) & (tweet_df.language == 'en')].reset_index(drop=True)
    
    # Drop language column
    tweet_df = tweet_df.drop(columns='language', axis=1)
    
    # Print how many tweets are left after filtering by language/likes and what's the date range of the tweets
    print(f'Number of {stock} tweets: {tweet_df.shape[0]}')
    
    return tweet_df

Let's start by pulling tweets using our function. We'll pull for the 5 last years and export to a csv file.
<br>Rationale here will be to pull tweets that have at least 1 like, since during initial inspection we noticed we could get rid of some "spam" tweets that usually have 0 likes.

### GOOGL

In [15]:
googl_2018 = get_tweets('GOOGL', '2018-1-1', '2019-1-1', 1)
googl_2019 = get_tweets('GOOGL', '2019-1-1', '2020-1-1', 1)
googl_2020 = get_tweets('GOOGL', '2020-1-1', '2021-1-1', 1)
googl_2021 = get_tweets('GOOGL', '2021-1-1', '2022-1-1', 1)
googl_2022 = get_tweets('GOOGL', '2022-1-1', '2023-1-1', 1)

  0%|          | 0/50000 [00:00<?, ?it/s]

Number of GOOGL tweets: 17982


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of GOOGL tweets: 18935


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of GOOGL tweets: 24037


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of GOOGL tweets: 28733


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of GOOGL tweets: 38355


In [16]:
googl_tweets = pd.concat([ googl_2018, googl_2019, googl_2020, googl_2021, googl_2021 ], ignore_index=True)
googl_tweets.head()

Unnamed: 0,datetime,text,username,likeCount,replyCount,retweetCount
0,2018-12-31 23:04:00+00:00,'Grandmasters had never seen anything like it....,Wexboy_Value,4,1,3
1,2018-12-31 22:29:34+00:00,Netflix topped the FANG trade as the best perf...,CNBCFastMoney,19,4,5
2,2018-12-31 22:27:53+00:00,Slippery and cold!! \n$FB &amp; $GOOGL trade e...,petenajarian,8,2,0
3,2018-12-31 21:48:52+00:00,Webinar. Right Now.\n\nMake 2019 your best tra...,OphirGottlieb,7,0,5
4,2018-12-31 21:48:34+00:00,Measured vs year ago (rather than vs 2018 peak...,tpetruno,3,1,1


In [19]:
googl_tweets.shape

(118420, 6)

In [18]:
googl_tweets.to_csv('../data/raw/googl_tweets_all.csv')

### XOM

In [4]:
xom_2018 = get_tweets('XOM', '2018-1-1', '2019-1-1', 1)
xom_2019 = get_tweets('XOM', '2019-1-1', '2020-1-1', 1)
xom_2020 = get_tweets('XOM', '2020-1-1', '2021-1-1', 1)
xom_2021 = get_tweets('XOM', '2021-1-1', '2022-1-1', 1)
xom_2022 = get_tweets('XOM', '2022-1-1', '2023-1-1', 1)

  0%|          | 0/50000 [00:00<?, ?it/s]

Number of XOM tweets: 4061


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of XOM tweets: 4284


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of XOM tweets: 11257


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of XOM tweets: 11677


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of XOM tweets: 20286


In [5]:
xom_tweets = pd.concat([ xom_2018, xom_2019, xom_2020, xom_2021, xom_2021 ], ignore_index=True)
xom_tweets.head()

Unnamed: 0,datetime,text,username,likeCount,replyCount,retweetCount
0,2018-12-31 22:08:24+00:00,IBM is the top Dog of the Dow for 2019 with a ...,bespokeinvest,34,0,12
1,2018-12-31 21:55:22+00:00,$D $JNJ $SNV $CVX $UNH $HD $FCB $CADE $VLO $XO...,TeresaTrades,2,0,0
2,2018-12-31 19:52:17+00:00,#LNG: The Best of 2018. What were the best LNG...,SusanSakmar,6,0,2
3,2018-12-31 17:11:14+00:00,$SPX $TNX $QQQ $SPY $DJIA $AMZN $RUT $VIX $TWT...,SethCL,1,0,1
4,2018-12-31 13:04:12+00:00,$ATLS 0067?! ONLY 31 MILLION O/S! tiny float. ...,aaaamhim,1,0,0


In [13]:
xom_tweets.shape

(42956, 6)

In [6]:
xom_tweets.to_csv('../data/raw/xom_tweets_all.csv')

### JPM

In [7]:
jpm_2018 = get_tweets('JPM', '2018-1-1', '2019-1-1', 1)
jpm_2019 = get_tweets('JPM', '2019-1-1', '2020-1-1', 1)
jpm_2020 = get_tweets('JPM', '2020-1-1', '2021-1-1', 1)
jpm_2021 = get_tweets('JPM', '2021-1-1', '2022-1-1', 1)
jpm_2022 = get_tweets('JPM', '2022-1-1', '2023-1-1', 1)

  0%|          | 0/50000 [00:00<?, ?it/s]

Unsupported card type on tweet 1041793302440435712: '2586390716:video_direct_message'


Number of JPM tweets: 8532


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of JPM tweets: 9301


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of JPM tweets: 15198


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of JPM tweets: 12441


  0%|          | 0/50000 [00:00<?, ?it/s]

Number of JPM tweets: 15525


In [8]:
jpm_tweets = pd.concat([ jpm_2018, jpm_2019, jpm_2020, jpm_2021, jpm_2021 ], ignore_index=True)
jpm_tweets.head()

Unnamed: 0,datetime,text,username,likeCount,replyCount,retweetCount
0,2018-12-31 23:45:26+00:00,Market Strategist Who Nailed S&amp;P 500 This ...,DanielMichael26,2,0,0
1,2018-12-31 23:23:00+00:00,$NSPX 0052? ITS BREAKTHROUGH DRUG ON THE NEWS ...,aaaamhim,1,0,2
2,2018-12-31 22:13:06+00:00,$XLF Daily: #XLF broke through 13MA. Consider...,MysteryTrader99,16,1,6
3,2018-12-31 22:07:30+00:00,The Dogs of the Dow finished 2018 with a total...,bespokeinvest,53,4,21
4,2018-12-31 21:51:41+00:00,$FRC $SLB $GE $UTX $FE $T $HUM $V $DIS $JPM $T...,TeresaTrades,1,0,0


In [14]:
jpm_tweets.shape

(57913, 6)

In [9]:
jpm_tweets.to_csv('../data/raw/jpm_tweets_all.csv')