# Scrape twitter with snscrape

snscrape: https://github.com/JustAnotherArchivist/snscrape

Reference article: https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af

In [1]:
import snscrape.modules.twitter as sntwitter
import pandas as pd

## Tweet attributes
![tweet attributes](https://miro.medium.com/v2/resize:fit:720/format:webp/1*rQSvTSDhAFkoEas3e4G5tw.png)

### Scraped:

- `url`: permalink pointing to tweet location
- `date`: date tweet was created
- `rawContent`: text content of tweet
- `id`: id of tweet
- `user.username`: twitter username
- `retweetCount`: count of retweets
- `likeCount`: count of likes

In [2]:
max_tweets = 10000
count = 0
tweets_list = []

# query for tweets containing "elon musk"
for tweet in sntwitter.TwitterSearchScraper("elon musk").get_items():
    if count == max_tweets:
        break
    else:
        # get english tweets only
        if tweet.lang == "en":
            # note: add or remove tweet attribute as and when needed (refer to above image)
            tweets_list.append([tweet.url, 
                                tweet.date, 
                                tweet.rawContent, 
                                tweet.id, 
                                tweet.user.username, 
                                tweet.retweetCount, 
                                tweet.likeCount])
            count += 1

In [3]:
# create pandas dataframe for scraped tweets
tweets_df = pd.DataFrame(tweets_list, columns=["url", "datetime", "text", "tweet_id", "username", "retweet_count", "like_count"])

tweets_df.head()

Unnamed: 0,url,datetime,text,tweet_id,username,retweet_count,like_count
0,https://twitter.com/nickytwoeyes/status/163928...,2023-03-24 15:28:42+00:00,@Cryptowizardd77 @crypto_rand @elonmusk @Hobbe...,1639288136076328960,nickytwoeyes,0,0
1,https://twitter.com/Dr_Bed_Dr/status/163928813...,2023-03-24 15:28:42+00:00,@MarkusWoat @elonmusk Hurrah,1639288133232590852,Dr_Bed_Dr,0,0
2,https://twitter.com/lill63416788/status/163928...,2023-03-24 15:28:41+00:00,@cb_doge @elonmusk wow that's so amazing 2look...,1639288131819302913,lill63416788,0,0
3,https://twitter.com/starflower1959/status/1639...,2023-03-24 15:28:41+00:00,@elonmusk @BillyM2k Hmmm …how about Australia?,1639288128962715648,starflower1959,0,0
4,https://twitter.com/DBrubaker13/status/1639288...,2023-03-24 15:28:40+00:00,@jayinneveh @williamlegate @elonmusk The only ...,1639288127079546880,DBrubaker13,0,0


In [4]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   url            10000 non-null  object             
 1   datetime       10000 non-null  datetime64[ns, UTC]
 2   text           10000 non-null  object             
 3   tweet_id       10000 non-null  int64              
 4   username       10000 non-null  object             
 5   retweet_count  10000 non-null  int64              
 6   like_count     10000 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(3)
memory usage: 547.0+ KB


In [5]:
# convert timezone of "datetime" column to singapore
tweets_df["datetime"] = tweets_df["datetime"].dt.tz_convert("Asia/Singapore")

tweets_df.head()

Unnamed: 0,url,datetime,text,tweet_id,username,retweet_count,like_count
0,https://twitter.com/nickytwoeyes/status/163928...,2023-03-24 23:28:42+08:00,@Cryptowizardd77 @crypto_rand @elonmusk @Hobbe...,1639288136076328960,nickytwoeyes,0,0
1,https://twitter.com/Dr_Bed_Dr/status/163928813...,2023-03-24 23:28:42+08:00,@MarkusWoat @elonmusk Hurrah,1639288133232590852,Dr_Bed_Dr,0,0
2,https://twitter.com/lill63416788/status/163928...,2023-03-24 23:28:41+08:00,@cb_doge @elonmusk wow that's so amazing 2look...,1639288131819302913,lill63416788,0,0
3,https://twitter.com/starflower1959/status/1639...,2023-03-24 23:28:41+08:00,@elonmusk @BillyM2k Hmmm …how about Australia?,1639288128962715648,starflower1959,0,0
4,https://twitter.com/DBrubaker13/status/1639288...,2023-03-24 23:28:40+08:00,@jayinneveh @williamlegate @elonmusk The only ...,1639288127079546880,DBrubaker13,0,0


In [6]:
# # export dataframe to csv
# tweets_df.to_csv("tweets_raw.csv", index=False)