# Scraping Tweets from a public Twitter account between two dates

The twitter API has a functionality of scraping tweets from a public Twitter account but it is returns only a limited number of recent tweets. Another library, snscrape can be used to scrape historical tweets but it truncates the tweet content when outputting the JSON file (not sure if I just made a mistake). Using both libraries together, can avoid both of their limitations. We use snscrape to gather the URLs of the tweets, and then use the twitter API to scrape the actual tweets.

To setup your twitter developer account and get the API keys, click [here](https://www.youtube.com/watch?v=Lu1nskBkPJU&t=897s).

The `config.ini` file contains the API keys. It looks similar to this:
```
[twitter]

api_key = xxxxxxxxxxxxxxxx
api_key_secret = xxxxxxxxxxxxxxxx

access_token = xxxxxxxxxxxxxxxx
access_token_secret = xxxxxxxxxxxxxxxx
```
Next, run the snscrape script on the terminal:

`snscrape --progress --since 2019-10-01 twitter-search "from:inquirerdotnet until:2021-08-04" > twitter_inquirerdotnet.txt`

This lets you set the date range and twitter username. The tweet urls will be saved on a text file for later use.

In [2]:
import pandas as pd
import tweepy
import configparser
from tqdm import tqdm

# Get Config

config = configparser.ConfigParser()
config.read('config.ini')

api_key = config['twitter']['api_key']
api_key_secret = config['twitter']['api_key_secret']

access_token = config['twitter']['access_token']
access_token_secret = config['twitter']['access_token_secret']

In [51]:
def setup_tweepy():
    auth = tweepy.OAuthHandler(api_key, api_key_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api

In [52]:
# Open the snscrape output
url_df = pd.read_csv('twitter_inquirerdotnet.txt', index_col=None, header=None, names=['links'])

#Extract the tweet id
af = lambda x: x['links'].str.split("/").str[-1]
url_df['tweet_id'] = url_df['links'].str.split('/').str[-1]

# Convert Id to list
ids = url_df['tweet_id'].tolist()

# Process the ids by batch or chunks of 100
total_count = len(ids)
chunks = (total_count-1) // 100 + 1
chunks

In [56]:
# Fetch the tweets along with some other data
def fetch_tw(ids, name="jack"):
    api = setup_tweepy()
    # tweet_mode = "extended" (280 characters)
    tweets = api.lookup_statuses(ids, tweet_mode= "extended")
    tweet_df = pd.DataFrame()
    for tweets in tweets:
        tweet_elem = {"tweet_id": status.id,
                      "screen_name": status.user.screen_name,
                      "Tweet":status.full_text,
                      "Date":status.created_at,
                      "retweet_count": status.retweet_count,
                      "favorite_count": status.favorite_count}
        # append tweet to dataframe of tweets
        tweet_df = tweet_df.append(tweet_elem, ignore_index = True)
    # continously append new data to the csv
    tweet_df.to_csv(f"tweets_{name}.csv", mode="a", index=False)

In [57]:
# Create a loop to fetch the tweets on batches of 100. I used tqdm to show the progress
for i in tqdm(range(chunks)):
    batch = ids[i*100:(i+1)*100]
    result = fetch_tw(batch, name='inquirerdotnet')

100%|██████████| 2075/2075 [1:43:21<00:00,  2.99s/it]


In [4]:
inquirer_df = pd.read_csv('tweets_inquirerdotnet.csv')
inquirer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209478 entries, 0 to 209477
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tweet_id        209478 non-null  object
 1   screen_name     209478 non-null  object
 2   Tweet           209478 non-null  object
 3   Date            209478 non-null  object
 4   retweet_count   209478 non-null  object
 5   favorite_count  209478 non-null  object
dtypes: object(6)
memory usage: 9.6+ MB
