<a href="https://colab.research.google.com/github/mratanusarkar/twitter-sentiment-analysis/blob/feature%2Ftwitter-api/Notebooks/Twitter_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter API

Aim is to connect to twitter via API (v2) and use it to pull tweets based on filters and conditions such as:
- hashtags (#)
- userId or mentions (@)
- keywords (string)

Using the retrieved data, sort and order the tweets based on various parameters such as:
- number of likes
- number of comments
- number of retweets
- number of engagement or view count

So that, the data can be used for sentiment analysis.

In future (v2), we can also go in depth into each "tweet thread" and "quote tweet (retweet)" to find links and relations with each other, and get different analysis.

**Note**: please skip to the **final implementation** [here](#final-implementation),<br>
if you don't want to go through all the analysis, various experiments with different packages and various methods & use-cases.

# Using "twitter-api v2"
https://developer.twitter.com/en/docs/twitter-api

## Import Packages

In [None]:
import tweepy
import configparser
import pandas as pd
from pprint import pprint

## Input secrets and keys

In [None]:
# secrets and keys
api_key = "<API_KEY>"
api_key_secret = "<API_KEY_SECRET>"
bearer_token = "<BEARER_TOKEN>"
access_token = "<ACCESS_TOKEN>"
access_token_secret = "<ACCESS_TOKEN_SECRET>"

## Connect to Twitter API

In [None]:
# set auth handler
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)

# authenticate and get api handler
api = tweepy.API(auth)

## Pull tweets using various methods

In [None]:
# get public tweets from my timeline
limit = 1

tweets_from_public_timeline = api.home_timeline(count=limit)
# pprint(tweets_from_public_timeline[0]._json)
print(tweets_from_public_timeline[0].text)

In [None]:
# get tweets from specific user
user = 'mratanusarkar'
limit = 1

tweets_from_user_timeline = api.user_timeline(screen_name=user, count=limit, tweet_mode='extended')
# pprint(tweets_from_user_timeline[0]._json)
print(tweets_from_user_timeline[0].full_text)

In [None]:
# get tweets using keywords, #hashtags or @mentions

In [None]:
# get tweets from keywords
keywords = "physics"
limit = 1

tweets_from_keywords = api.search(q=keywords, count=limit, tweet_mode='extended')
# pprint(tweets_from_keywords[0]._json)
print(tweets_from_keywords[0].full_text)

In [None]:
# get tweets from #hashtags
keywords = "#physics"
limit = 1

tweets_from_keywords = api.search(q=keywords, count=limit, tweet_mode='extended', include_entities=True)
# pprint(tweets_from_keywords[0]._json)
print(tweets_from_keywords[0].full_text)

In [None]:
# get tweets from @users or @mentions
keywords = "@3blue1brown"
limit = 1

tweets_from_keywords = api.search(q=keywords, count=limit, tweet_mode='extended', include_entities=True)
# pprint(tweets_from_keywords[0]._json)
print(tweets_from_keywords[0].full_text)

In [None]:
# combination of all
query = "#math OR #mathematics AND @3blue1brown"
limit = 1

tweets_from_keywords = api.search(q=query, count=limit, tweet_mode='extended', include_entities=True)
# pprint(tweets_from_keywords[0]._json)
print(tweets_from_keywords[0].full_text)

In [None]:
# use cursor to avoid the API cap
query = "#math OR #mathematics AND @3blue1brown OR physics"
limit = 300

tweets = tweepy.Cursor(api.search, q=query, count=100, tweet_mode='extended').items(limit)
print(tweets)

In [None]:
# print(list(tweets)[0].full_text)
# it seems that we can access the iterable only once!! 
# so better convert it to a df

In [None]:
# print all tweets
# for i, tweet in list(tweets):
#     print(i, ":", tweet.full_text)

## Save the data in a DataFrame

In [None]:
# define the column names
columns = ["Time", "User", "Tweet"]
data = []

In [None]:
for tweet in tweets:
    data.append([tweet.created_at, tweet.user.screen_name, tweet.full_text])
dataframe = pd.DataFrame(data, columns=columns)
dataframe

## Export

In [None]:
# export data
# dataframe.to_json("tweets.json")
# dataframe.to_csv("tweets.csv")

# Using "snscrape"
https://github.com/JustAnotherArchivist/snscrape

## Import Packages

In [None]:
!pip install snscrape

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from pprint import pprint
from tqdm.notebook import tqdm
import re

## Pull tweets using various methods

In [None]:
# list of available scrapers
[fn_names for fn_names in sntwitter.__all__ if "Scraper" in fn_names]

In [None]:
# using Search Scraper
query = "india"
twitter_search = sntwitter.TwitterSearchScraper(query).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using User Scraper
user = "mratanusarkar"
twitter_search = sntwitter.TwitterUserScraper(user).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using Profile Scraper
user = "mratanusarkar"
twitter_search = sntwitter.TwitterProfileScraper(user).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using Hashtag Scraper
hashtag = "india"
twitter_search = sntwitter.TwitterHashtagScraper(hashtag).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using Tweet Scraper
tweetId = "1623329586602995712"
twitter_search = sntwitter.TwitterTweetScraper(tweetId).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using ListPosts Scraper
listName = "Physics"
twitter_search = sntwitter.TwitterListPostsScraper(listName).get_items()
tweet = next(twitter_search)
print(vars(tweet))

In [None]:
# using Trends Scraper
user = "mratanusarkar"
twitter_search = sntwitter.TwitterProfileScraper(user).get_items()
tweet = next(twitter_search)
print(vars(tweet))

On analysis, useful methods that can be used for our purpose are:
* TwitterSearchScraper
* TwitterUserScraper
* TwitterProfileScraper
* TwitterHashtagScraper
* TwitterTrendsScraper

Since TwitterSearchScraper looks most useful for our use-case, let's study the response object:

In [None]:
# look at all the keys of the object
vars(tweet).keys()

In [None]:
useful_keys = ['id', 'date', 'user', 'rawContent', 'viewCount', 'likeCount', 'replyCount', 'retweetCount', 'quoteCount', 'url']
data = [str(vars(tweet).get(keys)) for keys in useful_keys]
data

In [None]:
data = [
    tweet.id,
    tweet.date,
    tweet.user.username,
    tweet.rawContent,
    tweet.viewCount,
    tweet.likeCount,
    tweet.replyCount,
    tweet.retweetCount,
    tweet.quoteCount,
    tweet.url
]
data

## Build the scraper script

In [None]:
# scrape and build a dataframe

query = "mratanusarkar"
limit = 10
tweets = []
columns = []

twitter_search = sntwitter.TwitterSearchScraper(query).get_items()
for tweet in tqdm(twitter_search, total=limit):
    if len(tweets) == limit:
        columns = list(vars(tweet).keys())
        break
    else:
        tweets.append(list(vars(tweet).values()))

df = pd.DataFrame(tweets, columns=columns)
df.head(1)

In [None]:
# structuring into a helper function
def get_tweets(query, limit, keep_keys=[]):
    tweets = []
    columns = []
    pattern = re.compile(r'(?<!^)(?=[A-Z])')
    twitter_search = sntwitter.TwitterSearchScraper(query).get_items()
    for tweet in tqdm(twitter_search, total=limit):
        if len(tweets) == limit:
            if len(keep_keys) > 0:
                columns = [pattern.sub('_', keys).lower() for keys in keep_keys]
            else:
                columns = list(vars(tweet).keys())
            break
        else:
            if len(keep_keys) > 0:
                data = [str(vars(tweet).get(keys)) for keys in keep_keys]
                tweets.append(data)
            else:
                tweets.append(list(vars(tweet).values()))

    df = pd.DataFrame(tweets, columns=columns)
    return df

In [None]:
useful_keys = ['id', 'date', 'user', 'rawContent', 'viewCount', 'likeCount', 'replyCount', 'retweetCount', 'quoteCount', 'url']
get_tweets("mratanusarkar", 10, useful_keys)

## Export

In [None]:
# export data
# dataframe.to_json("tweets.json")
# dataframe.to_csv("tweets.csv")

<a name="final-implementation"></a>
# Final Implemetation

Since Twitter API has limitations of number of tweets that can be pulled and also require authentication with it's own limitations, hence snscrape is a better choice.

In [1]:
!pip install snscrape

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import re
import traceback
from tqdm.notebook import tqdm

In [3]:
def get_tweets(query: str, limit: int) -> pd.DataFrame:
    """
    Scrape tweets from twitter based on input search query
    Arguments:
        :param query: twitter search query as per https://twitter.com/search?q=
        :param limit: number of tweets you want to scrape
    Returns:
        :return: a pandas dataframe with the tweets
    """
    tweets = []
    columns = [
        'id',
        'date',
        'username',
        'content',
        'view_count',
        'like_count',
        'reply_count',
        'retweet_count',
        'quote_Count',
        'url'
    ]
    try:  
        twitter_search = sntwitter.TwitterSearchScraper(query).get_items()
        for tweet in tqdm(twitter_search, total=limit):
            if len(tweets) == limit:
                break
            else:
                data = [
                    tweet.id,
                    tweet.date,
                    tweet.user.username,
                    tweet.rawContent,
                    tweet.viewCount,
                    tweet.likeCount,
                    tweet.replyCount,
                    tweet.retweetCount,
                    tweet.quoteCount,
                    tweet.url
                ]
                tweets.append(data)
        df = pd.DataFrame(tweets, columns=columns)
        return df
    except Exception:
        print(traceback.print_exc())
        return pd.DataFrame()

In [4]:
query = 'India'
limit = 100
rawData = get_tweets(query, limit)

  0%|          | 0/100 [00:00<?, ?it/s]

In [5]:
rawData.to_csv("tweets.csv")
rawData.to_json("tweets.json")
rawData.to_parquet("tweets.parquet")