This is a quick and dirty way to get a sense of what's trending on Twitter related to a particular Topic. For my use case, I am focusing on the city of Seattle but you can easily apply this to any topic.

**Use the GPU for this notebook to speed things up:** select the menu option "Runtime" -> "Change runtime type", select "Hardware Accelerator" -> "GPU" and click "SAVE".

The code in this notebook does the following things:


*   Scrapes Tweets related to the Topic you are interested in.
*   Extracts relevant Tags from the text (NER: Named Entity Recognition).
*   Does Sentiment Analysis on those Tweets.
*   Provides some visualizations in an interactive format to get a 'pulse' of what's happening.

We use Tweepy to scrape Twitter data and Flair to do NER / Sentiment Analysis. We use Seaborn for visualizations and all of this is possible because of the wonderful, free and fast (with GPU) Google Colab.

**A bit about NER (Named Entity Recognition)** 

This is the process of extracting labels form text. 

So, take an example sentence: 'George Washington went to Washington'. NER will allow us to extract labels such as Person for 'George Washington' and Location for 'Washington (state)'. It is one of the most common and useful applications in NLP and, using it, we can extract labels from Tweets and do analysis on them.

**A bit about Sentiment Analysis** 

Most commonly, this is the process of getting a sense of whether some text is Positive or Negative. More generally, you can apply it to any label of your choosing (Spam/No Spam etc.).

So, 'I hated this movie' would be classified as a negative statement but 'I loved this movie' would be classified as positive. Again - it is a very useful application as it allows us to get a sense of people's opinions about something (Twitter topics, Movie reviews etc). 

To learn more about these applications, check out the Flair Github homepage and Tutorials: https://github.com/zalandoresearch/flair


Note: You will need Twitter API keys (and of course a Twitter account) to make this work. You can get those by signing up here: https://developer.twitter.com/en/apps

In [None]:
# import lots of stuff
import sys
import os
import re
import tweepy
from tweepy import OAuthHandler
from textblob import TextBlob

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from IPython.display import clear_output
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS

In [None]:
# install Flair
!pip install --upgrade git+https://github.com/flairNLP/flair.git

clear_output()

In [None]:
# import Flair stuff
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')

clear_output()

In [None]:
#import Flair Classifier
from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')

clear_output()

In [None]:
#@title Enter Twitter Credentials
TWITTER_KEY = 'vh4S8DUjDMm5AWneqRU6GhFr1' #@param {type:"string"}
TWITTER_SECRET_KEY = 'KOPQupiLuwDKJHiXlCQeRtO7lk5dASScyvIHEGQrNul0AcuiUn' #@param {type:"string"}

In [None]:
# Authenticate
auth = tweepy.AppAuthHandler(TWITTER_KEY, TWITTER_SECRET_KEY)

api = tweepy.API(auth, wait_on_rate_limit=True,
				   wait_on_rate_limit_notify=True)

if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)


# Scrapping


The Twitter scrape code here was taken from: https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively.

My thanks to the author.

We need to provide a Search term and a Max Tweet count. Twitter lets you to request 45,000 tweets every 15 minutes  so setting something below that works.

In [35]:
#@title Twitter Search API Inputs
#@markdown ### Enter Search Country:
searchCountry = 'India' #@param {type:"string"}
radius = 2000 #@param {type:"number"}

#@markdown ### Enter Search Query:
searchQuery = '' #@param {type:"string"}

# #@markdown ### Enter until date
# #@markdown #### Returns tweets created before the given date. Date should be formatted as YYYY-MM-DD. Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
# until_date = "2021-07-10" #@param {type:"date"}


#@markdown ----------------
#@markdown ### Enter Max Tweets To Scrape:
#@markdown #### The Twitter API Rate Limit (currently) is 45,000 tweets every 15 minutes.
maxTweets = 45000 #@param {type:"slider", min:0, max:45000, step:100}
Filter_Retweets = True #@param {type:"boolean"}

tweetsPerQry = 100  # this is the max the API permits
tweet_lst = []

places = api.geo_search(query=searchCountry, granularity="country")
place_id = places[0].id

# if searchCountry:
#     searchQuery = searchQuery + f' place:{place_id}'
centroid = [str(i) for i in places[0].centroid[::-1]]
place = ",".join(centroid) + "," + f"{radius}km"
print(place)


if Filter_Retweets:
  searchQuery = searchQuery + ' -filter:retweets'  # to exclude retweets

# If results from a specific ID onwards are reqd, set since_id to that ID.
# else default to no lower limit, go as far back as API allows
sinceId = None

# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -10000000000

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
while tweetCount < maxTweets:
    try:
        if (max_id <= 0):
            if (not sinceId):
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry, 
                                        geocode=place,lang="en")
            else:
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                        geocode=place, lang="en",
                                        since_id=sinceId)
        else:
            if (not sinceId):
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                        geocode=place,
                                        lang="en", max_id=str(max_id - 1))
            else:
                new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                        lang="en", geocode=place, 
                                        max_id=str(max_id - 1), 
                                        since_id=sinceId)
        if not new_tweets:
            print("No more tweets found")
            break
        for tweet in new_tweets:
          if hasattr(tweet, 'reply_count'):
            reply_count = tweet.reply_count
          else:
            reply_count = 0
          if hasattr(tweet, 'retweeted'):
            retweeted = tweet.retweeted
          else:
            retweeted = "NA"
            
          # fixup search query to get topic
          topic = searchQuery[:searchQuery.find('-')].capitalize().strip()
          
          # fixup date
          tweetDate = tweet.created_at.date()
          
          tweet_lst.append([tweetDate, topic, 
                      tweet.id, tweet.user.screen_name, tweet.user.name, 
                      tweet.text, tweet.favorite_count,reply_count, 
                      tweet.retweet_count, retweeted])

        tweetCount += len(new_tweets)
        print("Downloaded {0} tweets".format(tweetCount))
        max_id = new_tweets[-1].id
    except tweepy.TweepError as e:
        # Just exit if any error
        print("some error : " + str(e))
        break

clear_output()
print("Downloaded {0} tweets".format(tweetCount))

Downloaded 66 tweets


In [36]:
pd.set_option('display.max_colwidth', -1)

# load it into a pandas dataframe
tweet_df = pd.DataFrame(tweet_lst, columns=['tweet_dt', 'topic', 'user_id', 'username', 'name', 'tweet', 'like_count', 'reply_count', 'retweet_count', 'retweeted'])
tweet_df["search_query"] = searchQuery
tweet_df.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,tweet_dt,topic,user_id,username,name,tweet,like_count,reply_count,retweet_count,retweeted,search_query
0,2021-08-06,Obcreservationinneet,1423565932002086915,latestly,LatestLY,"BJP Misleading People Regarding OBC Reservation, Says NCP's Nawab Malik\nhttps://t.co/bHH3M236oJ\n@OfficeofNM #BJP… https://t.co/WfolLTMYES",1,0,0,False,OBCreservationinNEET -filter:retweets
1,2021-08-05,Obcreservationinneet,1423261693874868225,iDrMTR,Dr. M. Thirupathi Reddy,"27% OBC, 10% EWS reservation in (NEET) medical seats from all-India quota https://t.co/iO976BuGLS via @timesofindia… https://t.co/mfAiFgcCTp",11,0,10,False,OBCreservationinNEET -filter:retweets
2,2021-08-05,Obcreservationinneet,1423149981804818432,arvindguptaaor,Adv Arvind Gupta (Sahu) 🇮🇳,@KARUNANIDHYG1 @mkstalin @sivasankar1ss Thanks @mkstalin @sivasankar1ss \n\n#OBCreservation… https://t.co/61nFNHwKOS,0,0,0,False,OBCreservationinNEET -filter:retweets
3,2021-08-04,Obcreservationinneet,1422991449994383360,iamsunnytawar,Sunny Tawar,The Madras HC observed that reservation provided by Tamil Nadu (50%) for OBCs must be applied in the AIQ seats surr… https://t.co/GdZIoFBKkJ,8,0,1,False,OBCreservationinNEET -filter:retweets
4,2021-08-04,Obcreservationinneet,1422807064821174272,iDrMoin82,Dr. Moin Uddin,"""Medical Salve for Social Justice: The opening up of medical colleges across India will help improve representation… https://t.co/F65yXxOEiE",0,0,0,False,OBCreservationinNEET -filter:retweets


## NER and Sentiment Analysis

Now let's do some NER / Sentiment Analysis. We will use the Flair library: https://github.com/zalandoresearch/flair

###NER

Previosuly, we extracted, and then appended the Tags as separate rows in our dataframe. This helps us later on to Group by Tags.

We also create a new 'Hashtag' Tag as Flair does not recognize it and it's a big one in this context.

### Sentiment Analysis

We use the Flair Classifier to get Polarity and Result and add those fields to our dataframe.

**Warning:** This can be slow if you have lots of tweets.

In [37]:
# predict NER
nerlst = []

for index, row in tqdm(tweet_df.iterrows(), total=tweet_df.shape[0]):
  cleanedTweet = row['tweet'].replace("#", "")
  sentence = Sentence(cleanedTweet, use_tokenizer=True)
  
  # predict NER tags
  tagger.predict(sentence)

  # get ner
  ners = sentence.to_dict(tag_type='ner')['entities']
  
  # predict sentiment
  classifier.predict(sentence)
  
  label = sentence.labels[0]
  response = {'result': label.value, 'polarity':label.score}
  
  # get hashtags
  hashtags = re.findall(r'#\w+', row['tweet'])
  if len(hashtags) >= 1:
    for hashtag in hashtags:
      ners.append({ 'type': 'Hashtag', 'text': hashtag })
  
  for ner in ners:
    adj_polarity = response['polarity']
    if response['result'] == 'NEGATIVE':
      adj_polarity = response['polarity'] * -1
    try:
      ner['type']
    except:
      ner['type'] = ''
    nerlst.append([ row['tweet_dt'], row['topic'], row['user_id'], row['username'], 
                   row['name'], row['tweet'], ner['type'], ner['text'], response['result'], 
                   response['polarity'], adj_polarity, row['like_count'], row['reply_count'], 
                  row['retweet_count'] ])

clear_output()

In [38]:
df_ner = pd.DataFrame(nerlst, columns=['tweet_dt', 'topic', 'user_id', 'username', 'name', 'tweet', 'tag_type', 'tag', 'sentiment', 'polarity', 
                                       'adj_polarity','like_count', 'reply_count', 'retweet_count'])
df_ner.head()

Unnamed: 0,tweet_dt,topic,user_id,username,name,tweet,tag_type,tag,sentiment,polarity,adj_polarity,like_count,reply_count,retweet_count
0,2021-08-06,Obcreservationinneet,1423565932002086915,latestly,LatestLY,"BJP Misleading People Regarding OBC Reservation, Says NCP's Nawab Malik\nhttps://t.co/bHH3M236oJ\n@OfficeofNM #BJP… https://t.co/WfolLTMYES",,BJP,NEGATIVE,0.999694,-0.999694,1,0,0
1,2021-08-06,Obcreservationinneet,1423565932002086915,latestly,LatestLY,"BJP Misleading People Regarding OBC Reservation, Says NCP's Nawab Malik\nhttps://t.co/bHH3M236oJ\n@OfficeofNM #BJP… https://t.co/WfolLTMYES",,Nawab Malik,NEGATIVE,0.999694,-0.999694,1,0,0
2,2021-08-06,Obcreservationinneet,1423565932002086915,latestly,LatestLY,"BJP Misleading People Regarding OBC Reservation, Says NCP's Nawab Malik\nhttps://t.co/bHH3M236oJ\n@OfficeofNM #BJP… https://t.co/WfolLTMYES",Hashtag,#BJP,NEGATIVE,0.999694,-0.999694,1,0,0
3,2021-08-05,Obcreservationinneet,1423149981804818432,arvindguptaaor,Adv Arvind Gupta (Sahu) 🇮🇳,@KARUNANIDHYG1 @mkstalin @sivasankar1ss Thanks @mkstalin @sivasankar1ss \n\n#OBCreservation… https://t.co/61nFNHwKOS,Hashtag,#OBCreservation,POSITIVE,0.964975,0.964975,0,0,0
4,2021-08-04,Obcreservationinneet,1422991449994383360,iamsunnytawar,Sunny Tawar,The Madras HC observed that reservation provided by Tamil Nadu (50%) for OBCs must be applied in the AIQ seats surr… https://t.co/GdZIoFBKkJ,,Madras HC,NEGATIVE,0.813445,-0.813445,8,0,1


## Integration with MongoDB

In [39]:
!apt install mongodb
!service mongodb start
!pip install pymongo
!pip install dnspython
!pip3 install pymongo[srv]
clear_output()

### Connecting to MongoDB server

In [40]:
from pymongo import MongoClient
import urllib.parse

password = urllib.parse.quote('social@123')

# instantiating the Mongoclient
client = MongoClient(f"mongodb+srv://socialorg:{password}@cluster0.cbgfb.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")

# getting the list of database present
client.list_database_names()

['test', 'admin', 'local']

In [41]:
# switching to database (if db not there it will automatically create it)
db = client.test

# use this only when want to remove the current collection
# db.drop_collection('test')

# getting a collection
collection = db.test

### Insertation of data to MongoDB

In [42]:
df_ner.tweet_dt = pd.to_datetime(df_ner.tweet_dt)

In [46]:
# getting the current data from the db and converting to dataframe
db_df = pd.DataFrame(collection.find())

if not db_df.empty:
    # dropping the _id column
    db_df.drop('_id', axis=1, inplace=True)

    # removing the duplicate data (which are already present in the database)
    df_ner = df_ner[~df_ner.tweet.isin(db_df.tweet)]

In [47]:
# inserting the dataframe to database
try:
    collection.insert_many(df_ner.to_dict(orient='records'))
except:
    print("No data to be inserted in the database")

No data to be inserted in the database


In [48]:
# getting the data from from database and putting to a dataframe
df = pd.DataFrame(collection.find())

In [49]:
df.head()

Unnamed: 0,_id,tweet_dt,topic,user_id,username,name,tweet,tag_type,tag,sentiment,polarity,adj_polarity,like_count,reply_count,retweet_count
0,610bff126fdf13b1c5c8de95,2021-08-05,place:b850c1bfd38f30e0,1423293301013192707,ParveshLamba02,Parvesh,@narendramodi @manpreetpawar07 @CPDelhi @AmitShah @AmitShahOffice @PIBHindi @PIBHomeAffairs @DelhiPolice… https://t.co/FUHrsVmsJj,,DelhiPolice,POSITIVE,0.67716,0.67716,0,0,0
1,610bff126fdf13b1c5c8de96,2021-08-05,place:b850c1bfd38f30e0,1423293298609778691,Chandan92480060,Chandan Kumar Jha,"@Olympics Money💰Money is new in the market, loot it quickly, it is not fake, it is real.\nIf you also want to earn m… https://t.co/bMBSdeXKsA",,Olympics Money💰Money,POSITIVE,0.983427,0.983427,0,0,0
2,610bff126fdf13b1c5c8de97,2021-08-05,place:b850c1bfd38f30e0,1423293295547994130,kk_nafla,Nafla.kk,MY JIMINIE💋😍\n#JIMIN https://t.co/LEhJhMKket,,JIMIN,POSITIVE,0.802625,0.802625,0,0,0
3,610bff126fdf13b1c5c8de98,2021-08-05,place:b850c1bfd38f30e0,1423293295547994130,kk_nafla,Nafla.kk,MY JIMINIE💋😍\n#JIMIN https://t.co/LEhJhMKket,Hashtag,#JIMIN,POSITIVE,0.802625,0.802625,0,0,0
4,610bff126fdf13b1c5c8de99,2021-08-05,place:b850c1bfd38f30e0,1423293291483734016,tanuj121,TaNuJ GaMbHiR....😎,@CeoNoida @InfoDeptUP @UPGovt @WeAreTeamIndia @Satishmahanaup @PMOIndia @CMOfficeUP @myogiadityanath @IndiaSports… https://t.co/T7BFrkTxov,,CeoNoida,POSITIVE,0.855613,0.855613,0,0,0
