# Spatio-Temporal Analysis of Covid-19 related Tweets using a modified TextRank algorithm

# Project Overview

## Objective
The primary objective of this project is to analyze six months of Twitter data related to COVID-19, spanning from April to September 2020, with an average daily volume close to one million tweets. The focus is on establishing meaningful trends and understanding how interest in specific topics has evolved globally during this period. The chosen approach employs a customized version of TextRank for extractive summarization.

## Text Summarization Categories
In the realm of text summarization, two main categories exist: 
1. **Extractive Summarization:** Involves selecting a subset of sentences from the original text.
2. **Abstractive Summarization:** Expresses the document's ideas using different wording.

For this project, the emphasis lies on extractive summarization.

## Modified TextRank Algorithm
The modified TextRank algorithm draws inspiration from the PageRank algorithm, wherein important web pages are linked to other important web pages. Similarly, our approach assumes that important sentences are linked or similar to other important sentences in the input document. The algorithm constructs a similarity graph where each vertex represents a sentence vector, and edge weights denote their similarity. Edges are established between vertices if the similarity measure exceeds a predefined threshold, effectively removing less coherent nodes (500 nodes removed).

An innovation in this modification involves introducing a damping factor to the algorithm. The formula incorporates TextRank of the sentence for a given node and the degree of that node. A presentation and paper have been included to enhance understanding.

## Data Preprocessing
The initial step involved collecting English tweets related to COVID-19 from April to September 2020. Subsequently, preprocessing steps included removing hashtags, URLs, emoticons, and irrelevant content.

## Keyword Classification
Simultaneously, two processes were executed:
1. The first utilized the entire six months' data to generate keywords above a specified frequency, which were then classified and associated with six different topics or buckets. For instance, keywords like "government" and "leadership" were classified under the administration topic, while "virus" and "cured" belonged to the disease topic.

2. The second process involved applying the modified TextRank algorithm to generate summary files for each of the 183 days. Within these summaries, keywords were identified, and the associated topic counts were incremented, resulting in 183 data points.

## Data Normalization and Visualization
These data points were normalized by dividing them by the number of tweets on that particular day. The final normalized data points were plotted, and a 7-day moving average was applied for enhanced readability.


In [1]:
import re
import pandas as pd

In [5]:
from nltk import *
import re
import urllib
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from bs4 import BeautifulSoup
from nltk.corpus import wordnet
from collections import namedtuple
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import operator
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from graphviz import Digraph

from nltk.tokenize import sent_tokenize
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


import string
#import preprocessor as p
#from preprocessor.api import clean, tokenize, parse
import unidecode
import os

In [3]:
def remove_stopword(text):
    cachedStopWords = stopwords.words("english")
    text = ' '.join([word for word in text.split() if word not in cachedStopWords])
    return text

The below keywords are generated by parsing the data, looking for words above a specified frequency. While we generate 7 buckets, ultimately due to the lack of data points, the religion bucket is dropped. 

One look at the list of keywords would show that a lot of them are abbreviations, Hindi words typed out in English, or just typos. While we do not expect the final trends to change significantly, for the sake of completeness, we keep working on bettering these keyword lists. 

If the reader wishes to contribute, they could do so by simply adding new relevant keywords or other variations of existing keywords.

In [6]:
administration = ['relief', 'lie', 'pulis', 'srkaar', 'congress', 'police', 'pm', 'chief', 'minister', 'hm', 'members', 'distributed', 'govt', 'government', 'suppo', 'food', 'judgement', 'distributing', 'diyaa', 'modi', 'scandalous', 'cm', 'kovind', 'mjduur', 'food,', 'presidents', 'nehru', 'dynasty', 'attacks', 'opponents', 'leadership', 'narendra', 'prime', 'bjp','aadesh','amendment','ordinance,','commitment','raashn','express','scam','niti','president','modis','fund','ruupaay','producer','constable','maansiktaa','crore']

disease = ['manifests','covid19', 'corona', 'coronavirus', 'cases', 'koronaa', 'positive', 'patients', 'covid', 'pandemic', 'crisis', 'koroonnnaa', 'spread', 'dies', 'virus', 'cured', 'donate', 'diseases', 'maut', 'suffering', 'deaths', 'died','epidemic','death','patient','demise','case','dead','coronavirus','maaro', 'tested', 'fighting','symptoms']

healthcare = ['help', 'nivaarnn', 'humanitarian','aid','treatment', 'ilaaj', 'metabolics', 'stitched', 'labs', 'blood', 'fight', 'rapid', 'stepped', 'respect', 'healthcare', 'plasma', 'icmr', 'hospitals', 'donated', 'wellbeing','commend','lose','save','lives','hospital','dr','doctor','antivenom','treatment','poisonous','medicines','donating','medical','testing','giving','gratitude','villagedoctors']

location = ['india','delhi', 'desh', 'world', 'dillii', 'mumbai', 'indias', 'tamil', 'country', 'china', 'states', 'state', 'tmilnaaddu', 'manipurdignity','area','maharashtra','chennai','nadu','yuupii','gaajiyaabaad','amerikn','countries','nizamuddin','bihar,','central','western','karnataka','south']

precaution = ['proactive','masks', 'test', 'kits', 'face', 'kit', 'kitt', 'walking', 'health', 'measures', 'safety,','mask','purchase', 'ghr','lockdown', 'home', 'month', 'stay', 'lonkddaaun']

public = ['office','people', 'workers', 'indian', 'log', 'lady', 'bhuukhe', 'everyone', 'youth', 'girl', 'human', 'krodd', 'media', 'migrant', 'thalapathy', '12year', 'kmaane', 'shrii', 'actor', 'worker', 'khaanaa', 'lakhs', 'millions', 'appeals','news','everyones','brothers','person','protect','journalists','sisters,','needy','body','private','sir','community','neighbors','shops','privaar','relatives','family']

religion = ['muslims','holy', 'muslim', 'jamaat', 'jaatii', 'mubarak', 'hinduu', 'markaz', 'mohammad','jamaats','pray','msjidon','mndiron','hindu','ramzan','priests']

In [5]:
df2 = pd.read_csv("all_months.csv")

In [6]:
df3 = pd.read_csv("all_months_normalized.csv")

The dataset we are dealing with is a massive 185 million tweets which goes in the order of Gigabytes. The actual dataset is stored in a Google Drive, with a link attached in the README file. 

Below is a tutorial on how to run our modified TextRank algorithm for the month of September. It is also worth noting that due to a lack of compute (the generation of similarity matrix makes the space $ O(n^2) $ ), we were unable to run our modified algorithm on the entire day all at once. Instead, we broke the text down into 4 parts, ran the algorithm on each part to generate 4 summaries, and combined them all to generate a final summary. 

First we remove some irrelevant literature, like stopwords, emoticons, hashtags, etc. We then generate the four summaries and the main summary. In the main summary, we try to search for our keywords, and assign the generate bucket counts for each day. This process is repeated for all 185 days, generating 185 dataponts, through which we generate our bucket trends. 

An important point to note was that interest in Covid-19 also did not stay constant, it ultimately started dying down. So a count of 3000 out of 1 million tweets could not be held to the same standard as a count of 3000 in 500k tweets. To account for this inconsistency, we normalize our bucket counts, dividing them by the daily tweet counts.

In [7]:
for ww in range (1,31):    
    print(ww)
    df = pd.read_csv(str(ww) + "Sep_text_filtered.csv")
    df = df['text']
    a = len(df)//4
    b = len(df)//2
    c = 3*a
    d = len(df)

    text1 =''
    for i in range(0, a):
        
        if type(df[i]) == str:
            text1 = text1 + df[i]

    text2 =''
    for i in range(a+1, b):
        
        if type(df[i]) == str:
            text2 = text2 + df[i]

    text3 =''
    for i in range(b+1, c):
        
        if type(df[i]) == str:
            text3 = text3 + df[i]
    text4 =''
    for i in range(c+1, d):
        
        if type(df[i]) == str:
            text4 = text4 + df[i]

    texts = [text1, text2, text3, text4]
    for k in range(0,len(texts)):
        input_file1=texts[k]     #opening the original dataset

        data=""                 #storing the dataset as a string
        for x in input_file1:
            data=data+x
        data = unidecode.unidecode(data)    #to remove accents in the string
        #print(data)
        #clean_text=p.clean(data)            #to clean the data by removing tweet ids and labels

        #####################

        data = re.sub(r'#\w*','',data) # Remove hashtag
        data = re.sub(r'(RT|rt|FAV|fav|VIA|via)','',data) #Remove twitter reserved words

        def get_url_patern():
            return re.compile(
                r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))'
                r'[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9]\.[^\s]{2,})')

        def remove_urls(self):
                self = re.sub(pattern=get_url_patern(), repl='', string=self)
                return self

        TAG_RE = re.compile(r'<[^>]+>')

        def remove_tags(text):
            return TAG_RE.sub('', text)


        data = remove_urls(data) #remove urls
        data = remove_tags(data) #remove tags
        data = re.sub(r'\.+', ".", data) #remove multiple dots
        data = data.replace('[', '')
        data = data.replace(']', '')
        data = data.replace('|', '')

        #####################


        import urllib.parse as urlparse
        #>>> string = '@peter I really love that shirt at #Macy. http://bit.ly//WjdiW#'
        new_string = ''
        for i in data.split():
            s, n, p, pa, q, f = urlparse.urlparse(i)
            if s and n:
                pass
            elif i[:1] == '@':
                pass
            elif i[:1] == '#':
                new_string = new_string.strip() + ' ' + i[1:]
            else:
                new_string = new_string.strip() + ' ' + i

        clean_text=new_string




        #print("\n\n\n\n\n", clean_text)
        bad_chars = [';', ':', '!', "relevant" , "+" , "-" ,  "not_relevant" , "*", "RT", "0U" , "https", "http", ".html","/" , '@' , "?" , "("  ,  ")" , "^" , "$" , "&" , "_" , "%"]  #to remove the irrelevant characters
        emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                "]+", flags=re.UNICODE)

        emoticons_happy = set([
            ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
            ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
            '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
            'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
            '<3'
            ])

        # Sad Emoticons
        emoticons_sad = set([
            ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
            ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
            ':c', ':{', '>:\\', ';('
            ])

        #combine sad and happy emoticons
        emoticons = list(emoticons_happy.union(emoticons_sad))

        for i in bad_chars or emoji_pattern or emoticons:       #clean_text contains the final cleaned data
            clean_text= clean_text.replace(i, '')  
        #print(clean_text)

        f20 = open(str(k) + '.txt','w')     #open a file to store the cleaned data which is the input for the extractive summarization
        for i in clean_text:
            f20.write(i)
        f20.close()


    r = ''
    l = ['0','1','2','3']



    for k in l:
        def textrank(document):
            sentence_tokenizer = PunktSentenceTokenizer()
            sentences = sentence_tokenizer.tokenize(document)
            f = open('sent_tokenized_file.txt','w+')
            f1 = open('processed_file.txt','w')
            #print sentences
            
            #sentence index tagging
            sentence_mod=[]
            processed_sent=[]
            index=0
            eol = '\n'
            
            for item in sentences:
                lineitem = str(index)+' '+item+eol
                f.writelines(str(lineitem)) 
                new_item=remove_stopword(item)   #remove stopwords
                f1.writelines((str(new_item)+eol)) 
                index=index+1
                item += u' '
                item +=str(index)
                sentence_mod.append(item)
                
                new_item +=u' '
                new_item +=str(index)
                processed_sent.append(new_item)
            f.close()
            
            f1.close
            #print sentence_mod
            #print processed_sent
            bow_matrix = CountVectorizer().fit_transform(processed_sent)
            
            #print bow_matrix
            
            #f1.write(str(bow_matrix))
            normalized = TfidfTransformer().fit_transform(bow_matrix)
            
            #print normalized
        
            similarity_graph = normalized * normalized.T
        
            nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
            scores = nx.pagerank(nx_graph)
            #nx.draw(nx_graph)
            #plt.show()
            return sorted(((scores[i],s) for i,s in enumerate(processed_sent)),
                        reverse=True),sentence_mod

        input_file=open(k+ '.txt', 'r')
        title=input_file.read()
        
        #print title
        txt_rnk,sentence_mod=textrank(title)
        #print sentence_m
        length = len(txt_rnk)
        #print(length)
        n=int(length*0.25)
        #print "n:------------------->",n
        i=0
        result=[]
        while i< n:
            result.append(txt_rnk[i][1])   
            i=i+1
            
        result.sort(key = lambda x: int(x.rsplit(' ',1)[1]))# sort according sentence index #

        result_sent=''
        result_index=[]
        for item in result:
            for k in sentence_mod:
                if item.rsplit(None, 1)[-1]==k.rsplit(None, 1)[-1]:
                    result_sent+=k.rsplit(' ', 1)[0] #remove sentence index from the result #
                    result_index.append(k.rsplit(' ', 1)[1])

        r = r+result_sent
        #print(len(r))


    def textrank(document):
        sentence_tokenizer = PunktSentenceTokenizer()
        sentences = sentence_tokenizer.tokenize(document)
        f = open('sent_tokenized_file.txt','w+')
        f1 = open('processed_file.txt','w')
        #print sentences
        
        #sentence index tagging
        sentence_mod=[]
        processed_sent=[]
        index=0
        eol = '\n'
        
        for item in sentences:
            lineitem = str(index)+' '+item+eol
            f.writelines(str(lineitem)) 
            new_item=remove_stopword(item)   #remove stopwords
            f1.writelines((str(new_item)+eol)) 
            index=index+1
            item += u' '
            item +=str(index)
            sentence_mod.append(item)
            
            new_item +=u' '
            new_item +=str(index)
            processed_sent.append(new_item)
        f.close()
        
        f1.close
        #print sentence_mod
        #print processed_sent
        bow_matrix = CountVectorizer().fit_transform(processed_sent)
        
        #print bow_matrix
        
        #f1.write(str(bow_matrix))
        normalized = TfidfTransformer().fit_transform(bow_matrix)
        
        #print normalized
    
        similarity_graph = normalized * normalized.T
    
        nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
        scores = nx.pagerank(nx_graph)
        #nx.draw(nx_graph)
        #plt.show()
        return sorted(((scores[i],s) for i,s in enumerate(processed_sent)),
                    reverse=True),sentence_mod

    title=r
    
    #print title
    txt_rnk,sentence_mod=textrank(title)
    #print sentence_m
    length = len(txt_rnk)
    #print(length)
    n=int(length*0.1)
    #print "n:------------------->",n
    i=0
    result=[]
    while i< n:
        result.append(txt_rnk[i][1])   
        i=i+1
        
    result.sort(key = lambda x: int(x.rsplit(' ',1)[1]))# sort according sentence index #

    result_sent=''
    result_index=[]
    for item in result:
        for k in sentence_mod:
            if item.rsplit(None, 1)[-1]==k.rsplit(None, 1)[-1]:
                result_sent+=k.rsplit(' ', 1)[0] #remove sentence index from the result #
                result_index.append(k.rsplit(' ', 1)[1])

    f20 = open(str(ww) + 'Sep.txt','w')   

    f20.write(result_sent)
    f20.close()

    f2 = open(str(ww)+ "Sep.txt", 'r', encoding="utf-8")
    a=f2.read()
    f2.close()

    l = re.split('(\W+?)', a)
    words = []
    for word in l:
        if word.isalnum():
            words.append(word)

    admin_count = 0
    disease_count = 0
    healthcare_count = 0
    location_count = 0
    precaution_count = 0
    public_count = 0
    religion_count = 0 
    for word in words:
        if word in administration:
            admin_count += 1
        if word in disease:
            disease_count += 1
        if word in healthcare:
            healthcare_count += 1
        if word in location:
            location_count += 1
        if word in precaution:
            precaution_count += 1
        if word in public:
            public_count += 1
        if word in religion:
            religion_count += 1

    df2[str(ww)+'Sep'] = [admin_count, disease_count, healthcare_count, location_count, precaution_count, public_count, religion_count]
    df3[str(ww)+'Sep'] = df2[str(ww)+'Sep']*20000/len(df)



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


In [17]:
df2.to_csv("all_months.csv")

In [20]:
df3.to_csv("all_months_normalized.csv")

In [18]:
df3 = df3.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Unnamed: 0.1.1.1'], axis = 1)
df3

Unnamed: 0,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,buckets,1Apr,2Apr,3Apr,4Apr,5Apr,6Apr,7Apr,...,21Sep,22Sep,23Sep,24Sep,25Sep,26Sep,27Sep,28Sep,29Sep,30Sep
0,0,0,administration,111.59509,83.426443,109.501285,132.645542,105.638541,135.006237,91.450447,...,109.41704,123.466317,71.466586,99.849433,99.594673,103.884861,83.96533,214.82485,101.979086,123.797944
1,1,1,disease,831.563411,693.482309,808.22377,957.995578,967.568295,884.877834,709.627109,...,835.874439,1015.510456,725.309815,675.172359,933.410539,891.67839,880.28169,1316.044126,978.307839,1085.442688
2,2,2,healthcare,201.59113,145.251397,203.359529,214.759448,222.715651,238.462103,194.243584,...,132.735426,132.726291,127.727515,117.283461,182.976259,134.184612,92.091008,292.23921,126.177513,198.960982
3,3,3,location,101.515533,88.640596,102.797125,133.34737,124.478536,108.591973,90.03261,...,75.336323,100.316382,76.028283,49.132261,152.866242,99.556325,73.131094,127.733695,108.892922,126.008622
4,4,4,precaution,209.510782,154.934823,223.47201,238.621609,215.314224,195.17206,177.938466,...,179.372197,219.152712,173.344484,166.415722,233.931673,212.098258,203.141928,315.463518,238.527353,252.017243
5,5,5,public,323.265776,233.147114,283.809453,282.83679,300.767057,325.042189,255.919467,...,252.914798,322.555753,209.83806,196.529044,287.203243,240.233741,238.353196,350.299981,269.639616,320.548248
6,6,6,religion,9.359588,9.683426,7.449067,7.018283,6.055713,8.804755,4.253509,...,1.793722,0.0,3.041131,0.0,0.0,4.328536,2.708559,3.870718,5.185377,0.0


In [19]:
df3 = df3.drop(['Unnamed: 0.1.1.1.1', 'Unnamed: 0.1.1.1.1.1'], axis = 1)
df3

Unnamed: 0,buckets,1Apr,2Apr,3Apr,4Apr,5Apr,6Apr,7Apr,8Apr,9Apr,...,21Sep,22Sep,23Sep,24Sep,25Sep,26Sep,27Sep,28Sep,29Sep,30Sep
0,administration,111.59509,83.426443,109.501285,132.645542,105.638541,135.006237,91.450447,105.671771,127.882193,...,109.41704,123.466317,71.466586,99.849433,99.594673,103.884861,83.96533,214.82485,101.979086,123.797944
1,disease,831.563411,693.482309,808.22377,957.995578,967.568295,884.877834,709.627109,803.680541,950.203449,...,835.874439,1015.510456,725.309815,675.172359,933.410539,891.67839,880.28169,1316.044126,978.307839,1085.442688
2,healthcare,201.59113,145.251397,203.359529,214.759448,222.715651,238.462103,194.243584,196.247574,227.087774,...,132.735426,132.726291,127.727515,117.283461,182.976259,134.184612,92.091008,292.23921,126.177513,198.960982
3,location,101.515533,88.640596,102.797125,133.34737,124.478536,108.591973,90.03261,99.920926,123.231932,...,75.336323,100.316382,76.028283,49.132261,152.866242,99.556325,73.131094,127.733695,108.892922,126.008622
4,precaution,209.510782,154.934823,223.47201,238.621609,215.314224,195.17206,177.938466,185.46474,225.537686,...,179.372197,219.152712,173.344484,166.415722,233.931673,212.098258,203.141928,315.463518,238.527353,252.017243
5,public,323.265776,233.147114,283.809453,282.83679,300.767057,325.042189,255.919467,253.75602,330.943616,...,252.914798,322.555753,209.83806,196.529044,287.203243,240.233741,238.353196,350.299981,269.639616,320.548248
6,religion,9.359588,9.683426,7.449067,7.018283,6.055713,8.804755,4.253509,7.907411,3.875218,...,1.793722,0.0,3.041131,0.0,0.0,4.328536,2.708559,3.870718,5.185377,0.0
