We look at the demonitization tweets dataset for extracting certain information.

### 1. Loading Data

In [241]:
import pandas as pd 
import warnings
warnings.filterwarnings("ignore")
#Loading the dataset
df = pd.read_csv("Demonetization_tweets.csv",encoding='unicode_escape')

# Printing first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,text
0,1,RT @rssurjewala: Critical question: Was PayTM ...
1,2,RT @Hemant_80: Did you vote on #Demonetization...
2,3,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,4,RT @ANI_news: Gurugram (Haryana): Post office ...
4,5,RT @satishacharya: Reddy Wedding! @mail_today ...


In [242]:
len(df)

14940

In [243]:
# Looking at some Tweets
for index, tweet in enumerate(df["text"][0:20]):
    print(index+1,".",tweet)

1 . RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;
2 . RT @Hemant_80: Did you vote on #Demonetization on Modi survey app?
3 . RT @roshankar: Former FinSec, RBI Dy Governor, CBDT Chair + Harvard Professor lambaste #Demonetization.

If not for Aam Aadmi, listen to th
4 . RT @ANI_news: Gurugram (Haryana): Post office employees provide cash exchange to patients in hospitals #demonetization https://t.co/uGMxUP9
5 . RT @satishacharya: Reddy Wedding! @mail_today cartoon #demonetization #ReddyWedding https://t.co/u7gLNrq31F
6 . @DerekScissors1: Indias #demonetization: #Blackmoney a symptom, not the disease https://t.co/HSl6Ihj0Qe via @ambazaarmag
7 . RT @gauravcsawant: Rs 40 lakh looted from a bank in Kishtwar in J&amp;K. Third such incident since #demonetization. That's how terrorists have
8 . RT @Joydeep_911: Calling all Nationalists to join...
Walk for #CorruptionFreeIndia and spread the

### 2. Data Pre-processing using Regex

From the above, we can see we can do the following data pre-processing.

1. Remove the word RT
2. Remove & amp;
3. Remove the strings like U+... and ed
4. Remove other tweet references that has a https link

In [244]:
import re

# Removing RT from a single Tweet
text = "RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;"
clean_text = re.sub('RT ','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;
Text after:
 @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;


In [245]:
#replacing the amp symbol with & 

clean_text= re.sub('&amp;', '&', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;
Text after:
 RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &


In [246]:
#Removing strings lik <U+...> and <ed>

text= "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders."

clean_text= re.sub('<U\+[a-zA-Z0-9]+>', '', text)
clean_text= re.sub('<ed>', '', clean_text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders.
Text after:
 @Jaggesh2 Bharat band on 28??Those who  are protesting #demonetization  are all different party leaders.


In [247]:
#Remove other tweet references that has an https link
text= '@DerekScissors1: Indias #demonetization: #Blackmoney a symptom, not the disease https://t.co/HSl6Ihj0Qe via @ambazaarmag'

clean_text= re.sub('https://[a-zA-Z0-9/.]+','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @DerekScissors1: Indias #demonetization: #Blackmoney a symptom, not the disease https://t.co/HSl6Ihj0Qe via @ambazaarmag
Text after:
 @DerekScissors1: Indias #demonetization: #Blackmoney a symptom, not the disease  via @ambazaarmag


In [248]:
def preprocessing(text):
    #removing RT
    text= re.sub('RT ','', text)
    #removing &amp;
    text= re.sub('&amp;', '&', text)
    #Removing strings lik <U+...> and <ed>
    text= re.sub('<U\+[a-zA-Z0-9]+>', '', text)
    text= re.sub('<ed>', '', text)
    #Remove other tweet references that has an https link
    text= re.sub('https?://[a-zA-Z0-9/.]+','', text)
    
    return text

df['clean_text']= df['text'].apply(preprocessing)
    

In [252]:
df= df.drop('Unnamed: 0', axis=1)

#### Removing duplicate tweets

It looks like there are quite a few duplicates in the dataset. Since we are trying to find the top 50 by count, we need to remove these duplicates. Otherwise, it may skew the result.

In [253]:
df[df[['text']].duplicated() == True]

Unnamed: 0,text,clean_text
16,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB..."
19,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB..."
20,RT @Hemant_80: Did you vote on #Demonetization...,@Hemant_80: Did you vote on #Demonetization on...
21,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB..."
22,RT @Atheist_Krishna: BEFORE and AFTER Gandhi j...,@Atheist_Krishna: BEFORE and AFTER Gandhi ji h...
...,...,...
14929,RT @rahulroushan: Prohibition is Demonetizatio...,@rahulroushan: Prohibition is Demonetization 2...
14930,RT @rahulroushan: Prohibition is Demonetizatio...,@rahulroushan: Prohibition is Demonetization 2...
14931,RT @bharat_builder: Lol. Demonetization has fi...,@bharat_builder: Lol. Demonetization has fixed...
14934,RT @Vidyut: How India became Bill Gates' guine...,@Vidyut: How India became Bill Gates' guinea p...


Out of the 14k tweets, ~10k tweets need to be de-duplicated.

In [254]:
new_df= df.drop_duplicates(subset=['text', 'clean_text'],)

In [255]:
len(df), len(new_df)

(14940, 5147)

We now have 5.1k unique tweets for our analysis.

### Find top 50 mentions in the dataset (mentions in this dataset refers to the @s)

In [256]:
#Extracting the mentions
text= 'RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy'
mentions= re.findall('@[\w]+', text)
mentions

['@Joydas']

In [257]:
new_df['mention']= new_df['clean_text'].apply(lambda x: re.findall('@[\w]+', x))
new_df

Unnamed: 0,text,clean_text,mention
0,RT @rssurjewala: Critical question: Was PayTM ...,@rssurjewala: Critical question: Was PayTM inf...,[@rssurjewala]
1,RT @Hemant_80: Did you vote on #Demonetization...,@Hemant_80: Did you vote on #Demonetization on...,[@Hemant_80]
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB...",[@roshankar]
3,RT @ANI_news: Gurugram (Haryana): Post office ...,@ANI_news: Gurugram (Haryana): Post office emp...,[@ANI_news]
4,RT @satishacharya: Reddy Wedding! @mail_today ...,@satishacharya: Reddy Wedding! @mail_today car...,"[@satishacharya, @mail_today]"
...,...,...,...
14933,@thehill To The Hill. Shame on you for your an...,@thehill To The Hill. Shame on you for your an...,[@thehill]
14935,RT @saxenavishakha: Ghost of demonetization re...,@saxenavishakha: Ghost of demonetization retur...,[@saxenavishakha]
14936,N d modi fans-d true nationalists of the count...,N d modi fans-d true nationalists of the count...,[]
14938,RT @Stupidosaur: @Vidyut B team of BJP. CIA ba...,@Stupidosaur: @Vidyut B team of BJP. CIA baby....,"[@Stupidosaur, @Vidyut]"


In [258]:
new_df= new_df.reset_index().drop('index', axis=1)
new_df

Unnamed: 0,text,clean_text,mention
0,RT @rssurjewala: Critical question: Was PayTM ...,@rssurjewala: Critical question: Was PayTM inf...,[@rssurjewala]
1,RT @Hemant_80: Did you vote on #Demonetization...,@Hemant_80: Did you vote on #Demonetization on...,[@Hemant_80]
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB...",[@roshankar]
3,RT @ANI_news: Gurugram (Haryana): Post office ...,@ANI_news: Gurugram (Haryana): Post office emp...,[@ANI_news]
4,RT @satishacharya: Reddy Wedding! @mail_today ...,@satishacharya: Reddy Wedding! @mail_today car...,"[@satishacharya, @mail_today]"
...,...,...,...
5142,@thehill To The Hill. Shame on you for your an...,@thehill To The Hill. Shame on you for your an...,[@thehill]
5143,RT @saxenavishakha: Ghost of demonetization re...,@saxenavishakha: Ghost of demonetization retur...,[@saxenavishakha]
5144,N d modi fans-d true nationalists of the count...,N d modi fans-d true nationalists of the count...,[]
5145,RT @Stupidosaur: @Vidyut B team of BJP. CIA ba...,@Stupidosaur: @Vidyut B team of BJP. CIA baby....,"[@Stupidosaur, @Vidyut]"


In [261]:
mention_count= {}

for i in range(len(new_df)):
    for j in new_df.loc[i,'mention']:
        if j in mention_count:
            mention_count[j]+=1
        else:
            mention_count[j]=1

In [262]:
mention_df= pd.DataFrame({'Mention': list(mention_count.keys()), 'Count_of_Mentions':list(mention_count.values()) })
mention_df.sort_values(by='Count_of_Mentions', ascending= False).head(50)

Unnamed: 0,Mention,Count_of_Mentions
11,@narendramodi,337
416,@YouTube,143
38,@PMOIndia,103
107,@ArvindKejriwal,84
29,@arunjaitley,41
1052,@evanspiegel,36
1097,@arvindsubraman,35
149,@ndtv,30
193,@RBI,26
49,@timesofindia,24


The above gives the top 50 mentions and their counts among the tweets.

### Find the top 50 most frequently used hashtags (#)

This is similar to the above task, with the key difference being, we will use # in the regex to find the hashtags

In [263]:
text= 'RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy'
hashtags= re.findall('#[\w]+', text)
hashtags

['#DeMonetization']

In [264]:
new_df['hashtag']= new_df['clean_text'].apply(lambda x: re.findall('#[\w]+', x))
new_df

Unnamed: 0,text,clean_text,mention,hashtag
0,RT @rssurjewala: Critical question: Was PayTM ...,@rssurjewala: Critical question: Was PayTM inf...,[@rssurjewala],[#Demonetization]
1,RT @Hemant_80: Did you vote on #Demonetization...,@Hemant_80: Did you vote on #Demonetization on...,[@Hemant_80],[#Demonetization]
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...","@roshankar: Former FinSec, RBI Dy Governor, CB...",[@roshankar],[#Demonetization]
3,RT @ANI_news: Gurugram (Haryana): Post office ...,@ANI_news: Gurugram (Haryana): Post office emp...,[@ANI_news],[#demonetization]
4,RT @satishacharya: Reddy Wedding! @mail_today ...,@satishacharya: Reddy Wedding! @mail_today car...,"[@satishacharya, @mail_today]","[#demonetization, #ReddyWedding]"
...,...,...,...,...
5142,@thehill To The Hill. Shame on you for your an...,@thehill To The Hill. Shame on you for your an...,[@thehill],[]
5143,RT @saxenavishakha: Ghost of demonetization re...,@saxenavishakha: Ghost of demonetization retur...,[@saxenavishakha],[]
5144,N d modi fans-d true nationalists of the count...,N d modi fans-d true nationalists of the count...,[],[]
5145,RT @Stupidosaur: @Vidyut B team of BJP. CIA ba...,@Stupidosaur: @Vidyut B team of BJP. CIA baby....,"[@Stupidosaur, @Vidyut]",[]


We notice that the hashtag for demonitization is available in both lower and sentence case. To get the top 50 hashtags, let's make the count case insensitive. This will give us a better picture of the topics being discussed through the hashtags.

In [265]:
hashtag_count= {}

for i in range(len(new_df)):
    for j in new_df.loc[i,'hashtag']:
        if j.lower() in hashtag_count: #taking lower case
            hashtag_count[j.lower()]+=1
        else:
            hashtag_count[j.lower()]=1

In [266]:
hashtag_df= pd.DataFrame({'Hashtag': list(hashtag_count.keys()), 'Count_of_Hashtag':list(hashtag_count.values()) })
hashtag_df.sort_values(by='Count_of_Hashtag', ascending= False).head(50)

Unnamed: 0,Hashtag,Count_of_Hashtag
0,#demonetization,2889
29,#india,165
2,#blackmoney,73
87,#demonetisation,46
50,#modi,46
6,#narendramodi,32
808,#youtube,28
51,#bjp,28
279,#news,25
88,#rbi,19


The above are the set of hashtags. It is overwhelmingly #demonetization.

### Find the sentences having mentions of Prime Minister



In [267]:
def pm_mention(text):

    patterns = [r'(?i)prime minister',
               r'(?i)pm',
               r'(?i)modi',
              ]

    output = []
    flag = 0
    for pat in patterns:
        if re.search(pat, text) != None:
            flag = 1
            break
    return flag

# apply function
new_df['mentions_PM'] = new_df['clean_text'].apply(pm_mention)

We have created a pattern with different variations of references to the prime minister, such as pm, prime minister and modi. These are case insensitive and therefore will capture all such references. The column 'Mentions_PM' will be set to 1 if the respective tweet has reference to the Prime Minister.

In [270]:
new_df[new_df['mentions_PM']==1]

Unnamed: 0,text,clean_text,mention,hashtag,mentions_PM
0,RT @rssurjewala: Critical question: Was PayTM ...,@rssurjewala: Critical question: Was PayTM inf...,[@rssurjewala],[#Demonetization],1
1,RT @Hemant_80: Did you vote on #Demonetization...,@Hemant_80: Did you vote on #Demonetization on...,[@Hemant_80],[#Demonetization],1
8,RT @sumitbhati2002: Many opposition leaders ar...,@sumitbhati2002: Many opposition leaders are w...,"[@sumitbhati2002, @narendramodi]",[#Demonetization],1
10,Many opposition leaders are with @narendramodi...,Many opposition leaders are with @narendramodi...,[@narendramodi],[#Demonetization],1
11,RT @Joydas: Question in Narendra Modi App wher...,@Joydas: Question in Narendra Modi App where P...,[@Joydas],[#DeMonetization],1
...,...,...,...,...,...
5061,@arunjaitley @FinMinIndia @PMOIndia No machine...,@arunjaitley @FinMinIndia @PMOIndia No machine...,"[@arunjaitley, @FinMinIndia, @PMOIndia, @TheOf...",[],1
5062,@arunjaitley @FinMinIndia @PMOIndia No money i...,@arunjaitley @FinMinIndia @PMOIndia No money i...,"[@arunjaitley, @FinMinIndia, @PMOIndia, @CorpB...",[],1
5097,Currency operater s are still active.#demoneti...,Currency operater s are still active.#demoneti...,"[@adhia03, @PMOIndia]",[#demonetization],1
5107,@TOIIndiaNews This is what called development....,@TOIIndiaNews This is what called development....,[@TOIIndiaNews],[],1


In [275]:
len(new_df[new_df['mentions_PM']==1])/len(new_df) #21% of the tweets make a reference to the Prime Minister.

0.20652807460656694

### Use of prepositions


In [276]:
import spacy
nlp= spacy.load('en_core_web_sm', disable= ['ner', 'textcat'])

In [279]:
text= "@rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure"
from spacy import displacy
displacy.render(nlp(text), style='dep',jupyter=True)

In [280]:
# rule for identifying noun-proposition-noun relationships
def rule(text):

    doc = nlp(text)

    sent = []

    for token in doc:

        # look for prepositions
        if token.pos_=='ADP':

            phrase = ''

            # if its head word is a noun
            if token.head.pos_=='NOUN':

                # append noun and preposition to phrase
                phrase += token.head.text
                phrase += ' '+token.text

                # check the nodes to the right of the preposition
                for right_tok in token.rights:
                    # append if it is a noun or proper noun
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        phrase += ' '+right_tok.text

                if len(phrase)>2:
                    sent.append(phrase)

    return sent

In [282]:
# create a df containing sentence and its output for rule
row_list = []

for i in range(len(new_df)):

    sent = new_df.loc[i,'clean_text']
    

    # rule
    output = rule(sent)

    dict1 = {'Tweet':sent,'Output':output}
    row_list.append(dict1)

df_rule = pd.DataFrame(row_list)

In [283]:
df_rule.head()

Unnamed: 0,Tweet,Output
0,@rssurjewala: Critical question: Was PayTM inf...,"[edict about, edict by PM]"
1,@Hemant_80: Did you vote on #Demonetization on...,[]
2,"@roshankar: Former FinSec, RBI Dy Governor, CB...",[]
3,@ANI_news: Gurugram (Haryana): Post office emp...,[patients in demonetization]
4,@satishacharya: Reddy Wedding! @mail_today car...,[]


In [284]:
# select non-empty outputs
df_show = pd.DataFrame(columns=df_rule.columns)

for row in range(len(df_rule)):

    if len(df_rule.loc[row,'Output'])!=0:
        df_show = df_show.append(df_rule.loc[row,:])

# reset the index
df_show.reset_index(inplace = True, drop = True)

In [318]:
df_show.head(25)

Unnamed: 0,Tweet,Output
0,@rssurjewala: Critical question: Was PayTM inf...,"[edict about, edict by PM]"
1,@ANI_news: Gurugram (Haryana): Post office emp...,[patients in demonetization]
2,@DerekScissors1: Indias #demonetization: #Bla...,[disease via @ambazaarmag]
3,@gauravcsawant: Rs 40 lakh looted from a bank ...,[bank in Kishtwar]
4,@Joydeep_911: Calling all Nationalists to join...,"[benefits of Demonetization, Demonetization am..."
5,National reform now destroyed even the essence...,[essence of sagan]
6,Many opposition leaders are with @narendramodi...,[b'coz of party]
7,@Joydas: Question in Narendra Modi App where P...,[Question in App]
8,@Jaggesh2 Bharat band on 28??Those who are pr...,[band on]
9,@Atheist_Krishna: The effect of #Demonetizatio...,[effect of Demonetization]


In [316]:
# separate noun, preposition and noun

prep_dict = dict()
dis_dict = dict()
dis_list = []

# iterating over all the sentences
for i in range(len(df_show)):

    # sentence containing the output
    sentence = df_show.loc[i,'Tweet']
    # output of the sentence
    output = df_show.loc[i,'Output']

    # iterating over all the outputs from the sentence
    for sent in output:

        # separate subject, verb and object
        n1, p, n2 = sent.split()[0], sent.split()[1], sent.split()[2:]

        # append to list, along with the sentence
        dis_dict = {'Tweet':sentence,'Noun1':n1,'Preposition':p,'Noun2':n2}
        dis_list.append(dis_dict)

        # counting the number of sentences containing the verb
        prep = sent.split()[1]
        if prep in prep_dict:
            prep_dict[prep]+=1
        else:
            prep_dict[prep]=1

df_sep= pd.DataFrame(dis_list)

In [317]:
df_sep[df_sep['Preposition']=='for'] #wher the proposition is for

Unnamed: 0,Tweet,Noun1,Preposition,Noun2
22,@dineshgrao you have 12.5 k followers yet you ...,likes,for,[tweets]
33,@ModiBharosa: Huge support for PM @narendramod...,support,for,[PM]
94,@MDiplomacyWORLD: Acid test for #Modi and #BJP...,test,for,[Modi]
95,Acid test for #Modi and #BJP: #Demonetization ...,test,for,[Modi]
109,@SriSri Ravi Shankar roots for #demonetizatio...,roots,for,[demonetization]
...,...,...,...,...
3286,Demonetization: Another chance for deposit of ...,chance,for,[deposit]
3291,@mostlyeconomics: Did India get the idea for i...,idea,for,[demonetization]
3363,#India #oil demand shrinks for third month ami...,shrinks,for,[month]
3399,@HDFCBank_Cares The issue is not shortage but ...,charge,for,[withdrawal]


In [289]:
df_sep[df_sep['Preposition']=='against'] #where the proposition is against

Unnamed: 0,Sentence,Noun1,Preposition,Noun2
24,@harshkkapoor: #DeMonetization survey results ...,fight,against,[@na]
37,@bhaiyyajispeaks: Here @sardesairajdeep strugg...,answer,against,[Demonetization]
47,@ashu3page: A man shaved his head at Jantar-Ma...,protest,against,[Demonetization]
67,@IndiaFactsOrg: #Demonetization: The Ultimate ...,Weapon,against,[economy]
112,Do you support the Opposition's agitation agai...,agitation,against,[demonetization]
...,...,...,...,...
3091,#India #National CBI files case against manage...,case,against,[managers]
3100,CBI files case against managers of Bank of Mah...,case,against,[managers]
3103,CBI files case against managers of Bank of Mah...,case,against,[managers]
3114,I think pm should take action against pakistan...,action,against,[ur]
