# Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach

## Objective: 
The objective is to build a sentiment classifier using a combination of a lexicon-based classifier, **TextBlob**, and a **HashTag sentiment classifier** and combine the sentiment score of the two to give an accurate sentiment classification to each tweet.

* **TextBlob:** is a python library to provide the sentiment/polarity score to each input string/tweet. 
* **HashTag Sentiment Classifier:** Is based on the belief propagation approach mentioned in the paper.
> __[Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.462.3827&rep=rep1&type=pdf)__

The idea is to assign a sentiment value to each hashtag based on the sentiment value of the neighbouring/co-occurring hashtags. i.e. a hashtag used frequently with other hashtags with negative sentiment value will most likely to be having a negative sentiment value itself.

The task of assigning sentiment label ['neg', 'neutral', 'pos'] to each hashtag is broken in two scripts.

1. **input_HG:** extract the sentiment probability of the hashtag based on the sentiments of the tweets in which hashtag occurs and extract the co-occurrence ratio between a pair of hashtags.
2. **LBP:** run the loopy belief propagation to assign a sentiment value to each hashtag based on its neighbouring/co-occurring hashtags.

Finaly a sentiment classifier which is a combination of TextBlob output and HashTag sentiment classifier is used in *similarity_graph.get_hashtag_polarity()* to output sentiment for each tweet.

**Note:** only most popular hashtags ~1000 are considered to keep running time realistic.

## Tools/ Technology Used

* **Pandas:** for data processing
* **RE:** find matching patterns in the string 
* **TextBlob:** library to extract the sentiments for the tweets
* **itertools:** to generate all possible pair combination from a list of values

## Output 
The output for the script is: 
* List of most popular/frequently used hashtags.
* Sentiment probablity for each hashtag in the list based on the polarity value of the tweets in which the hashtag occured.
* Co-occurance ratio between a pair of hashtags. i.e. 

\begin{equation*}
\frac{\#(H_1,H_2)}{\#(H_1) + \#(H_2)}
\end{equation*}


In [20]:
import re
import pandas as pd
from textblob import TextBlob
import itertools

In [10]:
def isNaN(num):
    return num != num

In [11]:
df = pd.read_csv('humans.csv', sep=';', index_col='id', usecols = ['id', 'source', 'is_retweet', 'hashtags', 'lemma_filter'])

### Read and Preprocess Data

We filter the records that were not retweeted by other users or doesn't have any content in the processed tweet or doesn't contains any hashtags in the input data. Also, extract all the hashtags in a list in a seperate column.

In [12]:
df0 = df[df['is_retweet'] == False].sample(frac=0.1)
df0_1 = df0[df0.apply(lambda x: (isNaN(x['lemma_filter'])==False) and (isNaN(x['hashtags'])==False), axis=1)]
df0_1['hashtags_list'] = df0_1.apply(lambda x: [i for i in set(re.findall(r"[\w']+", x['hashtags'].lower())) if len(i) > 1], axis=1)
df1 = df0_1[df0_1.apply(lambda x: len(x['hashtags_list']) > 0, axis=1)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
df1.head(5)

Unnamed: 0_level_0,source,is_retweet,hashtags,lemma_filter,hashtags_list
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
183325,instagram,False,"#GirlsLeadTheWorld, #BeYourOwnMuse, #FeministA...",yo quiero que pierdas la cabeza por mí . quier...,"[weloversize, womensupportingwomen, feministar..."
163851,instagram,False,"#selfie, #glasses, #cute, #asmrvideo, #420, #i...",imperfection www.younow.com/behappybeingyou ww...,"[tarot, cute, music, maydayparade, poetry, tre..."
336296,instagram,False,"#TanushreeDutta, #MeToo, #metoomovement, #incr...","accusation trigger movement india ? depend u ,...","[toxicmasculinity, metoomovement, genderequali..."
254932,instagram,False,"#Metoo, #Aidtoo, #GVAPeaceWeek",morning 's event un peacekeeping : great acco...,"[aidtoo, gvapeaceweek, metoo]"
5236,twitter,False,"MeToo, jimacosta",silence white house beauty accost rag chauvini...,"[jimacosta, metoo]"


### Use TextBlob to extract the sentiment from the tweets.

* **sentiment_polarity:** contains the polarity value [-1, 1] predicted by textblob (*a lexicon based sentiment prediction library*) for the tweets.

* **sentiment:** convert the polarity value into a categorical value *['neg' (sentiment_polarity < 0), 'neutral' (sentiment_polarity = 0), 'pos' (sentiment_polarity > 0)]*. 

In [14]:
df2 = df1
df2['sentiment_polarity'] = df2.apply(lambda x: TextBlob(x['lemma_filter']).sentiment[0], axis=1)
df2['sentiment'] = df2.apply(lambda x: 'pos' if x['sentiment_polarity'] > 0 else 'neg' if x['sentiment_polarity'] < 0 else 'neutral', axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
df2.head(5)

Unnamed: 0_level_0,source,is_retweet,hashtags,lemma_filter,hashtags_list,sentiment_polarity,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
183325,instagram,False,"#GirlsLeadTheWorld, #BeYourOwnMuse, #FeministA...",yo quiero que pierdas la cabeza por mí . quier...,"[weloversize, womensupportingwomen, feministar...",0.0,neutral
163851,instagram,False,"#selfie, #glasses, #cute, #asmrvideo, #420, #i...",imperfection www.younow.com/behappybeingyou ww...,"[tarot, cute, music, maydayparade, poetry, tre...",0.394444,pos
336296,instagram,False,"#TanushreeDutta, #MeToo, #metoomovement, #incr...","accusation trigger movement india ? depend u ,...","[toxicmasculinity, metoomovement, genderequali...",0.483333,pos
254932,instagram,False,"#Metoo, #Aidtoo, #GVAPeaceWeek",morning 's event un peacekeeping : great acco...,"[aidtoo, gvapeaceweek, metoo]",0.2,pos
5236,twitter,False,"MeToo, jimacosta",silence white house beauty accost rag chauvini...,"[jimacosta, metoo]",0.0,neutral


### Extract the Sentiment Probablity *[hashtag_dic]* and HashTag Set *[hashtags_set]*
**NUM:** Only hashtags occuring atleast this many number of times are considered for hashtag sentiment analysis *(to limit computation time)*.
<br>**hashtag_dic:** dictionary of hashtag sentiment probablity score.
<br>**hashtag_set:** set of hashtags satisfying NUM condition

In [24]:
NUM = 50  #UM of most popular hashtags
# # Only selecting positive and negative sentiment
# # df3 = df2[df2['sentiment'] != 'neutral'].set_index(['sentiment'])['hashtags_list'].apply(pd.Series).stack()

# Select all three sentiments
df3 = df2.set_index(['sentiment'])['hashtags_list'].apply(pd.Series).stack()
df3 = df3.reset_index()
df3.columns = ['sentiment', 'sample_num','hashtag']
df3_1 = df3.groupby(['hashtag', 'sentiment']).size().reset_index(name='counts')
df3_2 = df3.groupby(['hashtag']).size().reset_index(name='counts')
df3_merged = pd.merge(df3_1, df3_2, on='hashtag', how='inner')
result = df3_merged[df3_merged['counts_y'] > NUM] 
result['probe'] = result.apply(lambda x: float(x['counts_x'])/x['counts_y'], axis=1)
result = result[result.apply(lambda x: 'metoo' not in x['hashtag'], axis=1)]
# result

##
hashtag_dic = {}
for index, row in result.iterrows():
    hashtag_dic[row['sentiment'], row['hashtag']] = row['probe']

# 2019womensmarch

##
hashtags_set = set(result['hashtag'].values)
print('Number of hashtags considered = {}'.format(len(hashtags_set)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


Number of hashtags considered = 1054


### Extract co-occurance ratio *[SI]*

In [26]:
df1['idx'] = df1.index
tdf = df1.set_index(['idx'])['hashtags_list'].apply(pd.Series).stack()
tdf = tdf.reset_index()
tdf.columns = ['idx', 'sample_num','hashtag']
ht_dic = {}
for index, row in tdf.iterrows():
    if row['hashtag'] in ht_dic:
        ht_dic[row['hashtag']].add(row['idx'])
    else:
        ht_dic[row['hashtag']] = set([row['idx']])


SI = {}
for (i, j) in itertools.combinations(list(hashtags_set), 2):
    tmp = 0
    if i in ht_dic:
        i_cnt = len(ht_dic[i])
        tmp += 1
    if j in ht_dic:
        j_cnt = len(ht_dic[j])
        tmp += 1
    if tmp == 2:
        i_j_cnt = len(ht_dic[i].intersection(ht_dic[j]))
    if i_j_cnt > 0:
        SI[i, j] = float(i_j_cnt)/(i_cnt + j_cnt)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [27]:
import pickle
with open('hashtag.pickle', 'wb') as handle:
    pickle.dump([list(hashtags_set), hashtag_dic, SI], handle, protocol=pickle.HIGHEST_PROTOCOL)