## Objective: 
The objective is to build a sentiment classifier using a combination of a lexicon-based classifier, **TextBlob**, and a **HashTag sentiment classifier** and combine the sentiment score of the two to give an accurate sentiment classification to each tweet.

* **TextBlob:** is a python library to provide the sentiment/polarity score to each input string/tweet. 
* **HashTag Sentiment Classifier:** Is based on the belief propagation approach mentioned in the paper.
> __[Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.462.3827&rep=rep1&type=pdf)__

The idea is to assign a sentiment value to each hashtag based on the sentiment value of the neighbouring/co-occurring hashtags. i.e. a hashtag used frequently with other hashtags with negative sentiment value will most likely to be having a negative sentiment value itself.

The task of assigning sentiment label ['neg', 'neutral', 'pos'] to each hashtag is broken in two scripts.

1. **input_HG:** extract the sentiment probability of the hashtag based on the sentiments of the tweets in which hashtag occurs and extract the co-occurrence ratio between a pair of hashtags.
2. **LBP:** run the loopy belief propagation to assign a sentiment value to each hashtag based on its neighbouring/co-occurring hashtags.


Finaly a sentiment classifier which is a combination of TextBlob output and HashTag sentiment classifier is used in *similarity_graph.get_hashtag_polarity()* to output sentiment for each tweet.

**Note:** only most popular hashtags ~1000 are considered to keep running time realistic.

## Tools/ Technology Used

* **Pickle:** for loading python objects from file
* **Numpy:** For creating and performing matrix operations
* **time:** Get the system time
* **os:** To refer to operating system paths.

## Output 
The output for the script is a file containing the **id of hashtag, hastag, final label for hashtag, [positive + negative + neutral] probablity**.

In [11]:
!cat ouput.results | head -n 5

ID	Node	Label	pos	neg	neutral
0	thisisnotconsent	neutral	0.0	0.0	1.0
1	bhfyp	pos	0.9999999999999999	0.0	1.879366699436796e-41
2	igers	neutral	1.1487926839616725e-21	0.0	1.0
3	suicide	pos	1.0	0.0	0.0


In [12]:
import pickle
import numpy as np
import time
from os.path import abspath

### Loading the output of *input_HG*.
* **nodes:** List of most popular/frequently used hashtags.
* **PI:** Sentiment probablity for each hashtag in the list based on the polarity value of the tweets in which the hashtag occured.
* **SI:** Co-occurance ratio between a pair of hashtags. i.e. 

\begin{equation*}
\frac{\#(H_1,H_2)}{\#(H_1) + \#(H_2)}
\end{equation*}

In [13]:
labels = ['pos', 'neg', 'neutral']
pickle_in = open("hashtag.pickle","rb")
[nodes, PI, SI] = pickle.load(pickle_in)
num_nodes = len(nodes)
print("Number of nodes = {}".format(num_nodes))

Number of nodes = 1054


### Timer to count the number of seconds for each loopy belief propagation

In [14]:
class Timer:    
    """ Timer """
    def __enter__(self):
        self.start = time.clock() # start
        return self

    def __exit__(self, *args):
        self.end = time.clock() # end
        self.i = self.end - self.start # time taken

### Converting Co-occurance ratio dictionary into a numpy matrix
$ Shape(PSI) = \#(Nodes) * \#(Nodes) $ 

In [15]:
def get_psi(SI):
    PSI = np.zeros((num_nodes, num_nodes))
    for i in range(num_nodes):
        for j in range(num_nodes):
            try:
                PSI[i][j] = SI[nodes[i], nodes[j]]
                PSI[j][i] = SI[nodes[i], nodes[j]]
            except:
                try:
                    PSI[i][j] = SI[nodes[j], nodes[i]]
                    PSI[j][i] = SI[nodes[j], nodes[i]]
                except:
                    None
    return PSI

### Converting sentiment probablity dictionary into a numpy matrix
$ Shape(PHI) = \#(Nodes) * \#(Labels) $ 

In [16]:
def get_phi(PI):
    PHI = np.zeros((num_nodes, len(labels)))
    for node in range(num_nodes):
        for label in range(len(labels)):
            try:
                PHI[node][label] = PI[labels[label], nodes[node]]
            except:
                PHI[node][label] = 0.0
    return PHI   

### Propagation of belief from a node to its neighbours


$ m_{i \rightarrow j} (y_j ) \leftarrow \alpha \sum_{y_i} \psi_{i,j} (y_i, y_j ) \phi_{i}(y_i) \prod_{h_K \in N(h_i) \backslash h_j}  m_{k \rightarrow i}(y_i)$


In [17]:
def getMessage(i, PHIi, PSIi, label_j, labels, neighbours, messages):
    m = { label: np.zeros(messages[label].shape[0]) for label in labels } # initialise
    zeros = np.zeros(len(PSIi))

    for neighbour in neighbours:
        S = set(neighbours) - {neighbour}
        for label in labels: # compute contribution to each neighbour from all other neighbours of i
            m[label][neighbour] = np.prod([ messages[label][k,i] for k in S ]) + .00000001

    return sum( np.multiply(PHIi[i] * (PSIi if label_i == label_j else zeros), m[label_i]) for i, label_i in enumerate(labels))

### Get the final score for each hashtag after belief propataion is done
$ y_i \leftarrow \arg\max_{y \in \{pos, neg\}} \alpha \phi_i(y) \prod_{h_j \in N(h_i)}  m_{j \rightarrow i}(y)$ 

In [18]:
def getScores(i, PHIi, labels, neighbours, messages):
    scores = { label: PHIi[label_i] * np.prod([ messages[label][j,i] for j in neighbours ]) for label_i, label in enumerate(labels) }
    # normalise scores
    alpha = sum( scores[label] for label in labels )
    if alpha:
        for label in labels:
            scores[label] *= 1 / alpha

    return scores

def getLabels(scores):
    return [ max(score, key=score.get) for score in scores ]

### Running loopy belief propagation iteratively

In [19]:
def LBP(labels, nodes, PI, SI):
    with Timer() as t: # do some extra initialisation
        PSI = get_psi(SI) # Convert the SI value to matrix representation
        PHI = get_phi(PI) # Convert sentiment probablity to matrix representation
        messages = { label: (PSI > 0).astype(float) for label in labels } # Initialize the message values to 1 for co-occuring nodes 
        neighbours = [ list(np.nonzero(PSI[i])[0]) for i in range(num_nodes) ] # extract neighbours for all nodes
        print('PSI = {}, PHI = {}'.format(PHI.shape, PSI.shape))
    print("\nStarting propagation on {} node hashtag graph".format(num_nodes))

    loops = 0 # loop variable t
    while True:
        with Timer() as t: # compute messages
            loops += 1
            old_messages = { label: messages[label].copy() for label in labels } # archive messages
            for i in range(num_nodes): # for each node
                for label in labels: # compute message from node i to its neighbours
                    messages[label][i,:] = getMessage(i, PHI[i,:].flatten(), PSI[i,:].flatten(), label, labels, neighbours[i], old_messages)
                    
                alpha = [ alpha for alpha in map(lambda x : 1/x if x else 1, np.sum([ messages[label][i,:] for label in labels ], axis=0)) ] # compute normaliser
                for label in labels:
                    messages[label][i,:] *= alpha # normalise messages
                
        print("[{:.3f}s] Loop {} completed.\tMessage change: {}".format(t.i, loops, np.sum([ abs(old_messages[label] - messages[label]) for label in labels ])))

        if np.product([ np.allclose(old_messages[label], messages[label]) for label in labels ]) or loops > 10: # halt if messages stop changing
            break

    print("Propagation complete.\n")
    
    with Timer() as t: # compute final labels
        scores = [ getScores(i, PHI[i,:], labels, neighbours[i], messages) for i in range(num_nodes)]
        results = getLabels(scores)

    print("[{:.3f}s] Final labels computed. Objective value: {}".format(t.i, sum( max(score.values()) for score in scores )))

    return results, scores

### Printing the final result

In [21]:
def printResults(labels, nodes, results, scores):
    filename = 'ouput.results'
    with open(filename, 'w') as fo:
        fo.write("ID\tNode\tLabel\t{}\n".format('\t'.join(labels)))
        for i in range(len(nodes)):
            fo.write("{}\t{}\t{}\t{}\n".format(i, nodes[i], results[i], '\t'.join(str(scores[i][label]) for label in labels)))
            
    print("Results written to '{}'.".format(abspath(filename)))

### Running the algorithm and outputing the result to a file

In [22]:
results, scores = LBP(labels, nodes, PI, SI) # run propagation
printResults(labels, nodes, results, scores) # print results

PSI = (1054, 3), PHI = (1054, 1054)

Starting propagation on 1054 node hashtag graph
[198.643s] Loop 1 completed.	Message change: 439288.0000000003
[199.460s] Loop 2 completed.	Message change: 157.43698451538225
[199.616s] Loop 3 completed.	Message change: 9.698930472669753
[198.841s] Loop 4 completed.	Message change: 0.1042547165131901
[201.808s] Loop 5 completed.	Message change: 0.0005052888974507859
[200.982s] Loop 6 completed.	Message change: 2.3558844685393022e-05
[200.157s] Loop 7 completed.	Message change: 5.701835024207925e-07
Propagation complete.

[0.248s] Final labels computed. Objective value: nan
Results written to '/home/ipsita_proff/bd_project/ouput.results'.


  import sys
  import sys
