# Supervised Sentiment Analysis 
Here for this project we will be analyzing Tweets about the President Candidates by word; we will "tokenize" each word
(which essentially means that we split up sentences by word or sets of words) and then determine sentiment through
these words. "Supervised" in the above title means that I will be supplying the labels for the tweets; for these tweets here I have already classified 300 out of the 400 as either positive, negative or neutral. Then I will train a Multinomial Naive Bayes algorithm, as well as a Linear Support Vector Machine with a SGD (Stochastic Gradient Descent) Classifier. These two algorithms are very popular for these types of classification problems that have more than 2 response classes. 200 data points will go into training, 100 will go into testing, and the last 100 that are unlabeled will be predicted and looked at. 

# Unsupervised Sentiment Analysis
After completing the supervised algorithm, I will go and perform an unsupervised sentiment analysis on an unlabeled version of the original dataset. This algorithm will have a small base of words to determine sentiment; here I will use "create" and "destroy" as the base words. The algorithm determines semantic orientation of a phrase by subtracting the mutual information between the given phrase and word "destroy" from the mutual information between the phrase and the word "create". For more information on how exactly this algorithm works visit this site below:
http://nparc.cisti-icist.nrc-cnrc.gc.ca/eng/view/accepted/?id=4bb7a0c8-9d9b-4ded-bcf6-fdf64ee28ccc

Here we will import some libraries to get going:

In [1]:
import pandas as pd
import csv
import sys
import random
import numpy as np
import time
from time import strftime

This will allow us to parse the CSV and extract necessary data we need. 

In [2]:
import pygal
from IPython.display import SVG

In [3]:
predicted_data_NB = "predicted_nb.csv"
predicted_data_LSVM = "predicted_lsvm.csv"

In [5]:
def compare_predictions():
    names = ["president", "tweet", "prediction"]
    naive_bayes = pd.read_csv(predicted_data_NB, names = names)
    linear_svm = pd.read_csv(predicted_data_LSVM, names = names)

    naive_bayes_pred = np.array(naive_bayes["prediction"])
    linear_svm_pred = np.array(linear_svm["prediction"])

    print("The precent similarity between a Multinomial Naive Bayes Algorithm and a Linear SVM algorithm with a SGD Classifier is: ")
    print(np.mean(naive_bayes_pred == linear_svm_pred))

    plot_predictions(naive_bayes_pred)
    plot_predictions(linear_svm_pred)

def plot_predictions(predictions):

    pos_sent = len([k for k in predictions if k == "positive"]) / len(predictions)
    neg_sent = len([k for k in predictions if k == "negative"]) / len(predictions)
    neu_sent = len([k for k in predictions if k == "neutral"]) / len(predictions)

    chart = pygal.HorizontalBar()
    chart.title = 'Positive, Negative & Neutral Sentiment'
    chart.add('Positive', pos_sent * 100)
    chart.add('Negative', neg_sent * 100)
    chart.add('Neutral', neu_sent * 100)
    chart.render_to_file('sentiment.svg')
    SVG(filename='sentiment.svg')