In [1]:
import sys
import os
sys.path.insert(1, os.path.join(sys.path[0], '..'))
import main
import tweepy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm

api = "Dummy" # We don't need the actual API here so just pass a dummy value in

sa = main.SentimentAnalysis(False, api, 'svm', 1, 10)

## Training and using an SVM classifier

### What is an SVM classifier?
The second option for classifying our preprocessed tweets is a Support vector machine or SVM for short. 

A support vector machine is a supervised learning model so it looks at labeled training data and uses this to categorise any new data based on the categories which appear in the training data. Don't worry if that doesn't quite make sense to you, I'll break it down a bit more below.

### Training the SVM classifier
If our training corpus is too large to allow us to run sentiment analysis in a reasonable time then we can trim it to a more manageable size. In this case I'll use a fairly trivial set of 10 tweets for our training corpus, 5 positive and 5 negative.

In [2]:
positive_tweets = [["happy :)", "positive"],["i am very excited!", "positive"],["yay lol, exciting!", "positive"], ["happy time, i have cake!", "positive"], ["this is amazing! :)", "positive"]]
negative_tweets = [["no i am so sad", "negative"], ["my cat just died :(", "negative"], ["Just crashed my car whoops!", "negative"], ["Late for my new job arghh", "negative"], ["I feel like crying!", "negative"]]        

tweets = [i[0] for i in positive_tweets] + [i[0] for i in negative_tweets]
labels = [i[1] for i in positive_tweets] + [i[1] for i in negative_tweets]

sa.training_set_size = min(len(tweets),  sa.svm_training_set_size)

# Trim the training corpus to a smaller more manageable size
tweets = tweets[:int(sa.training_set_size)]
labels = labels[:int(sa.training_set_size)]

for tweet in tweets:
    print("-", tweet)

- happy :)
- i am very excited!
- yay lol, exciting!
- happy time, i have cake!
- this is amazing! :)
- no i am so sad
- my cat just died :(
- Just crashed my car whoops!
- Late for my new job arghh
- I feel like crying!


The next step is to create our vectorizer object which will convert our trimmed training corpus in to a set of vectors using the *tf-idf* statistic. 

You can [see here](http://www.tfidf.com/) for more details on tf-idf but to summarise, it stands for **term frequency - inverse document frequency** and it is a statistic that aims to reflect how important a word is to a document in a collection, or in this case, a corpus.

In [3]:
vectorizer = TfidfVectorizer(min_df=1,
                            max_df=0.95,
                            sublinear_tf = True,
                            use_idf = True,
                            ngram_range = (1, 2))

If you toggle on the code, above you can see the vectorizer setup and that we have a number of parameters for this.
* min_df - This basically states that we want to discard any words which appear less than some value of times. In the live code this value is set to 5 but for demonstration purposes as we have a very small training set the parameter has been set to 1.
* max_df - This allows us to discard any word which appear in more than X% of our documents.
* sublinear_tf - This applies sublinear term frequency scaling, i.e. replaces term frequency with 1 +    $Log_{2}$(term frequency).
* use_idf - The final parameter allows us to choose whether we wish to use inverse document frequency or not, so scaling frequency values down significantly as we use $Log_{2}$ values instead ([See here](https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html) for more details on inverse document frequency).
* ngram_range - The lower and upper boundary of values for n for different n-grams which can be extracted from each text entry. N-grams are all combinations of adjacent words or letters of length n that you can find in the source text. You can [see here](https://en.wikipedia.org/wiki/N-gram) for more on N-grams.

For more details on the TfidfVectorizer parameters you can [see here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Now that we have our vectorizer we can go ahead and use it to create our training vectors by transforming our trimmed training corpus into a set of training vectors

In [4]:
train_vectors = vectorizer.fit_transform(tweets)

The final step in the setup phase is to create our classifier. 

The Python library *scikit-learn* comes with a number of different classifiers already built-in. In these experiments, we will use the LinearSVC (linear support vector classification) variation of Support Vector Machine (SVM). 

We pass an argument C = 0.1 to the SVM object. Don't worry about this for now, we'll come back to it later on. 

Once this has been done we can simply fit our training vectors and corresponding sentiment labels from above in it and we are ready to classify new tweets.

In [5]:
# Perform classification with SVM
classifier_linear = svm.LinearSVC(C=0.1)

classifier_linear.fit(train_vectors, labels);

### Classifying using the SVM classifier

Now all we are ready to analyse new tweets. For demonstration purposes we will take a trivial tweet from a sentiment point of view and analyse it to obtain the sentiment.

We do this by transforming the tweet in to a test vector using our vectorizer again and then pass this vector to our classifier to predict the sentiment by plotting it on to the graph and see which side of the hyperplane the new vector lies. 

The hyperplane is the line through our graph which satisfies two properties:
1. It has the maximum possible distance between the closest point either side of the line to the line
2. It correctly seperates as many points by class as possible

![SVM diagram](https://www.tweetsentiment.co.uk/static/images/svm.jpg)
<center>[(Image source)](http://blogs.quickheal.com/machine-learning-approach-advanced-threat-hunting/)</center>

Now remember that SVM parameter C from above? The value that this has determines how much we want to focus on property 2 from above. A higher C value means that a smaller-margin hyperplane will be chosen in order to try and correctly identify as many points as possible. The inverse is also true for smaller C values. The technical name of this parameter is the penalty parameter C of the error term.

Below you can see 2 SVM examples demonstrating the difference between a lower C value and a higher C value.

![Low and high C diagram](https://www.tweetsentiment.co.uk/static/images/svm_c_values.png)
<center>[(Image source)](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel)</center>

As you can see for our trivial tweet, our SVM correctly predicts the sentiment to be positive:

In [6]:
tweet = "I am very happy :)"
test_vector = vectorizer.transform([tweet])
prediction = classifier_linear.predict(test_vector)

print('"' + tweet + '"')
print(prediction[0])

"I am very happy :)"
positive


Once again we can also return the positive and negative sentiment percentages from the classifier for each classification call:

Again if the difference is less than 10% we will classify the sentiment as neutral instead.

In [7]:
probs = classifier_linear.decision_function(test_vector)
negative = (1 - probs) / 2
positive = 1 - negative

positive = '%.1d' % (round(positive[0], 2) * 100) + "%"
negative = '%.1d' % (round(negative[0], 2) * 100) + "%"
print("Positive:", positive)
print("Negative:", negative)

Positive: 59%
Negative: 41%


Looking at the final example below you can again see that this tweet is analysed as being neutral when it once again isn't. As was the case for our Bayesian Classifier, the training set we use to train our SVM classifier is just as vital. There is no occurance of the word 'annoying' in our training set so the classification result will be highly likely to be incorrect.

This serves to reinforce that a well chosen and larger training set is highly desirable in order to ensure higher accuracy when it comes to sentiment analysis.

In [8]:
tweet = "You're annoying"
test_vector = vectorizer.transform([tweet])
prediction = classifier_linear.predict(test_vector)
probs = classifier_linear.decision_function(test_vector)
negative = (1 - probs) / 2
positive = 1 - negative

predict = prediction[0]
if positive - negative < 0.1 and positive - negative > -0.1:
    predict = "neutral"
    
positive = '%.1d' % (round(positive[0], 2) * 100) + "%"
negative = '%.1d' % (round(negative[0], 2) * 100) + "%"

print('"' + tweet + '"')
print("Sentiment:", predict)
print("Positive:", positive)
print("Negative:", negative)

"You're annoying"
Sentiment: neutral
Positive: 50%
Negative: 50%
