<center>
    <H1> SUPPORT VECTOR MACHINE </H1>
    <br>
======================================================================================================================
<br>
Support Vector Machine (SVM) is a discriminative binary supervised Machine Learning algorithm, which works only on linearly separable data. It is usually portrayed by a separating hyperplane.

## STEP 1: IMPORT LIBRARIES

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, accuracy_score

## STEP 2: LOAD DATASET

In [2]:
dataset = pd.read_csv('data/twitter_dataset.csv', encoding = 'latin-1') #load dataset from csv file
dataset.head()           #show first 5 rows

Unnamed: 0,name_wt,statuses_count,followers_count,friends_count,favourites_count,listed_count,label
0,0.9375,43,5,34,0,0,1
1,0.909091,12204,1182,1327,0,4,1
2,0.909091,42,3,34,0,0,1
3,1.0,215,1158,1545,0,21,1
4,0.285714,38420,2293,2198,1987,2,0


## STEP 3: FEATURE SELECTION

In [3]:
#Combinig attributes into single list of tuples and using those features create a 2D matrix 

features=[]
for attributes in dataset.columns:
    if attributes != 'label':
        features.append(attributes)

data = dataset.as_matrix(columns = features)

  


In [4]:
print("Total instances : ", data.shape[0], "\nNumber of features : ", data.shape[1])

Total instances :  6945 
Number of features :  6


In [5]:
#convert label column into 1D arrray

label = np.array(dataset['label'])
label

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

## STEP 4: CREATE TEST AND TRAIN SETS

We will randomly split our dataset in 80–20 ratio. Where 80% of the total data will be used as training set and rest 20% will be considered as test set. 

In [6]:
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size=0.2, random_state=0)

In [7]:
print("Number of training instances: ", X_train.shape[0])

Number of training instances:  5556


In [8]:
print("Number of testing instances: ", X_test.shape[0])

Number of testing instances:  1389


## STEP 5: TRAIN THE CLASSIFIER 

In [9]:
# Generate the model
svm_model = SVC(kernel='linear')    #linear kernel trick is used as the data is linearly non-separable

# Train the model using the training sets
data = X_train
label = y_train

svm_model.fit(data, label)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

## STEP 6: TEST THE CLASSIFIER 

Now our model is ready. We will test our data against given labels. For every test case, calculate class score (using Bayes theorem) and assign the class to the test case, having maximum score.

In [10]:
#test set
# X_test

In [11]:
svm_model.predict([X_test[1]])    #testing for single instance

array([0], dtype=int64)

In [12]:
'''
   Now, apply the model to the entire test set and predict the label for each test example

'''       
       
y_predict = []                       #to store prediction of each test example

for test_case in range(len(X_test)): 
    label = svm_model.predict([X_test[test_case]])
    
    #append to the predictions list
    y_predict.append(np.asscalar(label))

#predictions

In [13]:
# y_predict

## STEP 7: EVALUATION OF CLASSIFICATION RESULTS

The classifier will be evaluted using Accuracy, Recall, Precision and F-measure. For this first, a confusion matrix will be created. 

In [14]:
#true negatives is C(0,0), false negatives is C(1,0), false positives is C(0,1) and true positives is C(1,1) 
conf_matrix = confusion_matrix(y_test, y_predict)

In [15]:
#true_negative
TN = conf_matrix[0][0]
#false_negative
FN = conf_matrix[1][0]
#false_positive
FP = conf_matrix[0][1]
#true_positive
TP = conf_matrix[1][1]

In [16]:
# Recall is the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
# High Recall indicates the class is correctly recognized (small number of FN)
recall = (TP)/(TP + FN)

In [17]:
# Precision is the the total number of correctly classified positive examples divided by the total number of predicted positive examples. 
# High Precision indicates an example labeled as positive is indeed positive (small number of FP)
precision = (TP)/(TP + FP)

In [18]:
fmeasure = (2*recall*precision)/(recall+precision)
accuracy = (TP + TN)/(TN + FN + FP + TP)
#accuracy_score(y_test, y_predict)

In [19]:
print("------ CLASSIFICATION PERFORMANCE OF THE SVM MODEL ------\n"\
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n Accuracy : ", (accuracy*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )



------ CLASSIFICATION PERFORMANCE OF THE SVM MODEL ------ 
 Recall :  92.45647969052224 %
 Precision :  94.84126984126983 %
 Accuracy :  94.67649467649467 %
 F-measure :  93.63369245837414 %
