<center>
    <H1> ROCCHIO CLASSIFIER </H1>
    <br>
======================================================================================================================
<br>


## STEP 1: IMPORT LIBRARIES

In [1]:
import pandas as pd
from sklearn.neighbors import NearestCentroid
from sklearn.model_selection import train_test_split        #for train_test_split function
from sklearn.metrics import confusion_matrix, accuracy_score

## STEP 2: LOAD DATASET

In [2]:
dataset = pd.read_csv('data/twitter_dataset.csv', encoding = 'latin-1')
dataset.head()

Unnamed: 0,name_wt,statuses_count,followers_count,friends_count,favourites_count,listed_count,label
0,0.9375,43,5,34,0,0,1
1,0.909091,12204,1182,1327,0,4,1
2,0.909091,42,3,34,0,0,1
3,1.0,215,1158,1545,0,21,1
4,0.285714,38420,2293,2198,1987,2,0


## STEP 3: FEATURE SELECTION

In [3]:
features=[]
for attributes in dataset.columns:
    if attributes != 'label':
        features.append(attributes)
features

['name_wt',
 'statuses_count',
 'followers_count',
 'friends_count',
 'favourites_count',
 'listed_count']

## STEP 4: CREATE TEST AND TRAIN SETS

We will randomly split our dataset in 80–20 ratio. Where 80% of the total data will be used as training set and rest 20% will be considered as test set. 

In [5]:
#split dataset in features and target variable
X = dataset[features] # Features
y = dataset.label # Target variable

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # 80% training and 20% test

In [7]:
print("Number of training instances: ", X_train.shape[0])
print("Number of testing instances: ", X_test.shape[0])

Number of training instances:  5556
Number of testing instances:  1389


## STEP 5: TRAIN THE CLASSIFIER 

In [12]:
clf = NearestCentroid(shrink_threshold=0.2)
clf.fit(X_train, y_train)

NearestCentroid(metric='euclidean', shrink_threshold=0.2)

## STEP 6: TEST THE CLASSIFIER 

Now our model is ready. We will test our data against given labels. For every test case, calculate class score (using Bayes theorem) and assign the class to the test case, having maximum score.

In [13]:
#test set
X_test.head()

Unnamed: 0,name_wt,statuses_count,followers_count,friends_count,favourites_count,listed_count
119,0.285714,2,290,300,0,2
4977,0.833333,2,577,522,0,0
2529,0.916667,4948,467,266,772,23
1766,0.6,5223,271,738,2809,3
5477,0.230769,23645,1008,1756,897,66


In [14]:
#Predict the response for test dataset
y_predict = clf.predict(X_test)

## STEP 7: EVALUATION OF CLASSIFICATION RESULTS

The classifier will be evaluted using Accuracy, Recall, Precision and F-measure. For this first, a confusion matrix will be created. 

In [15]:
#true negatives is C(0,0), false negatives is C(1,0), false positives is C(0,1) and true positives is C(1,1) 
conf_matrix = confusion_matrix(y_test, y_predict)

In [16]:
#true_negative
TN = conf_matrix[0][0]
#false_negative
FN = conf_matrix[1][0]
#false_positive
FP = conf_matrix[0][1]
#true_positive
TP = conf_matrix[1][1]

In [18]:
# Recall is the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. 
# High Recall indicates the class is correctly recognized (small number of FN)

recall = (TP)/(TP + FN)

In [19]:
# Precision is the the total number of correctly classified positive examples divided by the total number of predicted positive examples. 
# High Precision indicates an example labeled as positive is indeed positive (small number of FP)

precision = (TP)/(TP + FP)

In [20]:
fmeasure = (2*recall*precision)/(recall+precision)
accuracy = (TP + TN)/(TN + FN + FP + TP)
# accuracy_score(y_test, y_predict)

In [21]:
print("------ CLASSIFICATION PERFORMANCE OF NEAREST-CENTROID MODEL ------ n"\
      "\n Recall : ", (recall*100) ,"%" \
      "\n Precision : ", (precision*100) ,"%" \
      "\n Accuracy : ", (accuracy*100) ,"%" \
      "\n F-measure : ", (fmeasure*100) ,"%" )


------ CLASSIFICATION PERFORMANCE OF NEAREST-CENTROID MODEL ------ n
 Recall :  92.46987951807229 %
 Precision :  59.496124031007746 %
 Accuracy :  66.30669546436285 %
 F-measure :  72.40566037735849 %
