# Classification Method

For each of the 7 NLP methods:

    1) loadNLPVectors
    
    2) genLabels
    
    3) train_test_split: create X_train, X_test, y_train, y_test

Metrics:

    1) accuracy
    
    2) F-Score (precision, recall?)
    
    3) Area under ROC

For each of the Machine Learning Algorithms for each of the NLP methods:

    1) Import classifier
    
    2) cross_val_score: classifier, X_train, y_train, scoring (multiple metrics), cv = 10, n_jobs = -1
    
    3) average cross_val_score
    
    4) Train classifier on entirety of X_train, y_train
    
    5) Evaluate classifier on X_test, y_test
    
    6) Compare test metric vs cross_val_score metric
    
    7) Save trained model
    
    8) Generate Confusion Matricies and other visualizations if necessary

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
def loadCSV(filename):
    file = filename
    if '.csv' not in filename:
        file += '.csv'
    data = pd.read_csv(file, encoding = 'ISO-8859-1')
    return data

In [3]:
def loadNLPVectors(filename):
    file = 'nlp_data/' + filename + '.npy'
    return np.load(file)

In [4]:
def genLabels():
    labels_array = []
    for row in range(0, len(data["tweet_class"])):
        labels_array.append(data["tweet_class"][row])
    labels = np.asarray(labels_array)
    return labels

In [5]:
csvFile = "binary_classification"
data = loadCSV(csvFile)

In [6]:
unigram_array = "feature_array_unigram.npy"
bigram_array = "feature_array_bigram.npy"
tfidf_array = "feature_array_tfidf.npy"
wordvec_array = "feature_array_word2vec.npy"
unigram_reduced = "reduced_unigram.npy"
bigram_reduced = "reduced_bigram.npy"
tfidf_reduced = "reduced_tfidf.npy"

In [6]:
word2vec = loadNLPVectors('feature_array_word2vec')

In [7]:
word2vec_labels = genLabels()

In [8]:
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(word2vec, 
                                                    word2vec_labels, 
                                                    test_size = 0.2, 
                                                    random_state = 42, 
                                                    shuffle = True)

In [9]:
from sklearn.svm import SVC
svm_clf = SVC(probability = True, random_state = 42)

In [10]:
cv_results = cross_val_score(svm_clf, X_train, y_train, scoring = "accuracy", cv = 10, n_jobs = -1)

In [11]:
cv_results

array([0.51151631, 0.51151631, 0.5125    , 0.5125    , 0.5125    ,
       0.5125    , 0.5125    , 0.5125    , 0.5125    , 0.5125    ])

In [12]:
cv_results.mean()

0.5123032629558542

In [13]:
svm_model = svm_clf.fit(X_train, y_train)

In [17]:
svm_model.predict_proba(X_test)

array([[0.51455095, 0.328629  , 0.15682006],
       [0.5141496 , 0.32941481, 0.15643558],
       [0.51399339, 0.32948167, 0.15652493],
       ...,
       [0.51460124, 0.32923166, 0.1561671 ],
       [0.51427649, 0.3295108 , 0.15621271],
       [0.51421424, 0.32926606, 0.15651969]])