Assuming I already have some amount of labeled training data, what can I do?

Here are a few options:
1. I can train my own classification model using any kind of feature extraction, and choosing from a range of available ML/DL algorithms. 
2. I can take an existing Deep learning model and "fine-tune" it for my task.

I will use the train_labelled.txt file as training data and , and use Sentence Transformers (sbert.net) to get a feature representation for the text. I will use Logistic regression as my classifier.

Note that my goal here is not to exhaustively evaluate how to train, but to only illustrate how training our own model (with a not so huge labeled dataset) compare with using a cloud provider's solution.

1. Following a typical text classification approach (Bag of words features + logistic regression)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import string

stopwords = ENGLISH_STOP_WORDS
# remove stopwords, punctuation and numbers
def clean(doc): # doc is a string of text
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
    doc = " ".join([token for token in doc.split() if token not in stopwords])
    return doc

#reads train/test files
def read_data(filepath):
    texts = []
    labels = []
    for line in open(filepath):
        sentence, label = line.strip().split("\t")
        labels.append(label)
        texts.append(sentence)
    return texts, labels
    
train_texts, train_labels = read_data("../files/train_labelled.txt")
test_texts, test_labels = read_data("../files/test_labelled.txt")

#BOW feature extraction
vect = CountVectorizer(preprocessor=clean, max_features=1000) # instantiate a vectoriezer
train_vector = vect.fit_transform(train_texts)# use it to extract features from training data
# transform test data (using training data's features)
test_vector = vect.transform(test_texts)

#Use a logistic regression classifier to train and test the model
logreg = LogisticRegression() # instantiate a logistic regression model
logreg.fit(train_vector, train_labels) # fit the model with training data
# Make predictions on test data
predicted = logreg.predict(test_vector)

#Print results.
print(classification_report(test_labels, predicted))
print(confusion_matrix(test_labels,predicted))

              precision    recall  f1-score   support

           0       0.71      0.82      0.76       500
           1       0.78      0.67      0.72       500

    accuracy                           0.74      1000
   macro avg       0.75      0.74      0.74      1000
weighted avg       0.75      0.74      0.74      1000

[[408  92]
 [166 334]]


2. Use a neural text representation, from Sentence Transformers library, 

In [7]:
from sentence_transformers import SentenceTransformer

#Choose from the models here: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('paraphrase-TinyBERT-L6-v2')

#Read the train/test files and extract labels and 
#textual features using the sentence transformers model.
def read_data(filepath):
    data = []
    labels = []
    for line in open(filepath):
        sentence, label = line.strip().split("\t")
        labels.append(label)
        data.append(model.encode(sentence)) #feature extraction
    return data, labels



In [8]:
#feature extraction
train_data, train_labels = read_data("../files/train_labelled.txt")
test_data, test_labels = read_data("../files/test_labelled.txt")


#modeling
clf = LogisticRegression(random_state=0).fit(train_data, train_labels)

#prediction/evaluation
predicted = clf.predict(test_data)
print(classification_report(test_labels, predicted))
print(confusion_matrix(test_labels,predicted))

              precision    recall  f1-score   support

           0       0.89      0.86      0.87       500
           1       0.86      0.89      0.88       500

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.87      1000
weighted avg       0.88      0.88      0.87      1000

[[428  72]
 [ 53 447]]


How does all this compare with snorkel's generated labels?

In [17]:
snorkel_train = []
snorkel_train_labels = []
for line in open("../files/snorkellabeled_train.csv"):
        sentence, label = line.strip().split("\t") #this file has an extra id column.
        snorkel_train_labels.append(label)
        snorkel_train.append(model.encode(sentence)) #feature extraction

In [18]:
clf = LogisticRegression(random_state=0).fit(snorkel_train, snorkel_train_labels)
#prediction/evaluation
predicted = clf.predict(test_data)
print(classification_report(test_labels, predicted))
print(confusion_matrix(test_labels,predicted))

              precision    recall  f1-score   support

           0       0.93      0.49      0.64       500
           1       0.65      0.96      0.78       500

    accuracy                           0.73      1000
   macro avg       0.79      0.73      0.71      1000
weighted avg       0.79      0.73      0.71      1000

[[247 253]
 [ 20 480]]
