# TRAINING TEXT CLASSIFIERS WITH SPACY

In this lab we will train different text classifiers with spacy.

1. Read through the code and train to add more inline documentation as you try to understand the functionality.

2. We will adapt the code to train two different fake news classifiers: one on general fake news from 6 different domains and another one on celebrities, were there are legitimate news but also news which are false gossip.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# We will be using spacy v2, so no need to upgrade to v3

In [None]:
# TODO install and test the language modules of your choice following the https://spacy.io/usage

!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md
#!python -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
import spacy
import csv
import random
import time
import numpy as np
import pandas as pd
import re
import string

from spacy.util import minibatch, compounding
import sys
from spacy import displacy
from itertools import chain

from sklearn.metrics import classification_report

# TODO add inline documentation describing the functionality of each function
# load data
def load_data(fnames):
    data = []
    for fname in fnames:
        data.append(pd.read_csv(fname, sep='\t', encoding='utf-8'))
    data = pd.concat(data)
    targets = set(data['Target'])
    return data, list(targets)

# pre-process tweets
def cleanup(tweet):
    """we remove urls, hashtags and user symbols"""
    tweet = re.sub(r"http\S+", "", tweet.replace("#", "").replace("@", "").replace('\n', ' ').replace('\t', ' '))
    return tweet

In [None]:
# data path. trial data used as training too.
trial_file = "/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/semeval2016-task6-trialdata.utf-8.txt"
train_file = "/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/semeval2016-task6-trainingdata.utf-8.txt"
test_file = "/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/SemEval2016-Task6-subtaskA-testdata-gold.txt"

training_data, targets = load_data([trial_file, train_file])
training_data['Clean_tweet'] = training_data['Tweet'].apply(cleanup)

test_data, _ = load_data([test_file])
test_data['Clean_tweet'] = test_data['Tweet'].apply(cleanup)
display(training_data)

Unnamed: 0,ID,Target,Tweet,Stance,Clean_tweet
0,1,Hillary Clinton,"@tedcruz And, #HandOverTheServer she wiped cle...",AGAINST,"tedcruz And, HandOverTheServer she wiped clean..."
1,2,Hillary Clinton,Hillary is our best choice if we truly want to...,FAVOR,Hillary is our best choice if we truly want to...
2,3,Hillary Clinton,@TheView I think our country is ready for a fe...,AGAINST,TheView I think our country is ready for a fem...
3,4,Hillary Clinton,I just gave an unhealthy amount of my hard-ear...,AGAINST,I just gave an unhealthy amount of my hard-ear...
4,5,Hillary Clinton,@PortiaABoulger Thank you for adding me to you...,NONE,PortiaABoulger Thank you for adding me to your...
...,...,...,...,...,...
2809,2910,Legalization of Abortion,"There's a law protecting unborn eagles, but no...",AGAINST,"There's a law protecting unborn eagles, but no..."
2810,2911,Legalization of Abortion,I am 1 in 3... I have had an abortion #Abortio...,AGAINST,I am 1 in 3... I have had an abortion Abortion...
2811,2912,Legalization of Abortion,How dare you say my sexual preference is a cho...,AGAINST,How dare you say my sexual preference is a cho...
2812,2913,Legalization of Abortion,"Equal rights for those 'born that way', no rig...",AGAINST,"Equal rights for those 'born that way', no rig..."


In [None]:
for target in targets:
  training_data[training_data['Target'] == target][['Stance', 'Clean_tweet']].to_csv(f"/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/train.{target}.tsv",
          sep="\t", index=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar="")
  test_data[test_data['Target'] == target][['Stance', 'Clean_tweet']].to_csv(f"/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/test.{target}.tsv",
          sep="\t", index=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

In [None]:
def load_data_spacy(fname):
  training_data = pd.read_csv(fname, sep='\t', encoding='utf-8')
  #train_data.dropna(axis = 0, how ='any',inplace=True)
  #train_data['Num_words_text'] = train_data['text'].apply(lambda x:len(str(x).split())) 
  #mask = train_data['Num_words_text'] >2
  #train_data = train_data[mask]
  print(training_data['Stance'].value_counts())
   
  train_texts = training_data['Clean_tweet'].tolist()
  train_cats = training_data['Stance'].tolist()
  final_train_cats=[]
  for cat in train_cats:
    cat_list = {}
    if cat == 'AGAINST':
      cat_list['AGAINST'] =  1
      cat_list['FAVOR'] =  0
      cat_list['NONE'] =  0
    elif cat == 'FAVOR':
      cat_list['AGAINST'] =  0
      cat_list['FAVOR'] =  1
      cat_list['NONE'] =  0
    else:
      cat_list['AGAINST'] =  0
      cat_list['FAVOR'] =  0
      cat_list['NONE'] =  1
    final_train_cats.append(cat_list)
    
  train_data = list(zip(train_texts, [{"cats": cats} for cats in final_train_cats]))
  return train_data, train_texts, train_cats


In [None]:
training_data, train_texts, train_cats = load_data_spacy('/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/train.Feminist Movement.tsv')
print(training_data[:10])
print(len(training_data))
test_data, test_texts, test_cats = load_data_spacy('/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/stance-semeval2016/test.Feminist Movement.tsv')
print(len(test_data))

AGAINST    328
FAVOR      210
NONE       126
Name: Stance, dtype: int64
[('Always a delight to see chest-drumming alpha males hiss and scuttle backwards up the wall when a feminist enters the room. manly SemST', {'cats': {'AGAINST': 0, 'FAVOR': 1, 'NONE': 0}}), ("Sometimes I overheat and want to take off my shirt but can't because of social expectations of people with breasts. ;n; SemST", {'cats': {'AGAINST': 0, 'FAVOR': 1, 'NONE': 0}}), ('If feminists spent 1/2 as much time reading papers as they do tumblr they would be real people, not ignorant sexist bigots. SemST', {'cats': {'AGAINST': 1, 'FAVOR': 0, 'NONE': 0}}), ('Stupid Feminists, the civilization you take for granted was built with the labour, blood sweat and tears of men. SemST', {'cats': {'AGAINST': 1, 'FAVOR': 0, 'NONE': 0}}), ("YOU'RE A GIRL AND HAVE A SEX DRIVE!? YOU MUST BE A SLUT! feminist SemST", {'cats': {'AGAINST': 0, 'FAVOR': 1, 'NONE': 0}}), ("Suns out....  Dresses out...  StreetHarassment out...  This shouldn't be 

In [None]:
def Sort(sub_li):
  # reverse = True (Soresulting_list = list(first_list)rts in Descending  order) 
  # key is set to sort using second element of  
  # sublist lambda has been used 
  return(sorted(sub_li, key = lambda x: x[1],reverse=True))  

# run the predictions on each sentence in the evaluation  dataset, and return the metrics
def evaluate(tokenizer, textcat, test_texts, test_cats ):
  docs = (tokenizer(text) for text in test_texts)
  preds = []
  for i, doc in enumerate(textcat.pipe(docs)):
    #print(doc.cats.items())
    scores = Sort(doc.cats.items())
    #print(scores)
    catList=[]
    for score in scores:
      catList.append(score[0])
    preds.append(catList[0])
        
  labels = ['AGAINST', 'FAVOR']
  print(classification_report(test_cats, preds,labels=labels))
    

In [None]:
def train_spacy(  train_data, iterations,test_texts,test_cats, model_arch, dropout = 0.3, model=None, init_tok2vec=None):
    ''' Train a spacy NER model, which can be queried against with test data
   
    train_data : training data in the format of (sentence, {cats: ['AGAINST'|'FAVOR'|'NONE']})
    labels : a list of unique annotations
    iterations : number of training iterations
    dropout : dropout proportion for training
    display_freq : number of epochs between logging losses to console
    '''
    
    nlp = spacy.load('en_core_web_sm')
    

    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat", config={"exclusive_classes": True, "architecture": model_arch}
        )
        nlp.add_pipe(textcat, last=True)
        
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe("textcat")

    # add label to text classifier
    textcat.add_label("AGAINST")
    textcat.add_label("FAVOR")
    textcat.add_label("NONE")


    # get names of other pipes to disable them during training
    pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        if init_tok2vec is not None:
            with init_tok2vec.open("rb") as file_:
                textcat.model.tok2vec.from_bytes(file_.read())
        print("Training the model...")
        print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
        batch_sizes = compounding(16.0, 64.0, 1.5)
        for i in range(iterations):
            print('Iteration: '+str(i))
            start_time = time.clock()
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the test data 
                evaluate(nlp.tokenizer, textcat, test_texts,test_cats)
            print ('Elapsed time'+str(time.clock() - start_time)+  "seconds")
        with nlp.use_params(optimizer.averages):
            model_name = model_arch + "_Feminism_Stance_Semeval2016"
            filepath = "/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/resources/" + model_name 
            nlp.to_disk(filepath)
    return nlp

In [None]:
nlp = train_spacy(training_data, 20, test_texts, test_cats, "bow")

Training the model...
LOSS 	  P  	  R  	  F  
Iteration: 0


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     AGAINST       0.64      1.00      0.78       183
       FAVOR       0.00      0.00      0.00        58

   micro avg       0.64      0.76      0.70       241
   macro avg       0.32      0.50      0.39       241
weighted avg       0.49      0.76      0.59       241

Elapsed time0.6579739999999994seconds
Iteration: 1
              precision    recall  f1-score   support

     AGAINST       0.64      1.00      0.78       183
       FAVOR       0.00      0.00      0.00        58

   micro avg       0.64      0.76      0.70       241
   macro avg       0.32      0.50      0.39       241
weighted avg       0.49      0.76      0.59       241

Elapsed time0.30694599999999994seconds
Iteration: 2
              precision    recall  f1-score   support

     AGAINST       0.64      1.00      0.78       183
       FAVOR       0.00      0.00      0.00        58

   micro avg       0.64      0.76      0.70       241
   macro avg       0.32  

In [None]:
nlp = train_spacy(training_data, 20, test_texts, test_cats, "simple_cnn")

Training the model...
LOSS 	  P  	  R  	  F  
Iteration: 0


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     AGAINST       0.64      1.00      0.78       183
       FAVOR       0.00      0.00      0.00        58

   micro avg       0.64      0.76      0.70       241
   macro avg       0.32      0.50      0.39       241
weighted avg       0.49      0.76      0.59       241

Elapsed time2.053317seconds
Iteration: 1
              precision    recall  f1-score   support

     AGAINST       0.64      1.00      0.78       183
       FAVOR       0.00      0.00      0.00        58

   micro avg       0.64      0.76      0.70       241
   macro avg       0.32      0.50      0.39       241
weighted avg       0.49      0.76      0.59       241

Elapsed time1.6899540000000002seconds
Iteration: 2
              precision    recall  f1-score   support

     AGAINST       0.69      0.79      0.73       183
       FAVOR       0.27      0.34      0.30        58

   micro avg       0.58      0.68      0.62       241
   macro avg       0.48      0.57   

In [None]:
textcat_bow = spacy.load("/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/resources/bow_Feminism_Stance_Semeval2016")
tweets = textcat_bow(test_texts[10])
print("Text: "+ test_texts[10])
print("Gold Label:"+ test_cats[10])
print(" Predicted Label:") 
print(tweets.cats)
print("=======================================")

Text: sometiimes you just feel like punching a feminist in the face SemST
Gold Label:AGAINST
 Predicted Label:
{'AGAINST': 0.40220609307289124, 'FAVOR': 0.36449941992759705, 'NONE': 0.23329448699951172}


# ASSIGNMENTS

1. TODO Train the classifiers for the other 4 targets in the Stance SemEval 2016 dataset.

2. TODO Reuse the above code to train a new classifier for fake news using the celebrity and the fake news datasets: 

  Data: "/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/datasets/fake_rada"

  2.1 HINT: You need to (i) load the data into a pandas dataframe; (ii) modify the labels from the converter and training functions.

  2.2 HINT:Once you have a pandas dataframe, it is easy to split the data into 80% for training and 20% for testing.

3. TODO Try the different spacy language models to see the difference in performance.