# Morality Baseline Classifier

This script proceeds in the following steps: 
- Data preprocessing and cleaning 
- Fitting baseline classifiers 
- Gridsearch on best baseline classifiers 
- Train all baseline classifiers with word embeddings  
- Get best performing candidate model and apply to final testing dataset set aside in the train-test-validation split
- This step was later added after all models were run, including BERT, and the best model was determined to be a word embedding model:Use the best performing classifier to label the unlabelled data

## Import required packages

In [2]:
import pandas as pd
import regex
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
)
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
import emoji
import warnings
import gensim
import gensim.downloader as api
import embeddingvectorizer
from embeddingvectorizer import EmbeddingCountVectorizer, EmbeddingTfidfVectorizer
import pickle 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load Datasets of Reddit Posts to r/news

#### Posts

In [3]:
# load labelled & unlabelled posts:
labelled_posts = pd.read_excel("labs_labelled_posts_new.xlsx")
unlabelled_posts = pd.read_csv("unlabelled_posts_new.csv")

## Data Preprocessing Functions to Compute Baseline Models:

In [4]:
# this function takes the text and an object with regex stored in it as input
def reddit_preprocessing(x, listofregex):
    # loops over regular expressions stored in the list 
  for expression in listofregex:
    # replaces the respective regex with "" if it is identified
    x = re.sub(expression, "", x)
    # replaces any emojis with ""
  x = emoji.replace_emoji(x, replace="") # remove emojis from the text
    # returns text without the patterns identified in the regular expressions
  return x

In [5]:
# Function to remove duplicate text and missing data (e.g., posts or comments with differen IDs but with the same content)
def remove_bad_rows(df, textcolumn): 
  df = df.drop_duplicates(subset = textcolumn).dropna()
  return df

## Custom Tokenizer for Baseline Models

In [6]:
# Self-made tokenizer: 
class MyTokenizer:
    def tokenize(self, text):
        result = [] 
        word_pattern = r"\p{letter}" # match any unicode character that is a letter, therefore strips special chatacters
        tokens = nltk.word_tokenize(text, language = 'english') # use word_tokenize (uses improved TreebankWordTokenizer & PunktSentenceTokenizer()
        tokens = [e for e in tokens if regex.search(word_pattern, e)]
        result += tokens    
        return result

mytokenizer = MyTokenizer()

# Moralised Frames in Posts – Classifier: 

In [7]:
n = len(labelled_posts)

## Apply Preprocessing Functions:

In [8]:
# store regular expressions in list to remove desired characters
regex_list = [
      r"&[^;]+;", #remove html character escapes (&amp etc.)
      r"</?\w[^>]*>", #remove html tags 
      r"https?://[\w\.]+\b|www\.[\w\.]+\b", # remove links to websites
      r"\s(www.\S+)" # remove links to websites
      ]

In [9]:
# apply preprocessing to title column
labelled_posts["title_pr_c"] = labelled_posts["title"].apply(lambda x: reddit_preprocessing(x, regex_list))

In [10]:
labelled_posts = remove_bad_rows(labelled_posts, "title")
print(f"{n-len(labelled_posts)} titles were removed")

19 titles were removed


## Splitting Data

In [11]:
# Train Test Split using the preprocessed title column and the overall morality label. 
# X_test_f and y_test_f are set aside to test the final model.
X_train, X_test_f, y_train, y_test_f = train_test_split(
    labelled_posts["title_pr_c"],
    labelled_posts["moral_label"],
    test_size=0.2,
    random_state=99)

# Split the training data again, this time with test size = .25 to achieve a final split of 
# 60 training data; 20 validation data (this is where baseline is tested on); 20 final testing data (best model testing)
X_train_sec, X_val, y_train_sec, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.25,
    random_state=99)

## Extending Stopword List

In [12]:
# Extend stopword list based on a warning that was raised when running the models
# Some of the stopwords contained in the warning (e.g., "must") 
# were deemed important for moralisation and were not to included in the stopword lists
stopwords_ext = ["'re", "'s", 'sha', 'wo'] + stopwords.words("english")

## Baseline SVM Classifiers

#### Inspect balance of labelled data:

In [13]:
# Inspect balance of labelled data: 
labelled_posts.groupby(["moral_label"]).count() 

Unnamed: 0_level_0,Unnamed: 0,label_threat,label_vict,post_id,title,date,title_pr_c
moral_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,715,715,715,715,715,715,715
1,266,266,266,266,266,266,266


As the classes are unbalanced (under-representations of moral frames in titles) the classifiers are adjusted (using ComplementNB() or adjusting the parameter class_weight to "balanced")

### Baseline Models

In [14]:
# Store configurations in "configurations" object

configurations=[('NB_CountV', CountVectorizer(tokenizer = mytokenizer.tokenize,
                                              stop_words = stopwords_ext), ComplementNB()),
('NB_Tfidf', TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                             stop_words = stopwords_ext), ComplementNB()), 
('LR_CountV', CountVectorizer(tokenizer = mytokenizer.tokenize,
                              stop_words = stopwords_ext), LogisticRegression(solver='liblinear',
                                                                                           class_weight = 'balanced')),
('LR_Tfidf', TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                             stop_words = stopwords_ext), LogisticRegression(solver='liblinear', 
                                                                                          class_weight = 'balanced')),
('SVM_CountV', CountVectorizer(tokenizer = mytokenizer.tokenize,
                                              stop_words = stopwords_ext), SVC(gamma = 'scale', 
                                                                                            class_weight = 'balanced')),
('SVM_Tfidf', TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                             stop_words = stopwords_ext),SVC(gamma = 'scale', 
                                                                          class_weight = 'balanced')), 
('RF_CountV', CountVectorizer(tokenizer = mytokenizer.tokenize,
                                              stop_words = stopwords_ext), RandomForestClassifier(class_weight = 'balanced')), 
('RF_Tfidf', TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                             stop_words = stopwords_ext) , RandomForestClassifier(class_weight = 'balanced'))
]

In [15]:
# Classification function
def classification(x):
  for name, vectorizer, classifier in x:
      trans_X_train_sec = vectorizer.fit_transform(X_train_sec)
      trans_X_val = vectorizer.transform(X_val)
      classifier.fit(trans_X_train_sec, y_train_sec)
      pred_y_sm = classifier.predict(trans_X_val)
      print(f"Classification Report for {name}:\n")
      print(classification_report(y_val, pred_y_sm))
      print("\n")

In [16]:
# Baseline classification task & return report:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
classification(configurations)



Classification Report for NB_CountV:

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       143
           1       0.53      0.47      0.50        53

    accuracy                           0.74       196
   macro avg       0.67      0.66      0.66       196
weighted avg       0.74      0.74      0.74       196







Classification Report for NB_Tfidf:

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       143
           1       0.50      0.43      0.46        53

    accuracy                           0.73       196
   macro avg       0.65      0.64      0.64       196
weighted avg       0.72      0.73      0.72       196







Classification Report for LR_CountV:

              precision    recall  f1-score   support

           0       0.81      0.89      0.85       143
           1       0.59      0.43      0.50        53

    accuracy                           0.77       196
   macro avg       0.70      0.66      0.67       196
weighted avg       0.75      0.77      0.75       196







Classification Report for LR_Tfidf:

              precision    recall  f1-score   support

           0       0.82      0.83      0.82       143
           1       0.52      0.49      0.50        53

    accuracy                           0.74       196
   macro avg       0.67      0.66      0.66       196
weighted avg       0.74      0.74      0.74       196







Classification Report for SVM_CountV:

              precision    recall  f1-score   support

           0       0.79      0.91      0.85       143
           1       0.59      0.36      0.45        53

    accuracy                           0.76       196
   macro avg       0.69      0.63      0.65       196
weighted avg       0.74      0.76      0.74       196







Classification Report for SVM_Tfidf:

              precision    recall  f1-score   support

           0       0.77      0.95      0.85       143
           1       0.65      0.25      0.36        53

    accuracy                           0.76       196
   macro avg       0.71      0.60      0.60       196
weighted avg       0.74      0.76      0.72       196







Classification Report for RF_CountV:

              precision    recall  f1-score   support

           0       0.76      1.00      0.86       143
           1       1.00      0.13      0.23        53

    accuracy                           0.77       196
   macro avg       0.88      0.57      0.55       196
weighted avg       0.82      0.77      0.69       196







Classification Report for RF_Tfidf:

              precision    recall  f1-score   support

           0       0.76      0.99      0.86       143
           1       0.89      0.15      0.26        53

    accuracy                           0.77       196
   macro avg       0.82      0.57      0.56       196
weighted avg       0.79      0.77      0.70       196





Both precision and recall were deemed important, as the classifier should both 
The best performing models were: 
- NB_CountV
- NB_Tfidf
- LR_CountV
- LR_Tfidf

## Grid search for the best models: 

In [17]:
# ignore warnings for better readability
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    
NB_CountV_pipe = Pipeline(
    steps=[("vectorizer", CountVectorizer(tokenizer = mytokenizer.tokenize,
                                              stop_words = stopwords_ext)),
        ("classifier", ComplementNB()),
            ]
           )
NB_CountV_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],
    "vectorizer__max_df": [0.5, 0.75, 1.0],
    "vectorizer__min_df": [0, 5, 10]
}
search_NB_CountV = GridSearchCV(
    estimator = NB_CountV_pipe, n_jobs=-1, param_grid=NB_CountV_grid, scoring="f1", cv=10)
search_NB_CountV.fit(X_train, y_train)



In [18]:
print(f"Best parameters: {search_NB_CountV.best_params_}")
print(f"Best score: {round(search_NB_CountV.best_score_,4)}")

Best parameters: {'vectorizer__max_df': 0.5, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 2)}
Best score: 0.4841


In [19]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

NB_Tfidf_pipe = Pipeline(
    steps=[("vectorizer", TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                                          stop_words = stopwords_ext)),
        ("classifier", ComplementNB()),
            ]
           )
NB_Tfidf_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],
    "vectorizer__max_df": [0.5, 0.75, 1.0],
    "vectorizer__min_df": [0, 5, 10]
}
search_NB_Tfidf = GridSearchCV(
    estimator = NB_Tfidf_pipe, n_jobs=-1, param_grid=NB_Tfidf_grid, scoring="f1", cv=10)
search_NB_Tfidf.fit(X_train, y_train)



In [20]:
print(f"Best parameters: {search_NB_Tfidf.best_params_}")
print(f"Best score: {round(search_NB_Tfidf.best_score_,4)}")

Best parameters: {'vectorizer__max_df': 0.75, 'vectorizer__min_df': 10, 'vectorizer__ngram_range': (1, 2)}
Best score: 0.4707


In [21]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

LR_Count_GS_pipe = Pipeline(
    steps=[("vectorizer", CountVectorizer(tokenizer = mytokenizer.tokenize,
                              stop_words = stopwords_ext)),
        ("classifier", LogisticRegression(solver='liblinear',
                                          class_weight = 'balanced')),
            ]
           )
LR_Count_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],
    "vectorizer__max_df": [0.5, 0.75, 1.0],
    "vectorizer__min_df": [0, 5, 10],
    "classifier__C":[0.01, 0.1, 1, 10, 100], 
    "classifier__penalty":["l1", "l2"]
}
search_LR_Count = GridSearchCV(
    estimator = LR_Count_GS_pipe, n_jobs=-1, param_grid=LR_Count_grid, scoring="f1", cv=10)
search_LR_Count.fit(X_train, y_train)



In [25]:
print(f"Best parameters: {search_LR_Count.best_params_}")
print(f"Best score: {round(search_LR_Count.best_score_,4)}")

Best parameters: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'vectorizer__max_df': 0.5, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 1)}
Best score: 0.4967


In [26]:
# Store best model with best hyperparameters
LR_Count_model = search_LR_Count.best_estimator_

In [27]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

LR_Tfidf_GS_pipe = Pipeline(
    steps=[("vectorizer", TfidfVectorizer(tokenizer = mytokenizer.tokenize,
                                          stop_words = stopwords_ext)),
        ("classifier", LogisticRegression(solver='liblinear',
                                          class_weight = 'balanced')),
            ]
           )
LR_Tfidf_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2)],
    "vectorizer__max_df": [0.5, 0.75, 1.0],
    "vectorizer__min_df": [0, 5, 10],
    "classifier__C":[0.01, 0.1, 1, 10, 100], 
    "classifier__penalty":["l1", "l2"]
}
search_LR_Tfidf = GridSearchCV(
    estimator = LR_Tfidf_GS_pipe, n_jobs=-1, param_grid=LR_Tfidf_grid, scoring="f1", cv=10)
search_LR_Tfidf.fit(X_train, y_train)



In [28]:
print(f"Best parameters: {search_LR_Tfidf.best_params_}")
print(f"Best score: {round(search_LR_Tfidf.best_score_,4)}")

Best parameters: {'classifier__C': 1, 'classifier__penalty': 'l2', 'vectorizer__max_df': 0.75, 'vectorizer__min_df': 0, 'vectorizer__ngram_range': (1, 1)}
Best score: 0.5028


## Add word embeddings and examine improvements in baseline models

In [29]:
# Download pre-trained word embeddings on google news.
wv = api.load('glove-wiki-gigaword-300')
wv_model = dict(zip(wv.index_to_key, wv.vectors))



In [30]:
# Store configurations in "configurations" object, including word embeddings and the count/tfidf vectorizers

configurations_emb=[('LR_CountV', embeddingvectorizer.EmbeddingCountVectorizer(wv_model, operator='mean'), LogisticRegression(solver='liblinear',
                                                                                           class_weight = 'balanced')),
('LR_Tfidf', embeddingvectorizer.EmbeddingTfidfVectorizer(wv_model, operator='mean'), LogisticRegression(solver='liblinear', 
                                                                                          class_weight = 'balanced')),
('SVM_CountV', embeddingvectorizer.EmbeddingCountVectorizer(wv_model, operator='mean'), SVC(gamma = 'scale', 
                                                                                            class_weight = 'balanced')),
('SVM_Tfidf', embeddingvectorizer.EmbeddingTfidfVectorizer(wv_model, operator='mean'),SVC(gamma = 'scale', 
                                                                          class_weight = 'balanced')), 
('RF_CountV', embeddingvectorizer.EmbeddingCountVectorizer(wv_model, operator='mean'), RandomForestClassifier(class_weight = 'balanced')), 
('RF_Tfidf', embeddingvectorizer.EmbeddingTfidfVectorizer(wv_model, operator='mean'), RandomForestClassifier(class_weight = 'balanced'))
]

In [31]:
# Classification & Report:
classification(configurations_emb)

Classification Report for LR_CountV:

              precision    recall  f1-score   support

           0       0.85      0.85      0.85       143
           1       0.58      0.58      0.58        53

    accuracy                           0.78       196
   macro avg       0.72      0.72      0.72       196
weighted avg       0.78      0.78      0.78       196



Classification Report for LR_Tfidf:

              precision    recall  f1-score   support

           0       0.83      0.82      0.82       143
           1       0.53      0.55      0.54        53

    accuracy                           0.74       196
   macro avg       0.68      0.68      0.68       196
weighted avg       0.75      0.74      0.75       196



Classification Report for SVM_CountV:

              precision    recall  f1-score   support

           0       0.85      0.88      0.86       143
           1       0.64      0.57      0.60        53

    accuracy                           0.80       196
   macro a

On this basis, the best performing model is SVM_Tfidf.

# Final Testing of the Best Morality Baseline Model

In [32]:
#specify pipeline with best model identified in the previous step
mypipe = Pipeline([('SVM_Tfidf', embeddingvectorizer.EmbeddingTfidfVectorizer(wv_model, operator='mean')),
                    ('svm_svc', 
                     SVC(gamma = 'scale', class_weight = 'balanced'))
])

# fit the model on the training data
mypipe.fit(X_train_sec, y_train_sec)

# predict the labels in the final testing data set aside at the start
y_pred = mypipe.predict(X_test_f)

# get classification report
print(metrics.classification_report(y_test_f, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.85      0.82       128
           1       0.67      0.57      0.61        69

    accuracy                           0.75       197
   macro avg       0.73      0.71      0.72       197
weighted avg       0.75      0.75      0.75       197



# Clean unlabelled data:

In [33]:
n_u = len(unlabelled_posts)

In [34]:
# apply preprocessing function to title column in unlabelled data
unlabelled_posts["title"] = unlabelled_posts["title"].apply(lambda x: reddit_preprocessing(x, regex_list))

In [35]:
# apply removal of NAs and duplicates function to unlabelled posts title column
unlabelled_posts = remove_bad_rows(unlabelled_posts, "title")
print(f"{n_u-len(unlabelled_posts)} titles were removed")

292 titles were removed


After applying these preprocessing steps, the word embedding vectorizer will apply the rest of the required preprocessing steps for word embeddings. 

# Label the unlabelled data

In [37]:
# Use the stored model ("mypipe") to predict the moral vs. neutral label in the unlabelled title column
unlabelled_posts["moral_labels"] = mypipe.predict(unlabelled_posts["title"])

In [39]:
unlabelled_posts.head()

Unnamed: 0.1,Unnamed: 0,post_id,title,date,moral_labels
0,6807,8vjh0c,Alabama man arrested after shouting 'womp womp...,2018-07-02,1
1,10965,nznzfc,U.S. to expand work permits for immigrants who...,2021-06-14,1
2,2832,5tavj7,"Undocumented Immigrants Arrested Across U.S., ...",2017-02-11,0
3,5587,7vlzmh,Best Immigration Consultant in Melbourne,2018-02-06,0
4,5413,7qyig9,Don Lemon Dumbfounded By Jeff Session's Commen...,2018-01-17,0


In [40]:
# Save model
with open("Moralisation_Model.pkl", mode="wb") as f:
    pickle.dump(mypipe, f)

In [41]:
# Save unlabelled posts to .csv
unlabelled_posts.to_csv("model_labelled_posts.csv")