This notebook is downloaded and adapted from the GitHub repo of the authors of MoralBERT (https://github.com/vjosapreniqi/MoralBERT/tree/main). It is to predict moral values in text according to the Moral Foundations theory. It gets MoralBERT weights models from Hugging Face

## Set up

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from huggingface_hub import PyTorchModelHubMixin
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

Load data

In [2]:
from google.colab import drive
import os
drive.mount('/content/drive')

Mounted at /content/drive


Paths

In [3]:
selfimprovement = "/content/drive/My Drive/UChicago/Tesis/selfimpr_preprocessed.csv"
investing = "/content/drive/My Drive/UChicago/Tesis/investing_preprocessed.csv"
homeowners = "/content/drive/My Drive/UChicago/Tesis/homeowners_preprocessed.csv"

Create data frames

In [4]:
data_selfimprv = pd.read_csv(selfimprovement)
data_investing = pd.read_csv(investing)
data_homeowners = pd.read_csv(homeowners)

  data_selfimprv = pd.read_csv(selfimprovement)
  data_investing = pd.read_csv(investing)


Confirm sizes

In [5]:
print("Shape selfimprovement:", data_selfimprv.shape)
print("Shape investing:", data_investing.shape)
print("Shape homeowners:", data_homeowners.shape)

Shape selfimprovement: (506574, 13)
Shape investing: (503158, 13)
Shape homeowners: (498733, 13)


Check distribution of popularity scores

In [6]:
percentiles = [25, 50, 75, 80, 85, 90, 95, 99, 99.995, 100]

# Calculate percentiles for each subreddit
selfimpr_p = np.percentile(data_selfimprv.score, percentiles)
investing_p = np.percentile(data_investing.score, percentiles)
homeowners_p = np.percentile(data_homeowners.score, percentiles)

# Put results on a df
percentiles_table = pd.DataFrame({
    'Percentile': percentiles,
    'Self-Improvement': selfimpr_p,
    'Investing': investing_p,
    'Homeowners': homeowners_p
})

percentiles_table = percentiles_table.round(3)

print(percentiles_table)


   Percentile  Self-Improvement  Investing  Homeowners
0      25.000             1.000      1.000        1.00
1      50.000             2.000      2.000        2.00
2      75.000             3.000      4.000        4.00
3      80.000             4.000      5.000        5.00
4      85.000             6.000      8.000        7.00
5      90.000             9.000     12.000       11.00
6      95.000            20.000     27.000       23.00
7      99.000           130.000    174.000      102.00
8      99.995          2351.357   4871.847     1044.19
9     100.000          6496.000  14369.000     2753.00


In [11]:
data_selfimprv[data_selfimprv.score > 1000]

Unnamed: 0.1,Unnamed: 0,id,created,author,score,num_comments,link,cleaned_text,word_count,type,link_id,year,month
2501,2501,31jhbr,2015-04-05 13:11,u/Scolez,1031,67.0,https://www.reddit.com/r/selfimprovement/comme...,how to eat healthy eat no processed food and d...,248,submission,,2015,4
8260,8260,773las,2017-10-17 21:17,u/bcbrought96,1340,153.0,https://www.reddit.com/r/selfimprovement/comme...,edit the response to this has been pretty posi...,3364,submission,,2017,10
14987,14987,a984xg,2018-12-24 13:52,u/g00ber88,1004,185.0,https://www.reddit.com/r/selfimprovement/comme...,lightning does strike twice or three times or ...,199,submission,,2018,12
18652,18652,bsvrq2,2019-05-25 10:31,u/eatmenlikeair,1948,99.0,https://www.reddit.com/r/selfimprovement/comme...,i attempted suicide at and began the journey o...,833,submission,,2019,5
19322,19322,c1mppl,2019-06-17 07:04,u/spegelspegel,1499,90.0,https://www.reddit.com/r/selfimprovement/comme...,foundation consistent quality sleep regular ex...,56,submission,,2019,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
429643,429643,jin3ue9,2023-05-02 20:29,u/plytime18,1264,,https://www.reddit.com/r/selfimprovement/comme...,and some people sit on their ass and win the l...,66,comment,t3_13646ok,2023,5
429660,429660,jin8nk7,2023-05-02 21:04,u/Throwawaylam49,1466,,https://www.reddit.com/r/selfimprovement/comme...,so true im and basically work an entry level m...,237,comment,t3_13646ok,2023,5
437643,437643,jmugkw6,2023-06-04 04:56,u/Aggressive-Ad7520,1867,,https://www.reddit.com/r/selfimprovement/comme...,real talk the thing which is holding you back ...,378,comment,t3_1406abo,2023,6
490866,490866,k82s3zd,2023-11-06 09:09,u/queermystic,1077,,https://www.reddit.com/r/selfimprovement/comme...,dude let it go its really weird to contact som...,84,comment,t3_17p46xk,2023,11


In [12]:
data_homeowners[data_homeowners.score > 1000]

Unnamed: 0.1,Unnamed: 0,id,created,author,score,num_comments,link,cleaned_text,word_count,type,link_id,year,month
7886,7886,8cn8eo,2018-04-16 07:44,u/sm1ttysm1t,1289,111.0,https://www.reddit.com/r/homeowners/comments/8...,saturday night my wife and i decided to host a...,374,submission,,2018,4
13153,13153,arfxoe,2019-02-16 20:10,u/kemuridragon,1358,269.0,https://www.reddit.com/r/homeowners/comments/a...,my husband and i bought a house last march tod...,231,submission,,2019,2
20713,20713,gdgxhm,2020-05-04 13:19,u/AustynCunningham,1016,157.0,https://www.reddit.com/r/homeowners/comments/g...,hello all before after photos before was ago w...,103,submission,,2020,5
22870,22870,i0ovnu,2020-07-30 11:15,u/GailaMonster,1044,395.0,https://www.reddit.com/r/homeowners/comments/i...,there was a post recently on r realestate aski...,238,submission,,2020,7
29500,29500,mt252b,2021-04-17 19:02,u/NotSureWTFUmean,1468,344.0,https://www.reddit.com/r/homeowners/comments/m...,today was the last straw i found an ant in my ...,106,submission,,2021,4
31999,31999,omvhzu,2021-07-18 12:48,u/poppadoc696969,1138,69.0,https://www.reddit.com/r/homeowners/comments/o...,we moved into our home in november buying it a...,282,submission,,2021,7
32893,32893,pcuvc0,2021-08-27 15:00,u/vmq,1206,90.0,https://www.reddit.com/r/homeowners/comments/p...,i just turned years old and did drugs for most...,185,submission,,2021,8
34186,34186,qehkhc,2021-10-23 19:54,u/beejers30,1151,79.0,https://www.reddit.com/r/homeowners/comments/q...,im a year old woman who is recently divorced a...,114,submission,,2021,10
42119,42119,y0r1o2,2022-10-10 16:45,u/PlusUltra0000,1455,407.0,https://www.reddit.com/r/homeowners/comments/y...,i was at the store today and got to talking to...,149,submission,,2022,10
44576,44576,10hyqtv,2023-01-21 13:03,u/ughthatsucks,1568,129.0,https://www.reddit.com/r/homeowners/comments/1...,we are fairly new to the neighborhood today my...,184,submission,,2023,1


In [13]:
data_investing[data_investing.score > 1000]

Unnamed: 0.1,Unnamed: 0,id,created,author,score,num_comments,link,cleaned_text,word_count,type,link_id,year,month
12696,12696,1qv5px,2013-11-17 20:09,u/I_hate_alot_a_lot,1772,142.0,https://www.reddit.com/r/investing/comments/1q...,would you guys be interested in bi weekly inve...,215,submission,,2013,11
21012,21012,2pdyhp,2014-12-15 13:32,u/stuyvesantthrowaway,1799,552.0,https://www.reddit.com/r/investing/comments/2p...,there has been a super popular article about m...,153,submission,,2014,12
29930,29930,3x7fky,2015-12-17 06:12,u/FCowperwood,1236,264.0,https://www.reddit.com/r/investing/comments/3x...,year old suspected of plundering retrophin to ...,61,submission,,2015,12
34247,34247,4qbnl8,2016-06-28 15:38,u/Emperor_YSSAC,1321,719.0,https://www.reddit.com/r/investing/comments/4q...,i want to do this for fun and for free i love ...,399,submission,,2016,6
34596,34596,4sz4yi,2016-07-15 08:43,u/rexmorrow,2339,479.0,https://www.reddit.com/r/investing/comments/4s...,the billionaire investor gave away billion wor...,59,submission,,2016,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
440960,440960,hwn04gn,2022-02-12 06:37,u/StoicBogle,1375,,https://www.reddit.com/r/investing/comments/sq...,financial advisor here nobody can accurately t...,123,comment,t3_sqou5g,2022,2
452905,452905,fteko1y,2020-06-08 15:19,u/Soe-Vand,1281,,https://www.reddit.com/r/investing/comments/gz...,number one rule of wall street nobody i dont c...,76,comment,t3_gz6xjl,2020,6
464328,464328,h88tfs5,2021-08-08 22:06,u/theNeumannArchitect,1029,,https://www.reddit.com/r/investing/comments/p0...,i mean what are you trying to do here time the...,131,comment,t3_p0r3o4,2021,8
466885,466885,drppgc8,2017-12-24 14:42,u/r_notfound,2177,,https://www.reddit.com/r/investing/comments/7l...,heres what i have to say about that rather tha...,62,comment,t3_7lxbzi,2017,12


Get only the top 0.005% most popular scores per subreddit: ~2500 posts and put them in lists for the model to use

In [14]:
def create_documents(df, n):
  # Sort based on the "score" column in descending order
  sorted_df = df.sort_values(by='score', ascending=False)

  # Select the top rows
  top_df = sorted_df.head(n)

  # Create list of documents

  lst_docs = top_df['cleaned_text'].tolist()

  return lst_docs

In [15]:
selfimprovement_list = create_documents(data_selfimprv, 400)
investing_list = create_documents(data_investing, 400)
homeowners_list = create_documents(data_homeowners, 400)

Double check len

In [16]:
print(len(selfimprovement_list))
print(len(investing_list))
print(len(homeowners_list))


400
400
400


Check GPU is being used

In [17]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

Implement MoralBERT

In [18]:
# BERT model and tokenizer:
bert_model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [19]:
class MyModel(
    nn.Module,
    PyTorchModelHubMixin,
    # optionally, you can add metadata which gets pushed to the model card
    # repo_url="your-repo-url",
    pipeline_tag="text-classification",
    license="mit",
):
    def __init__(self, bert_model, moral_label=2):

        super(MyModel, self).__init__()
        self.bert = bert_model
        bert_dim = 768
        self.invariant_trans = nn.Linear(768, 768)
        self.moral_classification = nn.Sequential(nn.Linear(768,768),
                                                      nn.ReLU(),
                                                      nn.Linear(768, moral_label))

    def forward(self, input_ids, token_type_ids, attention_mask):
        pooled_output = self.bert(input_ids,
                                token_type_ids = token_type_ids,
                                attention_mask = attention_mask).last_hidden_state[:,0,:]


        pooled_output = self.invariant_trans(pooled_output)


        logits = self.moral_classification(pooled_output)

        return logits

In [20]:
def preprocessing(input_text, tokenizer):
    '''
    Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
    '''
    return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        max_length = 150,
                        padding = 'max_length',
                        return_attention_mask = True,
                        return_token_type_ids = True,  # Add this line
                        return_tensors = 'pt',
                        truncation=True
                   )

In [21]:
# the list of Moral (MFT) values
mft_values = ["care", "harm", "fairness", "cheating", "loyalty", "betrayal",
              "authority", "subversion", "purity", "degradation"]

# function to load the model, predict the score, and return the second value
def get_model_score(sentence, mft):
    repo_name = f"vjosap/moralBERT-predict-{mft}-in-text"

    # loading the model
    model = MyModel.from_pretrained(repo_name, bert_model=bert_model)

    # preprocessing the text
    encodeds = preprocessing(sentence, tokenizer)

    # predicting the mft score
    output = model(**encodeds)
    score = F.softmax(output, dim=1)

    # extracting and return the second value from the tensor
    mft_value = score[0, 1].item()

    return mft_value


In [22]:
# Function to analyze the corpus of one of the subreddits

def analyze_corpus(sentences, corpus_name):
  # initialising a list to accumulate the results
  results = []

  # sequential execution of predictions
  for sentence in sentences:
      # dictionary to store scores for the current sentence
      sentence_scores = {"sentence": sentence}

      # iterate through each MFT model and get the score
      for mft in mft_values:
          sentence_scores[mft] = get_model_score(sentence, mft)

      results.append(sentence_scores)

  results_df = pd.DataFrame(results)

  # save the DataFrame to a CSV file
  results_df.to_csv("/content/drive/My Drive/UChicago/Tesis/MoralBERTresults400-{}.csv".format(corpus_name), index=False)

In [23]:
analyze_corpus(selfimprovement_list, "selfimprovement")

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

In [24]:
analyze_corpus(investing_list, "investing")

In [25]:
analyze_corpus(homeowners_list, "homeowners")