<a href="https://colab.research.google.com/github/hlab-repo/purity-and-danger/blob/master/Immigration_Pseudo_Labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Model for Immigration and 

---

Outsider Language

---



This notebook starts with a baseline system and then provides users the opportunity to attempt to improve performance with their own custom, complete system.

## Set-up

In [2]:
%%capture
!pip install datasets
!pip install transformers

In [3]:
import re
from collections import Counter
import datasets
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification

We can start with the Common Crawl news corpus (January 2017 - December 2019). See here for details:

https://huggingface.co/datasets/cc_news

In [4]:
# this could take several minutes
dataset = datasets.load_dataset('cc_news')

Reusing dataset cc_news (/root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b)


In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
        num_rows: 708241
    })
})

In [6]:
# look at the first 10 samples
for i, s in enumerate(dataset['train']):
    print(s)
    if i >= 10:
        break

{'date': '2017-04-17 00:00:00', 'description': "Officials unsealed court documents Monday (April 17) to reveal details surrounding the first searches of Prince's Paisley Park estate.", 'domain': '1041jackfm.cbslocal.com', 'image_url': 'https://cbs1041jackfm.files.wordpress.com/2017/04/prince-young-and-sad.jpg?w=946', 'text': 'By Abby Hassler\nOfficials unsealed court documents Monday (April 17) to reveal details surrounding the first searches of Prince’s Paisley Park estate following his untimely death.\nRelated: Prince’s Ex-Wife Mayte Garcia Says Memoir is not a Tell-All\nThe unsealed search warrants don’t confirm the source of the drug, fentanyl, that led to the 57-year-old singer’s accidental, self-administered overdose last April, according to The Star Tribune.\nInvestigators found no prescriptions in Prince’s name, however, Dr. Michael Todd Schulenberg told detectives he had written a prescription for oxycodone, which is also an opioid, under the name of long-time Prince associate

## Create some samples using a form of self-training (pseudo-labeling)

Maybe use `1` for language overlap between domains and `0` for lack of language overlap?

In [8]:
bert = BertForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=1)
bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert = bert.to(device)

In [10]:
# create a temporary classifier that can distinguish between two domains in question (when they are probably not mixed)
immigrant_domain_regex = re.compile(r'immigra.*?\b|foreign.*?\b|border\scrossing.*?\b|border\scontrol.*?\b|undocumented.*?\b|alien.*?\b|naturaliz.*?\b')
virus_domain_regex = re.compile(r'covid.*?\b|coronavirus.*?\b|virus.*?\b|disease.*?\b|infect.*?\b||epidem.*?\b|immun.*?\b|pandem.*?\bviral.*?\b')

min_length = 4_000
optimizer = optim.AdamW(bert.parameters(), lr=3e-5)
criterion = nn.BCELoss()

# limit token counts to 50 for occurences in their own domain
token_counts = Counter()

for epoch in range(3):

    running_loss = 0.
    b = 0

    for s in dataset['train']:
        for paragraph in s['text'].split('\n'):
            sentences = re.findall(r'.*?[.?!]', paragraph)
            for sentence in sentences:
                if sentence.strip():
                    loss = None
                    if immigrant_domain_regex.search(sentence) and not virus_domain_regex.search(sentence):
                        token = immigrant_domain_regex.search(sentence).group()
                        if token_counts[token] < 50:
                            output = bert(**bert_tokenizer(sentence, return_tensors='pt').to(device))
                            loss = criterion(torch.sigmoid(output.logits), torch.tensor([[1.]]).to(device))
                            token_counts[token] += 1
                    elif virus_domain_regex.search(sentence) and not immigrant_domain_regex.search(sentence):
                        token = virus_domain_regex.search(sentence).group()
                        if token_counts[token] < 50:
                            output = bert(**bert_tokenizer(sentence, return_tensors='pt').to(device))
                            loss = criterion(torch.sigmoid(output.logits), torch.tensor([[0.]]).to(device))
                            token_counts[token] += 1
                    elif virus_domain_regex.search(sentence) and immigrant_domain_regex.search(sentence):
                        output = bert(**bert_tokenizer(sentence, return_tensors='pt').to(device))
                        loss = criterion(torch.sigmoid(output.logits), torch.tensor([[0.5]]).to(device))
                    if loss:
                        loss.backward()
                        running_loss += loss.item()

                        if b % 10 == 0:
                            print(f'Epoch {epoch + 1} Batch {b + 1} Loss {loss.item()} Running Loss {running_loss / (b + 1)}')
                            optimizer.step()
                            optimizer.zero_grad()

                        b += 1
optimizer.zero_grad()

Epoch 1 Batch 1 Loss 0.6065901517868042 Running Loss 0.6065901517868042
Epoch 1 Batch 11 Loss 0.6792342066764832 Running Loss 0.6032024188475176
Epoch 1 Batch 21 Loss 0.4643443524837494 Running Loss 0.5677666720889863
Epoch 1 Batch 31 Loss 0.48582637310028076 Running Loss 0.5401253604119823
Epoch 1 Batch 41 Loss 0.4327050745487213 Running Loss 0.5150380367186012
Epoch 1 Batch 51 Loss 0.7478040456771851 Running Loss 0.5027365684509277
Epoch 1 Batch 61 Loss 0.7384192943572998 Running Loss 0.5441865422686593
Epoch 1 Batch 71 Loss 0.7446213364601135 Running Loss 0.5765529644321388
Epoch 1 Batch 81 Loss 0.7387065887451172 Running Loss 0.5975312965887564
Epoch 1 Batch 91 Loss 0.7355107069015503 Running Loss 0.6140055807082208
Epoch 1 Batch 101 Loss 0.7321497797966003 Running Loss 0.6259762119538713
Epoch 1 Batch 111 Loss 0.7328019142150879 Running Loss 0.6356392239665126
Epoch 1 Batch 121 Loss 0.734526515007019 Running Loss 0.6437633879913771
Epoch 1 Batch 131 Loss 0.7334743738174438 Running

Token indices sequence length is longer than the specified maximum sequence length for this model (578 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: ignored

In [1]:
# now look for when the model is ambiguous about how to classify a sentence
i = 1
samples = []
for j, s in enumerate(dataset['train']):
    if j < 1_511:
        continue
    for paragraph in s['text'].split('\n'):
        sentences = re.findall(r'.*?[.?!]', paragraph)
        for sentence in sentences:
            if sentence.strip() and (virus_domain_regex.search(sentence) or immigrant_domain_regex.search(sentence)):
                with torch.no_grad():
                    try:
                        target = bert(**bert_tokenizer(sentence, return_tensors='pt').to(device))
                    except:
                        continue
                    sigmoid = torch.sigmoid(target.logits).item()
                if sigmoid <= 0.3:
                    samples.append((sentence, 0, sigmoid))
                elif sigmoid >= 0.7:
                    samples.append((sentence, 1, sigmoid))
                else:
                    samples.append((sentence, 2, sigmoid))
                i += 1
                if i % 500 == 0:
                    print('On sample', i)

NameError: ignored

In [None]:
df = pd.DataFrame(data=samples, columns=['text', 'target', 'sigmoid'])
df[df['target'] == 2].describe()

Unnamed: 0,target,sigmoid
count,157.0,157.0
mean,2.0,0.532379
std,0.0,0.100923
min,2.0,0.310892
25%,2.0,0.447861
50%,2.0,0.523684
75%,2.0,0.629069
max,2.0,0.699679


In [None]:
df[df['target'] == 2].iloc[3].text

'Trump also related the opioid crisis to immigration, saying he will work to end sanctuary city policies and accusing Democrats of stonewalling progress on DACA because they want to stop construction of the border wall.'

In [None]:
# you can download from the directory (we can concatenate all of our efforts together)
df[['target', 'sigmoid', 'text']].sort_values(by=['target', 'sigmoid', 'text'], ascending=[False, False, True]).to_csv('immigration.csv')