<a href="https://colab.research.google.com/github/hlab-repo/purity-and-danger/blob/master/Immigration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Model for Immigration and Outsider Language

This notebook starts with a baseline system and then provides users the opportunity to attempt to improve performance with their own custom, complete system.

## Set-up

In [None]:
%%capture
!pip install datasets
!pip install transformers

In [None]:
import re
from collections import Counter
import datasets
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification

## Getting a test dataset

We can start with the Common Crawl news corpus (January 2017 - December 2019). See here for details:

https://huggingface.co/datasets/cc_news

This will constitute our test dataset. Note that the pseudolabels were generated from the beginning of this dataset but that the dataset (of 708,241 news articles) was in no way exhausted. You could perhaps skip the first 20,000 or so articles to deal only with new data.

In [None]:
# this could take several minutes
dataset = datasets.load_dataset('cc_news')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
        num_rows: 708241
    })
})

In [None]:
# look at the first 10 samples
for i, s in enumerate(dataset['train']):
    print(s)
    if i >= 10:
        break

{'date': '2017-04-17 00:00:00', 'description': "Officials unsealed court documents Monday (April 17) to reveal details surrounding the first searches of Prince's Paisley Park estate.", 'domain': '1041jackfm.cbslocal.com', 'image_url': 'https://cbs1041jackfm.files.wordpress.com/2017/04/prince-young-and-sad.jpg?w=946', 'text': 'By Abby Hassler\nOfficials unsealed court documents Monday (April 17) to reveal details surrounding the first searches of Prince’s Paisley Park estate following his untimely death.\nRelated: Prince’s Ex-Wife Mayte Garcia Says Memoir is not a Tell-All\nThe unsealed search warrants don’t confirm the source of the drug, fentanyl, that led to the 57-year-old singer’s accidental, self-administered overdose last April, according to The Star Tribune.\nInvestigators found no prescriptions in Prince’s name, however, Dr. Michael Todd Schulenberg told detectives he had written a prescription for oxycodone, which is also an opioid, under the name of long-time Prince associate

## Getting pseudo-labeled data for training

`0` represents viral language, `1` immigration language, and `2` a blend of the two. These categorizations are fuzzy and inexact and are not the result of manual annotations. They should be improved upon during the training process (or adjusted manually) when possible.

In [None]:
df = pd.read_csv('https://www.dropbox.com/s/kfbja23kisimedm/immigration.csv?dl=1')
df.head()

Unnamed: 0.1,Unnamed: 0,target,sigmoid,text
0,14823,2,0.699679,"Writing on the Lawfare blog, Weaver noted that..."
1,8950,2,0.69901,"Months after the blazes, many immigrants emplo..."
2,13815,2,0.697366,"“Every day, sanctuary cities release illegal i..."
3,11814,2,0.696693,It’s set on Mars and will have you fighting o...
4,15196,2,0.693552,"On Thursday, Whitman re-tweeted a letter to M..."


In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(df['text'], df['target'], train_size=0.7, random_state=42)

In [None]:
X_valid, y_valid

(17581     Rick Snyder about when the governor learned a...
 2420     When I was five, my family and I immigrated to...
 13476    The ED alleged that the FEMA violations were m...
 12248    The State Department has a list of nearly 60 g...
 12597    Immigrants’ rights groups have condemned the U...
                                ...                        
 10901                   Dreyfuss (who immigrated to the U.
 19562     I had had a kidney infection and scar tissue,...
 11622    The CBI case also accuses the airline's Indian...
 1846     “Impugning the official objective of a formal ...
 15841    The earlier periodontal disease is diagnosed i...
 Name: text, Length: 6226, dtype: object, 17581    0
 2420     1
 13476    1
 12248    1
 12597    1
         ..
 10901    1
 19562    0
 11622    1
 1846     1
 15841    0
 Name: target, Length: 6226, dtype: int64)

# Baseline 1

Let's use Naive Bayes. For the sake of simplicity, I will not add weighting to the classes here (we probably should!), but sklearn wants its weights to correspond to samples in the train dataset (when using the fit method). So you would need to feed in a list of weights the same length as your samples. Think about the weights in a table corresponding to class like this:

| Sample | Class | Weight |
| --- | --- | --- |
| sample 1 | 1 | 0.05 |
| sample 2 | 2 | 0.8 |
| sample 3 | 1 | 0.05 |
| sample 4 | 0 | 0.15 |
| sample 5 | 2 | 0.8 |

In [None]:
vectorizer = TfidfVectorizer()
train_vectorized = vectorizer.fit_transform(X_train)
valid_vectorized = vectorizer.transform(X_valid)
train_vectorized

<14527x22580 sparse matrix of type '<class 'numpy.float64'>'
	with 339053 stored elements in Compressed Sparse Row format>

In [None]:
naive_bayes = MultinomialNB()
naive_bayes.fit(train_vectorized, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
predictions = naive_bayes.predict(valid_vectorized)
predictions

array([0, 1, 1, ..., 1, 1, 0])

In [None]:
print(f'Accuracy: {accuracy_score(y_valid, predictions)}\n'
      f'Precision: {precision_score(y_valid, predictions, average=None)}\n'
      f'Recall: {recall_score(y_valid, predictions, average=None)}\n'
      f'F1 Score: {f1_score(y_valid, predictions, average=None)}\n')

Accuracy: 0.9649855444908448
Precision: [0.99604156 0.95005945 0.        ]
Recall: [0.92424242 0.99875    0.        ]
F1 Score: [0.95879971 0.97379647 0.        ]



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# y-axis (rows) == true label and x-axis (columns) == predicted label
confusion_matrix(y_valid, predictions)

array([[2013,  165,    0],
       [   5, 3995,    0],
       [   3,   45,    0]])

# Baseline 2

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=3)
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [None]:
# the classes are extremely unbalanced; let's generate weights that we can feed to loss function
unbalanced_weights = 1 / (y_train.value_counts() / len(y_train)).sort_index()
weights = unbalanced_weights / unbalanced_weights.sum()
weights

0    0.020874
1    0.011266
2    0.967860
Name: target, dtype: float64

In [None]:
# I will exclude datasets, dataloaders, etc. for the sake of simplicity
criterion = nn.CrossEntropyLoss(weight=torch.tensor(weights.values).float().to(device))
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
for epoch in range(1):  # make this up to 3!
    running_loss = 0.
    for batch_start in range(0, len(X_train), 4):
        X = X_train[batch_start:batch_start + 4].tolist()
        y = torch.tensor(y_train[batch_start:batch_start + 4].values).to(device)

        predictions = model(**tokenizer(X, return_tensors='pt', padding=True).to(device))
        loss = criterion(torch.softmax(predictions.logits, dim=-1), y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f'Finished epoch {epoch} with running loss of {running_loss / len(X_train)}')

In [None]:
# make predictions on validation set
valid_predictions = torch.zeros_like(torch.tensor(y_valid.values))
for batch_start in range(0, len(X_valid), 4):
    X = X_valid[batch_start:batch_start + 4].tolist()

    with torch.no_grad():
        predictions = model(**tokenizer(X, return_tensors='pt', padding=True).to(device))
        indices = torch.argmax(torch.softmax(predictions.logits, dim=-1), dim=-1)
    valid_predictions[batch_start:batch_start + 4] = indices

In [None]:
print(f'Accuracy: {accuracy_score(y_valid, valid_predictions.numpy())}\n'
      f'Precision: {precision_score(y_valid, valid_predictions.numpy(), average=None)}\n'
      f'Recall: {recall_score(y_valid, valid_predictions.numpy(), average=None)}\n'
      f'F1 Score: {f1_score(y_valid, valid_predictions.numpy(), average=None)}\n')

In [None]:
# y-axis (rows) == true label and x-axis (columns) == predicted label
confusion_matrix(y_valid, valid_predictions.numpy())

array([[2164,   14,    0],
       [   1, 3999,    0],
       [   6,   42,    0]])

# Your Original System

Improve upon the baselines above. Feel free to copy cells from one of the baselines above, paste it here, and tweak it for improvements. You have several models to select from from sklearn (both for classification and for vectorization of text). And even just trying different architectures for Basline 2 (such as RoBERTa, distilbert, etc.) would help.