## OLD:Blue-or-Red Roberta Transformer of 2022 Tweets

In [1]:
from fastai.text.all import *
from ideology_utils import *

### Model Type

In [2]:
#model_nm = 'vinai/bertweet-base'
model_nm = 'cardiffnlp/twitter-roberta-base'

### Build Dataset of 2022 Congressional Tweets

#### Grab Tweets of Each Member of Congress for Each Week

In [3]:
path=Path('tweets')

In [4]:
df = pd.concat(map(tweets2df, path.rglob("tweets-congress*/*")))

#### Label each tweet according by the handle of the legislator

In [5]:
df = label_tweets_of_legislators(df)

#### 2019 Tweets: ONLY PICK 2022 OR 2019 TWEETS!

In [None]:
df = tweets2df("/home/cdaniels/fastai-projects/blue-or-red/data_full") # 2019 tweets contain party affiliation

#### Preproccess Tweets

In [6]:
df = preprocess_tweets(df)

In [7]:
def party2num(x):return 0 if x=='Democrat' else 1 # Democrate = 0 and Republican = 1; NEED TO BE INTEGERS

In [8]:
df.party = df.party.apply(party2num)

The label column must be called `labels`. By convention, also have text input called `input`

In [9]:
df = df.rename(columns={'text':'input', 'party':'labels'})

In [26]:
df.head()

Unnamed: 0,handle,input,labels
0,RepMikeLevin,The @user is doing incredible work in our community!I was proud to secure $150000 for the museum to fund childhood literacy programs. This funding will help advance elementary students’ reading comprehension through active engagement with works of art. http,0
1,RepMikeLevin,.@OutdoorAlliance is doing amazing work to protect our planet and expand access to public lands.During our meeting we discussed my American Coasts and Oceans Protection Act which would prohibit any new offshore drilling along the Southern California coast. http,0
2,RepMikeLevin,I'm also glad that @user has announced he will delay any new tariffs on the American solar industry. I helped lead a letter last month urging the admin to make this commonsense decision so we can continue to support solar jobs lower energy costs and meet our climate goals.,0
3,RepMikeLevin,Investing in clean energy isn't just good for our planet it's good for our economy and our national security.Today @user invoked the Defense Production Act to ensure we have the energy we need to run our country all year round.https://t.co/g7ksT09v4l,0
4,RepMikeLevin,Today we honor the heroic actions of the servicemembers who landed in Normandy 78 years ago and fought so hard to turn the tides of #WWII and protect our fundamental freedoms.Thank you for your service 🇺🇸https://t.co/J4Y64grM0T,0


### Create Dataset from DataFrame

In [11]:
from datasets import Dataset, DatasetDict

In [12]:
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['handle', 'input', 'labels'],
    num_rows: 30107
})

### Tokenize Dataset
Need to know the `model` in order to get the correct Tokenizer with `AutoTokenizer(model)`

In [13]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

**It is important that the tokenizer used for the training is the same as used for new documents**

In [14]:
tokz = AutoTokenizer.from_pretrained(model_nm)

Here's a simple function which tokenizes our inputs:

In [15]:
def tok_func(x): return tokz(x["input"])

Map `tok_func` on `ds['input']`, which uses `tokz` to tokenize each element. This creates a tokenized dataset `tok_ds`

Error checking

In [16]:
for d in ds:
    try:
       tok_func(d)
    except:
        print(d)

In [17]:
tok_ds = ds.map(tok_func, batched=True)

  0%|          | 0/31 [00:00<?, ?ba/s]

### Create the Training and Test/Valid datasets from tok_ds

`DatasetDict` here, `dds` holds training and validation datasets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use `train_test_split`:

In [18]:
dds = tok_ds.train_test_split(.25) # DataSetDict
dds

DatasetDict({
    train: Dataset({
        features: ['handle', 'input', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 22580
    })
    test: Dataset({
        features: ['handle', 'input', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 7527
    })
})

### Metrics

In [None]:
#from sklearn.metrics import accuracy_score, f1_score

In [None]:
#def compute_metrics(eval_pred):
#    f1 = f1_score(*eval_pred, average="weighted")
#    acc = accuracy_score(*eval_pred)
#    return {"accuracy": acc, "f1": f1}

In [None]:
#def compute_metrics(eval_pred):
#    preds, labels = eval_pred
#    preds = np.argmax(preds, axis=1)
#    accuracy = sum(preds == labels)/len(preds)
#    return {"accuracy": accuracy}

In [19]:
from datasets import load_metric
def compute_metrics(eval_pred):
    metric = load_metric("accuracy")
    preds, labels = eval_pred
    preds = np.argmax(preds, axis=1)
    return metric.compute(predictions=preds, references=labels)

Transformers expects metrics to be returned as a `dict`, since that way the trainer knows what label to use, so let's create a function to do that:

### Create Model

In [20]:
from transformers import TrainingArguments, Trainer

Another Auto Factory Method using the `model_nm` to create a model consisitent with the `AutoTokenizer`

In [21]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2) # KEY NUMBER FOR 2 Classes

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base and

### Create Trainer

In [22]:
bs = 128
epochs = 4
lr = 8e-5
#lr = 3e-5

All of the paramaters related to the `Trainer` go into `TrainerArguments`

In [23]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none', logging_strategy='epoch')
# logging_strategy='epoch' required to get Training_Loss

In [24]:
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=compute_metrics)

Using amp half precision backend


### Execute Training

In [25]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input, handle. If input, handle are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 22580
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 356


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4859,0.355394,0.836057


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input, handle. If input, handle are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7527
  Batch size = 512


KeyboardInterrupt: 

### Save Model

In [31]:
roberta_model = 'blue-or-red-roberta-2022'

In [32]:
tokz.save_pretrained(roberta_model)

tokenizer config file saved in blue-or-red-roberta-2022/tokenizer_config.json
Special tokens file saved in blue-or-red-roberta-2022/special_tokens_map.json


('blue-or-red-roberta-2022/tokenizer_config.json',
 'blue-or-red-roberta-2022/special_tokens_map.json',
 'blue-or-red-roberta-2022/vocab.json',
 'blue-or-red-roberta-2022/merges.txt',
 'blue-or-red-roberta-2022/added_tokens.json',
 'blue-or-red-roberta-2022/tokenizer.json')

In [33]:
model.save_pretrained(roberta_model)

Configuration saved in blue-or-red-roberta-2022/config.json
Model weights saved in blue-or-red-roberta-2022/pytorch_model.bin


## Create Test Dataset and Evaluate using 2022 Tweets

### Load Saved Model and Create Trainer

#### 1) Restarted Kernel: Start Here

In [1]:
from fastai.text.all import *
from ideology_utils import *
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import Dataset, DatasetDict
from transformers import TrainingArguments, Trainer

In [2]:
def tok_func(x): return tokz(x["input"])

In [3]:
def party2num(x):return 0 if x=='Democrat' else 1 # Democrate = 0 and Republican = 1; NEED TO BE INTEGERS

In [4]:
from datasets import load_metric
def compute_metrics(eval_pred):
    metric = load_metric("accuracy")
    preds, labels = eval_pred
    preds = np.argmax(preds, axis=1)
    return metric.compute(predictions=preds, references=labels)

#### 2) Otherwise: Reload Saved Model from Here

In [5]:
roberta_model = 'blue-or-red-roberta-2022'
model = AutoModelForSequenceClassification.from_pretrained(roberta_model)
tokz  = AutoTokenizer.from_pretrained(roberta_model)
args = TrainingArguments("tmp_trainer", per_device_eval_batch_size=128)
trainer = Trainer(model, args, tokenizer=tokz);

### Create Evaluation Dataset

In [103]:
#weeks = ['tweets-congress-2022-05-18','tweets-congress-2022-05-24','tweets-congress-2022-06-01']
weeks = ['tweets-congress-2022-06-01']
weeks = [path/week for week in weeks]

In [104]:
df = pd.DataFrame(columns=['handle','text']) 
for week in weeks:       
    for t in Path(week).ls():
        dft = pd.read_csv(t)
        df  = pd.concat([df,dft],ignore_index=True)
df = label_tweets_of_legislators(df)

In [106]:
df = preprocess_tweets(df)
df.party = df.party.apply(party2num)
df = df.rename(columns={'text':'input', 'party':'labels'})

In [107]:
eval_df = df[['input']]
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

  0%|          | 0/6 [00:00<?, ?ba/s]

In [108]:
eval_labels = df.labels

### Prediction and Inference

Use `trainer.predict(eval_ds)` to make predictions on the `eval_ds`

In [109]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds = torch.tensor(preds)
preds = F.softmax(preds, dim = 1)

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input. If input are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 5849
  Batch size = 512


In [110]:
compute_metrics([preds,eval_labels])

{'accuracy': 0.9446059155411182}

### Create our CSV Submission Results

In [None]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv("submission.csv")