## Final Prediction <a class="anchor"  id="chapter11"></a> 

I will use the BERT model with uncleaned data to predict the test set and submit my final prediction, as it performs just slightly better than RoBERTa on uncleaned data. 

In [27]:
def preprocess_final(df, text_column, target_column, model_name, max_length=50, device='cuda'):
    train_cleaned = df[df[target_column].notna()]
    test_cleaned = df[df[target_column].isna()]

    train_dataset = train_cleaned[[text_column, target_column]]
    text = train_dataset[text_column].values
    labels = train_dataset[target_column].values

    test_dataset = test_cleaned[text_column].values

    # Set the device
    device = torch.device(device)

    # Load the BERT tokenizer
    tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

    # Tokenize the training texts
    encoded_dict = tokenizer(text=text.tolist(),
                                add_special_tokens=True,
                                max_length=max_length,
                                truncation=True,
                                padding=True, 
                                return_token_type_ids = False,
                                return_attention_mask = True,
                                verbose = True)

    # Tokenize the test texts
    encoded_test = tokenizer(text=test_dataset.tolist(),
                                add_special_tokens=True,
                                max_length=max_length,
                                truncation=True,
                                padding=True, 
                                return_token_type_ids = False,
                                return_attention_mask = True,
                                verbose = True)

    # Convert the TensorFlow tensors to PyTorch tensors
    input_ids_train = torch.tensor(encoded_dict['input_ids'])
    attention_mask_train = torch.tensor(encoded_dict['attention_mask'])
    labels = torch.tensor(labels)
    labels = labels.to(torch.int64)

    # Test set
    input_ids_test = torch.tensor(encoded_test['input_ids'])
    attention_mask_test = torch.tensor(encoded_test['attention_mask'])

    # Combine the inputs and labels into a TensorDataset
    train_dataset = TensorDataset(input_ids_train, attention_mask_train, labels)
    test_dataset = TensorDataset(input_ids_test, attention_mask_test)

    # Define the dataloaders for the training and test sets
    batch_size = 16
    train_dataloader = DataLoader(train_dataset, 
                                  batch_size=batch_size, 
                                  shuffle=True)
    test_dataloader = DataLoader(test_dataset, 
                                batch_size=batch_size)

    return train_dataloader, test_dataloader, train_dataset, test_dataset


In [28]:
train_dataloader, test_dataloader, train_dataset, test_dataset = preprocess_final(df, 'text', 'target', 'bert-base-uncased')

predictions = train_validate_test('bert-base-uncased', train_dataset, test_dataset, epochs=2, validation=False)

# Convert the list of predictions to a numpy array and flatten it
predictions = np.concatenate(predictions).flatten()

# Create a pandas dataframe with the test ids and the predicted targets
submission_df = pd.DataFrame({'id': test_cleaned['id'], 'target': predictions})

# Output the dataframe
submission_df.head()

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [29]:
submission_df.to_csv('submission.csv', index=False)