In this competition, our aim is to develop an AI model that can score student essays. This competition is actually an updated version of an old one that took place over a decade ago. In this version, we aim to improve upon essay scoring algorithms to enhance student learning outcomes.

Given an essay with n words X = {xi}i=1 to n, we need to output one score y as a result of measuring the level of this essay.

In [38]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl
import nltk
cmap = mpl.cm.get_cmap('coolwarm')
import torch
import torch.nn as nn
import re
import transformers
from transformers import AutoModel, AutoTokenizer, BertModel, BertTokenizer, BertConfig
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import cohen_kappa_score
import sys
sys.path.append('/kaggle/input/py-file')
sys.path.append('/kaggle/input/wordnet')

from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

  cmap = mpl.cm.get_cmap('coolwarm')


# ⚙️ | Configuration

In [39]:
class CFG:
    seed = 42  # Random seed 
    sequence_length = 512  # Input sequence length
    batch_size = 16  # Batch size
    #weights_name = "/kaggle/input/bert-medium"# Name of pretrained models
    #weights_name = "/kaggle/input/hf_bert-medium/pytorch/bert_medium/1"
    #weights_name = "/kaggle/input/deberta-v3-small/transformers/v1/1"
    weights_name = "/kaggle/input/distilbert/transformers/distilbert/1/distilB"

In [40]:
# Fine-tune the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available

# ♻️ | Reproducibility 
Sets value for random seed to produce similar result in each run.

In [41]:
torch.manual_seed(CFG.seed)
transformers.logging.set_verbosity_error()

# 📁 | Dataset Path 

In [42]:
BASE_PATH = '/kaggle/input/learning-agency-lab-automated-essay-scoring-2'

# 📖 | Meta Data

**Files in the dataset:**

- `{test|train}.csv`
  - `essay_id`: Unique identifier for each essay.
  - `full_text`: Essay text.
  - `score`: Essay's score from `1-6`.
- `sample_submission.csv`: Valid sample submission.

**What does the `score` mean?**

The `score` represents the quality of student-written argumentative essays. Essays were rated based on a rubric covering perspective development, critical thinking, evidence use, organization, language, and grammar/mechanics. Here's a summary of the scoring criteria:

| Score | Description |
|-------|-------------|
| 6     | Clear mastery with few errors, outstanding critical thinking, appropriate evidence, well-organized, skilled language use. |
| 5     | Reasonable mastery with occasional errors, strong critical thinking, generally appropriate evidence, well-organized, good language use. |
| 4     | Adequate mastery with some lapses, competent critical thinking, adequate evidence, generally organized, fair language use. |
| 3     | Developing mastery with weaknesses, limited critical thinking, inconsistent evidence, limited organization, fair language use with weaknesses. |
| 2     | Little mastery with serious flaws, weak critical thinking, insufficient evidence, poor organization, limited language use with frequent errors. |
| 1     | Very little or no mastery, severely flawed, no viable point of view, disorganized, fundamental language flaws, pervasive grammar/mechanics errors. |

> This grading is very similar to the grading used in the [ETS GRE (Graduate Record Examinations) AWA](https://www.ets.org/gre/test-takers/general-test/prepare/content/analytical-writing.html) exam, where prospective graduate students are asked to write essays to judge their analytical abilities, and their scores are later used for graduate admission. 


In [43]:
# Load data
df = pd.read_csv(f'{BASE_PATH}/train.csv')  # Read CSV file into a DataFrame

In [44]:
# Display information about the train data
print("# Train Data: {:,}".format(len(df)))
# Set the display options to show the maximum column width
pd.set_option('display.max_colwidth', None)
df.head(2)

# Train Data: 17,307


Unnamed: 0,essay_id,full_text,score
0,000d118,"Many people have car where they live. The thing they don't know is that when you use a car alot of thing can happen like you can get in accidet or the smoke that the car has is bad to breath on if someone is walk but in VAUBAN,Germany they dont have that proble because 70 percent of vauban's families do not own cars,and 57 percent sold a car to move there. Street parkig ,driveways and home garages are forbidden on the outskirts of freiburd that near the French and Swiss borders. You probaly won't see a car in Vauban's streets because they are completely ""car free"" but If some that lives in VAUBAN that owns a car ownership is allowed,but there are only two places that you can park a large garages at the edge of the development,where a car owner buys a space but it not cheap to buy one they sell the space for you car for $40,000 along with a home. The vauban people completed this in 2006 ,they said that this an example of a growing trend in Europe,The untile states and some where else are suburban life from auto use this is called ""smart planning"". The current efforts to drastically reduce greenhouse gas emissions from tailes the passengee cars are responsible for 12 percent of greenhouse gas emissions in Europe and up to 50 percent in some car intensive in the United States. I honeslty think that good idea that they did that is Vaudan because that makes cities denser and better for walking and in VAUBAN there are 5,500 residents within a rectangular square mile. In the artical David Gold berg said that ""All of our development since World war 2 has been centered on the cars,and that will have to change"" and i think that was very true what David Gold said because alot thing we need cars to do we can go anyway were with out cars beacuse some people are a very lazy to walk to place thats why they alot of people use car and i think that it was a good idea that that they did that in VAUBAN so people can see how we really don't need car to go to place from place because we can walk from were we need to go or we can ride bycles with out the use of a car. It good that they are doing that if you thik about your help the earth in way and thats a very good thing to. In the United states ,the Environmental protection Agency is promoting what is called ""car reduced""communtunties,and the legislators are starting to act,if cautiously. Maany experts expect pubic transport serving suburbs to play a much larger role in a new six years federal transportation bill to approved this year. In previous bill,80 percent of appropriations have by law gone to highways and only 20 percent to other transports. There many good reason why they should do this.",3
1,000fe60,"I am a scientist at NASA that is discussing the ""face"" on mars. I will be explaining how the ""face"" is a land form. By sharing my information about this isue i will tell you just that.\n\nFirst off, how could it be a martions drawing. There is no plant life on mars as of rite now that we know of, which means so far as we know it is not possible for any type of life. That explains how it could not be made by martians. Also why and how would a martion build a face so big. It just does not make any since that a martian did this.\n\nNext, why it is a landform. There are many landforms that are weird here in America, and there is also landforms all around the whole Earth. Many of them look like something we can relate to like a snake a turtle a human... So if there are landforms on earth dont you think landforms are on mars to? Of course! why not? It's just unique that the landform on Mars looks like a human face. Also if there was martians and they were trying to get our attention dont you think we would have saw one by now?\n\nFinaly, why you should listen to me. You should listen to me because i am a member of NASA and i've been dealing with all of this stuff that were talking about and people who say martians did this have no relation with NASA and have never worked with anything to relate to this landform. One last thing is that everyone working at NASA says the same thing i say, that the ""face"" is just a landform.\n\nTo sum all this up the ""face"" on mars is a landform but others would like to beleive it's a martian sculpture. Which every one that works at NASA says it's a landform and they are all the ones working on the planet and taking pictures.",3


In [45]:
count = (df['score'] == 1).sum()
print(count)

1252


In [46]:
# Ensure the necessary nltk data is downloaded
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional: for additional WordNet resources


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [47]:
def f7(seq):
    """
    Makes a list unique
    """
    seen = set()
    seen_add = seen.add
    return [x for x in seq if x not in seen and not seen_add(x)]


In [52]:
from nltk.corpus import wordnet

def get_wordnet_syns(word):
    """
    Utilize wordnet (installed with nltk) to get synonyms for words
    word is the input word
    returns a list of unique synonyms
    """
    synonyms = []
    regex = r"_"
    pat = re.compile(regex)
    synset = wordnet.synsets(word)
    for ss in synset:
        for swords in ss.lemma_names():
            synonyms.append(pat.sub(" ", swords.lower()))
    synonyms = f7(synonyms)
    return synonyms

In [49]:
def replace_first_zero_with_variable(s, replacement):
    # Find the first occurrence of '0'
    index = s.find('0')
    if index != -1:
        # Replace the first '0' with the replacement variable
        s = s[:index] + replacement + s[index+1:]
    return s

In [50]:
def generate_additional_essays(row, dictionary=None, max_syns=3):
        """
        Substitute synonyms to generate extra essays from existing ones.
        This is done to increase the amount of training data.
        Should only be used with lowest scoring essays.
        e_text is the text of the original essay.
        e_score is the score of the original essay.
        dictionary is a fixed dictionary (list) of words to replace.
        max_syns defines the maximum number of additional essays to generate.  Do not set too high.
        """
        essay_id = row['essay_id'] 
        e_text = row['full_text']
        e_score = row['score']
        if e_score < 5:
            print('returning')
            return
        e_toks = nltk.word_tokenize(e_text)
        all_syns = []
        for word in e_toks:
            synonyms = get_wordnet_syns(word)
            if(len(synonyms) > max_syns):
                synonyms = random.sample(synonyms, max_syns)
            all_syns.append(synonyms)
        new_essays = []
        for i in range(0, max_syns):
            syn_toks = e_toks
            for z in range(0, len(e_toks)):
                if len(all_syns[z]) > i and (dictionary == None or e_toks[z] in dictionary):
                    syn_toks[z] = all_syns[z][i]
            new_essays.append(" ".join(syn_toks))
        for z in range(0, len(new_essays)):
            modified_string = replace_first_zero_with_variable(essay_id, z)
            # New record to insert
            new_record = {'essay_id':modified_string,'full_text':new_essays[z],'score': e_score}

            # Append the new record
            df = df.append(new_record, ignore_index=True)
        return df

In [53]:
df_new['new_column'] = df.apply(generate_additional_essays, axis=1)

returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning
returning


LookupError: 
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

In [None]:
df['essay_length'] = df['full_text'].str.split().str.len()

Here we set ourselves up to use BERT-mini:

In [None]:
bert = AutoModel.from_pretrained(CFG.weights_name)

bert_tokenizer = AutoTokenizer.from_pretrained(CFG.weights_name)

In [None]:
def extract_segments(text, first_n=382, last_n=128):
    """
    Extracts the first `first_n` words and the last `last_n` words from the text if the text length is >= first_n + last_n.
    Otherwise, returns the text as is.

    Args:
        text (str): The input text.
        first_n (int): The number of words to take from the beginning.
        last_n (int): The number of words to take from the end.

    Returns:
        str: The extracted segments or the original text.
    """
    # Split the text into words
    words = text.split()

    # Check if the text length is at least first_n + last_n
    if len(words) > (first_n + last_n):
        # Extract the first first_n words and the last last_n words
        extracted_words = words[:first_n] + words[-last_n:]
    else:
        # Return the entire text as is
        extracted_words = words

    # Join the words back into a single string
    return " ".join(extracted_words)


In [None]:
def to_ordinal(y, num_classes=None, dtype="int"):
    """Converts a class vector (integers) to an ordinal regression matrix.

    This utility encodes class vector to ordinal regression/classification
    matrix where each sample is indicated by a row and rank of that sample is
    indicated by number of ones in that row.

    Args:
        y: Array-like with class values to be converted into a matrix
            (integers from 0 to `num_classes - 1`).
        num_classes: Total number of classes. If `None`, this would be inferred
            as `max(y) + 1`.
        dtype: The data type expected by the input. Default: `'float32'`.

    Returns:
        An ordinal regression matrix representation of the input as a NumPy
        array. The class axis is placed last.
    """
    y = np.array(y, dtype="int")
    input_shape = y.shape

    # Shrink the last dimension if the shape is (..., 1).
    if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
        input_shape = tuple(input_shape[:-1])

    y = y.reshape(-1)
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    range_values = np.arange(num_classes - 1)
    range_values = np.tile(np.expand_dims(range_values, 0), [n, 1])
    ordinal = np.zeros((n, num_classes - 1), dtype=dtype)
    ordinal[range_values < np.expand_dims(y, -1)] = 1
    output_shape = input_shape + (num_classes - 1,)
    ordinal = np.reshape(ordinal, output_shape)
    return ordinal

## Label Conversion

In [None]:
df['label'] = to_ordinal(df.score.values).tolist()

In [None]:
def clean_text(text):
    if text is None:
        print("Error: Received None text input.")
        return []

    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)
    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)
    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    text = text.lower()
    # Replace "\n\n" with a single space
    cleaned_text = re.sub(r'\n\n', ' ', text)
    # Replace "\'" with "'"
    cleaned_text = re.sub(r"\\'", "'", cleaned_text)
    
    return extract_segments(cleaned_text)

In [None]:
def sub_chars(string):
    """
    Strips illegal characters from a string.  Used to sanitize input essays.
    Removes all non-punctuation, digit, or letter characters.
    Returns sanitized string.
    string - string
    """
    #Define replacement patterns
    sub_pat = r"[^A-Za-z\.\?!,';:]"
    char_pat = r"\."
    com_pat = r","
    ques_pat = r"\?"
    excl_pat = r"!"
    sem_pat = r";"
    col_pat = r":"
    whitespace_pat = r"\s{1,}"

    #Replace text.  Ordering is very important!
    nstring = re.sub(sub_pat, " ", string)
    nstring = re.sub(char_pat," .", nstring)
    nstring = re.sub(com_pat, " ,", nstring)
    nstring = re.sub(ques_pat, " ?", nstring)
    nstring = re.sub(excl_pat, " !", nstring)
    nstring = re.sub(sem_pat, " ;", nstring)
    nstring = re.sub(col_pat, " :", nstring)
    nstring = re.sub(whitespace_pat, " ", nstring)

    return nstring

In [None]:
df['cleaned_full_text'] = df['full_text'].apply(lambda x:clean_text(x))

# 🔪 | Data Split

In the code snippet provided below, we will divide the existing **train** data into folds using a stratification of `label` column.

In [None]:
train_df, valid_df = train_test_split(df, test_size=0.2, stratify=df["score"], shuffle=True, random_state=CFG.seed)

In [None]:
train_df.head(2)

# 🎨 |Data Visualization

In [None]:
# Show distribution of answers using a bar plot
plt.figure(figsize=(8, 4))
df.score.value_counts().plot.bar(color=[cmap(0.0), cmap(0.25), cmap(0.65), cmap(0.9), cmap(1.0)])
plt.xlabel("Score")
plt.ylabel("Count")
plt.title("Score distribution for Train Data")
plt.show()

# Show distribution of essay length using a bar plot
plt.figure(figsize=(8, 4))
df['essay_length'] = df.full_text.map(len)
df.essay_length.plot.hist(logy=False, color=cmap(0.9))
plt.xlabel("Essay Length")
plt.ylabel("Count")
plt.title("Essay Length distribution for Train Data")
plt.show()

# 🍽️ | Preprocessing

**What it does:** The preprocessor takes input strings and transforms them into a dictionary (`token_ids`, `padding_mask`) containing preprocessed tensors. This process starts with tokenization, where input strings are converted into sequences of token IDs.

**Why it's important:** Initially, raw text data is complex and challenging for modeling due to its high dimensionality. By converting text into a compact set of tokens, such as transforming `"The quick brown fox"` into `["the", "qu", "##ick", "br", "##own", "fox"]`, we simplify the data. Many models rely on special tokens and additional tensors to understand input. These tokens help divide input and identify padding, among other tasks. Making all sequences the same length through padding boosts computational efficiency, making subsequent steps smoother.

### Batch tokenization
use the batch_encode_plus method for bert_tokenizer to tokenize a list of strings.

In [None]:
 def get_batch_token_ids(batch, tokenizer):
    """Map `batch` to a tensor of ids. The return
    value should meet the following specification:

    1. The max length should be 512.
    2. Examples longer than the max length should be truncated
    3. Examples should be padded to the max length for the batch.
    4. The special [CLS] should be added to the start and the special
       token [SEP] should be added to the end.
    5. The attention mask should be returned
    6. The return value of each component should be a tensor.

    Parameters
    ----------
    batch: list of str
    tokenizer: Hugging Face tokenizer

    Returns
    -------
    dict with at least "input_ids" and "attention_mask" as keys,
    each with Tensor values

    """
    encoding = tokenizer.batch_encode_plus(batch, max_length=CFG.sequence_length, padding='max_length',
                                     truncation=True, return_tensors='pt', add_special_tokens=True)

    return encoding


### Fine-tuning module
1. in the init method, define self.classifier_layer using nn.Sequential
2. Complete the forward method.

In [None]:
class BertClassifierModule(nn.Module):
    def __init__(self,
            n_classes,
            hidden_activation,
            weights_name=CFG.weights_name):
        """This module loads a Transformer based on  `weights_name`,
        puts it in train mode, add a dense layer with activation
        function give by `hidden_activation`, and puts a classifier
        layer on top of that as the final output. The output of
        the dense layer should have the same dimensionality as the
        model input.

        Parameters
        ----------
        n_classes : int
            Number of classes for the output layer
        hidden_activation : torch activation function
            e.g., nn.Tanh()
        weights_name : str
            Name of pretrained model to load from Hugging Face

        """
        super().__init__()
        self.n_classes = n_classes
        self.weights_name = CFG.weights_name
        self.bert = AutoModel.from_pretrained(self.weights_name)
        self.bert.train()
        self.hidden_activation = hidden_activation
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # Add the new parameters here using `nn.Sequential`.
        # We can define this layer as
        #
        #  h = f(cW1 + b_h)
        #  y = hW2 + b_y
        #
        # where c is the final hidden state above the [CLS] token,
        # W1 has dimensionality (self.hidden_dim, self.hidden_dim),
        # W2 has dimensionality (self.hidden_dim, self.n_classes),
        # f is the hidden activation, and we rely on the PyTorch loss
        # function to add apply a softmax to y.
        self.classifier_layer = None
        ##### YOUR CODE HERE
       # Define the classifier_layer using nn.Sequential
        
        self.classifier_layer = nn.Sequential(
            nn.Linear(self.hidden_dim,self.hidden_dim),  
            self.hidden_activation,      # Activation function
            nn.Dropout(0.1),
            nn.Linear(self.hidden_dim, self.n_classes)
        )
        #self.classifier_layer.apply(init_weights)
        
    def forward(self, indices, mask):
        """Process `indices` with `mask` by feeding these arguments
        to `self.bert` and then feeding the initial hidden state
        in `last_hidden_state` to `self.classifier_layer`

        Parameters
        ----------
        indices : tensor.LongTensor of shape (n_batch, k)
            Indices into the `self.bert` embedding layer. `n_batch` is
            the number of examples and `k` is the sequence length for
            this batch
        mask : tensor.LongTensor of shape (n_batch, d)
            Binary vector indicating which values should be masked.
            `n_batch` is the number of examples and `k` is the
            sequence length for this batch

        Returns
        -------
        tensor.FloatTensor
            Predicted values, shape `(n_batch, self.n_classes)`

        """
        # Process indices and mask through self.bert
        outputs = self.bert(indices, attention_mask=mask)

        # Extract the [CLS] token representation
        #cls_token_representation = outputs.last_hidden_state[:, 0, :]
        # Extract all token representations
        last_hidden_state = outputs.last_hidden_state

        # Apply max-pooling across the sequence dimension (dim=1)
        cls_token_representation, _ = torch.max(last_hidden_state, dim=1)

        # Apply the classifier layer
        logits = self.classifier_layer(cls_token_representation)
        return logits

In [None]:
bert_module = BertClassifierModule(n_classes=6, hidden_activation=nn.Tanh())

In [None]:
class BertClassifier(TorchShallowNeuralClassifier):
    def __init__(self, weights_name, *args, **kwargs):
        self.weights_name = CFG.weights_name
        self.tokenizer = AutoTokenizer.from_pretrained(self.weights_name)
        super().__init__(*args, **kwargs)
        self.params += ['weights_name']
        self.classes_ = None
        self.n_classes_ = 6
        
    def build_graph(self):
        return BertClassifierModule(
            self.n_classes_, self.hidden_activation, self.weights_name)

    def build_dataset(self, X, y=None):
        data = get_batch_token_ids(X, self.tokenizer)
        if y is None:
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'])
        else:
            self.classes_ = np.unique(y, axis=0)
            self.n_classes_ = self.classes_.shape[1]
            
            y = np.array(y, dtype=np.float32)
            y_tensor = torch.tensor(y, dtype=torch.float32)
            
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'], y_tensor)
        return dataset


In [None]:
bert_finetune = BertClassifier(
    weights_name=CFG.weights_name,
    hidden_activation=nn.GELU(),
    hidden_dim=512,
    max_iter=1,
    eta=3e-5,             # Low learning rate for effective fine-tuning.
    batch_size=CFG.batch_size,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params

In [None]:
%%time
train_labels = train_df.label.to_list() # Extract training labels
_ = bert_finetune.fit(train_df['cleaned_full_text'].to_list(),train_labels)

In [None]:
preds = bert_finetune.predict(valid_df['cleaned_full_text'].tolist())

# 📏 | Metric

The metric for this competition is quadratic **Weighted Kappa**. This metric is particularly useful for tasks involving ordinal classification (where labels have inherent order). The following code implements this metric from scratch. This metric is implemented taking inspiration from [this TensorFlow implementation](https://www.tensorflow.org/addons/api_docs/python/tfa/losses/WeightedKappaLoss). You can learn more about this metric [here](https://www.sciencedirect.com/science/article/abs/pii/S0167865517301666). 

> This metric implementation is a bit different than the competition metric, which was resolved by @taichiuemura in [here](https://www.kaggle.com/code/taichiuemura/aes-2-0-kerasnlp-starter/#%F0%9F%93%8F-%7C-Metric).

## Result Summary

In [None]:
y_true = []
y_true.extend(valid_df['label'])
y_true = np.array(y_true)
preds = np.vstack(preds)
print('sklearn metric:', cohen_kappa_score(
    np.sum(y_true > 0.5, axis = 1),
    np.sum(preds > 0.5, axis = 1),
    weights = 'quadratic',
))


# 🧪 | Testing

In this section, we will visually test how our model performs on some samples from the validation data.

> Note that we are converting the ordinal regression model outputs with `sum`, unlike a typical classification problem where we would use `argmax`.

In [None]:
# Format predictions and true answers
pred_scores = np.sum((preds > 0.5).astype(int), axis=-1)
true_scores = valid_df.score.values

# Check 5 Predictions
print("# Predictions\n")
for i in range(5):
    row = valid_df.iloc[i]
    text = row.full_text
    pred_answer = pred_scores[i]
    true_answer = true_scores[i]
    print(f"❓ Text {i+1}:\n{text[:150]} .... {text[-150:]}\n")
    print(f"✅ True: {true_answer}\n")
    print(f"🤖 Predicted: {pred_answer}\n")
    print("-" * 90, "\n")


# 📬 | Submission

In this section, we will infer our model on the test data and then finally prepare the submission file.

## Build Test Dataset

In [None]:
#Test Data
test_df = pd.read_csv(f"{BASE_PATH}/test.csv")

test_texts = test_df.full_text.fillna("").tolist()  # Extract test texts

# Build test dataset
test_preds = bert_finetune.predict(test_texts)


## Inference on Test Data

In [None]:
# Do inference
test_preds = bert_finetune.predict(test_texts)
test_preds = np.vstack(test_preds)
# Convert probabilities to class labels
test_preds = np.sum((test_preds>0.5).astype(int), axis=-1).clip(1, 6)

## Create Submission File

In [None]:
# Create a DataFrame to store the submission
sub_df = test_df[["essay_id"]].copy()

# Add the formatted predictions to the submission DataFrame
sub_df["score"] = test_preds

# Save Submission
sub_df.to_csv('submission.csv',index=False)

# Display the first 2 rows of the submission DataFrame
sub_df.head()