<a href="https://colab.research.google.com/github/lorhzy09/Argument-Effectiveness-Classification/blob/main/Argument_Effectiveness_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DSU 2nd quarter curriculum training 

Objectives : 
- classify argumentative elements in student writing as "effective," "adequate," or "ineffective." 
- create model trained on data that is representative of the 6th-12th grade population in the United States in order to minimize bias. 
- Models derived will help students to receive enhanced feedback on argumentative writing


References & tutorials: 
- https://www.tensorflow.org/tutorials/keras/classification
- https://www.tensorflow.org/text/tutorials/classify_text_with_bert
- https://www.youtube.com/watch?v=wp9BudYGZyA
- James Briggs: https://www.youtube.com/watch?v=pjtnkCGElcE&t=971s


Dataset source: 
- https://www.kaggle.com/competitions/feedback-prize-effectiveness


In [130]:
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)

Mounted at /content/gdrive


In [131]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

from tensorflow.keras import layers
from tensorflow.keras import losses

!pip install transformers
from transformers import BertTokenizer

assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')

df = pd.read_csv('/content/gdrive/MyDrive/Projects/Predicting Effective Arguments/feedback-prize-effectiveness/train.csv')
df.head()
num_samples = len(df)
df.columns


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Index(['discourse_id', 'essay_id', 'discourse_text', 'discourse_type',
       'discourse_effectiveness'],
      dtype='object')

https://www.tensorflow.org/tutorials/load_data/csv

In [132]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

token = tokenizer.encode_plus(
    df['discourse_text'].iloc[0], 
    max_length=1024,
    truncation=True,
    padding = 'max_length',
    add_specialtokens = True, 
    return_tensor = 'tf'
)


Keyword arguments {'add_specialtokens': True, 'return_tensor': 'tf'} not recognized.


In [133]:
from transformers.models.auto.tokenization_auto import tokenizer_class_from_name
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')


#initialize data shape of Xid, XMask, and labels with all zeros
Xids = np.zeros((len(df), 512))
XMask = np.zeros((len(df), 512)) 
# labels = np.zeros((len(df), 3)) # 3 types of labels 

for count, text in enumerate(df['discourse_text']): 
  tokens = tokenizer.encode_plus(text, 
                             max_length=512, truncation=True,
                             padding='max_length', add_special_tokens=True, 
                             return_tensors='tf')
  Xids[count, :] = tokens ['input_ids']
  XMask[count, :] = tokens ['attention_mask']


In [134]:
df.head()

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate


In [150]:
arr = df['discourse_effectiveness'].values

effectiveness_as_num = arr
arr

effectiveness_as_num [ effectiveness_as_num == 'Ineffective'] = 0.0
effectiveness_as_num [ effectiveness_as_num == 'Adequate'] = 1.0
effectiveness_as_num [ effectiveness_as_num == 'Effective'] = 2.0

effectiveness_as_num = effectiveness_as_num.astype(int)

effectiveness_as_num

labels = np.zeros((len(df), 3))
labels[np.arange(len(df)), effectiveness_as_num] = 1

labels


array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       ...,
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [153]:
dataset = tf.data.Dataset.from_tensor_slices((Xids, XMask, labels))
dataset.take(1)


(512,)

In [155]:
def map_func(input_ids, mask, labels): 
  return {'input_ids': input_ids, 'attention_mask': mask}, labels 

dataset = dataset.map(map_func)
dataset.take(1)


<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.float64, name=None)}, TensorSpec(shape=(3,), dtype=tf.float64, name=None))>

Q: what is the difference between batch size and epoch in a neural network? 

A: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

- batch size: hyperparameter, defines # of samples to work through before updating internal model parameters
- epoch: hyperparameter, defines # of times learning algorithm will work through entire training dataset.





In [138]:
# set up batch size 

BATCH_SIZE = 16 

dataset = dataset.shuffle(10000).batch(BATCH_SIZE, drop_remainder = True )
# dataset.take(1) # tensor shape has changed, now 16 samples for eery tensor 

In [139]:
# create train test split 

SPLIT = 0.9

size = int ((len(df) / BATCH_SIZE)* SPLIT)

train_ds = dataset.take(size)
val_ds = dataset.skip(size)
print(size)

2068


In [140]:
from transformers import TFAutoModel

bert = TFAutoModel.from_pretrained('bert-base-uncased')

bert.summary()

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "tf_bert_model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
_________________________________________________________________


Q: What is a "tensor" in tensorflow? 

A: https://www.tensorflow.org/guide/tensor
= multi-dimensional arrays with a uniform dtype
= similar to numpy.arrays
= generalization of scalar scalars and vectors: scalar is zero rank tensor; vector is first rank tensor

In [141]:
# create input and mask layer 

input_ids = tf.keras.layers.Input(shape=(len(df),), name = 'input_ids', dtype = 'int32')
# setting name because we have dictionary input_ids (16, 512), specify which 'input_ids' each input tensor will go into 
# int32 because BErT expect tokens to be integers values 
mask = tf.keras.layers.Input(shape=(len(df),), name = 'attention_mask', dtype = 'int32')



Q: what is a embedding and how does it relate to BErt? 

In [142]:
# feed layer into bert by creating embeddings, access transformer within bert object 

embeddings = bert.bert(input_ids, attention_mask=mask) [1] #3D tensor layer pulled into 2D 
 

In [143]:
# convert embeddings into label predictions 
x = tf.keras.layers.Dense (1024, activation='relu')(embeddings)
y = tf.keras.layers.Dense (3, activation='softmax', name = 'outputs')(x) # add connection to x

In [144]:
# intialize layers into model object 

model = tf.keras.Model (inputs = [input_ids, mask], outputs = y)
model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 36765)]      0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 36765)]      0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 3676                                         

In [145]:
# set up model training parameters to pass into model.compile 

optimizer = tf.keras.optimizers.Adam(lr=1e-5, decay=1e-6)

# loss function 
loss = tf.keras.losses.CategoricalCrossentropy() 
# categorical corss entropy because output are the 3 levels of effectiveness, which is categorical 

acc = tf.keras.metrics.CategoricalAccuracy('accuracy')


  super(Adam, self).__init__(name, **kwargs)


In [146]:
# call the compile method, passing in the parameters specified above 

model.compile (optimizer = optimizer, loss = loss, metrics=[acc])

train_ds

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.float64, name=None)}, TensorSpec(shape=(16, 3), dtype=tf.float64, name=None))>

In [147]:
# now we can finally train our model 

history = model.fit (
    train_ds, 
    validation_data = val_ds,
    epochs = 3
)

Epoch 1/3


ValueError: ignored

In [None]:
# after training, save the model 