<a href="https://colab.research.google.com/github/marco-siino/DA-ESWA/blob/main/code/CNN_ISS_augmented_DE_IT_NB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text preprocessing worth the time: A comparative survey on the impact of common techniques on NLP model performances. 
- - - 
CNN ON FNS DS EXPERIMENTS NOTEBOOK 
- - -
Convolutional Neural Network on Fake News Spreaders Dataset.
Code by M. Siino. 

From the paper: "Text preprocessing worth the time: A comparative survey on the impact of common techniques on NLP model performances." by M.Siino et al.



## Importing modules.

In [3]:
import matplotlib.pyplot as plt
import os
import random
import re
import shutil
import string
import tensorflow as tf

import numpy as np

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from keras.models import Model
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

os.environ['TF_CUDNN_DETERMINISTIC']='true'
os.environ['TF_DETERMINISTIC_OPS']='true'

## Importing DS and extract in current working directory.

In [4]:
# Url obtained starting from this: https://drive.google.com/file/d/19ZcqEv88euKB71HfAWjTGN3uCKp2qsfP/ and forcing export=download.
urlTrainingSet = "https://github.com/marco-siino/DA-ESWA/raw/main/data/iss/iss-training-augmented-de-it.zip"

urlTestSet="https://github.com/marco-siino/DA-ESWA/raw/main/data/iss/iss-test-original.zip"

training_set = tf.keras.utils.get_file("pan22-author-profiling-training-2022-03-29-augmented.zip", urlTrainingSet,
                                   extract=True, archive_format='zip',cache_dir='.',
                                   cache_subdir='')

test_set = tf.keras.utils.get_file("pan22-author-profiling-test-2022-04-22-without_truth.zip", urlTestSet,
                                    extract=True, archive_format='zip',cache_dir='.',
                                    cache_subdir='')

training_set_dir = os.path.join(os.path.dirname(training_set), 'pan22-author-profiling-training-2022-03-29-augmented')
test_set_dir = os.path.join(os.path.dirname(test_set), 'pan22-author-profiling-test-2022-04-22-without_truth')
print('training_set',training_set)
print('training_set_dir',training_set_dir)
print('####')
print('test_set_dir',test_set_dir)
print('####')
!ls -A



Downloading data from https://github.com/marco-siino/DA-ESWA/raw/main/data/iss/iss-training-augmented-de-it.zip
Downloading data from https://github.com/marco-siino/DA-ESWA/raw/main/data/iss/iss-test-original.zip
training_set ./pan22-author-profiling-training-2022-03-29-augmented.zip
training_set_dir ./pan22-author-profiling-training-2022-03-29-augmented
####
test_set_dir ./pan22-author-profiling-test-2022-04-22-without_truth
####
.config
iss-test-original
pan22-author-profiling-test-2022-04-22-without_truth.zip
pan22-author-profiling-training-2022-03-29-augmented
pan22-author-profiling-training-2022-03-29-augmented.zip
sample_data


## Build folders hierarchy to use Keras folders preprocessing function.



In [5]:
### Training Folders. ###

# First level directory.
if not os.path.exists('train_dir_en'):
    os.makedirs('train_dir_en')

# Class labels directory.
if not os.path.exists('train_dir_en/0'):
    os.makedirs('train_dir_en/0')
if not os.path.exists('train_dir_en/1'):
    os.makedirs('train_dir_en/1')

# Make Py variables.
train_dir='train_dir_'

## Test Folders. ##
# First level directory.
if not os.path.exists('test_dir_en'):
    os.makedirs('test_dir_en')

# Class labels directory.
if not os.path.exists('test_dir_en/0'):
    os.makedirs('test_dir_en/0')
if not os.path.exists('test_dir_en/1'):
    os.makedirs('test_dir_en/1')

# Make Py variables.
test_dir='test_dir_'

!ls -A

.config
iss-test-original
pan22-author-profiling-test-2022-04-22-without_truth.zip
pan22-author-profiling-training-2022-03-29-augmented
pan22-author-profiling-training-2022-03-29-augmented.zip
sample_data
test_dir_en
train_dir_en


## Set language and directory paths.

In [6]:
# Set en and es ground truth file path for train_dir. We haven't a ground truth file for the test set.
language='en'

truth_file_training_dir_en = training_set_dir  +'/' + language  #samples training
truth_file_training_path_en = training_set_dir +'/'+ 'truth.txt'   #train truth file
print('truth_file_training_dir_en', truth_file_training_dir_en)
print('truth_file_training_path_en', truth_file_training_path_en)

truth_file_test_dir = test_set_dir +'/'+language+'/'  #tutti i sample
truth_file_test_path_en = test_set_dir + '/'+ 'truth'+'.txt'  #path to truth file
print('truth_file_test_dir', truth_file_test_dir)
print('truth_file_training_dir_en', truth_file_test_path_en)

truth_file_training_dir_en ./pan22-author-profiling-training-2022-03-29-augmented/en
truth_file_training_path_en ./pan22-author-profiling-training-2022-03-29-augmented/truth.txt
truth_file_test_dir ./pan22-author-profiling-test-2022-04-22-without_truth/en/
truth_file_training_dir_en ./pan22-author-profiling-test-2022-04-22-without_truth/truth.txt


## Read truth.txt to organize training and test dataset folders.

In [7]:
# Open the file truth.txt with read only permit.
f = open('/content/pan22-author-profiling-training-2022-03-29-augmented/en/truth.txt' , "r")
# use readline() to read the first line 
line = f.readline()
# use the read line to read further.
# If the file is not empty keep reading one line
# at a time, till the file is empty
while line:
    # Split line at :::
    x = line.split(":::")
    fNameXml = x[0]+'.xml'
    fNameTxt = x[0]+'.txt'
    # Second coord [0] gets just the first character (label) and not /n too.
    label = x[1][0]
    #print('label',label)
    if label == 'I':
      label = '1'
    elif label == 'N':
      label = '0'

    # Now move the file to the right folder.
    if os.path.exists('/content/pan22-author-profiling-training-2022-03-29-augmented/en' +'/'+ fNameXml):
      os.rename('/content/pan22-author-profiling-training-2022-03-29-augmented/en' +'/'+ fNameXml, 
                './train_dir_'+language+'/'+label+'/'+fNameTxt )

    # use readline() to read next line
    line = f.readline()

# Open the file truth.txt with read only permit.
#f = open(truth_file_test_path_en, "r")
f = open('/content/iss-test-original/pan22-author-profiling-test-2022-04-22-without_truth/truth.txt', "r")
# use readline() to read the first line 
line = f.readline()
# use the read line to read further.
# If the file is not empty keep reading one line
# at a time, till the file is empty
while line:
    # Split line at :::
    #print(line)
    x = line.split(":::")
    fNameXml = x[0]+'.xml'
    fNameTxt = x[0]+'.txt'
    # Second coord [0] gets just the first character (label) and not /n too.
    label = x[1][0]
    if label == 'I':
      label = '1'
    elif label == 'N':
      label = '0'
    # Now move the file to the right folder.
    #print(truth_file_test_dir+fNameXml)
    if os.path.exists('/content/iss-test-original/pan22-author-profiling-test-2022-04-22-without_truth/en/'+fNameXml):
      #print('path exist')
      os.rename('/content/iss-test-original/pan22-author-profiling-test-2022-04-22-without_truth/en/'+fNameXml, 
                './test_dir_'+language+'/'+label+'/'+fNameTxt)

    # use readline() to read next line
    line = f.readline()

## Generate full training set.

In [8]:
# Generate full randomized training set.
batch_size=1

en_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    train_dir+language, 
    batch_size=batch_size,
    shuffle=False
    )

en_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    test_dir+language, 
    batch_size=batch_size,
    shuffle=False
    )

train_ds=en_train_ds.shuffle(300,seed=1, reshuffle_each_iteration=False)
test_ds=en_test_ds.shuffle(200,seed=1, reshuffle_each_iteration=False)

train_ds_size=len(train_ds)
test_ds_size=len(test_ds)

Found 420 files belonging to 2 classes.
Found 180 files belonging to 2 classes.


## Functions to pre-process source text. (A detailed discussion on our paper)

In [9]:
# Do-Nothing preprocessing function.
def DON(input_data):
  tag_open_CDATA_removed = tf.strings.regex_replace(input_data, '<\!\[CDATA\[', ' ')
  tag_closed_CDATA_removed = tf.strings.regex_replace(tag_open_CDATA_removed,'\]{1,}>', ' ')
  tag_author_lang_en_removed = tf.strings.regex_replace(tag_closed_CDATA_removed,'<author lang="en">', ' ')
  tag_closed_author_removed = tf.strings.regex_replace(tag_author_lang_en_removed,'</author>', ' ')
  tag_open_documents_removed = tf.strings.regex_replace(tag_closed_author_removed,'<documents>\n(\t){0,2}', '')
  output_data = tf.strings.regex_replace(tag_open_documents_removed,'</documents>\n(\t){0,2}', ' ')
  return output_data

## Get the length of the longest sample in training set. Then adapt text.



In [17]:
def preprocess_and_adapt_ts(preprocessing_function,training_set):
  # Set a large sequence length to find the longest sample in the training set.
  sequence_length = 20000
  vectorize_layer = TextVectorization(
      standardize=preprocessing_function,
      output_mode='int',
      output_sequence_length=sequence_length)

  train_text = training_set.map(lambda x, y: x)
  vectorize_layer.adapt(train_text)
  #vectorize_layer.get_vocabulary()

  model = tf.keras.models.Sequential()
  model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
  model.add(vectorize_layer)

  longest_sample_length=1

  for element in training_set:
    authorDocument=element[0]
    label=element[1]
    
    #print("Sample considered is: ", authorDocument[0].numpy())
    #print("Preprocessed: ", str(custom_standardization(authorDocument[0].numpy())))
    #print("And has label: ", label[0].numpy())

    out=model(authorDocument)
    # Convert token list to numpy array.
    token_list = out.numpy()[0]
    token_list = np.trim_zeros(token_list,'b')
    if longest_sample_length < len(token_list):
      longest_sample_length = len(token_list)

  print("Length of the longest sample is:", longest_sample_length)

  # After tokenization longest_sample_length covers all the document lenghts in our dataset.
  sequence_length = longest_sample_length

  vectorize_layer = TextVectorization(
      standardize=preprocessing_function,
      output_mode='int',
      output_sequence_length=sequence_length)

  # Finally adapt the vectorize layer.
  train_text = training_set.map(lambda x, y: x)
  vectorize_layer.adapt(train_text)
  return vectorize_layer

## Define a dictionary with -> function_names:prepro_function_caller. And a dictionary to store model results.




In [18]:
model_results = {}
prepro_functions_dict_base = {
    'DON':DON,
    }

# 3 prepro functions = 15 combs...+1 for do_nothing

prepro_functions_dict_comb = {
    # 1. Do nothing 
    'DON': DON,
}

for key in prepro_functions_dict_comb:
  print(key)
  model_results[key]=[]

DON


## Some training hyperparameters...

In [19]:
# Word embedding dimensions.
embedding_dim = 100

num_runs = 5 
# No need to go over the 20th epoch...Overfitting begins.
num_epochs_per_run = 20

#opt = tf.keras.optimizers.RMSprop()

## Models definition and evaluation.




In [20]:
tf.random.set_seed(1)

# Reset model_results list.
for key in prepro_functions_dict_comb:
  model_results[key]=[]

for key in prepro_functions_dict_comb:
  runs_accuracy = []

  print("\n\n* * * * EVALUATION USING", key, "AS PREPROCESSING FUNCTION * * * *")

  # Preprocess training set to build a dictionary.
  vectorize_layer = preprocess_and_adapt_ts(prepro_functions_dict_comb[key],train_ds)

  max_features=len(vectorize_layer.get_vocabulary()) + 1
  print("Vocabulary size is:", max_features)

  for run in range(1,(num_runs+1)):
    epochs_accuracy=[]
    model = tf.keras.Sequential([
                                    tf.keras.Input(shape=(1,), dtype=tf.string),
                                    vectorize_layer,
                                    layers.Embedding(max_features + 1, embedding_dim),                     
                                    layers.Dropout(0.8),

                                    layers.Conv1D(256,16,activation='relu'),
                                    layers.MaxPooling1D(),
                                    layers.Dropout(0.6),

                                    layers.Dense(512,activation='relu'),
                           
                                    layers.GlobalAveragePooling1D(),
                                    layers.Dropout(0.2),
                                    layers.Dense(1)                            
    ])
    model.compile(loss=losses.BinaryCrossentropy(from_logits=True), optimizer='RMSprop', metrics=tf.metrics.BinaryAccuracy(threshold=0.0)) 

    for epoch in range (0,num_epochs_per_run):
        history = model.fit(
          train_ds,
          validation_data = test_ds,
          epochs=1,
          shuffle=False,
          # Comment the following line to do not save and download the model.
          #callbacks=[callbacks]
          )
        accuracy = history.history['val_binary_accuracy']
        print("Run: ",run,"/ Accuracy at epoch ",epoch," is: ", accuracy[0],"\n")
        epochs_accuracy.append(accuracy[0])

    print("Accuracies over epochs:",epochs_accuracy,"\n\n")
    runs_accuracy.append(max(epochs_accuracy))

  runs_accuracy.sort()
  print("\n\n Over all runs maximum accuracies on English are:", runs_accuracy)
  print("The median for English is:",runs_accuracy[2],"\n\n\n")
  
  if (runs_accuracy[2]-runs_accuracy[0])>(runs_accuracy[4]-runs_accuracy[2]):
    max_range_from_median = runs_accuracy[2]-runs_accuracy[0]
  else:
    max_range_from_median = runs_accuracy[4]-runs_accuracy[2]
  final_result = str(runs_accuracy[2])+" +/- "+ str(max_range_from_median)
  model_results[key].append(final_result)
  print("CNN Accuracy Score on Test set -> ",model_results[key])



* * * * EVALUATION USING DON AS PREPROCESSING FUNCTION * * * *
Length of the longest sample is: 20000
Vocabulary size is: 259004
Run:  1 / Accuracy at epoch  0  is:  0.5 

Run:  1 / Accuracy at epoch  1  is:  0.5 

Run:  1 / Accuracy at epoch  2  is:  0.5 

Run:  1 / Accuracy at epoch  3  is:  0.5 

Run:  1 / Accuracy at epoch  4  is:  0.5 

Run:  1 / Accuracy at epoch  5  is:  0.5 

Run:  1 / Accuracy at epoch  6  is:  0.5055555701255798 

Run:  1 / Accuracy at epoch  7  is:  0.5055555701255798 

Run:  1 / Accuracy at epoch  8  is:  0.5111111402511597 

Run:  1 / Accuracy at epoch  9  is:  0.5111111402511597 

Run:  1 / Accuracy at epoch  10  is:  0.5055555701255798 

Run:  1 / Accuracy at epoch  11  is:  0.5111111402511597 

Run:  1 / Accuracy at epoch  12  is:  0.5055555701255798 

Run:  1 / Accuracy at epoch  13  is:  0.5 

Run:  1 / Accuracy at epoch  14  is:  0.5111111402511597 

Run:  1 / Accuracy at epoch  15  is:  0.5 

Run:  1 / Accuracy at epoch  16  is:  0.5 

Run:  1 / A

## Now show compact results in a table.

In [21]:
print(" PREPRO FUNCTION    |  Test Accuracy   |",end = '')

print("\n")
for prepro_func in prepro_functions_dict_comb:
  #print(prepro_func,"\t\t\t",format(round(model_results[prepro_func][0],4),'.4f'),"\t\t",end='')
  result = model_results[prepro_func][0]
  # result = format(round(model_results[prepro_func][0],4),'.4f')
  print(f'{prepro_func:27}{ result :12}')
  print("\n")

 PREPRO FUNCTION    |  Test Accuracy   |

DON                        0.5111111402511597 +/- 0.0


